[fpc-pascal] Case insensitive comparison of strings with non-ascii characters

theo xpde at theo.ch
Sat Jul 25 17:46:39 CEST 2009


@Luiz Americo

Your code
WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
will work, but if speed matters, then it's rather bad.

I've tried to make a faster function for UTF-8:

uses unicodeinfo, LCLProc;

function UTF8CompareText(s1, s2: UTF8String): Integer;
var u1, u2: Ucs4Char;
  u1l, u2l: longint;
  BytePos1, Len1, SLen1: integer;
  BytePos2, Len2, SLen2: integer;
begin
  Result := 0;
  BytePos1 := 1;
  BytePos2 := 1;
  SLen1 := System.Length(s1);
  SLen2 := System.Length(s2);

  if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
have the same byte length
  begin
    if SLen1 > SLen2 then Result := 1 else Result := -1;
    exit;
  end;

  repeat
    u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1);
    inc(BytePos1, Len1);
    u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2);
    inc(BytePos2, Len2);
    if u1 <> u2 then
    begin
      {$IFDEF useunicodinfo}
      u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping;
      if u1l <> -1 then u1 := u1l;
      u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping;
      if u2l <> -1 then u2 := u2l;
      {$ELSE}
      u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]);
      u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]);
      {$ENDIF}
      if u1 <> u2 then
      begin
        Result := u1 - u2;
        exit;
      end;
    end;
  until (BytePos1 > SLen1) or (BytePos2 > SLen2)
end;


Some numbers for my system (Linux) where WideCompareText is the function
you use now, WideUppercase is the above function and unicodeinfo is
the above function with useunicodinfo defined. See here
http://wiki.lazarus.freepascal.org/Theodp


Comparing identical Strings of 322 Chars 10000 times
WideCompareText: 785ms
unicodeinfo: 75ms
WideUpperCase: 74ms

Comparing Strings of 322 Chars 10000 times where the 3rd char differs
WideCompareText: 268ms
unicodeinfo: 3ms
WideUpperCase: 8ms

Comparing identical Text of 322 Chars 10000 times where one Text is all
uppercase
WideCompareText: 810ms
unicodeinfo: 121ms
WideUpperCase: 1076ms

Regards Theo




More information about the fpc-pascal mailing list