[fpc-pascal] Case insensitive comparison of strings with non-ascii characters

Luiz Americo Pereira Camara pascalive at bol.com.br
Thu Jul 23 14:02:38 CEST 2009


theo escreveu:
> @Luiz Americo
>
> Your code
> WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
> will work, but if speed matters, then it's rather bad.
>   

Hi, i'm aware that the performance is bad although had not tested like 
you did, but at this point i'd like to stick with a solution that fpc 
provides natively since it's being used in a fpc component 
(TSqlite3Dataset).

In last revision i switched to the ansi version of the functions to save 
the conversion of the Key at each comparison. See 
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/packages/fcl-db/src/sqlite/customsqliteds.pas?view=log#rev13431

Anyway is clear that functions to handle UTF8 and unicode in general is 
missing in fpc...
> I've tried to make a faster function for UTF-8:
>   

... maybe your function can be used as a base to future development. Add 
a new function to the widestringmanager?

Luiz
> uses unicodeinfo, LCLProc;
>
> function UTF8CompareText(s1, s2: UTF8String): Integer;
> var u1, u2: Ucs4Char;
>   u1l, u2l: longint;
>   BytePos1, Len1, SLen1: integer;
>   BytePos2, Len2, SLen2: integer;
> begin
>   Result := 0;
>   BytePos1 := 1;
>   BytePos2 := 1;
>   SLen1 := System.Length(s1);
>   SLen2 := System.Length(s2);
>
>   if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
> have the same byte length
>   begin
>     if SLen1 > SLen2 then Result := 1 else Result := -1;
>     exit;
>   end;
>
>   repeat
>     u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1);
>     inc(BytePos1, Len1);
>     u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2);
>     inc(BytePos2, Len2);
>     if u1 <> u2 then
>     begin
>       {$IFDEF useunicodinfo}
>       u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping;
>       if u1l <> -1 then u1 := u1l;
>       u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping;
>       if u2l <> -1 then u2 := u2l;
>       {$ELSE}
>       u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]);
>       u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]);
>       {$ENDIF}
>       if u1 <> u2 then
>       begin
>         Result := u1 - u2;
>         exit;
>       end;
>     end;
>   until (BytePos1 > SLen1) or (BytePos2 > SLen2)
> end;
>
>
> Some numbers for my system (Linux) where WideCompareText is the function
> you use now, WideUppercase is the above function and unicodeinfo is
> the above function with useunicodinfo defined. See here
> http://wiki.lazarus.freepascal.org/Theodp
>
>
> Comparing identical Strings of 322 Chars 10000 times
> WideCompareText: 785ms
> unicodeinfo: 75ms
> WideUpperCase: 74ms
>
> Comparing Strings of 322 Chars 10000 times where the 3rd char differs
> WideCompareText: 268ms
> unicodeinfo: 3ms
> WideUpperCase: 8ms
>
> Comparing identical Text of 322 Chars 10000 times where one Text is all
> uppercase
> WideCompareText: 810ms
> unicodeinfo: 121ms
> WideUpperCase: 1076ms
>
> Regards Theo
>
> _______________________________________________
> fpc-pascal maillist  -  fpc-pascal at lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-pascal
>
>   




More information about the fpc-pascal mailing list