[fpc-pascal] Case insensitive comparison of strings with non-ascii characters
Luiz Americo Pereira Camara
pascalive at bol.com.br
Thu Jul 23 14:02:38 CEST 2009
theo escreveu:
> @Luiz Americo
>
> Your code
> WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
> will work, but if speed matters, then it's rather bad.
>
Hi, i'm aware that the performance is bad although had not tested like
you did, but at this point i'd like to stick with a solution that fpc
provides natively since it's being used in a fpc component
(TSqlite3Dataset).
In last revision i switched to the ansi version of the functions to save
the conversion of the Key at each comparison. See
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/packages/fcl-db/src/sqlite/customsqliteds.pas?view=log#rev13431
Anyway is clear that functions to handle UTF8 and unicode in general is
missing in fpc...
> I've tried to make a faster function for UTF-8:
>
... maybe your function can be used as a base to future development. Add
a new function to the widestringmanager?
Luiz
> uses unicodeinfo, LCLProc;
>
> function UTF8CompareText(s1, s2: UTF8String): Integer;
> var u1, u2: Ucs4Char;
> u1l, u2l: longint;
> BytePos1, Len1, SLen1: integer;
> BytePos2, Len2, SLen2: integer;
> begin
> Result := 0;
> BytePos1 := 1;
> BytePos2 := 1;
> SLen1 := System.Length(s1);
> SLen2 := System.Length(s2);
>
> if SLen1 <> SLen2 then //Assuming lower/uppercase representations
> have the same byte length
> begin
> if SLen1 > SLen2 then Result := 1 else Result := -1;
> exit;
> end;
>
> repeat
> u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1);
> inc(BytePos1, Len1);
> u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2);
> inc(BytePos2, Len2);
> if u1 <> u2 then
> begin
> {$IFDEF useunicodinfo}
> u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping;
> if u1l <> -1 then u1 := u1l;
> u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping;
> if u2l <> -1 then u2 := u2l;
> {$ELSE}
> u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]);
> u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]);
> {$ENDIF}
> if u1 <> u2 then
> begin
> Result := u1 - u2;
> exit;
> end;
> end;
> until (BytePos1 > SLen1) or (BytePos2 > SLen2)
> end;
>
>
> Some numbers for my system (Linux) where WideCompareText is the function
> you use now, WideUppercase is the above function and unicodeinfo is
> the above function with useunicodinfo defined. See here
> http://wiki.lazarus.freepascal.org/Theodp
>
>
> Comparing identical Strings of 322 Chars 10000 times
> WideCompareText: 785ms
> unicodeinfo: 75ms
> WideUpperCase: 74ms
>
> Comparing Strings of 322 Chars 10000 times where the 3rd char differs
> WideCompareText: 268ms
> unicodeinfo: 3ms
> WideUpperCase: 8ms
>
> Comparing identical Text of 322 Chars 10000 times where one Text is all
> uppercase
> WideCompareText: 810ms
> unicodeinfo: 121ms
> WideUpperCase: 1076ms
>
> Regards Theo
>
> _______________________________________________
> fpc-pascal maillist - fpc-pascal at lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-pascal
>
>
More information about the fpc-pascal
mailing list