[fpc-devel] for-in-index loop

Fri Jan 25 11:39:10 CET 2013

Op Fri, 25 Jan 2013, schreef Michael Schnell:

> On 01/25/2013 11:12 AM, Michael Van Canneyt wrote:
>> 
>> Pchar ?
>> 
> You seem to miss my point: the n'th printable character in an utf-8 coded 
> string (may same be stored as a pchar or a string) starts at the m'th byte 
> (m>=n).
>
> To find m for a given n you need to scan all bytes < m.
>
> Thus a loop such as
>
> for I = 1 to 100000 do begin
>  n = Integer (random(100000));
>  c = myString[n];
> end;
>
> Is rather fast with ANSI coded Strings.
>
> When myString is coded in utf-8, it obviously provides silly code byte 
> instead of printable characters, and replacing the term myString[n] by a 
> straight forward  function searching for the n'th printable character will be 
> very slow.

Yes, it is a known fact that this is a weakness of UTF-8. Consider 
transforming the string to UTF-16, UTF-32 or even an internal 
datastructure before doing the random access.

Random access inside UTF-8 is an algorithmic time complexity issue. A 
language extension can only be a band-aid for that.

Daniël