[fpc-devel] Unicode and UTF8String

Mon Dec 1 16:39:09 CET 2008

Martin Friebe escreveu:
> Marco van de Voort wrote:
>> In our previous episode, Martin Friebe said:  
>>> I agree, using RTlString will probably help fpc to optimize your exe 
>>> for each OS.
>>>
>>> But, using RTLString means you do not know, if you have UTF8 or not. 
>>>     
>> Correct.
>>  
>>> Because UTF8 behaves slightly different from other Strings, many 
>>> operations can not be performed on RTLString
>>>
>>> foo[1], copy, pos ... simply because you do not know, if the result 
>>> is a char, a codepoint or a subcodepoint (single utf8 byte)
>>>     
>> You don't know that about UTF-16 either. Even though that is no 
>> problem in
>>   
> True, good point
>> most cases, it is slowly time to abandon too simplistic thinking about
>> strings. The best solution is to minimize editing, and localize them in
>> certain parts of the code, keeping most of the code encoding agnostic.
>>   
> True, too. But we are talking Pascal, not some other language. 
> string[index], copy, pos, length have always been part of Pascal.
>
> Of course they are still there, to be used in the few parts of your 
> code, where you specialize on whatever string type you deal with.
> But otherwise, using  RTLString  IMHO will abandon this part of pascal 
> syntax.  A function of which the result can not be used, as it can 
> change at compile time => such a function can not be used. (or we will 
> have buffer overflows, code injection and more ...) 

To use safely RTLString, at first look, would be be sufficient to use 
overloaded functions from the Characters unit (introduced in Delphi 
2009). See http://www.jacobthurman.com/?p=30 how you can use them to get 
Copy, Pos behavior.

Next week, i'll implement those functions for UTF16 and UTF8 and do some 
tests.

Luiz