[fpc-devel] Unicode support in RTL - Roadmap

Fri Nov 21 15:27:55 CET 2008

On 21 Nov 2008, at 14:50, Michael Schnell wrote:

>> If Length() would return its value in chars, what length in *bytes*  
>> would the following call set:
>>
>> SetLength(utfstring_1), Length(utfstring_2));
>>
> I don't really understand your question.
>
> I think would would need to have two different function
>
> UTF8ElementlLength(UTF8String) and UTF8PointLength(UTF8String),  
> first giving the string length in code elements (byte) and second  
> giving the length in code points (unicode characters),
>
> So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would  
> be 1.

Or 2, depending on whether it's predcomposed or decomposed.

> I think we should have a third function Length(UTF8String) that can  
> be selected by the user (e.g. via a {$ option to be mapped to wither  
> of the two.

He's simply talking about the case where Length is mapped to your  
proposed UTF8PointLength.

> I do see that there in fact is a compatibility problem when porting  
> old code with the setting of UTF8Count=Point.
>
> here
>
> SetLength(utfstring_1), Length(utfstring_2)); would be translated as
> UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2));
>
> which does not make sense if UTF8PointLength(utfstring_1) is smaller  
> than UTF8PointLength(utfstring_2).

It does not make any sense under any circumstances, because there is  
no way for "UTF8PointSetLength" to know how many bytes it has to  
allocate when you pass a value (any value, regardless of where it  
comes from) to it.

Jonas