Sergei Gorelkin sergei_gorelkin at mail.ru
Fri Nov 21 15:46:08 CET 2008

Michael Schnell wrote:

> I don't really understand your question.
> I think would would need to have two different function
> UTF8ElementlLength(UTF8String) and UTF8PointLength(UTF8String), first 
> giving the string length in code elements (byte) and second giving the 
> length in code points (unicode characters),
> So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would be 1.
> I think we should have a third function Length(UTF8String) that can be 
> selected by the user (e.g. via a {$ option to be mapped to wither of the 
> two.
> The same would be necessary for the SetLength function
> e.g.
> (1) UTF8ElementSetLength(utfstring_1), UTF8ElementLength(utfstring_2));
> or
> (2) UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2));
> (2) would work as expected if the purpose i to delete all but the first 
> n characters in a string.
> I don't see a decent use for (1) other than creating a string long 
> enough to use as a buffer for e.g. TStream.read.
> I do see that there in fact is a compatibility problem when porting old 
> code with the setting of UTF8Count=Point.
> here
> SetLength(utfstring_1), Length(utfstring_2)); would be translated as
> UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2));
> which does not make sense if UTF8PointLength(utfstring_1) is smaller 
> than UTF8PointLength(utfstring_2).
The SetLength function is used mostly for allocating the storage for the 
new strings. Yes, it can be used for truncating the overlong strings, 
but truncating can be perfectly done with Delete (or UTF8Delete).

As you mentioned yourself, allocating utf-8 strings using length in 
codepoints is senseless. This is exactly what I wanted to say initially.

What follows is that for calls like SetLength(str1, Pos('foo', str2)) 
you also cannot freely change the return value of Pos() from elements to 
codepoints. And so on, and so forth.


