[fpc-pascal] Unicode chars losing information

Sun Mar 7 18:48:56 CET 2021

On 3/7/21 7:21 PM, Ryan Joseph via fpc-pascal wrote:
>
>> On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>>
>>
>> Yes it is. And there are about 1114000 unicode codepoints, or about 17 times what fits in a 2-byte wide char.
>>
>> https://en.wikipedia.org/wiki/Code_point
>>
>> https://en.wikipedia.org/wiki/UTF-16
> I thought unicode strings "just worked" but maybe that's UTF-8 and the character I want is maybe UTF-16. What are you supposed to do then? UnicodeString knows how to print the full string so all the data is there but I can't index to get characters unless I know their size.

It depends on what you mean by "just working". UnicodeString is an 
UTF-16 encoded string and a WideChar is just a UTF-16 code unit. Both 
UTF-8 and UTF-16 are variable length encodings. UTF-16 is just more 
simple to decode. Note also that, even though a single Unicode codepoint 
might need two UTF-16 code units (i.e. WideChars), that is still not 
enough to represent what users perceive as a character. There are also 
plenty of Unicode combining characters. What most users perceive as a 
character is actually called an Extended Grapheme Cluster and is 
actually a sequence of Unicode code points. There's an algorithm (an 
enumerator) that splits a string into grapheme clusters, and that's 
implemented in FPC trunk in the GraphemeBreakProperty unit. It 
implements this algorithm:

http://www.unicode.org/reports/tr29/

This was done by me for the Unicode Free Vision port in the unicodekvm 
SVN branch, but it was already committed to trunk (the rest of the 
Unicode Free Vision still isn't), because it's a new unit that is 
relatively self-contained and provides new functionality (so, won't 
break existing code) that wasn't provided by the RTL before.

Note that normally, most programs wouldn't actually need to split a 
string into grapheme clusters, unless they implement something like a UI 
toolkit or a text editor or something of that sort. That's why it was 
needed for the Unicode Free Vision.

Nikolay