[fpc-pascal] Unicode chars losing information
Nikolay Nikolov
nickysn at gmail.com
Sun Mar 7 18:48:56 CET 2021
On 3/7/21 7:21 PM, Ryan Joseph via fpc-pascal wrote:
>
>> On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>>
>>
>> Yes it is. And there are about 1114000 unicode codepoints, or about 17 times what fits in a 2-byte wide char.
>>
>> https://en.wikipedia.org/wiki/Code_point
>>
>> https://en.wikipedia.org/wiki/UTF-16
> I thought unicode strings "just worked" but maybe that's UTF-8 and the character I want is maybe UTF-16. What are you supposed to do then? UnicodeString knows how to print the full string so all the data is there but I can't index to get characters unless I know their size.
It depends on what you mean by "just working". UnicodeString is an
UTF-16 encoded string and a WideChar is just a UTF-16 code unit. Both
UTF-8 and UTF-16 are variable length encodings. UTF-16 is just more
simple to decode. Note also that, even though a single Unicode codepoint
might need two UTF-16 code units (i.e. WideChars), that is still not
enough to represent what users perceive as a character. There are also
plenty of Unicode combining characters. What most users perceive as a
character is actually called an Extended Grapheme Cluster and is
actually a sequence of Unicode code points. There's an algorithm (an
enumerator) that splits a string into grapheme clusters, and that's
implemented in FPC trunk in the GraphemeBreakProperty unit. It
implements this algorithm:
http://www.unicode.org/reports/tr29/
This was done by me for the Unicode Free Vision port in the unicodekvm
SVN branch, but it was already committed to trunk (the rest of the
Unicode Free Vision still isn't), because it's a new unit that is
relatively self-contained and provides new functionality (so, won't
break existing code) that wasn't provided by the RTL before.
Note that normally, most programs wouldn't actually need to split a
string into grapheme clusters, unless they implement something like a UI
toolkit or a text editor or something of that sort. That's why it was
needed for the Unicode Free Vision.
Nikolay
More information about the fpc-pascal
mailing list