[fpc-pascal] Parse unicode scalar

José Mejuto joshyfun at gmail.com
Mon Jul 3 14:41:31 CEST 2023


El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió:
> 
> Right now I've just read the file into an AnsiString and indexing assuming a fixed character size, which breaks of course if non-1 byte characters exist
> 
>   I also need to know if I come across something like \u1F496 I need to convert that to a unicode character.
> 

Hello,

You are intermixing a lot of concepts, ASCII, Unicode, grapheme, 
representation, content, etc...

Talking about Unicode you must forget ASCII, the text is a sequence of 
bytes which are encoded in a special format (UTF-8, UTF-16, UTF-32,...) 
and that must be represented in screen using Unicode representation 
rules, which are not the same as ASCII.

Just to keep this message quite short, think in a text with only one 
"letter":

"á"

This text (text, not one letter, Unicode is about texts) can be 
transmitted or stored using Unicode encoding rules which are a sequence 
of bytes with its own rules to encode the information. Each byte is 
hexadecimal:

UTF8: C3 A1
UTF16LE: 00 E1
UTF32: 00 00 00 E1

You must know in advance the encoding format to get the text from the 
bytes sequence. There is also a BOM (Byte Order Mark) which is sometimes 
used in files as a header to indicate the encoding, but in general it is 
not used.

Now decoding that sequence of bytes, using the right decoding format you 
get a text which represent the letter "a" with an acute accent, but 
Unicode is *not* so *simple* and the same text could be represented in 
screen using letter "a" + "combining acute accent" and bytes sequence is 
totally different, different at encoding level but identical at 
renderization level. So this two UTF8 sequences:

"C3 A1" and "61 CC 81"

are different at grapheme level and encoding level but identical at 
representation level.

Just as final note, this is the UTF-8 sequence of bytes for one single 
"character" in screen:

F0 9F 8F B4 F3 A0 81 A7 F3 A0 81 A2 F3 A0 81 B3 F3 A0 81 A3 F3 A0 81 B4 
F3 A0 81 BF

Unicode is far, far from easy.

Have a nice day.


More information about the fpc-pascal mailing list