[fpc-pascal] Parse unicode scalar
joshyfun at gmail.com
Mon Jul 3 14:41:31 CEST 2023
El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió:
> Right now I've just read the file into an AnsiString and indexing assuming a fixed character size, which breaks of course if non-1 byte characters exist
> I also need to know if I come across something like \u1F496 I need to convert that to a unicode character.
You are intermixing a lot of concepts, ASCII, Unicode, grapheme,
representation, content, etc...
Talking about Unicode you must forget ASCII, the text is a sequence of
bytes which are encoded in a special format (UTF-8, UTF-16, UTF-32,...)
and that must be represented in screen using Unicode representation
rules, which are not the same as ASCII.
Just to keep this message quite short, think in a text with only one
This text (text, not one letter, Unicode is about texts) can be
transmitted or stored using Unicode encoding rules which are a sequence
of bytes with its own rules to encode the information. Each byte is
UTF8: C3 A1
UTF16LE: 00 E1
UTF32: 00 00 00 E1
You must know in advance the encoding format to get the text from the
bytes sequence. There is also a BOM (Byte Order Mark) which is sometimes
used in files as a header to indicate the encoding, but in general it is
Now decoding that sequence of bytes, using the right decoding format you
get a text which represent the letter "a" with an acute accent, but
Unicode is *not* so *simple* and the same text could be represented in
screen using letter "a" + "combining acute accent" and bytes sequence is
totally different, different at encoding level but identical at
renderization level. So this two UTF8 sequences:
"C3 A1" and "61 CC 81"
are different at grapheme level and encoding level but identical at
Just as final note, this is the UTF-8 sequence of bytes for one single
"character" in screen:
F0 9F 8F B4 F3 A0 81 A7 F3 A0 81 A2 F3 A0 81 B3 F3 A0 81 A3 F3 A0 81 B4
F3 A0 81 BF
Unicode is far, far from easy.
Have a nice day.
More information about the fpc-pascal