[fpc-pascal] Parse unicode scalar

Nikolay Nikolov nickysn at gmail.com
Sun Jul 2 21:15:34 CEST 2023


On 7/2/23 20:38, Martin Frb via fpc-pascal wrote:
> On 02/07/2023 19:20, Nikolay Nikolov via fpc-pascal wrote:
>> On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote:
>>> I'm interested in parsing unicode scalars (I think they're called) 
>>> to byte sized values but I'm not sure where to start. First thing I 
>>> did was choose the unicode scalar U+1F496 (💖).
>>
>> There's no such thing as "unicode scalar" in Unicode terminology:
>>
>> https://unicode.org/glossary/
> There seems to be
> https://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#G7404
Too bad it's not included in the Unicode glossary. :( So, it's basicaly 
a Unicode code point that is not a high-surrogate or low-surrogate. And 
if you want to know what "high-surrogate" and "low-surrogate" means, you 
should read about UTF-16.
>
>
>>
>>>
>>> Next I cheated and ask ChatGPT. :) Amazingly from my question it was 
>>> able to tell me the scaler is comprised of these 4 bytes:
>>>
>>>   240 159 146 150
>
> That is an utf-8 encoded representation of such a value.
>
> You can find them on https://www.compart.com/en/unicode/U+0041
> (using the hex for whatever codepoint interests you)

Or just learn about Unicode encodings, such as UTF-8, UTF-16, etc.

https://en.wikipedia.org/wiki/UTF-8

https://en.wikipedia.org/wiki/UTF-16

https://en.wikipedia.org/wiki/UTF-32

Both UTF-8 and UTF-16 are frequently used and are important to know. 
UTF-32 is rarely used, but is very simple and easy to understand as 
well. It's just not very efficient, hence its rarity. :)

Nikolay


More information about the fpc-pascal mailing list