[fpc-devel] Memory consumed by strings

Sun Nov 23 15:02:16 CET 2008

On 2008-11-23 14:49, Daniël Mantione wrote:
>
> Op Sun, 23 Nov 2008, schreef Jonas Maebe:
>>
>> On 23 Nov 2008, at 13:31, Daniël Mantione wrote:
>>
>>> For an IDE, this is a little bit more complicated. I.e. searching for
>>> a ç in a source file needs to find both the composed and the
>>> decomposed variant, and in the case of UTF-8, this character can be
>>> encoded in 1, 2, 3 or 4 bytes which all need to be found. This is
>>> where UTF-16 and UTF-32 start to make sense.
>>
>> Characters can also be decomposed in UTF-16 and in UTF-32 (for the
>> same reasons as in UTF-8).
>
> I am aware of that, but the combining cedille is not in the "easy to
> process range" of UTF-8. In other words, you cannot do
> "if char[i]=combining_cedille" in UTF-8.
>
> Instead UTF-8, you need to make sure the string has enough characters
> left, and then compare multiple characters. Heck, you even need to take
> care of the fact the the combining cedille can be encoded in 2, 3 or 4
> bytes.

This is one of the million and one small details that one has to keep in 
mind while programming.

What I think would more sensible is that, instead of using all these 
variable sizes and all, simply use 4-byte/char strings and compose (in 
UTF sense) everything into that string.

You do this once, when importing/loading text to your app. And, then on, 
everthing is just like the good old string --except that it is a 4-byte 
per char string, instead of 1-byte.

Now, my question is this: How would I create a 'FourByteString' type, 
reference counted etc. just like the usual 'String'?

How hard is it?

Can someone like me, who does nor speak assembler, do it?

If so, where do I begin copy&pasting from 'string'?