[fpc-pascal] Unicode file routines proposal
Mattias Gaertner
nc-gaertnma at netcologne.de
Tue Jul 1 09:56:29 CEST 2008
On Tue, 01 Jul 2008 09:35:35 +0200
Luca Olivetti <luca at ventoso.org> wrote:
> En/na Marco van de Voort ha escrit:
> >>> They have a UTF-16/UCS-2 internal representation, same as MSEgui
> >>> which works very well and is fast and handy BTW.
> >> And len, slicing, etc. work as expected.
> >> Note that if you need characters beyond $ffff you have to compile
> >> it with wide unicode support, and in that case every character
> >> will use 4 bytes.
> >>
> > That's IMHO a faulty system. It requires you to choose between an
> > incomplete solution or making strings a horrible memory hog.
>
> OTOH using variable length characters will make string operations
> expensive (since you can't just multiply the index by 2 or 4 but you
> have to examine the string from the beginning, and the length in
> bytes isn't the same as the length in characters).
It's amazing that this argument come up again and again. But I know
hardly any code that need this index to char mapping. And the code,
that need it is seldom time critical.
(I must admit, I feared the same some years ago. But the extra cost is
practically a myth.)
> > But maybe that doesn't
> > matter for mere scripting languages (though I wonder then why they
> > didn't chose UTF-32 directly)
> >
> > Surrogates are not nice, but they were invented for a reason.
>
> Well, yes, they're a trade-off between performance and memory
> consumption, but I fear we're losing one of the advantages that
> pascal has over C: fast and simple string handling.
Most code only needs the number of bytes. And this still cost under
pascal O(1).
In fact if a UTF8String or UTF16String would be added, then I would
say, it would be a waste of memory to store an extra PtrInt for the
number of characters.
Mattias
More information about the fpc-pascal
mailing list