[fpc-devel] String and UnicodeString and UTF8String
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Wed Jan 12 13:38:14 CET 2011
Jeff Wormsley schrieb:
> On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote:
>>
>> UTF-8 combines an single (byte-based) storage type with lossless
>> encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look*
>> easier to handle in user code, but both will fail and require special
>> code whenever characters outside the assumed codepage may occur.
>
> Preface: I don't write international apps, and probably won't for the
> foreseeable future...
Then you may be bound to some legacy compiler version when the
stringhandling will change in some future time, as happened to Delphi
users. Continued support of AnsiString type(s) is not enough, because
legacy code can be broken by (eventually) required changes to "set of
char", sizeof(char) and PChar, sizeof(string) as opposed to
Length(string), upper/lower conversion, and many more not so obvious
consequences.
> Isn't all of this concentration on trying to make strings have single
> byte characters (who cares how they are encoded), using the argument
> that it is somehow faster, incorrect for just about any modern
> processor, including embedded CPU's such as ARM? It was my
> understanding that 32 bit aligned access was always faster than byte
> aligned access on just about any CPU FPC still supports.
See Marco's comment about data size etc.
> The argument holds just fine for memory, but I don't really get the
> speed argument. Maybe I'm missing something.
FPC (the compiler) still uses ShortStrings wherever possible, because
that was found out as the most efficient string representation. This is
partially due to the ASCII encoding of source code, except for string
literals. But like you, I'm not sure that this argument still holds on
modern hardware.
Speed loss may occur due to:
- data shuffling in general (total byte count)
- (implied) string conversion
- indexed access to MBCS[1] strings (including UTF-8/16)
[1] All encodings of variable "character" size discourage indexed access
to strings. Then "char" can have multiple meanings, as either
representing the (physical) string/array *element* size, or the
(logical) size of an *codepoint*. Until now most users, including you,
most probably don't realize that difference between phyiscal and logical
characters, and assume that sizeof(char) always is 1, and eventually
that sizeof(WideChar) is 2. IMO variables of type "char" should have at
least 3 (better 4) bytes in an Unicode environment, in order to maintain
the correspondence between physical and logical characters. As already
suggested the "packed" keyword could be applied to strings and char
arrays, to definitely signal to the user that indexed access should not
be used with such variables, unless a speed penalty is acceptable.
DoDi
More information about the fpc-devel
mailing list