[fpc-devel] String and UnicodeString and UTF8String

Wed Jan 12 13:38:14 CET 2011

Jeff Wormsley schrieb:
> On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote:
>>
>> UTF-8 combines an single (byte-based) storage type with lossless 
>> encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* 
>> easier to handle in user code, but both will fail and require special 
>> code whenever characters outside the assumed codepage may occur.
> 
> Preface: I don't write international apps, and probably won't for the 
> foreseeable future...

Then you may be bound to some legacy compiler version when the 
stringhandling will change in some future time, as happened to Delphi 
users. Continued support of AnsiString type(s) is not enough, because 
legacy code can be broken by (eventually) required changes to "set of 
char", sizeof(char) and PChar, sizeof(string) as opposed to 
Length(string), upper/lower conversion, and many more not so obvious 
consequences.

> Isn't all of this concentration on trying to make strings have single 
> byte characters (who cares how they are encoded), using the argument 
> that it is somehow faster, incorrect for just about any modern 
> processor, including embedded CPU's such as ARM?  It was my 
> understanding that 32 bit aligned access was always faster than byte 
> aligned access on just about any CPU FPC still supports.

See Marco's comment about data size etc.

> The argument holds just fine for memory, but I don't really get the 
> speed argument.  Maybe I'm missing something.

FPC (the compiler) still uses ShortStrings wherever possible, because 
that was found out as the most efficient string representation. This is 
partially due to the ASCII encoding of source code, except for string 
literals. But like you, I'm not sure that this argument still holds on 
modern hardware.

Speed loss may occur due to:
- data shuffling in general (total byte count)
- (implied) string conversion
- indexed access to MBCS[1] strings (including UTF-8/16)

[1] All encodings of variable "character" size discourage indexed access 
to strings. Then "char" can have multiple meanings, as either 
representing the (physical) string/array *element* size, or the 
(logical) size of an *codepoint*. Until now most users, including you, 
most probably don't realize that difference between phyiscal and logical 
characters, and assume that sizeof(char) always is 1, and eventually 
that sizeof(WideChar) is 2. IMO variables of type "char" should have at 
least 3 (better 4) bytes in an Unicode environment, in order to maintain 
the correspondence between physical and logical characters. As already 
suggested the "packed" keyword could be applied to strings and char 
arrays, to definitely signal to the user that indexed access should not 
be used with such variables, unless a speed penalty is acceptable.

DoDi