[fpc-devel] Re: enumerators

Hans-Peter Diettrich DrDiettrich1 at aol.com
Thu Nov 18 00:33:21 CET 2010


Marco van de Voort schrieb:

> It's an users own choice to not be unicode compliant in his apps (e.g. if he
> knows he never goes to the Eastern Asiatic market etc), but a runtime should
> be as unicode compliant as reasonably possible.

IMO there exist levels of compliance.

The bottom level supplies storage facilities for Unicode. Strings are 
only stored and processed as a whole, never analyzed or modified. This 
level supports e.g. text display and storage in databases. A single 
internal Unicode representation is sufficient, e.g. UTF-16 for Windows 
or UTF-8 else.

In the next level dedicated string handling is added, for everydays use, 
like for splitting and composing filenames; this is where basic 
iteration and character classification support enters the scene, for 
mostly internal use. Separator characters can be assumed as ASCII, so 
that they can be found by a dumb byte/char scan; only few encodings have 
to be recognized and handled, based on the char size: MBCS (UTF-8...), 
WideChars (UTF-16/UCS2) and UTF-32.

Beforementioned levels IMO can be considered part of the RTL.

Next comes codepage specific handling, usable by coders which are 
familiar with the specific language. Here more basic parsing features 
are added, like for whitespace, punctuation, words, numbers etc.  Added 
are character classification and conversion (upper/lower), what requires 
an implementation of Unicode character sets, as a replacement for the 
restricted 256-element sets. The Unicode BMP eventually can be treated 
as one such codepage, so that the many Unicode separators (spaces, 
dashes...) can be handled in a unique way.

This level can be implemented in language/codepage specific packages, 
whose maintenance requires more than only coding skills.

The top level is text processing, where knowledge of the language is 
inevitable, and special libraries are required. Only at this level 
string composition and decomposition at single character level is 
required, taking into account ambiguous encodings, ligatures, and the 
other hazzles introduced by full Unicode. The grammar of a language 
becomes important, for e.g. proper spelling and breaking of words.

DoDi




More information about the fpc-devel mailing list