[fpc-devel] Re: enumerators
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Thu Nov 18 00:33:21 CET 2010
Marco van de Voort schrieb:
> It's an users own choice to not be unicode compliant in his apps (e.g. if he
> knows he never goes to the Eastern Asiatic market etc), but a runtime should
> be as unicode compliant as reasonably possible.
IMO there exist levels of compliance.
The bottom level supplies storage facilities for Unicode. Strings are
only stored and processed as a whole, never analyzed or modified. This
level supports e.g. text display and storage in databases. A single
internal Unicode representation is sufficient, e.g. UTF-16 for Windows
or UTF-8 else.
In the next level dedicated string handling is added, for everydays use,
like for splitting and composing filenames; this is where basic
iteration and character classification support enters the scene, for
mostly internal use. Separator characters can be assumed as ASCII, so
that they can be found by a dumb byte/char scan; only few encodings have
to be recognized and handled, based on the char size: MBCS (UTF-8...),
WideChars (UTF-16/UCS2) and UTF-32.
Beforementioned levels IMO can be considered part of the RTL.
Next comes codepage specific handling, usable by coders which are
familiar with the specific language. Here more basic parsing features
are added, like for whitespace, punctuation, words, numbers etc. Added
are character classification and conversion (upper/lower), what requires
an implementation of Unicode character sets, as a replacement for the
restricted 256-element sets. The Unicode BMP eventually can be treated
as one such codepage, so that the many Unicode separators (spaces,
dashes...) can be handled in a unique way.
This level can be implemented in language/codepage specific packages,
whose maintenance requires more than only coding skills.
The top level is text processing, where knowledge of the language is
inevitable, and special libraries are required. Only at this level
string composition and decomposition at single character level is
required, taking into account ambiguous encodings, ligatures, and the
other hazzles introduced by full Unicode. The grammar of a language
becomes important, for e.g. proper spelling and breaking of words.
DoDi
More information about the fpc-devel
mailing list