[fpc-devel] ansistrings and widestrings

Sun Jan 9 12:53:53 CET 2005

> peter green wrote:
>  
> > it should be noted that pascal classes are really not suited to doing
> > strings.
> 
> IMO we should distinguish Strings, as containers, from Text as an
> interpretation of data as, ahem, text of some language, in some
> encoding, possibly with attributes...
> 
> > to do strings with classes you really need language features which fpc
> > doesn't have.
> 
> Please explain?
> 
> > doing strings with non garbage collected heap based classes would make
> > something that was as painfull to work with as pchars and that was totally
> > different from any string handling pascal has seen before.
> 
> FPC has reference counted string and array types, so that GC is
> available.

Peter probably means that to make custom string types, you need to have a way
to define operations and conversions. In Java, C++ this is possible afaik.

In C++ because it is a template, in Java because compiler manages classes.

> IMO we must distinguish between the handling of Characters, Strings and
> Text. For the alphabets (character sets) of natural languages it should
> be possible to implement functions to compare and convert characters;
> such support often is built into the OS, for selected languages. 

That's problem 1: on Unix that part of the OS exists, but is not
standarised. This not being standarised is the main reason for avoiding
linking every program to these libs.

> This is the level where multibyte characters can come in, so that just a
> Character can be different from any fixed-size data type, and that the
> same Character can have multiple representations - remember your umlaut
> example? Nonetheless the rules on the Character level at least are quite
> well defined, so that it's possible to implement according standard
> procedures for comparison and conversion.

> Of course these procedures
> require parameters like the language and the encoding of the characters,
> so that IMO exchangable and configurable classes are the best containers
> for characters.

The problem with string-classes is that you loose all automatism. This
complicates each and every operation where new strings are created from old
ones. This is what Peter was hinting at.

Personally, I still think it would be best to have 2 types of widestrings
(UTF8 - UTF16), with automatic conversions between them.  GNU is a UTF8 world,
Windows typically uses an own encoding that is more UTF16-like)

(UTF32 is rarely used, since afaik it is mostly for dead languages and
uncommon writing styles of east Asian languages. Moreover it indeed afaik
doesn't hold the often cited advantage that it has fixed length chars.
diacritic modifiers exist here too. However since most combinations also
have a formal codepoint, I don't know if that can be solved (e.g. by merging
them) )