[fpc-devel] Unicode in the RTL (my ideas)

Graeme Geldenhuys graemeg.lists at gmail.com
Mon Aug 20 18:22:24 CEST 2012


...Continuing the discussion of a Unicode rTL in a new thread as promised...


I obviously have lot of issues with the RTL suggestions being thrown
around in the past. eg: I have heard lots about the RTL mostly likely
being UTF-16 only, or being spilt into 3 versions AnsiString, UTF-16
and UTF-8 (a maintenance nightmare). Why? Why can't you have code as
follows:


   {$IFDEF WINDOWS}
      UnicodeString = type AnsiString(CP_UTF16);
   {$ELSE}
      // probably not strictly correct, but assuming *nix here. But
you get the idea
      UnicodeString = type AnsiString(CP_UTF8);
   {$ENDIF

   String = type UnicodeString;
   Char = type String[4];   // the maximum size of a Unicode codepoint
is 4 bytes


Now the RTL can have something like


     Exception = class
     public
         property Message: string read....
     end;


     TStings = class(...)
     public
         ....
         function Add(const AText: String); integer;
         ....
        // I'm not 100% about the actual signature, but UTF-8 is
probably a very safe bet
        // for the default, because 99.9999% of unicode text is stored
in UTF-8, and
        // ANSI text could safely load too. If the developers knows
otherwise, they can always
        // pass a different encoding constant to the function.
         procedure LoadFromFile(const AFilename: String; AEncoding:
TEncoding = cp_UTF8);
     end;


This should be pretty "delphi compatible", meaning Delphi code could
probably compile under FPC Windows without much need for change. As
far as I know "delphi compatibility" is only meant for the Windows
platform, and Delphi code moving to FPC (not the other way round).

Also, now the locale variables can have things like the Russian
Thousand Separator (U+00A0) character stored in a Char too. For those
that didn't know, the Russian locale uses the non-breaking space as a
thousand separator, which in UTF-8 is 'C2 A0' (bytes) and takes up 2
bytes of memory. There might be other similar locale variables in
other languages that might take up more bytes per.



In general encoding conversions will be reduced on each platform, or
no conversion is needed at all, because the native encoding is always
used.


-- 
Regards,
  - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net



More information about the fpc-devel mailing list