[fpc-pascal] ord() of a string index returns wrong value.

Jonas Maebe jonas.maebe at elis.ugent.be
Sat Jan 29 18:59:24 CET 2011


On 29 Jan 2011, at 18:34, Pew (piffle.the.cat) wrote:

> On 01/30/2011 03:13 AM, Jonas Maebe wrote:
>> 
>> On 29 Jan 2011, at 17:05, Pew (piffle.the.cat) wrote:
>> 
>>> I have a problem where ord() of a character (single string index) returns the wrong value. the character is a 'o' which is a 111 value but the ord of it returns 121 into an integer. What am I doing wrong?
>> 
>> Is that Lazarus code? If so, the string will be utf-8 encoded and you cannot assume that str[i] corresponds to the i'th character of the string. Even if it's not Lazarus code, it could still be utf-8 encoded depending on what the source of the string is and/or the locale settings of the system.
> 
> Yes, it is Lazarus code. Okay so I think that we have found the problem. Now how do I fix it?

If you want to access individual characters, it's probably the easiest to convert it to a unicodestring first:

var
  utxt: unicodestring;
begin
  ..
  utxt:=utf8decode(txt);
  { now perform all operations on utxt instead of on txt }
  ..
end;

Note:
1) even in UTF-16 (which is the encoding of a unicodestring), a single character may take up more than one code point, so this is not 100% safe yet either. If you want a guarantee to string[i] corresponding 1 single "character", you
a) have to normalize the unicode string to remove decomposed characters, and then
b) convert it to an UTF-32 string. You can use this routine for the unicodestring to UTF-32 conversion: http://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html (note that UCS4String is a dynamic array, not a string type)

I don't know whether Lazarus contains platform-indepdendent wrappers for a). FPC itself at least doesn't at this time.

2) you will have to make sure that your "Rects_low" and "LastCharacterDefined" are defined in terms of UTF-16. Unless they are all plain ASCII characters (i.e., with an ordinal value <=127), using a simple range is unlikely to work correctly.


A simpler alternative, with fairly high chances of data loss, is something like this:

var
  mytxt: ansistring;
begin
  mytxt:=utf8decode(txt);
  { now perform all operations on mytxt instead of on txt }
  ...
end;

This will first decode the UTF-8 encoded string to an UTF-16 encoded unicodestring, and then convert this unicodestring to a plain ansistring. Data loss can happen in case the string contains characters that cannot be represented using the "ansi" (~ default) code page of the system the program is running on. Such non-representable characters will be replaced by '?'.


In summary, unless you are an expert at working with unicode, you should not work with such string at the character/code point level, and use higher level helpers instead to achieve what you want to do. You may want to ask for help about that on the Lazarus mailing list (subscription information at http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus), by describing what exactly it is you want to do rather than showing how you are currently doing it.


Jonas


More information about the fpc-pascal mailing list