[fpc-pascal] UTF-8 versions of Copy() and Length()

Rimgaudas Laucius rimga at ktl.mii.lt
Sat May 19 12:31:22 CEST 2007


Storage:
UTF8<UTF16 only for most of latin scripts,
all other scripts (Chinese, Greek, Cylilic, Arabic, Indic, ...)
UTF8>UTF16.

Performance:
Length (UTF8) = UTF8->UTF16
2*Lenth(UTF8)> UTF8->UTF16

4-byte characters are used by UTF32. UTF16 uses sequences of 2 code points 
from surrogates area to expess charactes outside basic multilingual plane 
that are very rarely used (actully i do not know any program that implements 
that).



----- Original Message ----- 
From: "Felipe Monteiro de Carvalho" <felipemonteiro.carvalho at gmail.com>
To: "FPC-Pascal users discussions" <fpc-pascal at lists.freepascal.org>
Sent: Saturday, May 19, 2007 12:57 PM
Subject: Re: [fpc-pascal] UTF-8 versions of Copy() and Length()


> On 5/19/07, Rimgaudas Laucius <rimga at ktl.mii.lt> wrote:
>> It is not useful to have functions for both encodings, because these
>> encodings are interconvertable and it is more effective to use UTF-16 for
>> data processing
>
> I disagree. The conversion impacts performance heavely. It will also
> require memory to store the converted string, and after you perform a
> operation you need to convert back.
>
> Further, UTF-16 contains both 2-byte characters and 4-byte characters,
> so I don't see how it would be any faster to process it in comparison
> to process a utf-8 string.
>
> About being easier to implement, that's irrelevant, because the
> functions are already done.
>
> -- 
> Felipe Monteiro de Carvalho
> _______________________________________________
> fpc-pascal maillist  -  fpc-pascal at lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-pascal
> 





More information about the fpc-pascal mailing list