[fpc-devel] Patch to speed up Uppercase/Lowercase functions
daniel at deadlock.et.tudelft.nl
Sat Jun 11 14:17:19 CEST 2005
Op Sat, 11 Jun 2005, schreef Michael Van Canneyt:
> On Sat, 11 Jun 2005, DaniÃ«l Mantione wrote:
> > Op Fri, 10 Jun 2005, schreef Florian Klaempfl:
> > > Joost van der Sluis wrote:
> > >
> > > > Hi all,
> > > >
> > > > I don't know if rtl-optimilisation patches have a large priority,
> > >
> > > It depends if someone does it ;)
> > >
> > > > but
> > > > nevertheless this patch improves the speed of the sysutils.uppercase and
> > > > lowercase functions.
> > >
> > > What about creating a table which does direct mapping? It's a lot faster.
> > It would be faster, but it would require two 256-byte tables, which'll
> > help make people complain about code size even more. I would do it in an
> > inner loop, but upper/lower case conversions are seldomly called in an
> > inner loop. It also does some cache trashing, which is an often ignored
> > speed issue in programming.
> Well. Discussion is nice, but what does the real world show ?
> To compare, I made 6 versions of Lowercase:
> 1 - Sysutils
> 2 - Sysutils with Joost's improvement.
> 3 - Sysutils with Joost's improvement, but forward loop.
> 4 - Using PChar.
> 5 - Using PChar with lookup table and if check
> 5 - Using Pchar with lookup table and no check.
> Result on an AMD 64 3000:
> Lowercase time to execute: 00:00:01.563
> Lowercase2 Time to execute: 00:00:01.363
> Lowercase3 Time to execute: 00:00:01.394
> Lowercase4 Time to execute: 00:00:00.999
> Lowercase5 Time to execute: 00:00:01.021
> Lowercase6 Time to execute: 00:00:00.948
> So, judge for yourself. I think this is worth the 256 byte lookup table.
0.948/0.999 = 95 %
So, we 5% speed improvement from using a table; this is much worse than I
thought and can easily be undone in real world by the increased cache
trashing. Of course any speed improvement is welcome, but IMHO this is not
worth the size increase.
Remember, this just 1 procedure, and 256 byte extra is nothing compared to
the whole unit.
But if we start doing this kind of optimization accross the entire unit,
we'll get a horribly bloated unit.
Also, if speed is really important, nothing can beat a hand-optimized
assembler routine that does the operation without jump and by means of
32-bit registers does 4 chars at once. We have hand optimized string
routines in the rtl, I don't see why it cannot be done in sysutils.
More information about the fpc-devel