[fpc-devel] Fast ascii upper/lowercase

Christian Iversen chrivers at iversen-net.dk
Sat Jun 11 23:29:10 CEST 2005


On Saturday 11 June 2005 20:38, Daniƫl Mantione wrote:
> Op Sat, 11 Jun 2005, schreef Christian Iversen:
> > With all the discussion about speed of lowercase/uppercase recently, I
> > thought I'd chip in. I don't have the time to actually implement and test
> > this, but I hope somebody else will.
> >
> > This semi-pseudocode should be a bit faster than what is currently done:
> >
> > function lowercase(const value: string): pchar;
> > var
> >   X,L,C: Integer;
> >   B: Char;
> > begin
> >   C := length-of(value);
> >   set-length(result, C);
> >   for X := 0 to C/4-1 do
> >   begin
> >     L = PIntegers(value)[X];
> >     // Char 1/4
> >     B := char(L and $FF);
> >     if (B in ['A'..'Z']) then
> >       result[x shl 2] := B or $20 else
> >       result[x shl 2] := B;
> >     // Char 2/4
> >     B := char((L shr 8) and $FF);
> >     if (B in ['A'..'Z']) then
> >       result[(x shl 2)+1] := B or $20 else
> >       result[(x shl 2)+1] := B;
> >     ..etc..
> >   end;
> >   ..fix remainging 0-3 chars..
> > end;
> >
> > Basically, a simple loop unroll, with the twist that the fetch is
> > collapsed to every 4th iteration. This, combined with the dynamic
> > scheduling of modern CPUs, should increase the speed considerably. Since
> > each iteration is very collapsable in terms of data-dependencies, this
> > could work out to be quite fast.
> >
> > I could be wrong, of course :-)
> >
> > Does somebody want to try this?
>
> I expect it would be hard to get any speed advantage using this approach,
> since the overhead of expressions like char((L shr 8) and $FF) is quite
> severe.

I was aiming for a very low level of data-dependencies, which (on modern 
hardware anyway) can more or less amortize having to do extra calculations. 
Do not underestimate the power of the force^H^H^H^Hpipelining :-)

> The real penalty is the expression in ['A'..'Z']. This expression needs
> two branches which both cannot be predicted by the processer. They depend
> on the input, the processors branch caches don't help here. You get a
> pipeline stall in 50% of the cases. This is the reason why a branch-less
> solution like I posted is so much faster.

On older processors, I'd expect you to be right. However, according to the 
impression I got from the Intel Pentium 4 Optimization Handbook, this is less 
true for the P4 and similar processors. They do speculative branch 
calculations for crying out loud!

In general, it comes down to what is most expensive: data stalls or branch 
stalls. You could very well be right - branch stalls are expensive as hell - 
but I'd still time it to find out. 

> Of course, working with dwords can be a huge advantage, but only if you
> can process data in parallel or if you are memory bound. At least, this is
> what I expect and theory and practise don't allways agree.

And that's exactly the point. You _can_ process data in parallel. (or did I 
miss something?). With the ever diverging speeds of CPU and memory, we may or 
may not be memory bound.

I appreciate the input.

-- 
Regards,
Christian Iversen




More information about the fpc-devel mailing list