[fpc-devel] x86_64 SHA1 implementation

Sat Sep 16 17:29:51 CEST 2023

Thanks for the resources - these will prove very useful!  Intel and AMD 
processors also have specialised SHA instructions later on.  I know the 
AMD Zen supports them - not sure the earliest Intel models though.

Currently I'm sticking with pure SSE2 since this is the latest 
instruction set that is guaranteed to be available on all x86_64 
processors.  I can write versions for SSSE3 and AVX later, but currently 
I'm trying to identify the mysterious performance drops.

Kit

On 16/09/2023 16:18, Wayne Sherman wrote:
> J. Gareth Moreton via fpc-devel <fpc-devel at lists.freepascal.org> wrote:
>> So this past week I've been building on Rika's work by adding an
>> assembly version of SHA-1 for x86_64 to complement Rika's i386 version.
>> So far I've successfully made a version that runs twice as fast as the
>> Pascal code.  I hoped to go even faster by making use of the SSE2
>> instruction set...
> In 2010 Intel published SSE3 code to improve SHA1 performance.  Later
> that year it was incorporated into OpenSSL ASM code.  The OpenSSL code
> also includes AVX and SHA acceleration extensions.
>
> Intel Article:
> https://www.intel.com/content/www/us/en/developer/articles/technical/improving-the-performance-of-the-secure-hash-algorithm-1.html
>
> Brief on Intel SHA extensions (also works for AMD Zen and later CPUs)
> https://en.wikipedia.org/wiki/Intel_SHA_extensions
>
> OpenSSL x86 64-bit assembly code and performance chart
> https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha1-x86_64.pl
>
> ######################################################################
> # Current performance is summarized in following table. Numbers are
> # CPU clock cycles spent to process single byte (less is better).
> #
> #               x86_64         SSSE3            AVX[2]
> # P4            9.05           -
> # Opteron       6.26           -
> # Core2         6.55           6.05/+8%         -
> # Westmere      6.73           5.30/+27%        -
> # Sandy Bridge  7.70           6.10/+26%        4.99/+54%
> # Ivy Bridge    6.06           4.67/+30%        4.60/+32%
> # Haswell       5.45           4.15/+31%        3.57/+53%
> # Skylake       5.18           4.06/+28%        3.54/+46%
> # Bulldozer     9.11           5.95/+53%
> # Ryzen         4.75           3.80/+24%        1.93/+150%(**)
> # VIA Nano      9.32           7.15/+30%
> # Atom          10.3           9.17/+12%
> # Silvermont    13.1(*)        9.37/+40%
> # Knights L     13.2(*)        9.68/+36%        8.30/+59%
> # Goldmont      8.13           6.42/+27%        1.70/+380%(**)
> #
> # (*) obviously suboptimal result, nothing was done about it,
> # because SSSE3 code is compiled unconditionally;
> # (**) SHAEXT result
>