The idea I had currently (this is without 
looking at any previous theory) was to use 
a kind of sliding window, similar to how 
ZIP and other LZ77-based algorithms work 
when compressing repeating strings, to 
look backwards in the current block for a 
matching command and then scan forward. If 
the scan gets up to the instruction right 
before the starting point, then it's 
potential for vectorisable code. Using the 
previous example:

movss 16(%rsp),%xmm0
addss 32(%rsp),%xmm0
movss %xmm0,(%rax)
movss 20(%rsp),%xmm0
addss 36(%rsp),%xmm0
movss %xmm0,4(%rax)

Starting at the 4th command, it looks back 
to find a match in the 1st command, albeit 
with Ann address that differs only by 4. 
As it scans forward, it finds similar 
matches in subsequent commands, and 
eventually realises the entire block could 
potentially be vectorised. If it 
continues, it finds the code fragment 
repeats 4 times and can be vectorised with 
little difficulty. Being only SSE commands 
helps too.


P.S. I did look at the loop unrolling 
code, but it almost never triggers due to 
the small instruction cache that's 
assumed. For x86-64, is it safe to assume 
a cache length of 60 instead of 30, since 
almost all modern Intel and AMD processors 
have 56+ elements in their queues.

