[fpc-devel] Vectorization

J. Gareth Moreton gareth at moreton-family.com
Tue Dec 12 03:31:39 CET 2017


Okay, sit back everyone - this is a long read!

----

I'm starting with the problem as listed in https://bugs.freepascal.org/view.php?id=27870 with the source 
code provided, although with {$codealign varmin=16} and {$codealign localmin=16} at the top.

I'm running the latest version of the compiler with the following parameters "-O3 -va -CfSSE64 -a -Sv".  Find attached the source file and the generated assembly.

First thing to note is that no vectorisation occurs for the individual setting of elements - e.g. the v1[ 0] 
:= 0.2 lines are assembled as follows: 

movl	_$TESTFILE$_Ld1(%rip),%eax
movl	%eax,48(%rsp)
movl	_$TESTFILE$_Ld1(%rip),%eax
movl	%eax,52(%rsp)
movl	_$TESTFILE$_Ld1(%rip),%eax
movl	%eax,56(%rsp)
movl	_$TESTFILE$_Ld1(%rip),%eax
movl	%eax,60(%rsp)

(_$TESTFILE$_Ld1 refers to the 32-bit representation of 0.2, namely $CDCC4C3E, and I'm surprised the 
optimizer doesn't notice the redundant setting of %eax)

For the line "v3 := v1 + v2;", this is vectorised because the compiler can identify all the operands as 
vector types, but as already suspected, there is a missing command to write %xmm0 to the stack.

movdqa	48(%rsp),%xmm0
addps	64(%rsp),%xmm0

The next operation is "call fpc_get_output" that begins a call to "WriteLn".

Also, there is a very slight bug with the generated code.  "movdqa" is an integer move, not a floating-point 
move.  With the floating-point "addps" that follows, this incurs a performance penalty due to switching 
between the two modes - "movaps" should be used instead.

Regarding alignment, the stack is correctly aligned because, while no stack frame is set up, the command 
"pushq %rbx" aligns the stack to a 16-byte boundary. Depending on how easy or tricky it is to enforce the 
stack alignment, it might be possible to not have to switch to using the unaligned move commands.

Once I've figured out how it emits the vector commands, I'll see that it includes the missing movaps 
command.  Initially I'll probably switch to using movups to ensure no segmentation faults occur, and then 
migrate back to movaps if I can automatically enforce the correct byte alignment with no input from the 
programmer.  This might be due to seeing the variables are vector types and aligning them to a 16-byte 
boundary if SSE is selected.  I'll let you know how it goes.


Kit

----

P.S. Depending on how the optimizer is structured, I might suggest a kind of "Deep Optimizer" that is a part 
of -O3 (or -O4 if it's a little risky) and is done after all of the other compilation and optimisation 
stages and immediately prior to writing the assembler/object file, which does things like remove the 
redundant writes to %eax and also other optimizations that the peephole optimizer misses.  In the .s file, 
there are snippets of code akin to the following:

movq	%rax,%rbx
leaq	_$TESTFILE$_Ld3(%rip),%r8
movq	%rbx,%rdx

Because of the leaq command in between, the peephole optimizer doesn't notice the performance penalty that 
comes from writing to %rbx and then immediately reading it again to copy into %rdx.  If it were detected and 
changed to the following:

movq	%rax,%rbx
leaq	_$TESTFILE$_Ld3(%rip),%r8
movq	%rax,%rdx

Changing %rbx to %rax in the second movq command removes the performance penalty and takes advantage of 
modern processors' multiple ALUs (leaq does not modify any of the registers other than the unrelated %r8 in 
this instance, so it's safe), thus likely collapsing this group of three commands into a single CPU cycle 
instead of 2.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testfile.pp
Type: application/octet-stream
Size: 1407 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20171212/b343de4c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testfile.s
Type: application/octet-stream
Size: 15863 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20171212/b343de4c/attachment-0001.obj>


More information about the fpc-devel mailing list