[fpc-devel] Vectorization
J. Gareth Moreton
gareth at moreton-family.com
Tue Dec 12 03:31:39 CET 2017
Okay, sit back everyone - this is a long read!
----
I'm starting with the problem as listed in https://bugs.freepascal.org/view.php?id=27870 with the source
code provided, although with {$codealign varmin=16} and {$codealign localmin=16} at the top.
I'm running the latest version of the compiler with the following parameters "-O3 -va -CfSSE64 -a -Sv". Find attached the source file and the generated assembly.
First thing to note is that no vectorisation occurs for the individual setting of elements - e.g. the v1[ 0]
:= 0.2 lines are assembled as follows:
movl _$TESTFILE$_Ld1(%rip),%eax
movl %eax,48(%rsp)
movl _$TESTFILE$_Ld1(%rip),%eax
movl %eax,52(%rsp)
movl _$TESTFILE$_Ld1(%rip),%eax
movl %eax,56(%rsp)
movl _$TESTFILE$_Ld1(%rip),%eax
movl %eax,60(%rsp)
(_$TESTFILE$_Ld1 refers to the 32-bit representation of 0.2, namely $CDCC4C3E, and I'm surprised the
optimizer doesn't notice the redundant setting of %eax)
For the line "v3 := v1 + v2;", this is vectorised because the compiler can identify all the operands as
vector types, but as already suspected, there is a missing command to write %xmm0 to the stack.
movdqa 48(%rsp),%xmm0
addps 64(%rsp),%xmm0
The next operation is "call fpc_get_output" that begins a call to "WriteLn".
Also, there is a very slight bug with the generated code. "movdqa" is an integer move, not a floating-point
move. With the floating-point "addps" that follows, this incurs a performance penalty due to switching
between the two modes - "movaps" should be used instead.
Regarding alignment, the stack is correctly aligned because, while no stack frame is set up, the command
"pushq %rbx" aligns the stack to a 16-byte boundary. Depending on how easy or tricky it is to enforce the
stack alignment, it might be possible to not have to switch to using the unaligned move commands.
Once I've figured out how it emits the vector commands, I'll see that it includes the missing movaps
command. Initially I'll probably switch to using movups to ensure no segmentation faults occur, and then
migrate back to movaps if I can automatically enforce the correct byte alignment with no input from the
programmer. This might be due to seeing the variables are vector types and aligning them to a 16-byte
boundary if SSE is selected. I'll let you know how it goes.
Kit
----
P.S. Depending on how the optimizer is structured, I might suggest a kind of "Deep Optimizer" that is a part
of -O3 (or -O4 if it's a little risky) and is done after all of the other compilation and optimisation
stages and immediately prior to writing the assembler/object file, which does things like remove the
redundant writes to %eax and also other optimizations that the peephole optimizer misses. In the .s file,
there are snippets of code akin to the following:
movq %rax,%rbx
leaq _$TESTFILE$_Ld3(%rip),%r8
movq %rbx,%rdx
Because of the leaq command in between, the peephole optimizer doesn't notice the performance penalty that
comes from writing to %rbx and then immediately reading it again to copy into %rdx. If it were detected and
changed to the following:
movq %rax,%rbx
leaq _$TESTFILE$_Ld3(%rip),%r8
movq %rax,%rdx
Changing %rbx to %rax in the second movq command removes the performance penalty and takes advantage of
modern processors' multiple ALUs (leaq does not modify any of the registers other than the unrelated %r8 in
this instance, so it's safe), thus likely collapsing this group of three commands into a single CPU cycle
instead of 2.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testfile.pp
Type: application/octet-stream
Size: 1407 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20171212/b343de4c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testfile.s
Type: application/octet-stream
Size: 15863 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20171212/b343de4c/attachment-0001.obj>
More information about the fpc-devel
mailing list