<div dir="ltr"><div dir="ltr"><div dir="ltr">On Wed, Mar 27, 2019 at 11:32 AM J. Gareth Moreton <<a href="mailto:gareth@moreton-family.com">gareth@moreton-family.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>So with the false start that was pure inline assembly, I like to talk about how to move forward with FPC, or at least with x86_64.</div></blockquote><div><br></div><div>It occurred to me today, aren't you the person who fixed the -Sv compiler flag so that it actually works? I'd say expansion on that functionality would be more widely useful than just about anything else I can think of with regards to optimization (because it's so easy to use, and yet so powerful.)</div><div><br></div><div>Maybe start with making it fully use AVX instructions for the operations? IIRC, currently, even if you use the AVX or AVX2 compiler flags, it will always generate stuff like this:</div><div><br></div><div><div>vmovups<span style="white-space:pre">      </span>(%rdx),%xmm0</div><div>addps<span style="white-space:pre">     </span>(%r8),%xmm0</div><div>vmovups<span style="white-space:pre">    </span>%xmm0,(%rax)</div></div><div><br></div><div>rather than using vaddps.</div><div><br></div><div>From there you could make it support arrays larger than 4 elements, e.t.c....</div></div></div></div>