<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 04/01/2022 10:31, Marco van de Voort
via fpc-devel wrote:<br>
</div>
<blockquote type="cite"
cite="mid:9bb809ec-40c7-58a0-9e95-069b510e1875@pascalprogramming.org">
<br>
<br>
Weird as mine is inlined with -Cpcoreavx -O4, with no special
handling for 0. But that does put some things on shaky ground.
Maybe zero the result before hand?
<br>
</blockquote>
<br>
Same here.<br>
<br>
----------------------------------------<br>
About UTF8LengthFast()<br>
<br>
Well, before I get to this, I noted something weird.....<br>
<br>
2 runs, compiled with the same compiler ( 3.2.3 ), and the same
settings, with the only difference: -gw3 or not -gw3<br>
=> And the speed differed. 600 (with dwarf) vs 700 (no dwarf) /
reproducible.<br>
<br>
So then<br>
=> I compiled the app, with the same 3.2.3 (and I compiled
WITHOUT -a / though with -al I get the same results)<br>
=> I compiled once with -gw3 , and once without dwarf info<br>
=> I used objdump to dis-assemble the exe.<br>
=> I diffed (and searched for 0x101010.... (only used in the
...Fast code)<br>
--> <u><b>The assembler is identical.</b></u><br>
Yet one is faster. (the one WITH dwarf)<br>
The calling code in the main body is the same too (as far as I could
see), except that the address of the callee is different (but it is
just 20 calls per measurement)<br>
<br>
I did those runs OUTSIDE the IDE. <br>
So no debugger in the background.<br>
<br>
Win64 / 64 bit<br>
Core I7 8600K<br>
<br>
Using 3.3.1 the speed is equal. Never mind if dwarf is generated or
not.<br>
(I did not compare the asm for that...)<br>
<br>
---------------------<br>
Clinging to straws, there is one (maybe) diff in the 3.2.3
with/without dwarf assembler.<br>
*** I am totally out of my depth here ****<br>
<br>
Alignment. 16 vs 32 bit. Can that make a difference?<br>
According to:
<a class="moz-txt-link-freetext" href="https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache">https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache</a><br>
<blockquote type="cite"> The Decoded ICache consists of 32 sets.
Each set contains eight Ways. Each Way can hold up to six
micro-ops.<br>
<br>
All micro-ops in a Way represent instructions which are
statically contiguous in the code and have their EIPs within <u><b>the
same aligned 32-byte region</b></u>.<br>
</blockquote>
<br>
So the alignment of the 2 procedures differs by 16 bytes<br>
The proc entry is at<br>
With dwarf 100001870<br>
without 100001860 // actually this is 32byte aligned (but slower)<br>
<br>
Yet, maybe it matters which statements in the big loop happen to
fall into the same 32byte block???<br>
<br>
The loop starting with<br>
for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do<br>
<br>
With DWARF (faster):<br>
1000018f0: 49 83 c2 01 add $0x1,%r10<br>
1000018f4: 4c 8b 19 mov (%rcx),%r11<br>
1000018f7: 4d 89 d8 mov %r11,%r8<br>
1000018fa: 48 bf 80 80 80 80 80 movabs
$0x8080808080808080,%rdi<br>
100001901: 80 80 80 <br>
100001904: 49 21 f8 and %rdi,%r8<br>
100001907: 49 c1 e8 07 shr $0x7,%r8<br>
10000190b: 49 f7 d3 not %r11<br>
10000190e: 49 c1 eb 06 shr $0x6,%r11<br>
100001912: 4d 21 d8 and %r11,%r8<br>
100001915: 4c 89 c3 mov %r8,%rbx<br>
100001918: 49 bb 01 01 01 01 01 movabs
$0x101010101010101,%r11<br>
10000191f: 01 01 01 <br>
100001922: 4d 0f af c3 imul %r11,%r8<br>
100001926: 49 c1 e8 38 shr $0x38,%r8<br>
10000192a: 4c 01 c6 add %r8,%rsi<br>
10000192d: 48 83 c1 08 add $0x8,%rcx<br>
100001931: 4d 39 d1 cmp %r10,%r9<br>
100001934: 7f ba jg 1000018f0
<P$PROGRAM_$$_UTF8LENGTHFAST$PCHAR$INT64$$INT64+0x80><br>
<br>
WITHOUT:<br>
1000018e0: 49 83 c2 01 add $0x1,%r10<br>
1000018e4: 4c 8b 19 mov (%rcx),%r11<br>
1000018e7: 4d 89 d8 mov %r11,%r8<br>
1000018ea: 48 bf 80 80 80 80 80 movabs
$0x8080808080808080,%rdi<br>
1000018f1: 80 80 80 <br>
1000018f4: 49 21 f8 and %rdi,%r8<br>
1000018f7: 49 c1 e8 07 shr $0x7,%r8<br>
1000018fb: 49 f7 d3 not %r11<br>
1000018fe: 49 c1 eb 06 shr $0x6,%r11<br>
100001902: 4d 21 d8 and %r11,%r8<br>
100001905: 4c 89 c3 mov %r8,%rbx<br>
100001908: 49 bb 01 01 01 01 01 movabs
$0x101010101010101,%r11<br>
10000190f: 01 01 01 <br>
100001912: 4d 0f af c3 imul %r11,%r8<br>
100001916: 49 c1 e8 38 shr $0x38,%r8<br>
10000191a: 4c 01 c6 add %r8,%rsi<br>
10000191d: 48 83 c1 08 add $0x8,%rcx<br>
100001921: 4d 39 d1 cmp %r10,%r9<br>
100001924: 7f ba jg 0x1000018e0<br>
<br>
<br>
----------------------------------------<br>
*** written before I got into the above......<br>
-------<br>
<br>
About UTF8LengthFast()<br>
<br>
I notice 3 differences.<br>
Though I only compare the O4 result, and have no idea what is
pre-peephole and post-peephole.<br>
<br>
1) As you say: different registers (but the same statements, in the
same order).<br>
No idea if that affects the CPU.<br>
<br>
2) One extra statement in 3.3.1<br>
<br>
movq %r10,%r8 ////
<<<<<<<<<<<<<<< not in
3.2.3<br>
movq $72340172838076673,%r10<br>
imulq %r10,%r8<br>
<br>
3) I do have "dwarf" enabled. Even though, at O4 that is not
expected to do any good.<br>
I noted that in 3.3.1 this leads to way more asm-labels than in
3.2.3.<br>
Those labels are only referred to by dwarf-line info (some asm
statements, are reported to be in the "begin" line. Even so, they
are clearly not.<br>
- It could be a result of the peephole opt.<br>
- But it could also be that the peephole is affected by the presence
of those labels.<br>
<br>
</body>
</html>