[fpc-pascal] FPC 2.6.2 throws SEGV in fpc_AnsiStr_Decr_Ref(). How is this possible?

Thu May 9 05:19:19 CEST 2013

So here's some more diagnostic at the point of the SEGV:

(gdb) disass
Dump of assembler code for function _$SYSTEM$_Ll1637:
=> 0x0118ace1 <+0>:     cmpl   $0x0,(%edx)
End of assembler dump.
(gdb) i reg
eax            0xb6c77158       -1228443304
ecx            0xb6c76c04       -1228444668
edx            0xfffffff8       -8
ebx            0x12adbf8        19586040
esp            0xb6c75f5c       0xb6c75f5c
ebp            0xb6c75f70       0xb6c75f70
esi            0xb6c77020       -1228443616
edi            0xb6c77020       -1228443616
eip            0x118ace1        0x118ace1 <_$SYSTEM$_Ll1637>
eflags         0x210293 [ CF AF SF IF RF ID ]
cs             0x73     115
ss             0x7b     123
ds             0x7b     123
es             0x7b     123
fs             0x0      0
gs             0x33     51
(gdb) p $eax^
$4 = 0

This tells me that the test at the top of fpc_AnsiStr_Decr_Ref:

        cmpl $0,(%eax)
        jne .Ldecr_ref_continue
        ret
.Ldecr_ref_continue:

passed (i.e. (%eax) was NOT nil) but sometime during the execution of the
following code:

// Temps allocated between ebp-24 and ebp+0
        subl    $4,%esp
// Var S located in register
// Var l located in register
        movl    %eax,(%esp)
// [101] l:=@PAnsiRec(S-FirstOff)^.Ref;
        movl    (%eax),%edx
        subl    $8,%edx
// [102] If l^<0 then exit;
        cmpl    $0,(%edx)

the variable (%eax) MUST have been changed (to nil) BY ANOTHER THREAD.

Is there any other plausible explanation I may have missed?

If there is no other explanation, then it means I need to find out how the
string variable referred to by (%eax) could have been been accessed (or
even known to exist) by any other thread in the same address space.

If that variable is local to a function (i.e. foo's Result with SEGV upon
its assignment immediately it first comes into scope, per my earlier email)
then absent a bug in FPC's handling string references and allocation, it
seems impossible that it could be known or referenced by any other other
thread.

I'm reasonably confident there's no other way it could be overwritten by
another thread (i.e. I don't think there are any range or buffer pointer
errors anywhere else) so logic tells me I must have the wrong thesis or
there's a string handling error in FPC.

Any clues or insight, gratefully received :-)

Cheers, Bruce.

PS: I can't use valgrind in practice for a variety of reasons, not the
least of which is that I'm not likely to see the error for an extraordinary
long time given that slight changes to the (execution time of the) code
made so far have had a dramatic effect on the likelihood of the occurrence
of this problem at all but it's clearly some sort of race condition over
unprotected memory somewhere.

On Thu, May 9, 2013 at 9:47 AM, Bruce Tulloch <pascal at causal.com> wrote:

> I've not managed to trap it again, but based on the information I have
> from the last time it occurred I can say the error happened here:
>
> --- a/rtl/i386/i386.inc
> +++ b/rtl/i386/i386.inc
> @@ -1523,7 +1523,7 @@
>          movl    (%eax),%edx
>          subl    $8,%edx
>  // [102] If l^<0 then exit;
>          cmpl    $0,(%edx) <-- SEGV OCCURS HERE
>          jl      .Lj3596
>  .Lj3603:
>  // [104] If declocked(l^) then
>
> That is, when testing the string length, the address of the length
> variable appears to be duff.
>
> I don't know what %edx was pointing to at the time (I hope to know next
> time I trap it) but it was obviously wrong.
>
> -b
>
>
> On Thu, May 9, 2013 at 9:33 AM, Bruce Tulloch <pascal at causal.com> wrote:
>
>> Thanks Jonas, that confirms what I suspected. Next time I trap an
>> instance of this (rare) fault I will inspect exactly which CPU instruction
>> raised the SEGV inside fpc_AnsiStr_Decr_Ref in search of a source of memory
>> corruption.
>>
>>
>> Bruce.
>>
>>
>> On Wed, May 8, 2013 at 11:49 PM, Jonas Maebe <jonas.maebe at elis.ugent.be>wrote:
>>
>>>
>>> On 08 May 2013, at 08:13, Bruce Tulloch wrote:
>>>
>>>  After a random but very long period of time (i.e. very many successful
>>>> calls) I get a SEGV in the built-in function fpc_AnsiStr_Decr_Ref.
>>>>
>>>> GDB reports the argument to fpc_AnsiStr_Decr_Ref (the string who's
>>>> reference is to be decremented) is nil (i.e. 0x0).
>>>>
>>>> Prima facie, that's the reason for the SEGV, but how is it possible that
>>>> the compiler would pass a nil pointer to this function the first place?
>>>>
>>>
>>> The first thing fpc_AnsiStr_Decr_Ref does is check whether its parameter
>>> is nil, and if so it immediately exists. It can be nil in case the
>>> ansistring contains an empty string.
>>>
>>> That routine itself also sets its argument to nil in case this was not
>>> the case initially (it's a var-parameter), and I assume your crash happens
>>> after this has been done.
>>>
>>>
>>>  To put this into context, I'm running FPC 2.6.2 on a 32 bit Linux system
>>>> executing in a multi-threaded application (which uses python threads and
>>>> fpc threads). I have not found obvious evidence of memory corruption
>>>> from
>>>> other execution contexts or shared memory handling problems.
>>>>
>>>
>>> It's nevertheless most likely memory corruption. You can try compiling
>>> with -gv and running your program under valgrind to see whether it finds
>>> anything (you will probably get some false positives about certain RTL
>>> pchar routines such as strscan and strlen, but you can ignore those).
>>>
>>>
>>> Jonas
>>> ______________________________**_________________
>>> fpc-pascal maillist  -  fpc-pascal at lists.freepascal.**org<fpc-pascal at lists.freepascal.org>
>>> http://lists.freepascal.org/**mailman/listinfo/fpc-pascal<http://lists.freepascal.org/mailman/listinfo/fpc-pascal>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20130509/03d6c8b7/attachment.html>