DevHeads IoT Integration Server•10mo ago

Debugging Persistent Segmentation Fault in Multi-threaded C++ Program on AMD Barcelona CPUs

I have been wrestling with a persistent segmentation fault in a multi-threaded C++ program running on a cluster of AMD Barcelona CPUs Linux/x86_64. The code causing the crashes is a heavily used function, and under load, running 1000 instances of the program same optimized binary can generate 1 to 2 crashes per hour. Now here's the interesting part, the crashes happen on different machines within the cluster although the machines themselves are almost identical, and they all share the same characteristics - same crash address and call stack.

Solution:

Tools like GDB can help you track memory access patterns in different threads , and you know that mutexes are synchronization mechanisms you should add around critical sections of foo to allow you have thread safe access to shared data . How's it going ? @Marvee Amasi

Jump to solution

7 Replies

Marvee Amasi•10mo ago

The crash Details: Signal: Segmentation fault (SIGSEGV) Faulting Instruction Address: 0x17bd9fc (mid-instruction in function Foo)

Marvee Amasi•10mo ago

My code around the crash location:

(gdb) x/6i $pc-12
0x17bd9f1: mov    (%rbx),%eax  ; 
0x17bd9f3: mov    %rbx,%rdi  ; 
0x17bd9f6: callq  *0x70(%rax)  ; 
0x17bd9f9 <_Z3Foov+345>: cmp    %eax,%r12d  ; 
0x17bd9fc <_Z3Foov+348>: mov    %eax,-0x80(%rbp)  ; 
0x17bd9ff <_Z3Foov+351>: jge    0x17bd97e  ;

(gdb) x/6i $pc-12
0x17bd9f1: mov    (%rbx),%eax  ; 
0x17bd9f3: mov    %rbx,%rdi  ; 
0x17bd9f6: callq  *0x70(%rax)  ; 
0x17bd9f9 <_Z3Foov+345>: cmp    %eax,%r12d  ; 
0x17bd9fc <_Z3Foov+348>: mov    %eax,-0x80(%rbp)  ; 
0x17bd9ff <_Z3Foov+351>: jge    0x17bd97e  ;

Marvee Amasi•10mo ago

The crash happens in the middle of the instruction at 0x17bd9fc , which is after a call to a virtual function through a pointer at offset 0x70 from memory pointed to by %eax . Examining the virtual table shows it's not corrupted, and it points to the expected function Foo::Get() . Foo::Get() itself seems to be simple and well-behaved (will be shown in disassembly below). The return address on the stack ($rsp-8) points to the correct instruction after the call to Foo::Get().

Marvee Amasi•10mo ago

Disassembly of Foo::Get():

(gdb) disas 0x2d3d7b0
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax  ; Move value from memory pointed to by offset 0x70 from %rdi to %eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

(gdb) disas 0x2d3d7b0
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax  ; Move value from memory pointed to by offset 0x70 from %rdi to %eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

It's as if during the return from Foo::Get(), something increments the program counter (%rip) by 4 bytes, leading to the crash mid-instruction in Foo. Has anyone encountered anything similar? Any suggestions on how to approach debugging this further?

Solution

UC GEE•10mo ago

Marvee Amasi•10mo ago

Yh @UC GEE so I was able to identify the issue as a data race condition within the Foo function. Multiple threads were like accessing or modifying shared data concurrently, it coused the corruption and the crash

Marvee Amasi•10mo ago

I synchronized thread with a(n) semaphore around the critical sections of Foo that involved shared data access. I wanted to ensure that only one thread can access that data at a time, preventing race conditions

Gaming

Programming

Debugging Persistent Segmentation Fault in Multi-threaded C++ Program on AMD Barcelona CPUs

Did you find this page helpful?