(gdb) break *0x972

Run GDB until the Application Segfaults

Tuesday, November 25, 2014 - No comments

I'm trying to debug a race condition that crashes the application only once every ~15 runs of 3-4 minutes. I want a GDB prompt on that crash, and not only a core dump.

The naive way is to start gdb, run the application, wait for its termination, and restart it if it didn't crash.

A better alternative is to automatize it:

(gdb) py gdb.events.exited.connect(lambda evt : gdb.post_event(lambda : gdb.execute("run")))

Step-by-step, gdb.events.exited.connect() registers an exit-event callback, that posts an asynchronous command gdb.post_event() (otherwise that would create a recursion, and maybe end up with a stack overflow), that restarts the execution gdb.execute("run").

Finding a Bug with GDB (and mcGDB)

Saturday, October 25, 2014 - No comments

Yesterday, I had to come back on an OpenCL code I wrote 6 months ago, for a trivial update. After I did my few modification, I ran the code to test it, and it failed.

$ bin/xspecfem3D
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7FFD9B489BC5
#1  0x4CA9DD in prepare_cleanup_device_ at prepare_mesh_constants_gpu.c:2472
#2  0x40473A in xspecfem3d at specfem3D.F90:473
#3  0x7FFD9930BB44
Segmentation fault

First reflex: run it with GDB:

$ gdb bin/xspecfem3D
(gdb) run
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bd6bc5 in clReleaseMemObject () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
(gdb) where
#0  0x00007ffff7bd6bc5 in clReleaseMemObject () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#1  0x00000000004ca9de in prepare_cleanup_device_ () at src/gpu/prepare_mesh_constants_gpu.c:2472
#2  0x000000000040473b in xspecfem3d () at src/specfem3D/specfem3D.F90:473
#3  main () at src/specfem3D/specfem3D.F90:32
#4  0x00007ffff5a58b45 in __libc_start_main () at libc-start.c:287
#5  0x0000000000404780 in _start ()

We can see that the crash is in function clReleaseMemObject, around these lines in the source file:

#ifdef USE_OPENCL
    if (run_opencl) RELEASE_PINNED_BUFFER_OCL (station_seismo_field); // <---- segfault here
#endif
#ifdef USE_CUDA
    if (run_cuda) cudaFreeHost(mp->h_station_seismo_field);
#endif

RELEASE_PINNED_BUFFER_OCL is a preprocessor macro function defined like that:

#define RELEASE_PINNED_BUFFER_OCL(_buffer_) \
     clCheck(clEnqueueUnmapMemObject(mocl.command_queue, mp->h_pinned_##_buffer_, \
                                                           mp->h_##_buffer_, 0, NULL, NULL)); \
     clCheck(clReleaseMemObject (mp->h_pinned_##_buffer_))

Okay, so we crash in that second line. What is the value of [c]cl_mem mp->h_pinned_station_seismo_field[c]? (defined there)

(gdb) print mp->h_pinned_station_seismo_field
$1 = (cl_mem) 0x3fd71113a0000000
(gdb) print *mp->h_pinned_station_seismo_field
$2 = <incomplete type>

The pointer appears to be valid, but we don't know much about it. Let's try to set a breakpoint on the function call before the segfault:

(gdb) break clEnqueueUnmapMemObject
(gdb) run
Program received signal SIGSEGV, Segmentation fault.

Humg? we wanted to stop before the segfault, the application was not supposed to crash. Let's examine a bit further the state of the application when it crashed:

#ifdef USE_OPENCL
   if (run_opencl) RELEASE_PINNED_BUFFER_OCL (station_seismo_field); // <---- segfault here
#endif
#ifdef USE_CUDA
   if (run_cuda) cudaFreeHost(mp->h_station_seismo_field);
#endif

Both USE_OPENCL and USE_CUDA are defined, I know it. run_opencl should be true as well, let's check that:

(gdb) print run_opencl
$1 = 0

Oh oh, waht? so I'm not in an OpenCL run? (OCL and Cuda are mutually exclusive) Let's make sure that we're in Cuda:

(gdb) print run_cuda
$2 = 1

Alright, that's clear now!

* OpenCL was not supposed to run,
* it crashes in clReleaseMemObject nevertheless,
* breakpoint in the first function of the macro-function didn't work, ...

The if test is not doing what was expected from it! Here is what it really does:

if (run_opencl) clCheck(clEnqueueUnmapMemObject(mocl.command_queue, mp->h_pinned_station_seismo_field, 
                                                           mp->h_station_seismo_field, 0, NULL, NULL)); 
     clCheck(clReleaseMemObject (mp->h_pinned_station_seismo_field));

The second function call is not part of the conditional execution ...

The problem is easy to solve, either by protecting the if:

if (run_opencl) {
    RELEASE_PINNED_BUFFER_OCL (station_seismo_field);
}

or by protecting the macro:

#define ALLOC_PINNED_BUFFER_OCL(_buffer_, _size_) do { ....} while (0)

I originally didn't protect the macro, because all my ifs are protected (and I didn't paid enough attention at being future-proof). But the code isn't mine, and someone changed the coding convention (and didn't test the OpenCL branch of the code).

In mcGDB, understanding the state of the application would have been easier. Instead of:

(gdb) print mp->h_pinned_station_seismo_field
$1 = (cl_mem) 0x3fd71113a0000000
(gdb) print *mp->h_pinned_station_seismo_field
$2 = <incomplete type>

The pointer appears to be valid, but we don't know much about it.

we could have had:

(mcGDB) opencl info buffer mp->h_pinned_station_seismo_field
No OpenCL buffer at @0x....

and instead of: run_opencl should be true as well, let's check that:

(gdb) print run_opencl
$1 = 0

we would have noticed that no OpenCL event was displayed in the debugger before the crash (like "New kernel created", "New buffer created"), or more explicitely:

(mcgdb) opencl show activity
No OpenCL activity recorded