Since quite a while now, I'm a big fan of GDB, the debugger of the GNU
project. I did my Master project
(User level DB: a debugging API for user-level thread libraries
, PDF) and PhD thesis
(Programming-Model Centric Debugging for multicore embedded systems)
on the topic of source-level interactive debugging, so I'm
getting quite confident in how to use GDB. I find its command-line
interface (CLI) easy to use, intuitive, powerful, etc., but I know
that's not a commonly shared opinion :-) So I'd like to describe, with
this 'real-life' bug, how I use it, and at the same time how I did this
debugging investigation.
The Minimal Application
I wrote for this article a minimal working example (MWE) that mimics the behavior
of the buggy application. Don't forget that at
the time I write this post, I already know where the bug is, so I'll go straight to the point,
and skip some of the dead-ends of the initial investigation. That will save us time
:-) Also note that I did not write the original code myself, and I
don't know anything about libevent. I just got
my hand on it, and wanted to get rid of this bug.
You can get the source code from this
GIST,
with the fix
or
without
(if you want to try to solve it yourself :-). Compile it with:
g++ bug.cpp -std=c++11 -levent -levent_pthreads -lpthread -g -O2
You'll need libevent debugging information if you want to reproduce
the investigation.
Its (expected) behavior is quite simple:
- Configure and enable libevent.
- Create an second thread dispatching the events. We'll call it the
dispatcher thread.
- Wait WAIT_TIME seconds to mockup the application execution.
- Decide it's time to leave and tell the other thread/libevent to
quit the event dispatching loop.
- Wait for the other thread to report its end of activity, join it
and quit.
The Visible Error
Step 5. is the one with the visible error: the main thread loops
forever, waiting for the dispatcher thread:
PID is 19425
Setting thread usage a bit late
Sleeping for 2 seconds ...
Running the dispatcher ...
And killing the application.
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
^C
It won't terminate, so we call it with a ^C.
Understand the Situation: The Downwards Investigation
In the first part of the investigation, we'll try to understand why
the application is blocked. We'll dive into the code to
figure out what drives the execution into that bug.
Let's run the application with GDB.
$ gdb a.aout
(gdb) run
PID is 24468
Setting thread usage a bit late
[New Thread 0x7ffff6b37700 (LWP 24472)]
Running the dispatcher ...
Sleeping for 2 seconds ...
And killing the application.
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
...
^C
The execution is now interrupted in the middle of the infinite loop,
so let's check the state of the threads.
(gdb) thread 1
$ where n
#0 0x... in nanosleep () from /usr/lib/libpthread.so.0
#1 0x... in sleep_for<long, std::ratio<1l, 1000l> > (...)
#2 main () at bug.cpp:82
(gdb) frame 2
(gdb) l # for list
81 while (true) {
82 std::this_thread::sleep_for(std::chrono::milliseconds(1000));
83 if (not m_active_thread.load()) {
84 break;
85 }
86 std::cout << "Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)" << std::endl;
Right, thread 1 is inside the waiting loop, that was expected.
(gdb) thread 2
(gdb )where 4
#0 0x... in epoll_wait () from /usr/lib/libc.so.6
#1 0x... in epoll_dispatch (...) at epoll.c:407
#2 0x... in event_base_loop (...) at event.c:1633
#3 0x... in server_thread () at bug.cpp:21
(More stack frames follow...)
Hum, thread 2 is still blocked in epoll_wait, that's why it
doesn't return from event_base_loop and unlocks the main thread from
the waiting loop:
void server_thread(){
std::cout << "Running the dispatcher ..." << std::endl;
event_base_dispatch(base); // still in this function
event_base_free(base);
m_active_thread.store(false);
std::cout << "Dead" << std::endl;
}
Clue 1: the dispatcher thread is blocked in epoll_wait system
call.
Side remark: I won't go into details into
the probe effect that can happen here, check my previous post.
New lead: why the dispatcher thread doesn't quit the dispatching
loop?
A quick search on the Internet tells us that event_base_loopbreak should
break that loop.
Hypothese to verify: Is event_base_loopbreak actually called?
(gdb) break event_base_loopbreak
(gdb) run
...
Breakpoint 1, event_base_loopbreak (event_base=0x617c50) at event.c:1511
(gdb) up
#1 0x... in main () at bug.cpp:79
79 event_base_loopbreak(base);
Yes, at bug.cpp:79.
Hypothese to verify: Is event_base_loopbreak action actually done?
int
event_base_loopbreak(struct event_base *event_base)
{
int r = 0;
if (event_base == NULL)
return (-1);
EVBASE_ACQUIRE_LOCK(event_base, th_base_lock);
event_base->event_break = 1;
if (EVBASE_NEED_NOTIFY(event_base)) {
r = evthread_notify_base(event_base);
} else {
r = (0);
}
EVBASE_RELEASE_LOCK(event_base, th_base_lock);
return r;
}
Is base not NULL? (otherwise we'll return too quickly)
(gdb) p base # for print
$1 = (event_base *) 0x617c50
(gdb) p *base
$2 = {
evsel = 0x7ffff7dd9140 <epollops>,
evbase = 0x617ee0,
changelist = {
changes = 0x0,
n_changes = 0,
changes_size = 0
},
evsigsel = 0x7ffff7dd9180 <evsigops>,
sig = {
ev_signal = {
...
Yes, and its content seem right (base->evsel is epollops, which
makes sense, so it's not a dangling pointer).
So the event_break flag will be set:
(gdb) n # for next
1513 if (event_base == NULL)
(gdb) n
1516 EVBASE_ACQUIRE_LOCK(event_base, th_base_lock);
(gdb) n
1519 if (EVBASE_NEED_NOTIFY(event_base)) {
(gdb) n
1517 event_base->event_break = 1;
Wait, what!? event_break = 1 should be before NEED_NOTIFY! Don't
pay attention to that, it's because of compiler
optimizations. Recompile libevent with -O0 flag or ignore it!
(gdb) n
1519 if (EVBASE_NEED_NOTIFY(event_base)) {
This one is the real test ;-)
Now let's make sure that it's the right flag we set. With a quick
search in event.c, we found:
int
event_base_got_break(struct event_base *event_base)
{
int res;
EVBASE_ACQUIRE_LOCK(event_base, th_base_lock);
res = event_base->event_break;
EVBASE_RELEASE_LOCK(event_base, th_base_lock);
return res;
}
That looks good. Is it ever called?
(gdb) start
(gdb) b event_base_got_break
Breakpoint 2 at 0x7ffff7ba05a0: file event.c, line 1530.
(gdb) cont
Running the dispatcher ...
Sleeping for 2 seconds ...
And killing the application.
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
^C
Program received signal SIGINT, Interrupt.
0x00007ffff778465d in nanosleep () from /usr/lib/libpthread.so.0
Not before the infinite loop apparently,
(gdb) cont
Dead
[Thread 0x7ffff6b39700 (LWP 20947) exited]
Bye everybody.
[Inferior 1 (process 20852) exited normally]
... and not after either. (That's the probe effect we see here). So that's certainly a dead-end.
That could also be the reason of the bug, but the fact that the debugger
interrupt makes the application finish properly tells me to let that idea aside.
Next candidate?
int
event_base_loop(struct event_base *base, int flags)
{
...
base->event_gotterm = base->event_break = 0;
...
while (!done) {
if (base->event_break) {
break;
}
...
res = evsel->dispatch(base, tv_p);
...
}
}
(gdb) p evsel->dispatch
$1 = (int (*)(struct event_base *, struct timeval *)) 0x7ffff7bb7d80 <epoll_dispatch>
Yes, that seems to be the right flag: it breaks the loop when set, and
just a bit after, the epoll_dispatch function is called (that's the
one that blocks forever on epoll_wait).
New lead: How is epoll supposed to quit its polling loop?
(gdb) thread 2
Thread ID 2 not known.
Damn, it's gone, that's the probe effect ...
(gdb) run # that restarts the execution from the beginning
...
Sleeping for 2 seconds ...
^C
Program received signal SIGINT, Interrupt.
0x... in nanosleep () from /usr/lib/libpthread.so.0
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff6b39700 (LWP 3962))]
#0 0x... in epoll_wait () from /usr/lib/libc.so.6
(gdb) where 3
#0 0x... in epoll_wait () at /usr/lib/libc.so.6
#1 0x... in epoll_dispatch (base=0x617c50, tv=<optimized out>) at epoll.c:407
#2 0x... in event_base_loop (base=0x617c50, flags=0) at event.c:1629
#3 0x... in server_thread() () at bug.cpp:21
(gdb) up
#1 0x... in epoll_dispatch (base=0x617c50, tv=<optimized out>) at epoll.c:407
407 res = epoll_wait(epollop->epfd, events, epollop->nevents, timeout);
(gdb) p timeout
$1 = <optimized out>
Nope, we can't get the timeout value ... Let's try to see how
it's computed:
(gdb) l 397
391 long timeout = -1;
392
393 if (tv != NULL) {
394 timeout = evutil_tv_to_msec(tv);
395 if (timeout < 0 || timeout > MAX_EPOLL_TIMEOUT_MSEC) {
396 /* Linux kernels can wait forever if the timeout is
397 * too big; see comment on MAX_EPOLL_TIMEOUT_MSEC. */
398 timeout = MAX_EPOLL_TIMEOUT_MSEC;
399 }
400 }
(gdb) p tv
$2 = <optimized out>
Nope again, but let's try to outsmart it!
(gdb) up
#2 0x00007ffff7ba2a0d in event_base_loop (base=0x617c50, flags=0) at event.c:1629
1629 res = evsel->dispatch(base, tv_p);
(gdb) p tv_p
$3 = (struct timeval *) 0x0
Soon enough! So timeout remains -1. And that's a valid value for
epoll_wait:
The timeout argument specifies the minimum number of milliseconds
that epoll_wait() will block. [...] Specifying a timeout of -1 causes
epoll_wait() to block indefinitely [...].
So epoll_wait won't return until it receives something from its event
listeners ...
Clue 2: The dispatcher thread waits for a notification.
We saw something like that earlier on, where was that ... Oh, yes,
EVBASE_NEED_NOTIFY(event_base) inside event_base_loopbreak. But
the name looks like a preprocessor macro ... Indeed:
(gdb) p EVBASE_NEED_NOTIFY
No symbol "EVBASE_NEED_NOTIFY" in current context.
So let's get back to the source:
$ cd $PATH_TO_LIBEVENT_SRC
$ grep "define EVBASE_NEED_NOTIFY" . --after-context=2 --include='*.h' -r
./evthread-internal.h:#define EVBASE_NEED_NOTIFY(base) \
./evthread-internal.h- (_evthread_id_fn != NULL && \
./evthread-internal.h- (base)->running_loop && \
--
./evthread-internal.h:#define EVBASE_NEED_NOTIFY(base) \
./evthread-internal.h- ((base)->running_loop && \
./evthread-internal.h- ((base)->th_owner_id != _evthreadimpl_get_id()))
--
./evthread-internal.h:#define EVBASE_NEED_NOTIFY(base) 0
./evthread-internal.h-#define EVBASE_ACQUIRE_LOCK(base, lock) _EVUTIL_NIL_STMT
./evthread-internal.h-#define EVBASE_RELEASE_LOCK(base, lock) _EVUTIL_NIL_STMT
According to the macro names, it seems to be the first one. If you can
recompile the code, you can add a #error to ensure it:
#error "Compilation will crash if we pass here."
#define EVBASE_NEED_NOTIFY(base) ...
If you can recompile the code, while you're at it, you can do a bit of
printf debugging (that's what I did :-/) to get it's value:
printf("need notify ? %d\n", EVBASE_NEED_NOTIFY(event_base));
Otherwise, let's check the values with GDB:
(gdb) p _evthread_id_fn
$5 = (unsigned long (*)(void)) 0x7ffff7991d50 <evthread_posix_get_id>
(gdb) p base->running_loop
$6 = 1
Looks good. We can continue that investigation:
(gdb) b evthread_notify_base
(gdb) continue
Breakpoint 2, evthread_notify_base (base=0x617c50) at event.c:2039
2039 {
Where are we? at the begiining of this function:
(gdb) l
2038 evthread_notify_base(struct event_base *base)
2039 {
2040 EVENT_BASE_ASSERT_LOCKED(base);
2041 if (!base->th_notify_fn) {
2042 return -1;
2043 }
2044 if (base->is_notify_pending)
2045 return 0;
2046
2047 base->is_notify_pending = 1;
2048 return base->th_notify_fn(base);
2049 }
Great, so, do we have a notification function?
(gdb) p base->th_notify_fn
$7 = (int (*)(struct event_base *)) 0x0
No, there is no notification function!
Now it makes sense:
Intermediate conclusion regarding the visible error:
- the dispatcher thread is blocked forever:
- and waits for an epoll notification that never comes,
- then it should check a flag for breaking its listening loop
- the main thread sets the loop-breaking flag,
- but fails to notify epoll, so the dispatcher thread never checks the flag
- then it waits forever for the dispatcher thread completion.
Problem solved! Or maybe not, maybe we just understood the
situation.
Understanding the Reason of the Invalid State: Backwards Investigation
Now that we are at the bottom of the callstacks, we need to go
backwards in time: The state of the application is invalid, the notify
function pointer has to be set, so why is it empty? which
decision(s) set it to NULL instead of the actual notification
function?
New lead: Why is this th_notify_fn empty?
$ cd $PATH_TO_LIBEVENT_SRC
grep th_notify_fn * -n
grep: autom4te.cache: Is a directory
grep: compat: Is a directory
event.c:2037: if (!base->th_notify_fn) {
event.c:2045: return base->th_notify_fn(base);
event.c:2829: base->th_notify_fn = notify;
event-internal.h:293: int (*th_notify_fn)(struct event_base *base);
The instruction at event.c:2829 looks interesting. It is inside
function evthread_make_base_notifiable. Is this function ever called?
(gdb) start
(gdb) break evthread_make_base_notifiable
Breakpoint 2 at 0x7ffff7ba6220: file event.c, line 2780.
(gdb) cont
PID is 28680
Setting thread usage a bit late
[New Thread 0x7ffff6b39700 (LWP 28778)]
Running the dispatcher ...
Sleeping for 2 seconds ...
And killing the application.
Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)
^C
Apparently, no. So th_notify_fn is empty because it is never set.
New lead: How to 'make the base notifiable'?
$ grep evthread_make_base_notifiable *.c -n
event.c:640: r = evthread_make_base_notifiable(base);
event.c:884: res = evthread_make_base_notifiable(base);
event.c:2775:evthread_make_base_notifiable(struct event_base *base)
Line 2775 is the function definition, line 884 is inside a
function called event_reinit, so that's certainly not what we
can. Line 640 looks more interesting:
struct event_base *
event_base_new_with_config(const struct event_config *cfg)
{
...
#ifndef _EVENT_DISABLE_THREAD_SUPPORT
if (EVTHREAD_LOCKING_ENABLED() &&
(!cfg || !(cfg->flags & EVENT_BASE_FLAG_NOLOCK))) {
int r;
EVTHREAD_ALLOC_LOCK(base->th_base_lock,
EVTHREAD_LOCKTYPE_RECURSIVE);
base->defer_queue.lock = base->th_base_lock;
EVTHREAD_ALLOC_COND(base->current_event_cond);
r = evthread_make_base_notifiable(base);
if (r<0) {
event_warnx("%s: Unable to make base notifiable.", __func__);
event_base_free(base);
return NULL;
}
}
#endif
Macro _EVENT_DISABLE_THREAD_SUPPORT is not defined (so the block
is executed), but what is EVTHREAD_LOCKING_ENABLED()?
grep EVTHREAD_LOCKING_ENABLED * --after-context=1 --include='*.h'
evthread-internal.h:#define EVTHREAD_LOCKING_ENABLED() \
evthread-internal.h- (_evthread_lock_fns.lock != NULL)
... and this thread-locking function table is ... empty!
(gdb) p _evthread_lock_fns.lock
$1 = (int (*)(unsigned int, void *)) 0x0
(gdb) p _evthread_lock_fns
$2 = {
lock_api_version = 0,
supported_locktypes = 0,
alloc = 0x0,
free = 0x0,
lock = 0x0,
unlock = 0x0
}
So to make the base notifiable ...
New lead: Thread locking should be enabled.
Let's check if it already done ..., and let's do it with GDB, instead
of grep, with a memory watchpoint:
(gdb) start
(gdb) p _evthread_lock_fns
$1 = { lock_api_version = 0, supported_locktypes = 0, ... }
# that's the defaut value
(gdb) watch _evthread_lock_fns
Watchpoint 2: _evthread_lock_fns
(gdb) run
... takes forever ...
Wait, there is a problem here, it takes forever. And
it doesn't say 'Hardware watchpoint', although I know my CPU is
capable of it ...
Side remark: That's because _evthread_lock_fns is a structure,
and HW watchpoints can only watch a few addresses. So software
watchpoints are used instead, which are very very very slow (each time
a machine instruction is executed, GDB analyses the memory
address(es) involved, and continues the execution if they are not
covered by a watchpoint).
Let's try again:
(gdb) start
(gdb) p &_evthread_lock_fns
$1 = (struct evthread_lock_callbacks *) 0x7ffff7dda320 <_evthread_lock_fns>
(gdb) watch *0x7ffff7dda320
Hardware watchpoint 2: *0x7ffff7dda320
# good, a HW breakpoint
# it only watches the first bits of the structure, but that's enough
# OR that would do the same:
(gdb) watch _evthread_lock_fns->lock_api_version
Hardware watchpoint 3: _evthread_lock_fns->lock_api_version
(gdb) cont
Hardware watchpoint 2: *0x7ffff7dda320
Old value = 0
New value = 1
Hardware watchpoint 3: _evthread_lock_fns->lock_api_version
Old value = 0
New value = 1
0x00007ffff7ba7472 in evthread_set_lock_callbacks (cbs=cbs@entry=0x7fffffffe4b0) at evthread.c:101
101 memcpy(target, cbs, sizeof(_evthread_lock_fns));
(gdb) p *cbs
$3 = {
lock_api_version = 1,
supported_locktypes = 1,
alloc = 0x7ffff7991e60 <evthread_posix_lock_alloc>,
free = 0x7ffff7991e40 <evthread_posix_lock_free>,
lock = 0x7ffff7991e20 <evthread_posix_lock>,
unlock = 0x7ffff7991e10 <evthread_posix_unlock>
}
So we do set the thread locking functions ...
New lead: but why were they empty earlier on? when/where do we
set them?
(gdb) where
#0 0x00007ffff7ba7390 in evthread_set_lock_callbacks (cbs=cbs@entry=0x7fffffffe4b0) at evthread.c:76
#1 0x00007ffff799202b in evthread_use_pthreads () at evthread_pthread.c:185
#2 0x000000000040162a in main() () at bug.cpp:69
oh, that's from the main, at bug.cpp:69. Where were we when we saw
them empty?
(gdb) break event_base_new_with_config
(gdb) run
(gdb) where
#0 0x00007ffff7ba63c0 in event_base_new_with_config (cfg=cfg@entry=0x617c20) at event.c:553
#1 0x00007ffff7ba680b in event_base_new () at event.c:452
#2 0x0000000000401553 in main() () at bug.cpp:50
Oh, I see, we were at bug.cpp:50, so 19 lines before.
Defect explanation and fix: it happens that the application
created and configured libevent's base (event_base_new())
before configuring its thread-support. So the notification could not
be sent, and hence the dispatcher thread remained blocked in epoll
forever.
This is obviously an 'application bug', libevent was used
incorrectly. Furthermore, event_base_loopbreak did returned at error
code, but that was not checked and hence never noticed. We can verify that
very quickly with GDB:
(gdb) start
(gdb) break bug.cpp:79
(gdb) cont
Breakpoint 2, main () at bug.cpp:79
79 event_base_loopbreak(base);
(gdb) p event_base_loopbreak(base)
$1 = -1
This way, we ask GDB to call the function event_base_loopbreak and print its result.
Another way is to set a breakpoint inside the function, and ask GDB to finish it:
(gdb) break event_base_loopbreak
(gdb) cont
Breakpoint 2, event_base_loopbreak (event_base=0x617c50) at event.c:1511
1511 {
(gdb) finish
Run till exit from #0 event_base_loopbreak (event_base=0x617c50) at event.c:1511
main () at bug.cpp:82
82 std::this_thread::sleep_for(std::chrono::milliseconds(1000));
Value returned is $1 = -1
Both ways show that the return value is -1...
Of course the application from which I base this example was in
development, with no bug-free guarantee, so no one to blame! I
actually quite enjoyed this work of investigation :)
You can change #define DO_BUG to false at the beginning
the source code, and see that is solves the problem.
Conclusion
In this article I tried to highlight the usage of GDB with a real
debugging example, on which I spend my last Friday afternoon. I showed
the way I performed this investigation, although I must admit that I
used a bit more printf than I show here. That's mainly because of
the
probe effect
that occurs with this bug, that complicates the use of GDB.
If ever you don't like interactive debuggers, or you would have done
this debugging with different tools and techniques, let me know, I'm
curious! Just keep in mind that I didn't write the original code and I
didn't want to spend my afternoon reading libevent documentation!
(I didn't even know for sure that the bug was related to this library).
Also, this is not an 'advanced' usage of GDB, I only used the very
basic functionalities. Maybe one day I'll encounter a trickier bug
that requires a more subtle juggling with GDB commands, and I'll
post a new article on the topic!