Debugging is like being the detective in a crime movie ...
Monday, October 12, 2015 - No comments
(from @fortes on Tweeter)
(from @fortes on Tweeter)
Since quite a while now, I'm a big fan of GDB, the debugger of the GNU project. I did my Master project (User level DB: a debugging API for user-level thread libraries , PDF) and PhD thesis (Programming-Model Centric Debugging for multicore embedded systems) on the topic of source-level interactive debugging, so I'm getting quite confident in how to use GDB. I find its command-line interface (CLI) easy to use, intuitive, powerful, etc., but I know that's not a commonly shared opinion :-) So I'd like to describe, with this 'real-life' bug, how I use it, and at the same time how I did this debugging investigation.
I wrote for this article a minimal working example (MWE) that mimics the behavior of the buggy application. Don't forget that at the time I write this post, I already know where the bug is, so I'll go straight to the point, and skip some of the dead-ends of the initial investigation. That will save us time :-) Also note that I did not write the original code myself, and I don't know anything about libevent. I just got my hand on it, and wanted to get rid of this bug.
You can get the source code from this GIST, with the fix or without (if you want to try to solve it yourself :-). Compile it with:
g++ bug.cpp -std=c++11 -levent -levent_pthreads -lpthread -g -O2
You'll need libevent debugging information if you want to reproduce the investigation.
Its (expected) behavior is quite simple:
Step 5. is the one with the visible error: the main thread loops forever, waiting for the dispatcher thread:
PID is 19425 Setting thread usage a bit late Sleeping for 2 seconds ... Running the dispatcher ... And killing the application. Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) ^C
It won't terminate, so we call it with a ^C.
In the first part of the investigation, we'll try to understand why the application is blocked. We'll dive into the code to figure out what drives the execution into that bug.
Let's run the application with GDB.
$ gdb a.aout (gdb) run PID is 24468 Setting thread usage a bit late [New Thread 0x7ffff6b37700 (LWP 24472)] Running the dispatcher ... Sleeping for 2 seconds ... And killing the application. Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) ... ^C
The execution is now interrupted in the middle of the infinite loop, so let's check the state of the threads.
(gdb) thread 1 $ where n #0 0x... in nanosleep () from /usr/lib/libpthread.so.0 #1 0x... in sleep_for<long, std::ratio<1l, 1000l> > (...) #2 main () at bug.cpp:82 (gdb) frame 2 (gdb) l # for list 81 while (true) { 82 std::this_thread::sleep_for(std::chrono::milliseconds(1000)); 83 if (not m_active_thread.load()) { 84 break; 85 } 86 std::cout << "Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT)" << std::endl;
Right, thread 1 is inside the waiting loop, that was expected.
(gdb) thread 2 (gdb )where 4 #0 0x... in epoll_wait () from /usr/lib/libc.so.6 #1 0x... in epoll_dispatch (...) at epoll.c:407 #2 0x... in event_base_loop (...) at event.c:1633 #3 0x... in server_thread () at bug.cpp:21 (More stack frames follow...)
Hum, thread 2 is still blocked in epoll_wait, that's why it doesn't return from event_base_loop and unlocks the main thread from the waiting loop:
void server_thread(){ std::cout << "Running the dispatcher ..." << std::endl; event_base_dispatch(base); // still in this function event_base_free(base); m_active_thread.store(false); std::cout << "Dead" << std::endl; }
Clue 1: the dispatcher thread is blocked in epoll_wait system call.
Side remark: I won't go into details into the probe effect that can happen here, check my previous post.
New lead: why the dispatcher thread doesn't quit the dispatching loop?
A quick search on the Internet tells us that event_base_loopbreak should break that loop.
Hypothese to verify: Is event_base_loopbreak actually called?
(gdb) break event_base_loopbreak (gdb) run ... Breakpoint 1, event_base_loopbreak (event_base=0x617c50) at event.c:1511 (gdb) up #1 0x... in main () at bug.cpp:79 79 event_base_loopbreak(base);
Yes, at bug.cpp:79.
Hypothese to verify: Is event_base_loopbreak action actually done?
int event_base_loopbreak(struct event_base *event_base) { int r = 0; if (event_base == NULL) return (-1); EVBASE_ACQUIRE_LOCK(event_base, th_base_lock); event_base->event_break = 1; if (EVBASE_NEED_NOTIFY(event_base)) { r = evthread_notify_base(event_base); } else { r = (0); } EVBASE_RELEASE_LOCK(event_base, th_base_lock); return r; }
Is base not NULL? (otherwise we'll return too quickly)
(gdb) p base # for print $1 = (event_base *) 0x617c50 (gdb) p *base $2 = { evsel = 0x7ffff7dd9140 <epollops>, evbase = 0x617ee0, changelist = { changes = 0x0, n_changes = 0, changes_size = 0 }, evsigsel = 0x7ffff7dd9180 <evsigops>, sig = { ev_signal = { ...
Yes, and its content seem right (base->evsel is epollops, which makes sense, so it's not a dangling pointer).
So the event_break flag will be set:
(gdb) n # for next 1513 if (event_base == NULL) (gdb) n 1516 EVBASE_ACQUIRE_LOCK(event_base, th_base_lock); (gdb) n 1519 if (EVBASE_NEED_NOTIFY(event_base)) { (gdb) n 1517 event_base->event_break = 1;
Wait, what!? event_break = 1 should be before NEED_NOTIFY! Don't pay attention to that, it's because of compiler optimizations. Recompile libevent with -O0 flag or ignore it!
(gdb) n 1519 if (EVBASE_NEED_NOTIFY(event_base)) {
This one is the real test ;-)
Now let's make sure that it's the right flag we set. With a quick search in event.c, we found:
int event_base_got_break(struct event_base *event_base) { int res; EVBASE_ACQUIRE_LOCK(event_base, th_base_lock); res = event_base->event_break; EVBASE_RELEASE_LOCK(event_base, th_base_lock); return res; }
That looks good. Is it ever called?
(gdb) start (gdb) b event_base_got_break Breakpoint 2 at 0x7ffff7ba05a0: file event.c, line 1530. (gdb) cont Running the dispatcher ... Sleeping for 2 seconds ... And killing the application. Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) ^C Program received signal SIGINT, Interrupt. 0x00007ffff778465d in nanosleep () from /usr/lib/libpthread.so.0
Not before the infinite loop apparently,
(gdb) cont Dead [Thread 0x7ffff6b39700 (LWP 20947) exited] Bye everybody. [Inferior 1 (process 20852) exited normally]
... and not after either. (That's the probe effect we see here). So that's certainly a dead-end.
That could also be the reason of the bug, but the fact that the debugger interrupt makes the application finish properly tells me to let that idea aside.
Next candidate?
int event_base_loop(struct event_base *base, int flags) { ... base->event_gotterm = base->event_break = 0; ... while (!done) { if (base->event_break) { break; } ... res = evsel->dispatch(base, tv_p); ... } } (gdb) p evsel->dispatch $1 = (int (*)(struct event_base *, struct timeval *)) 0x7ffff7bb7d80 <epoll_dispatch>
Yes, that seems to be the right flag: it breaks the loop when set, and just a bit after, the epoll_dispatch function is called (that's the one that blocks forever on epoll_wait).
New lead: How is epoll supposed to quit its polling loop?
(gdb) thread 2 Thread ID 2 not known.
Damn, it's gone, that's the probe effect ...
(gdb) run # that restarts the execution from the beginning ... Sleeping for 2 seconds ... ^C Program received signal SIGINT, Interrupt. 0x... in nanosleep () from /usr/lib/libpthread.so.0 (gdb) thread 2 [Switching to thread 2 (Thread 0x7ffff6b39700 (LWP 3962))] #0 0x... in epoll_wait () from /usr/lib/libc.so.6 (gdb) where 3 #0 0x... in epoll_wait () at /usr/lib/libc.so.6 #1 0x... in epoll_dispatch (base=0x617c50, tv=<optimized out>) at epoll.c:407 #2 0x... in event_base_loop (base=0x617c50, flags=0) at event.c:1629 #3 0x... in server_thread() () at bug.cpp:21 (gdb) up #1 0x... in epoll_dispatch (base=0x617c50, tv=<optimized out>) at epoll.c:407 407 res = epoll_wait(epollop->epfd, events, epollop->nevents, timeout); (gdb) p timeout $1 = <optimized out>
Nope, we can't get the timeout value ... Let's try to see how it's computed:
(gdb) l 397 391 long timeout = -1; 392 393 if (tv != NULL) { 394 timeout = evutil_tv_to_msec(tv); 395 if (timeout < 0 || timeout > MAX_EPOLL_TIMEOUT_MSEC) { 396 /* Linux kernels can wait forever if the timeout is 397 * too big; see comment on MAX_EPOLL_TIMEOUT_MSEC. */ 398 timeout = MAX_EPOLL_TIMEOUT_MSEC; 399 } 400 } (gdb) p tv $2 = <optimized out>
Nope again, but let's try to outsmart it!
(gdb) up #2 0x00007ffff7ba2a0d in event_base_loop (base=0x617c50, flags=0) at event.c:1629 1629 res = evsel->dispatch(base, tv_p); (gdb) p tv_p $3 = (struct timeval *) 0x0
Soon enough! So timeout remains -1. And that's a valid value for epoll_wait:
The timeout argument specifies the minimum number of milliseconds that epoll_wait() will block. [...] Specifying a timeout of -1 causes epoll_wait() to block indefinitely [...].
So epoll_wait won't return until it receives something from its event listeners ...
Clue 2: The dispatcher thread waits for a notification.
We saw something like that earlier on, where was that ... Oh, yes, EVBASE_NEED_NOTIFY(event_base) inside event_base_loopbreak. But the name looks like a preprocessor macro ... Indeed:
(gdb) p EVBASE_NEED_NOTIFY No symbol "EVBASE_NEED_NOTIFY" in current context.
So let's get back to the source:
$ cd $PATH_TO_LIBEVENT_SRC $ grep "define EVBASE_NEED_NOTIFY" . --after-context=2 --include='*.h' -r ./evthread-internal.h:#define EVBASE_NEED_NOTIFY(base) \ ./evthread-internal.h- (_evthread_id_fn != NULL && \ ./evthread-internal.h- (base)->running_loop && \ -- ./evthread-internal.h:#define EVBASE_NEED_NOTIFY(base) \ ./evthread-internal.h- ((base)->running_loop && \ ./evthread-internal.h- ((base)->th_owner_id != _evthreadimpl_get_id())) -- ./evthread-internal.h:#define EVBASE_NEED_NOTIFY(base) 0 ./evthread-internal.h-#define EVBASE_ACQUIRE_LOCK(base, lock) _EVUTIL_NIL_STMT ./evthread-internal.h-#define EVBASE_RELEASE_LOCK(base, lock) _EVUTIL_NIL_STMT
According to the macro names, it seems to be the first one. If you can recompile the code, you can add a #error to ensure it:
#error "Compilation will crash if we pass here." #define EVBASE_NEED_NOTIFY(base) ...
If you can recompile the code, while you're at it, you can do a bit of printf debugging (that's what I did :-/) to get it's value:
printf("need notify ? %d\n", EVBASE_NEED_NOTIFY(event_base));
Otherwise, let's check the values with GDB:
(gdb) p _evthread_id_fn $5 = (unsigned long (*)(void)) 0x7ffff7991d50 <evthread_posix_get_id> (gdb) p base->running_loop $6 = 1
Looks good. We can continue that investigation:
(gdb) b evthread_notify_base (gdb) continue Breakpoint 2, evthread_notify_base (base=0x617c50) at event.c:2039 2039 {
Where are we? at the begiining of this function:
(gdb) l 2038 evthread_notify_base(struct event_base *base) 2039 { 2040 EVENT_BASE_ASSERT_LOCKED(base); 2041 if (!base->th_notify_fn) { 2042 return -1; 2043 } 2044 if (base->is_notify_pending) 2045 return 0; 2046 2047 base->is_notify_pending = 1; 2048 return base->th_notify_fn(base); 2049 }
Great, so, do we have a notification function?
(gdb) p base->th_notify_fn $7 = (int (*)(struct event_base *)) 0x0
No, there is no notification function!
Now it makes sense:
Intermediate conclusion regarding the visible error:
Problem solved! Or maybe not, maybe we just understood the situation.
Now that we are at the bottom of the callstacks, we need to go backwards in time: The state of the application is invalid, the notify function pointer has to be set, so why is it empty? which decision(s) set it to NULL instead of the actual notification function?
New lead: Why is this th_notify_fn empty?
$ cd $PATH_TO_LIBEVENT_SRC grep th_notify_fn * -n grep: autom4te.cache: Is a directory grep: compat: Is a directory event.c:2037: if (!base->th_notify_fn) { event.c:2045: return base->th_notify_fn(base); event.c:2829: base->th_notify_fn = notify; event-internal.h:293: int (*th_notify_fn)(struct event_base *base);
The instruction at event.c:2829 looks interesting. It is inside function evthread_make_base_notifiable. Is this function ever called?
(gdb) start (gdb) break evthread_make_base_notifiable Breakpoint 2 at 0x7ffff7ba6220: file event.c, line 2780. (gdb) cont PID is 28680 Setting thread usage a bit late [New Thread 0x7ffff6b39700 (LWP 28778)] Running the dispatcher ... Sleeping for 2 seconds ... And killing the application. Waiting for the thread to die ... (try ^z/fg, SIGSTOP/SIGINT) ^C
Apparently, no. So th_notify_fn is empty because it is never set.
New lead: How to 'make the base notifiable'?
$ grep evthread_make_base_notifiable *.c -n event.c:640: r = evthread_make_base_notifiable(base); event.c:884: res = evthread_make_base_notifiable(base); event.c:2775:evthread_make_base_notifiable(struct event_base *base)
Line 2775 is the function definition, line 884 is inside a function called event_reinit, so that's certainly not what we can. Line 640 looks more interesting:
struct event_base * event_base_new_with_config(const struct event_config *cfg) { ... #ifndef _EVENT_DISABLE_THREAD_SUPPORT if (EVTHREAD_LOCKING_ENABLED() && (!cfg || !(cfg->flags & EVENT_BASE_FLAG_NOLOCK))) { int r; EVTHREAD_ALLOC_LOCK(base->th_base_lock, EVTHREAD_LOCKTYPE_RECURSIVE); base->defer_queue.lock = base->th_base_lock; EVTHREAD_ALLOC_COND(base->current_event_cond); r = evthread_make_base_notifiable(base); if (r<0) { event_warnx("%s: Unable to make base notifiable.", __func__); event_base_free(base); return NULL; } } #endif
Macro _EVENT_DISABLE_THREAD_SUPPORT is not defined (so the block is executed), but what is EVTHREAD_LOCKING_ENABLED()?
grep EVTHREAD_LOCKING_ENABLED * --after-context=1 --include='*.h' evthread-internal.h:#define EVTHREAD_LOCKING_ENABLED() \ evthread-internal.h- (_evthread_lock_fns.lock != NULL)
... and this thread-locking function table is ... empty!
(gdb) p _evthread_lock_fns.lock $1 = (int (*)(unsigned int, void *)) 0x0 (gdb) p _evthread_lock_fns $2 = { lock_api_version = 0, supported_locktypes = 0, alloc = 0x0, free = 0x0, lock = 0x0, unlock = 0x0 }
So to make the base notifiable ...
New lead: Thread locking should be enabled.
Let's check if it already done ..., and let's do it with GDB, instead of grep, with a memory watchpoint:
(gdb) start (gdb) p _evthread_lock_fns $1 = { lock_api_version = 0, supported_locktypes = 0, ... } # that's the defaut value (gdb) watch _evthread_lock_fns Watchpoint 2: _evthread_lock_fns (gdb) run ... takes forever ...
Wait, there is a problem here, it takes forever. And it doesn't say 'Hardware watchpoint', although I know my CPU is capable of it ...
Side remark: That's because _evthread_lock_fns is a structure, and HW watchpoints can only watch a few addresses. So software watchpoints are used instead, which are very very very slow (each time a machine instruction is executed, GDB analyses the memory address(es) involved, and continues the execution if they are not covered by a watchpoint).
Let's try again:
(gdb) start (gdb) p &_evthread_lock_fns $1 = (struct evthread_lock_callbacks *) 0x7ffff7dda320 <_evthread_lock_fns> (gdb) watch *0x7ffff7dda320 Hardware watchpoint 2: *0x7ffff7dda320 # good, a HW breakpoint # it only watches the first bits of the structure, but that's enough # OR that would do the same: (gdb) watch _evthread_lock_fns->lock_api_version Hardware watchpoint 3: _evthread_lock_fns->lock_api_version (gdb) cont Hardware watchpoint 2: *0x7ffff7dda320 Old value = 0 New value = 1 Hardware watchpoint 3: _evthread_lock_fns->lock_api_version Old value = 0 New value = 1 0x00007ffff7ba7472 in evthread_set_lock_callbacks (cbs=cbs@entry=0x7fffffffe4b0) at evthread.c:101 101 memcpy(target, cbs, sizeof(_evthread_lock_fns)); (gdb) p *cbs $3 = { lock_api_version = 1, supported_locktypes = 1, alloc = 0x7ffff7991e60 <evthread_posix_lock_alloc>, free = 0x7ffff7991e40 <evthread_posix_lock_free>, lock = 0x7ffff7991e20 <evthread_posix_lock>, unlock = 0x7ffff7991e10 <evthread_posix_unlock> }
So we do set the thread locking functions ...
New lead: but why were they empty earlier on? when/where do we set them?
(gdb) where #0 0x00007ffff7ba7390 in evthread_set_lock_callbacks (cbs=cbs@entry=0x7fffffffe4b0) at evthread.c:76 #1 0x00007ffff799202b in evthread_use_pthreads () at evthread_pthread.c:185 #2 0x000000000040162a in main() () at bug.cpp:69
oh, that's from the main, at bug.cpp:69. Where were we when we saw them empty?
(gdb) break event_base_new_with_config (gdb) run (gdb) where #0 0x00007ffff7ba63c0 in event_base_new_with_config (cfg=cfg@entry=0x617c20) at event.c:553 #1 0x00007ffff7ba680b in event_base_new () at event.c:452 #2 0x0000000000401553 in main() () at bug.cpp:50
Oh, I see, we were at bug.cpp:50, so 19 lines before.
Defect explanation and fix: it happens that the application created and configured libevent's base (event_base_new()) before configuring its thread-support. So the notification could not be sent, and hence the dispatcher thread remained blocked in epoll forever.
This is obviously an 'application bug', libevent was used incorrectly. Furthermore, event_base_loopbreak did returned at error code, but that was not checked and hence never noticed. We can verify that very quickly with GDB:
(gdb) start (gdb) break bug.cpp:79 (gdb) cont Breakpoint 2, main () at bug.cpp:79 79 event_base_loopbreak(base); (gdb) p event_base_loopbreak(base) $1 = -1
This way, we ask GDB to call the function event_base_loopbreak and print its result. Another way is to set a breakpoint inside the function, and ask GDB to finish it:
(gdb) break event_base_loopbreak (gdb) cont Breakpoint 2, event_base_loopbreak (event_base=0x617c50) at event.c:1511 1511 { (gdb) finish Run till exit from #0 event_base_loopbreak (event_base=0x617c50) at event.c:1511 main () at bug.cpp:82 82 std::this_thread::sleep_for(std::chrono::milliseconds(1000)); Value returned is $1 = -1
Both ways show that the return value is -1...
Of course the application from which I base this example was in development, with no bug-free guarantee, so no one to blame! I actually quite enjoyed this work of investigation :)
You can change #define DO_BUG to false at the beginning the source code, and see that is solves the problem.
In this article I tried to highlight the usage of GDB with a real debugging example, on which I spend my last Friday afternoon. I showed the way I performed this investigation, although I must admit that I used a bit more printf than I show here. That's mainly because of the probe effect that occurs with this bug, that complicates the use of GDB.
If ever you don't like interactive debuggers, or you would have done this debugging with different tools and techniques, let me know, I'm curious! Just keep in mind that I didn't write the original code and I didn't want to spend my afternoon reading libevent documentation! (I didn't even know for sure that the bug was related to this library).
Also, this is not an 'advanced' usage of GDB, I only used the very basic functionalities. Maybe one day I'll encounter a trickier bug that requires a more subtle juggling with GDB commands, and I'll post a new article on the topic!
Probe effect is unintended alteration in system behavior caused by measuring that system. (according to Wikipedia)
I had such a problem yesterday while debugging a tool based on libevent :
But that didn't work, because the helper thread never quits, and the main one waits forever.
Here is a slightly simplified reconstruction of this bug. Don't pay attention to the epoll configuration, I don't know how to use it and didn't learn!
#include <sys/epoll.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define MAXEVENTS 64 #define TIMEOUT -1 void main () { struct epoll_event event; struct epoll_event *events; int fd = 0; int efd = epoll_create1 (0); printf("OH, HAI!\n"); printf("BTW pid = %d\n", getpid()); /* I never learnt how to use epoll... */ event.events = EPOLLIN | EPOLLET; event.data.fd = fd; epoll_ctl (efd, EPOLL_CTL_ADD, fd, &event); events = calloc (MAXEVENTS, sizeof event); /* Wait forever for an event. */ epoll_wait (efd, events, MAXEVENTS, TIMEOUT); printf("KTHXBYE\n"); }
then run it:
$ ./a.out OH, HAI! BTW pid = 4900
yes, it's blocked, that was expected. Ctrl-C to quit.
Now run it again, and attach GDB to it:
$ gdb --pid $THE_PID GNU gdb (GDB) 7.9.50.20150505-cvs ... 0x00007fdb33ac7703 in __epoll_wait_nocancel () from /usr/lib/libc.so.6 (gdb) continue KTHXBYE [Inferior 1 (process 6957) exited with code 010]
What !? the application now prints KTHXBYE and exits normally ... ! That's the probe effect !
There is another way to get get the application finish properly, it's with a SIGSTOP/SIGCONT combo:
$ ./a.out OH, HAI! BTW pid = 7997 ^Z [1] + 7997 suspended ./a.out $ fg [1] + 7997 continued ./a.out KTHXBYE
So that probe effect is not actually due to the probe, but to ...
... to the fact that GDB (and signals) interrupt system calls !
That's (obviously) in epoll_wait man page:
ERRORS
EINTR : The call was interrupted by a signal handler before either (1) any of the requested events occurred or (2) the timeout expired; see signal(7).
In my libevent application, the thread was blocked in this epoll_wait syscall, which was interrupted by the debugger/signals. Then it checks for the exit flag and terminates as expected.
(It was a configuration problem, not a bug in libevent. The function that should notify the epoll listener was not correctly configured.)
Debuggers (and signals) may and will interrupt system calls. Your code should take that in consideration, that's a normal behavior of system calls. But that may also introduce "probe effects" when you try to study your code with a debugger!
When extending through its Python API (or directly in C), the situation where you have to use threads may pop up. For instance with GUIs, or when using another library or module.
Unfortunately, GDB doesn't like that much. First of all, you cannot call GDB Python functions from another thread. GDB itself not multithreaded, and hence not thread safe. Python is though, so you should be able to block the main thread in Python, and call GDB functions in the other thread, but outside from that, GDB will simply crash! And there is nothing to do against that, as far as I know.
But worth (kind of), GDB doesn't support that your code spawns a thread.
In C (GDB bug #17247, patch and discussion):
sigemptyset (&sigchld_mask); sigaddset (&sigchld_mask, SIGCHLD); sigprocmask (SIG_BLOCK, &sigchld_mask, &prev_mask); scm_with_guile (call_initialize_gdb_module, NULL); sigprocmask (SIG_SETMASK, &prev_mask, NULL);
In Python:
import pysigset, signal with pysigset.suspended_signals(signal.SIGCHLD): # start threads, they will inherit the signal mask pass
When you create (in Python) a thread, and then run the application (in my case it happens mainly when the application itself spawns threads), GDB freezes with the following callstack:
(gdb) where #0 sigsuspend () from /usr/lib/libc.so.6 #1 wait_lwp (lp=lp@entry=0x21f63b0) at ../../gdb/gdb/linux-na #2 stop_wait_callback (lp=0x21f63b0, data=<optimized out>) at #3 iterate_over_lwps (filter=..., callback=callback@entry=0x4 #4 linux_nat_wait_1 (ops=<optimized out>, target_options=1, o #5 linux_nat_wait (ops=<optimized out>, ptid=..., ourstatus=0 #6 thread_db_wait (ops=<optimized out>, ptid=..., ourstatus=0 ...
The function in which GDB is blocked is sigsuspend:
NAME
sigsuspend, rt_sigsuspend - wait for a signal
SYNOPSIS
int sigsuspend(const sigset_t * mask);
DESCRIPTION
sigsuspend() temporarily replaces the signal mask of the calling process with the mask given by mask and then suspends the process until delivery of a signal whose action is to invoke a signal handler or to terminate a process.
GDB uses this function to wait for new events from the application execution: when something occurs in the debuggee (see how debuggers work), the kernel will inform GDB of it by sending a SIGCHLD signal. When it's received, GDB awakes and check what happened.
However, the signal is delivered to GDB process, but not necessarily to its main thread. And it practise, it occurs often that it's delivered to the second thread, who doesn't care about it (that's the default behavior), and continues its life as if nothing occurred.
We cannot change the behavior of the thread. However, we have a bit of control over its default signal handling behavior: it is inherited from its parent! So, in Python, we can go this way:
import pysigset, signal # with SIGCHLD blocked, with pysigset.suspended_signals(signal.SIGCHLD): # start threads, # they will inherit the signal mask pass # SIGCHLD is unblocked after the with statement, # so that GDB can operate properly afterwards
Since the beginning of my PhD program in 2015, I'm working on the campus(es) of the University of Grenoble. Grenoble is located in a basin in the middle of 3 mountain ranges, Belledonne (the highest, with many rocky summit above 2000m) and Chartreuse and Vercors, more modest but also greener.
One of the Computer Science "campus" is located in Montbonnot, aside from the city:
The main campus is closer to the city center, but also lets you enjoy the view of the mountains:
It's also close to the Isere river:
Lately (one year ago), we moved to the GIANT campus, on the other side of the city, inside the CEA facilities. We're now closer to Chartreuse and Vercors mountain ranges, although they're behind fences and buildings ...
When I script GDB to develop model-centric debugging support, I often need to execute code upon specific events, namely after breakpoints hits. It looks like that:
import gdb class MyBP(gdb.Breakpoint): def __init__(self): gdb.Breakpoint.__init__(self, "a_function") self.silent = True def stop(self): print "##### breakpoint" return True # stop the execution at this point MyBP()
However, in this Breakpoint::stop callback, you're not free to do what ever you want:
- alter the execution state of the inferior (i.e., step, next, etc.),
- alter the current frame context (i.e., change the current active frame), or
- alter, add or delete any breakpoint.
As a general rule, you should not alter any data within gdb or the inferior at this time.
That's a "general rule" in the documentation, however it's not enforced at runtime, so in practice, there are many things you can do, but the result in not guaranteed!
I sometimes have to delete breakpoint, and I thought that gdb.post_event(this_bp.delete) would be safe. And it is, if your GDB is configured with set height 0. You don't see the link? Fair enough, it's very tricky! set height 0 tells GDB not to stop when the screen is full, which is an artifact of the past.
When GDB does stop at the end of the window, it processes the "posted events", and actually deletes the breakpoint structure. However the breakpoint is not removed from all GDB low-level lists, and than leads to a segmentation fault when GDB tries to access it. Here are the stack traces (deletion and invalid access) from Valgrind that helped me to figure out what was happening:
Address 0x18c0bd40 is 0 bytes inside a block of size 200 free'd # stack of the free(breakpoint) at 0x4A07577: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x763558: xfree (common-utils.c:98) # here the breakpoint structure is freed by 0x58D49D: delete_breakpoint (breakpoint.c:14074) by 0x525F92: bppy_delete_breakpoint (py-breakpoint.c:287) # that's the breakpoint delete function by 0x3BB384A0D2: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0) by 0x3BB38DC026: PyEval_CallObjectWithKeywords (in /usr/lib64/libpython2.7.so.1.0) by 0x522C9C: gdbpy_run_events (python.c:934) # process Python events by 0x4AC258: run_async_handler_and_reschedule (ser-base.c:137) by 0x4AC328: fd_event (ser-base.c:182) by 0x61C281: handle_file_event (event-loop.c:762) by 0x61B768: process_event (event-loop.c:339) by 0x61B80A: gdb_do_one_event (event-loop.c:391) Invalid read of size 8 # stack of the illegal memory access at 0x57C2B9: bpstat_explains_signal (breakpoint.c:4430) # here the breakpoint structure is accessed by 0x5FCFE3: handle_signal_stop (infrun.c:4474) by 0x5FC484: handle_inferior_event (infrun.c:4110) by 0x5FA831: fetch_inferior_event (infrun.c:3261) by 0x61E506: inferior_event_handler (inf-loop.c:57) by 0x4BC10E: handle_target_event (linux-nat.c:4448) by 0x61C281: handle_file_event (event-loop.c:762) by 0x61B768: process_event (event-loop.c:339) by 0x61B80A: gdb_do_one_event (event-loop.c:391) by 0x727F0D: maybe_wait_sync_command_done (top.c:386) by 0x728199: execute_command (top.c:478) by 0x6148E7: catch_command_errors (main.c:373)
Among the other thing that cannot be done (and that I wanted to do), we find thread switching:
def stop(self): print("##### breakpoint") gdb.selected_inferior().threads()[1].switch() gdb.selected_inferior().threads()[0].switch() return False # don't stop
which leads to another segfault:
[New Thread 0x7ffff7fca700 (LWP 22661)] ##### breakpoint [Thread 0x7ffff7fca700 (LWP 22661) exited] [Inferior 1 (process 22657) exited normally] [2] 22651 segmentation fault (core dumped) gdb-fedora -ex "source test.py" ./thread -ex run
or altering the instruction pointer of the inferior:
[New Thread 0x7ffff7fca700 (LWP 23100)] ##### breakpoint Traceback (most recent call last): File "test.py", line 11, in stop gdb.execute("next") gdb.error: Cannot execute this command while the selected thread is running.
But what if you really what to do it ?! Then you have to find alternatives! and there are a few (but they assume you want to stop the execution at this breakpoint, and give the prompt back to the user).
class MyBP(gdb.Breakpoint): def stop(self): print("##### breakpoint") gdb.events.stop.connect(stop_event) gdb.prompt_hook = prompt gdb.post_event(posted_event) return True
Subscribing to the stop events
def stop_event(evt): print("#### stop event") gdb.events.stop.disconnect(stop_event)
Hooking the prompt:
def prompt(current): print("#### prompt") gdb.prompt_hook = current
Posting events (which seem to work well when height=0)
def posted_event(): print("#### posted event")
These callbacks are executed in this order:
##### breakpoint #### stop event #### prompt (gdb) #### posted event
Note that the prompt hook is executed right before giving the control to the user, so gdb.execute("<command>") works very close to what you can expect from the command line. That saved the work of my afternoon yesterday!
When you're scubadiving, one important thing is to keep track of the air you've got left in your tank, and more importantly, how fast it goes down.
Just a quick reminder for non-scubadiver readers: the deeper you are, the more air you consume, because of the increased pressure. At surface level (1atm/~1bar), your lungs can hold ~7l of air. At 10m (2atm), they'll hold the equivalent of 14l of air. That's 21l at 20m, 28l at 30m and 35l at 40m. That means that if you hold your breath from 40m to surface, you'll literally explode! It also means that you don't consume the same quantity for air depending of your diving depth.
And what I wanted to do is estimating the air consumption speed. Of course I could buy a diving computing connected to the tank gauge, but that's out of my budget. So I programmed it :-) With my dive computer, I do have a report of the dive depth, every 20s :
I also know the volume of my tank (12l or 15l, carved on the tank), the initial pressure (usually ~200/230bar) and the end-of-dive pressure (100bar here), read on the gauge. Of course this is approximate, but that's the poor man way! I retrieve the figures of the dive depth and time inside my Subsurface XML divelog, as well as the dive tank information:
<divecomputer model='Suunto Vyper' deviceid='5aa559db' diveid='8f75b186'> <depth max='10.058 m' mean='6.601 m' /> <temperature air='17.0 C' water='17.0 C' /> <sample time='0:00 min' depth='0.0 m' /> <sample time='0:20 min' depth='1.829 m' /> <sample time='0:40 min' depth='2.134 m' /> <sample time='1:00 min' depth='2.743 m' /> <sample time='1:20 min' depth='3.353 m' /> <sample time='1:40 min' depth='3.658 m' /> <sample time='2:00 min' depth='4.267 m' /> <sample time='2:20 min' depth='4.572 m' /> <sample time='2:40 min' depth='4.572 m' /> <! --... -->
And now, I can start computing my air consumption.
What I know for my calculation:
Now, I need to distribute Vconso over the dive, but paying attention to different dive depths. So, for all Ti and Pi,
For my last dives, this gives:
and I didn't write down enough details for the previous ones.
It's hard to tell if the computation is good enough, the last 4 dives (top of the list), where during the same weekend and the consumption are every homogeneous, but the 2 previous ones are surprising. Maybe the cave dive was a bit stressful, even if I didn't notice it, maybe I didn't write correctly the tank sizes correctly (I also had to breath on 2 independent tanks), and for the dive at 27l/min, my suit was too large, I could feel cold water each time I was breathing, I don't know if that explains the figures twice as high ... or my computations are just wrong !
def get_conso(xml_dive, volume): surface_eq_time = 0.0 def subsurface_time_to_minutes(sub_time): mn, sc = [int(ent) for ent in sub_time.split(":")] return mn + sc/60 time, depth = 0.0, 0.0 for sample in xml_dive.find("divecomputer").findall("sample"): new_time, new_depth = [sample.get(entry).split()[0] for entry in ("time", "depth")] new_time = subsurface_time_to_minutes(new_time) duraction = new_time - time new_depth = float(new_depth) sample_depth = get_sample_depth(depth, new_depth) presure = abs(sample_depth) / 10 + 1 # meters to atm presure surface_eq_time += presure * duraction time, depth = new_time, new_depth return int(volume/surface_eq_time)
Actual function is there: http://git.0x972.info/subsurface2jinja.git/tree/subsurface2conso.py
In my previous post on what I like with *nix OSes, programmability and composability, I didn't mention one point that is globally orthogonal, but still participates to programmability aspect: the is a very large set of Free, Libre and Open Source Software (FLOSS). I don't contribute much to these projects (I'd like to but I lack time, I already spend too much time working on my own projects), but in "exchange", I do my best to show (to myself mainly) that the FLOSS ecosystem is mature enough to use it daily. I won't say I use it exclusively, but anytime I can choose, I favor FLOSS software and FLOSS-friendly hardware.
Why would I crack (or pay for) an application when someone happily made it for free? Nowadays, after quite a longtime in this mindset, I'm honestly surprised and astonished when I hear someone saying, "I use a cracked version; I can find you a crack for that tool". In my computing world this doesn't make sense anymore!
At the end of my PhD thesis redaction, I tried to enumerate the FLOSS tools I used my research work, and during the redaction of the manuscript:
Let me also give a practical example of why I FLOSS tools:
One month ago, I read this article on Hacker News: Signing in to websites with SSH. Although what they discuss is too advanced for my needs, I loved the idea, and I wanted it in some the tools I use.
Step 1: ssh connection
cat >> ~/.bashrc << EOF ssh_connect_me () { ssh 0x972.info /home/kevin/.ssh_log_me } EOF
Step 2: remember ssh connection
cat > ~/.ssh_log_me << EOF #!/bin/bash WHERE=/var/www/alternc/k/kevin/.ssh_autologin echo $SSH_CONNECTION | cut -d" " -f1 > $WHERE echo 'Logged in !' $(cat $WHERE) EOF chmod u+x ~/.ssh_log_me
Step 3: adapt tools for autologin
// autologin for SSH authenticated ips $AUTOLOGIN_FILE = "/var/www/alternc/k/kevin/.ssh_autologin"; $autologin = FALSE; if (file_exists($AUTOLOGIN_FILE) //&& (time() - filemtime($AUTOLOGIN_FILE)) < 60*60*12 // file is newer than 12h && $_SERVER["REMOTE_ADDR"] == str_replace("\n", "", file_get_contents($AUTOLOGIN_FILE))) { $autologin = TRUE; touch($AUTOLOGIN_FILE); $_POST['nom_utilisateur'] = $GLOBALS['identifiant']; $_POST['mot_de_passe'] = $GLOBALS['mdp']; }
// autologin for SSH authenticated ips $AUTOLOGIN_FILE = "/var/www/alternc/k/kevin/.ssh_autologin"; if (file_exists($AUTOLOGIN_FILE) // && (time() - filemtime($AUTOLOGIN_FILE)) < 60*60*12 // file is newer than 12h && $_SERVER["REMOTE_ADDR"] == str_replace("\n", "", file_get_contents($AUTOLOGIN_FILE))) { touch($AUTOLOGIN_FILE); $this->loggedin = true; }
That's nothing complicated, but quite convenient for me! Each time my IP changes, I run ssh_connect_me, and I'm logged in immediately. I'm not sure how secure it is, I'm got daily backups for both databases and nothing confidential accessible to them, so I guess this is safe enough. I initially put a time limit, but I don't see any real interest after all.
Since almost two years, I rent a bit of space on the Internet to host my services, and an address (or two) for "my place". 0x972.info and kevin.pouget.me (actually, pouget.me) is where you have to go to find me.
During the first year, I used L'Autre hosting facilities, but I realized that I frustrated not to have a complete control over the server. I guess if you just need a simple HTTP/PHP server, and/or limited Linux administration knowledge, L'Autre would be a nice platform, with a friendly and responsive community.
So last year, I switched to PulseHeberg vps, with which I'm quite happy (although it's not always that fast, but I can't say if it comes from the VPS power, I may 7e/month, I could upgrade, or if it's their network ... ).
During my L'Autre experience, I discovered AlternC (english presentation here), a free web-hosting management platform for Debian. AlternC takes the control of your Debian server; and customizes it so that you can administrate it from its control panel. The main features I use/like are:
One thing I was not happy with, with the current version of AlternC (v3.2.3), it's that it doesn't support the HTTPS protocol (maybe because it's complicated to manage user certificates?). But it doesn't matter, as AlternC is easy to tweak.
All the configuration files start with this header:
###BEGIN OF ALTERNC AUTO-GENERATED FILE - DO NOT EDIT MANUALLY###
So let's try to understand how Apache vhost are configured.
$ cat /etc/apache2/conf.d/alternc.conf .... # Now we include all the generated configuration Include /var/lib/alternc/apache-vhost/vhosts_all.conf
Indeed, this file contains what I configured in the web-inteface:
<VirtualHost *:80> ServerName phd.kevin.pouget.me DocumentRoot "/var/www/alternc/k/kevin/www/0x972.info/blog/" AssignUserId #1000 #1000 SetEnv LOGIN "1000-kevin" </VirtualHost> <Directory "/var/www/alternc//k/kevin/www/0x972.info/cgit/"> php_admin_value open_basedir "/var/www/alternc//k/kevin/:/usr/share/php/" php_admin_value upload_tmp_dir /var/www/alternc//k/kevin//tmp php_admin_value sendmail_path '/usr/lib/alternc/sendmail "..." ' php_admin_flag mail.add_x_header on Options +MultiViews -FollowSymLinks +SymLinksIfOwnerMatch AllowOverride AuthConfig FileInfo Limit Options Indexes </Directory>
Where is the template for this file?
$ grep DocumentRoot /etc/alternc -r ... /etc/alternc/templates/apache2/vhost.conf: DocumentRoot "%%document_root%%"
There we are :-) I just have to duplicate the template, change the port (can configure Apache to talk https to that port), and I've got my secure web-pages !
$ cat | sudo tee -a /etc/alternc/templates/apache2/vhost.conf <VirtualHost *:443> ServerName %%fqdn%% DocumentRoot "%%document_root%%" AssignUserId #%%UID%% #%%GID%% SetEnv LOGIN "%%UID%%-%%LOGIN%%" </VirtualHost>
Currently the pages are served under a self-signed certificate, which is not the best as people need to accept the certificate first, and not everybody understands what this step means, but wild-card certificates (for *.0x972.info) cost a lot!
I can't deny I still do most of my management from the command line, and that's good because that's the reason I left l'Autre!
In some cases like the configuration of my Owncloud server, or Mozilla Sync, I need to pass some special options to Apache, so AlternC is too generic. For instance for Sync, I need to tell Apache how to run Python code:
WSGIPassAuthorization On WSGIScriptAlias / /var/www/alternc/k/kevin/www/0x972.info/sync/sync.wsgi WSGIProcessGroup kevin WSGIDaemonProcess kevin user=kevin group=kevin processes=2 threads=25 WSGISocketPrefix run/wsgi
and a few I setup but don't use anymore:
Ten years ago, I started university and learned how to program and use Linux. A couple of years after, I kicked Windows (XP) out of my computers, and never get back to it since.
I'd like to come back on one aspect that makes me love Linux-based environments: they are easily to program (but by easy, I only include myself, I know it's not true for everyone!).
Computers are good at repeating things and following rules. That's the only thing they actually do, so I want to be able to exploit it, and I do it quite often, not just as part of my development work, but also to improve my daily use of my computer.
Let me give some examples.
Composability is ability to connect different tools together. That's actually part of the "Unix Philosophy", combining "small, sharp tools" to accomplish larger tasks. What did I last use ?
Linux/Unix tools are great for that, although when it's become too complicated, it may be worth starting writing a bash script, or even a Python script. Just to name of few of them:
Software design with such usages in mind also tend to export some of their functionalities to the command line. For instance, Dia or LibreOffice:
$ dia --help Usage: dia [OPTION…] [FILE...] Application Options: -e, --export=OUTPUT Export loaded file and exit -t, --filter=TYPE Select the filter/format out of: png, jpg, svg, tex, ... -s, --size=WxH Export graphics size $ libreoffice --help LibreOffice 4.3.6.2 430m0(Build:2) --convert-to output_file_extension[:output_filter_name[:output_filter_options]] [--outdir output_dir] files Batch convert files. If --outdir is not specified then current working dir is used as output_dir. Eg. --convert-to pdf *.doc --convert-to pdf:writer_pdf_Export --outdir /home/user *.doc --convert-to "html:XHTML Writer File:UTF8" *.doc --convert-to "txt:Text (encoded):UTF8" *.doc --print-to-file [-printer-name printer_name] [--outdir output_dir] files Batch print files to file. If --outdir is not specified then current working dir is used as output_dir. Eg. --print-to-file *.doc --print-to-file --printer-name nasty_lowres_printer --outdir /home/user *.doc
Last week, I had to 1/ modify a file 2/ save it 3/ go to a console window 4/ hit the up key (ie, go to previous command) and 5/ hit enter to generate an image.
Points 1/ and 2/ are not that repetitive, but 3-5 are. So let's see if we can automatize this a bit.
inotifywait is the tool for that!
$ man inotifywait NAME inotifywait - wait for changes to files using inotify DESCRIPTION inotifywait efficiently waits for changes to files using Linux's inotify(7) interface. It is suitable for waiting for changes to files from shell scripts. It can either exit once an event occurs, or continually execute and output events as they occur. OUTPUT inotifywait will output diagnostic information on standard error and event information on standard output. The event output can be configured, but by default it consists of lines of the following form: ...
So I wrote this little script:
$ cat ~/bin/autorun #!/bin/bash WATCH=/tmp/autorun touch $WATCH while [[ -f $WATCH ]] do echo Run $* $* echo Waiting ... inotifywait $WATCH -qq done echo Done
and I combined it with i3 bindings:
bindsym F9 exec touch /tmp/autorun bindsym Shift+F9 exec rm -f /tmp/autorun
and finally, in my console, I just run $ autorun $MY_COMMAND, and hit F9 to trigger my command execution (that was more convenient/easier than a compile-on-save).
That's nothing complicated, advanced, nor really meant to reuse after I forget about it, but it surely saved my some time and concentration (it's not the few seconds I took to do it manually that are the problem, but rather that I had to get out of my problem during that time and possibility lost my thread of through).
I wanted daily backups of my MySQL databases, into separate files, keeping the last 7 backups, but only if the database was modified since the last backup.
# password-less connection (with sudo) # ask MySQL to list its databases # excluse system databases databases=$(mysql --defaults-file=/etc/mysql/debian.cnf -e "SHOW DATABASES;" | grep -Ev "(Database|information_schema|performance_schema)") for db in $databases; do dbname=$(echo $db | sed 's/\./@002e/g') # points in databases complicate everything ... # in /var/lib/mysql/$dbname dir, find the most recent table most_recent_db_file=$(find /var/lib/mysql/$dbname -type f -exec stat --format '%Y +%n' {} \; | sort -nr | cut -d+ -f2 | head -1) back=db_$db.sql # if the backup file doesn't exist or the database is newer than the backup if [[ ! -f $back.0 || $most_recent_db_file -nt $back.0 ]]; then # dump database $db into backup file $back mysqldump --force --opt --databases $db > $back # debian tool that saves/deletes/compresses backup files $back[.gz]{.0-7} savelog -c 7 $back > /dev/null fi
done
When software tools evolve again and again, they accumulate a lot of functionalities that are provided to their users. And frequently, these functionalities can be combined to create "higher level" functionalities. Either you click on the first feature, configure it and run it, and the same for the second, and the third, and you start over when you want to replay the high-level feature. Or your tool can be programmed (or scripted, if it differs).
Emacs can be configured and extended with through its Lisp engine. What do I do with that?
1/ easy one, copied from the Internet (Ctrl-C Ctrl-X quits Emacs)
;; ask confirmation before exiting ;; C-x C-c are soo close, y'know what I mean ... (defun confirm-exit-emacs () "ask for confirmation before exiting emacs" (interactive) (if (yes-or-no-p "Are you sure you want to exit? ") (save-buffers-kill-emacs))) (global-unset-key "\C-x\C-c") (global-set-key "\C-x\C-c" 'confirm-exit-emacs)
Again nothing complicated, I'm not happy with the default behavior, quiting is too easy if everything is saved (otherwise it asks for confirmation), I can change it myself. Here I just add a confirmation, but I could do what ever I want: delete temporary files, log the time I stopped working, call my mum, ...
2/ In my configuration, Shift+Tab tries to complete the current word with words found in other buffers (ie open files). While writing my PhD thesis, I used that:
(defun load-english-completion () "Load english dictionary in background for dynamic completion" (interactive) (find-file "~/Documents/redactions/english") (make-buffer-uninteresting) (switch-to-prev-buffer))
~/Documents/redactions/english is a dump of all the words known by aspell spell checker. (make-buffer-uninteresting) adds a space in front of the buffer name, so that it doesn't appear in the buffer lists.
3/ For compiling Latex documents, the default configuration in C-x C-s (save buffer) C-c C-c Enter (run "LaTeX" command). That's too long for me, so:
;; compile LaTeX document without any confirmation (defun LaTeX-do-build () (interactive) ;; save current Latex documents (let ((TeX-save-query nil)) (TeX-save-document (TeX-master-file))) ;; run LaTeX command on the master file (setq build-proc (TeX-command "LaTeX" 'TeX-master-file -1)) (set-process-sentinel build-proc 'build-sentinel)) (global-set-key (quote [f5]) 'LaTeX-do-build)