(gdb) break *0x972

Debugging, GNU± Linux and WebHosting and ... and ...

[Dev-tools configuration] GDB's .gdbinit

Thursday, December 04, 2014 - 1 comment

In the next posts, I'll share the configuration files of my development and debugging tool. I think having a nice dev environment is key to good quality programming. At least, it makes your experience better. It's also verrrry nice for the people that will have to interact with you and your code!

I'll start with my .gdbinit:

  • No window size, we can scroll:

    set height 0
    set width 0
    
  • Allow pending breakpoints, we know what we're doing:

    (gdb) break foobar Function "foobar" not defined. Breakpoint 9 (foobar) pending.

    means that there is no "foobar" currently loaded, but maybe you know it will appear later.

    set breakpoint pending on
    
  • We're doing python development, so print the full stack trace when it crashes:

    set python print-stack full
    
  • GDB, please don't complain, I know what I'm doing! ... but beware, you may loose your session if you enter start in the middle of your debugging

    set confirm off
    
  • Structure easier to read:

    set print pretty
    
  • Save history, up-arrow and ctrl-r can save a lot of time:

    set history filename ~/.gdb_history
    set history save
    

We can also add a little bit of Python shortcut commands:

import gdb
  • quickly print Python objects pp instead of py print:

    class pp(gdb.Command):
        """Python print its arg"""
        def __init__(self):
            gdb.Command.__init__ (self, "pp", gdb.COMMAND_DATA,
                                  completer_class=gdb.COMPLETE_SYMBOL)
        def invoke (self, arg, from_tty):
            gdb.execute("python print %s" % arg)
    pp()
    
  • quickly list attributes of a Python object: ppd

    class ppd(gdb.Command):
        """Python print dir() of its arg"""
        def __init__(self):
            gdb.Command.__init__ (self, "ppd", gdb.COMMAND_DATA, completer_class=gdb.COMPLETE_SYMBOL)
    
        def invoke (self, arg, from_tty):
            gdb.execute("python print dir(%s)" % arg)
    ppd()
    

And finally a code I retrieved from ratmice@gitorious, an extended prompt for GDB, which shows the current source file, line and function, all that with colors (my version is here):

(src/ls.c:1242 main gdb)
$

Zététique et debugging: le rasoir d'Occam

Monday, December 01, 2014 - No comments

Aujourd'hui, je fais le lien entre le cours de zététique que j'ai suivi il y a 3 ans (cours de culture générale pour le doctorat), et mon travail de recherche: le rasoir d'Occam :

Ça dit en substance : Pluralitas non est ponenda sine necessitate. Et en compréhensible : Pourquoi faire compliqué quand on peut faire simple ? En gros, ce que dit ce rasoir, c’est que lorsqu’il y a plusieurs hypothèses en compétition, il vaut mieux prendre les moins « coûteuses » cognitivement.

Ou avec une exemple :

Ce coupe-chou peut s’avérer aussi utile pour l’analyse des théories dites du complot. Il n’est pas impossible que le 11 septembre soit le fruit d’une orchestration planifiée par les services secrets, moyennant une grande discrétion des complices, tout un tas de précautions et l’effacement de toutes les preuves, ceci afin de déclarer le combat contre l’Axe du Mal et déclencher la deuxième guerre du golfe. C’est un scénario séduisant, surtout quand on est anti-Bush. Mais un peu de culture historique rend assez coûteuse cette hypothèse.

Appliqué au debugging, on peut aussi trouver des hypothèses « coûteuses » à oublier, comme les bugs du compilateur, de l'OS ou du processeur (ou du débogueur :-).

Et ce que ça donne quand on utilise pas le rasoir: :-)

Kaamelott (Saison 4 Episode 6 – Les pisteurs) © CALT / DIES IRAE / SHORTCOM – 2006

Linux Kernel System Debugging, part 1: System Setup

Thursday, November 27, 2014 - No comments

In this post series, I'll explain how to setup and play with system debugging of Linux kernel. We'll have to build the environment: Qemu virtual machine, Linux kernel and a Busybox filesystem. I'll show how to play with the debugger in the next post.

All the explanation here are for building a x86/x86_64 kernel. Everything will work the same with another instruction-set, but you'll have to setup yourself the cross-compiling environment!

Compile Linux Kernel in Debugging Mode

# Download and extract the sources of the kernel
KVERSION=3.17.4
wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-$KVERSION.tar.xz;
tar xvf linux-$KVERSION.tar.xz
cd linux-$KVERSION

# configure the kernel
cp /boot/config-$(uname -r) .config # copy Fedora kernel configuration
make oldconfig                      # set new options
make menuconfig

Make sure that these options are checked/unchecked:

Kernel hacking
--> Compile-time checks and compiler options
--> Compile-time checks and compiler options
--> [X] Compile the kernel with debug info
--> [ ] Strip assembler-generated symbols during link
[X] Kernel debugging

make run 

... and go grab a coffee, it will take a while!

Compile Qemu in Debugging Mode

If you want to study how Qemu communicates with the debugger, build it now with the debugging information (enabled by default), otherwise install it from your distribution packages.

QVERSION=2.1.2
wget http://wiki.qemu-project.org/download/qemu-$QVERSION.tar.bz2
tar xvf qemu-$QVERSION.tar.bz2
mkdir qemu-{build,install}
cd qemu-build
../qemu-$QVERSION/configure --prefix=$(readlink -f ../qemu-install) --disable-kvm --target-list="i386-softmmu x86_64-softmmu"
make && make install

Compile Busybox Filesystem

Busybox and initrd preparation come almost directly from mgalgs, thanks!

If you want to have a chance to study the link between user-level applications and the kernel, build Busybox with debugging information. Otherwise, just grab the precompiled binaries

BVERSION=1.19.4
wget http://busybox.net/downloads/busybox-$BVERSION.tar.bz2
tar xf busybox-$BVERSION.tar.bz2
cd busybox-$BVERSION/
make menuconfig

and make sure that the debugging options are checked:

Busybox Settings
--> Debugging Options

then compile and install (by default in _install sub-directory) make make install

Build Initrd Filesystem

Initrd provides an early filesystem, I assume it's preloaded in the shared memory by the BIOS (ie, Qemu):

mkdir initramfs
cd initramfs
# create standard filesystem directories
mkdir -pv bin lib dev etc mnt/root proc root sbin sys
# create standard file devices
sudo cp -va /dev/{null,console,tty} dev/
sudo mknod dev/sda b 8 0
# import busybox filesystem
cp ../busybox-$BVERSION/_install/* . -rv

We didn't recompile the glibc, so we need to import it from our local system (adapt it to what you see in ldd output)

# copy relevant shared libraries
ldd bin/busybox
# linux-vdso.so.1 =>  (0x00007fff9fdfe000) (virtual)
# libm.so.6 => /lib64/libm.so.6 (0x0000003d49e00000)
# libc.so.6 => /lib64/libc.so.6 (0x0000003d49200000)
# /lib64/ld-linux-x86-64.so.2 (0x0000003d48e00000)

mkdir lib
ln -s lib lib64 # make lib and lib64 identical
cp /lib64/libm.so.6 lib  # symlink to libm-2.18.so
cp /lib64/libm-2.18.so lib
cp /lib64/libc.so.6 lib # symlink to libc-2.18.so
cp /lib64/libc-2.18.so lib
cp /lib64/ld-linux-x86-64.so.2 lib

You can ensure that your shared library are correctly imported by running sudo chroot . /bin/sh in your initramfs directory.

Finally, prepare an init script that will setup the user-space environment:

cat > init << EOF
#!/bin/sh

/bin/mount -t proc none /proc
/bin/mount -t sysfs sysfs /sys
/bin/mount -t ext2 /dev/sda /mnt/root

exec /bin/sh
EOF
chmod 755 init

Prepare and Run the Virtual Machine

The initrd filesystem has to be packaged in a cpio archive. Each time you modify a file in initramfs you'll have to rebuild the archive:

cd initramfs && \
find . -print0 | cpio --null -ov --format=newc > ../my-initramfs.cpio \    
&& cd ..

Last step consists in creating a hard-disk for the system and format it in ext2:

SIZE=512M
qemu-img create disk.img $SIZE
mkfs.ext2 -F disk.img

Now you're ready to boot the virtual machine!

qemu-install/bin/qemu-system-x86_64 -nographic -hda disk.img -kernel linux-$KVERSION/arch/x86_64/boot/bzImage -initrd my-initramfs.cpio -append "console=ttyS0"

Options -nographic and -append "console=ttyS0" redirect the virtual machine output to the console, instead of creating a dedicated window. The others are straightforward, they pass the hard disk file, kernel file (compressed in bz format) and initrd filesystem.

Hit Ctrl-Alt-C to enter Qemu console, and quit to exit.

Debug Linux Kernel

Run Qemu with gdbserver listener (notice the -s, equivalent to -gdb tcp::1234):

qemu-install/bin/qemu-system-x86_64 -nographic -hda disk.img -kernel linux-$KVERSION/arch/x86_64/boot/bzImage -initrd my-initramfs.cpio -append "console=ttyS0" -s

and connect GDB to that port, with the uncompressed kernel as symbol file:

gdb linux-$KVERSION/vmlinux -ex "target remote localhost:1234"
...
Reading symbols from linux-3.17.4/vmlinux...done.
Remote debugging using localhost:1234
(gdb) where
#0  native_safe_halt () at .../irqflags.h:50
#1  arch_safe_halt () at .../paravirt.h:111
#2  default_idle () at .../process.c:311
#3  arch_cpu_idle () at .../process.c:302
#4  cpuidle_idle_call () at .../idle.c:120
#5  cpu_idle_loop () at .../idle.c:220
#6  cpu_startup_entry (state=<optimized out>) at .../idle.c:268
#7  rest_init () at init/main.c:418
#8  start_kernel () at init/main.c:680
#9  x86_64_start_reservations (real_mode_data=<optimized out>) at .../head64.c:193
#10 x86_64_start_kernel (real_mode_data=<optimized out>) at .../head64.c:182

Publié dans :

Run GDB until the Application Segfaults

Tuesday, November 25, 2014 - No comments

I'm trying to debug a race condition that crashes the application only once every ~15 runs of 3-4 minutes. I want a GDB prompt on that crash, and not only a core dump.

The naive way is to start gdb, run the application, wait for its termination, and restart it if it didn't crash.

A better alternative is to automatize it:

(gdb) py gdb.events.exited.connect(lambda evt : gdb.post_event(lambda : gdb.execute("run")))

Step-by-step, gdb.events.exited.connect() registers an exit-event callback, that posts an asynchronous command gdb.post_event() (otherwise that would create a recursion, and maybe end up with a stack overflow), that restarts the execution gdb.execute("run").

L'informatique et les standards et les formats ouverts

Monday, November 24, 2014 - No comments

Ce weekend j'ai signé l'Appel pour l'interopérabilité dans l'Éducation nationale, et avant de faire suivre le lien je voulais revenir sur les raisons pour lesquelles cela me semble important.

Déjà, c'est quoi un format ouvert, et un standard ?

c'est un protocole d'échanger ou de communiquer. In real life (IRL), on peut comparer ça à ...

  • les clés à pans et les têtes de vis qui vont avec,
    • la taille et l'angle des pans est clairement spécifié, à travers le monde ça sera tout le temps les mêmes
  • les culots des ampoules
  • la langue utilisée pour discuter
    • si on ne parle pas la même langue, on ne se comprend pas

En informatique, c'est un peu pareil. Quand vous allez sur une page web, votre ordinateur ne connait pas le serveur web, mais s'ils parlent la même langue, ils vont pouvoir se parler et récupérer la page web. Pareil pour ouvrir un document, tant que la clé correspond à la tête de vis, on s'en sort.

Pourquoi il faut que cela soit "ouvert" ?

"Ouvert", ici, ça veut dire que les spécifications (taille et angle des pans) sont publiques et utilisables/consultables librement, sans avoir a payer des millions. Imaginez une route, faîtes par l’État, sur laquelle seulement des voitures Chevrolet (™) puissent rouler, parce que tout le monde utilise ces voitures. Ou bien Renault ou Peugeot c'est pareil, mais on pourrait penser que c'est par préférence nationale. Non, Chevrolet fait des roues carrés, donc l’État fait des routes arrondies (comme ça le carré peut tourner librement).

C'est pas possible que l'État fasse ça ? IRL, non ... en informatique, si !

Microsoft fait des roues carrés (Word et .doc-x, Excel et .xsl-x), et il ne veut pas faire de roues rondes (les format de OpenDocument .odt .odc, utilisées par LibreOffice par exemple). Quand l’Éducation Nationale signe des accords avec MS, ils acceptent et promeuve leur situation de monopole, on garde les roues carrés et les routes arrondies ! On donne des millions de dollar à MS, qui en échange nous donne Windows à prix réduit. Qu'est-ce qu'on peut trouver comme comparaison cette fois ... un nouveau système de visserie :

Voilà donc quelques éléments sur pourquoi il faut signer Appel pour l'interopérabilité dans l'Éducation nationale, pour éviter les roues carrés ! Ou au moins pour aider l'informatique à se détacher de la main mise des monopoles/multinationales, qui empêchent le développement des alternatives (libres ou non) en forçant l'utilisation de format non libres.

(parce que bien sur, MS Word pourrait supporter les format OpenDocument, mais ils ne veulent pas, par choix politiques. Depuis quand ils supportent l'export en PDF, sans passer par une imprimante PDF et tout ? ... je ne sais pas en fait, ils le supportent ? ^^)


Plus généralement en informatique, l'ouverture des formats et des protocoles de communication est primordiale, c'est grâce à cela qu'Internet est arrivée à son stade actuel. Par exemple pour afficher cette page, on a eu besoin des standards ...

  • web (HTML et CSS)
  • réseau (HTTP, TCP, IP, Ethernet, ...)
  • bases de donnes (SQL)

Tous ces différents protocoles permettent à Internet de fonctionner. Par contre, quand on enferme ses données chez une seule compagnie, il n'y a plus de besoin de communication ni d’interopérabilité. Les messages Facebook, Tweeter ou Whasapp ne passent pas par Internet, ils restent "chez Facebook", et à plus ou moins long terme, ça cloisonne chacun chez soi, de la même façon que Microsoft et ses roues carrés. Combien de fois on entend à la radio/TV "venez discuter sur notre compte Twitter, #-tag ..... ou sur notre page Facebook ..." ? alors que Twitter est loin d’être adaptée pour des commentaires avec sa limite ~128 caractères, mais non, son interface est fermée, aucune interopérabilité, donc on doit utiliser Twitter pour profiter du réseau ...

Printing corrupted (scanned) PDF

Thursday, November 20, 2014 - No comments

My printer is ... special. From its web interface, you can scan a document and get a PDF, but you can't print it!

It generated "nature-friendly" PDFs! Only white pages get out of the printer.

THINK BEFORE YOU PRINT: Please consider the environment before printing this email.

It doesn't work with Evince, nor pdf2ps, nor evince > print to file > print, nor convert.

But Evince does print some useful information:

Syntax Error (5404808): Illegal character '>'
Corrupt JPEG data: premature end of data segment
Corrupt JPEG data: premature end of data segment
Corrupt JPEG data: premature end of data segment

The JPEG images contained in the PDF are corrupted. For some reasons, Evince can display them onscreen, but not translate them to PS for the printer ... There's certainly a PDF library down there that doesn't handle invalid images that is used to transform PDFs into other formats.

Hopefully, Popplet's pdfimages doesn't rely on that "broken" library, and it can extract all the images of a PDF!

When you try to export the images in JPEG format (option -j), it still doesn't work, as it just extracts the invalid images out of the PDF. Eye of Gnome can't display it and explains why:

Error interpreting JPEG image file (Maximum supported image dimension is 65500 pixels)

However, pdfimages can also export PPM images ( portable pixmap file format), that are not invalid! yeay! :-)

pdfimages $PDF $PREFIX

and with ImageMagicks convert, you can rebuild your PDF:

convert-pdf() {
  PDF=$1
  PREFIX=convert-
  TMP=$(mktemp -d)
  WD=$(pwd)
 cp $PDF $TMP
  mv $PDF $PDF.bak
  cd $TMP
  pdfimages $PDF $PREFIX
  # convert ppm to jpg, that saves a lot space!
  for i in $PREFIX*.ppm
      convert $i $(basename $i .ppm).jpg 
  convert $PREFIX*.jpg $PDF
  mv $PDF $WD
  cd  $WD
  rm -rf $TMP
}

Publié dans :  

Solving administration problems with debugging tools (strace)

Sunday, November 16, 2014 - No comments

This week, we wanted to setup a printer on a colleague's computer, it worked on CUPS web interface (http://localhost:631), but Gnome control center was freezing when we tried to access the printer configuration.

gnome freeze

How can you get a clue about what's going on?

GDB might be a bit of an overkill, even if your distribution provides you with Gnome's source code and debug information.

But strace can be helpful!

 $ strace gnome-control-center
 execve("/usr/bin/gnome-control-center", ["gnome-control-center"], [/* 37 vars */]) = 0
 brk(0)                                  = 0x1ee9000
 access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
 fstat(3, {st_mode=S_IFREG|0644, st_size=264676, ...}) = 0
 mmap(NULL, 264676, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7486506000
 [...]
 <click on 'Printers'>
 [...]

 connect(15, {sa_family=AF_INET, sin_port=htons(631), sin_addr=inet_addr("192.168.1.25")}, 16) = -1 EINPROGRESS (Operation now in progress)
 fcntl(15, F_SETFL, O_RDWR)              = 0
 poll([{fd=15, events=POLLIN|POLLOUT}], 1, 250) = 0 (Timeout)
 poll([{fd=15, events=POLLIN|POLLOUT}], 1, 250) = 0 (Timeout)

Here it is, Gnome tries to connect to a network address, and the data polls are timing out. In fact, the colleague had configured it system to connect to the company CUPS server, which was not reachable from our lab, and Gnome tries again and again to connect to this address, unsuccessfully.

To go one step further, and find where Gnome picks this address, you can check what files the program opens:

$ strace -e open,connect gnome-control-center
[...]
open("/home/kevin/.cups/client.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/cups/client.conf", O_RDONLY) = 15
open("/home/kevin/.cups/client.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
connect(15, {sa_family=AF_INET, sin_port=htons(631), sin_addr=inet_addr("192.168.1.25")}, 16) = -1 EINPROGRESS (Operation now in progress)

Bonne pioche, /etc/cups/client.conf is opened right before the connect call, easy peasy! (but it's not always that simple ;-)

$ cat /etc/cups/client.conf 
# see 'man client.conf'
#ServerName /run/cups/cups.sock #  alternative: ServerName hostname-or-ip-address[:port] of a remote server
ServerName 192.168.1.25

(I knew it, I just changed it 5 mins ago to recreate the problem!)


Different problem, same solution, I use open2300 to access the data of my weather station. I usually access it from the raspberry pi that I setup last year, but today I need to connect it to my desktop computer ... and it doesn't work:

$ ./interval2300 0 0
Unable to open serial device /dev/ttyUSB1

indeed, the weather station is on ttyUSB0, not ttyUSB1. Quick and dirty solution is cd /dev; sudo ln -s ttyUSB0 ttyUSB1, but that disappear on reboot (and I asked myself not to create a udev rule for that!). So, I had to understand where open2300 takes that file name: strace, there you go!

$ strace -e open ./interval2300 0 0                                                                                                                                  1 ?
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("open2300.conf", O_RDONLY)         = -1 ENOENT (No such file or directory)
open("/usr/local/etc/open2300.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/open2300.conf", O_RDONLY)    = 3
open("/dev/ttyUSB1", O_RDWR|O_NONBLOCK) = -1 ENOENT (No such file or directory)

Unable to open serial device /dev/ttyUSB1
+++ exited with 1 +++

Tree calls for dynamic libraries, and three for configuration file (current directory, /usr/local/etc, both missing, and finally /etc/open2300.conf is found). Thanks again strace!

Abort early or crash late

Sunday, November 16, 2014 - No comments

Last week I was discussing with a colleague a problem I had with Python's assert statement, that can't be disabled. Let's consider this code snippet that sums up the situation*:

# expects digit as an integer, 
#         unfrequentFlag as a boolean
def fct(digit, unfrequentFlag=False):
  assert type(digit) is int
  print("you gave digit #{}".format(digit))

  if unfrequentFlag:
    digit += 1

  return digit

# test cases
fct(1)
fct("2")
fct("3", unfrequentFlag=True)

From the bug taxonomy presented earlier, we can say that:

  • case #2 and #3 are code defects that infect the program state,
  • program instruction digit += 1 causes the failure if the state is infected *if the unfrequentFlag is set,
  • assertion type(digit) is int ensures that the state is correct when entering the function body.

In Python, asserts can't be disabled, and that's a problem for me, because I wanted to have the ability:

  • during development, to abort early, that is, as soon as I know the program state is infected
  • in production, to avoid crashes as much as possible.

In case #2, the execution won't crash (without the assert I mean). The state is invalid, but the instruction causing the failure is not executed, so it goes unnoticed, everybody's happy!

Is this reasonment flawed? Did I miss something in duck typing? I guess so, otherwise assertion could be disabled more easily in Python ...

In statically typed languages like Java, such problems are detected even earlier, by the type system of the compilier, but that's not the discussion here! Java also faces that problem I descibed here, for instance with a null object, as assert(obj != null) and later the dereference of the object in question.

I also know that unit testing is the solution, but who writes unit tests for a code that is in no way critical? (I had to write non-regression tests for the patches I contributed to GDB, and along with documentation it can be longer to write than the patch itself, so you must have good motivations to slow down the development by a factor of two! Automatic tools like the compiler parser gives you the first level of guarantee for free!

* I know this is not pythonic code, that's just an example ;-)

How Does a C Debugger Work? (GDB Ptrace/x86 example)

Thursday, November 13, 2014 - 15 comments

When you use GDB, you can see that it has a complete control over your application process. Hit Ctrl-C while the application is running and the process execution stops, and GDB shows its current location, stack trace, etc.

But how can it do it?

How they don't work?

Let's start first with how it doesn't work. It doesn't simulate the execution, by reading and interpreting the binary instructions. It could, and that would work (that the way valgrind memory debugger works), but that would be too slow. Valgrind slows the application 1000x down, GDB doesn't. That's also the way virtual machines like Qemu work.

So, what's the trick? Black magic! ... no, that would be too easy.

Another guess? ... ? Hacking! yes, there's a good deal of that, plus help from the OS kernel.

First of all, there's one thing to know about Linux processes: parent processes can get additional information about their children, in particular the ability to ptrace them. And, you can guess, the debugger is the parent of the debuggee process (or it becomes, processes can adopt a child in Linux :-).

Linux Ptrace API

Linux ptrace API allows a (debugger) process to access low-level information about another process (the debuggee). In particular, the debugger can:

  • read and write the debuggee's memory: PTRACE_PEEKTEXT, PTRACE_PEEKUSER, PTRACE_POKE...
  • read and write the debuggee's CPU registers: PTRACE_GETREGSET, PTRACE_SETREGS,
  • be notified of system events: PTRACE_O_TRACEEXEC, PTRACE_O_TRACECLONE, PTRACE_O_EXITKILL, PTRACE_SYSCALL (you can recognize the exec syscall, clone, exit, and all the other syscalls)
  • control its execution: PTRACE_SINGLESTEP, PTRACE_KILL, PTRACE_INTERRUPT, PTRACE_CONT (notice the CPU single-stepping here)
  • alter its signal handling: PTRACE_GETSIGINFO, PTRACE_SETSIGINFO

How is Ptrace implemented?

Ptrace implementation is outside of the scope of this post, but I don't want to move the black-box one step above, so let me explain quickly how it works (I'm no kernel expert, please correct me if I'm wrong and excuse me if I simplify too much :-).

Ptrace is part of Linux kernel, so it has access to all the kernel-level information about the process:

What about systems without Ptrace?

The explanation above targeted Linux native debugging, but it's valid for most of the other environments. To get a clue on what GDB asks to its different targets, you can take a look at the operations of its target stack.

In this target interface, you can see all of the high-level operations required for C debugging:

struct target_ops 
{
  struct target_ops *beneath;   /* To the target under this one.  */
  const char *to_shortname; /* Name this target type */
  const char *to_longname;  /* Name for printing */
  const char *to_doc;       /* Documentation.  Does not include trailing
               newline, and starts with a one-line descrip-
               tion (probably similar to to_longname).  */

 void (*to_attach) (struct target_ops *ops, const char *, int);
 void (*to_fetch_registers) (struct target_ops *, struct regcache *, int);
 void (*to_store_registers) (struct target_ops *, struct regcache *, int);
 int (*to_insert_breakpoint) (struct target_ops *, struct gdbarch *,
             struct bp_target_info *);
 int (*to_insert_watchpoint) (struct target_ops *,
             CORE_ADDR, int, int, struct expression *);
 ...
}

The generic part of GDB calls these functions, and the target-specific parts implement them. It is (conceptually) shaped as a stack, or a pyramid: the top of the stack is quite generic, for instance:

The remote target is interesting, as it splits the execution stack between two "computers", through a communication protocol (TCP/IP, serial port).

The remote part can be gdbserver, running in another Linux box. But it can also be an interface to a hardware-debugging port (JTAG) or a virtual machine hypervisor (e.g Qemu), that will play the role of the kernel+ptrace. Instead of querying the OS kernel structures, the remote debugger stub will query the hypervisor structures, or directly the hardware registers of the processor.

For further reading about this remote protocol, Embecosm wrote a detail guide about the different messages. Gdbserver event-processing loop is there, and Qemu gdb-server stub is also online.

To sum up

We can see here that all the low-level mechanisms required to implement a debugger are there, provided by this ptrace API:

  • Catch the exec syscall and block the start of the execution,
  • Query the CPU registers to get the process's current instruction and stack location,
  • Catch for clone/fork events to detect new threads,
  • Peek and poke data addresses to read and alter memory variables.

But is that all a debugger does? no, that just the very low level parts ... It also deals with symbol handling. That's link between the binary code and the program sources. And one thing is still missing, maybe the most important one: breakpoints! I'll first explain how breakpoints work as it's quite interesting and tricky, then I'll come back on symbol management.

Breakpoints are not part of Ptrace API

As we've seen above, breakpoints are not part of ptrace API services. But we can alter the memory, and receive the debugee's signals. You can't see the link? That's because breakpoint implementation is quite tricky and hacky! Let's examine how to set a breakpoint at a given address:

  1. The debugger reads (ptrace peek) the binary instruction stored at this address, and saves it in its data structures.
  2. It writes a trapping instruction at this location. This instruction can be a dedicated debugging instruction (INT3/0xCC on x86, ebreak on RISC-V), or any invalid instruction for the given CPU.
  3. When the debuggee reaches this invalid instruction (or, put more correctly, the CPU, configured with the debuggee memory context), it won't be able to execute it (because it's an invalid instruction), or it will trap in the kernel fault handler.
  4. In modern multitask OSes, an invalid instruction doesn't crash the whole system, but it gives the control back to the OS kernel, by raising an interruption (or a fault).
  5. This interruption is translated by Linux into a SIGTRAP signal, and transmitted to the process ... or to it's parent, as the debugger asked for.
  6. The debugger gets the information about the signal, and checks the value of the debuggee's instruction pointer (i.e., where the trap occurred). If the IP address is in its breakpoint list, that means it's a debugger breakpoint (otherwise, it's a fault in the process, just pass the signal and let it crash).
  7. Now that the debuggee is stopped at the breakpoint, the debugger can let its user do what ever s/he wants, until it's time to continue the execution.
  8. To continue, the debugger needs to 1/ write the correct instruction back in the debuggee's memory, 2/ single-step it (continue the execution for one CPU instruction, with ptrace single-step) and 3/ write the invalid instruction back (so that the execution can stop again next time). And 4/, let the execution flow normally.

Neat, isn't it? As a side remark, you can notice that this algorithm will not work if not all the threads are stopped at the same time (because running threads may pass the breakpoint when the valid instruction is in place). I won't detail the way GDB guys solved it, but it's discussed in detail this paper: Non-stop Multi-threaded Debugging in GDB. Put briefly, they write the instruction somewhere else in memory, set the instruction pointer to that location and single-step the processor. But the problem is that some instruction are address-related, for example the jumps and conditional jumps ...

Symbol and debug information handling

Now, let's come back to the symbol and debug information handling aspect. I didn't study that part into details, so I'll only present an overview.

First of all, can we debug without debug information and symbol addresses? The answer is yes, as, as we've seen above, all the low-level commands deal with CPU registers and memory addresses, and not source-level information. Hence, the link with the sources are only for the user's convenience. Without debug information, you'll see your application the way the processor (and the kernel) see it: as binary (assembly) instructions and memory bits. GDB doesn't need any further information to translate binary data into CPU instructions:

(gdb) x/10x $pc # heXadecimal representation
0x402c60:   0x56415741  0x54415541  0x55f48949  0x4853fd89
0x402c70:   0x03a8ec81  0x8b480000  0x8b48643e  0x00282504
0x402c80:   0x89480000  0x03982484
(gdb) x/10i $pc # Instruction representation
=> 0x402c60:    push   %r15
0x402c62:   push   %r14
0x402c64:   push   %r13
0x402c66:   push   %r12
0x402c68:   mov    %rsi,%r12
0x402c6b:   push   %rbp
0x402c6c:   mov    %edi,%ebp
0x402c6e:   push   %rbx
0x402c6f:   sub    $0x3a8,%rsp
0x402c76:   mov    (%rsi),%rdi

Now if we add symbol handling information, GDB can match addresses with symbol names:

(gdb) $pc
$1 = (void (*)()) 0x402c60 <main>

You can list the symbols of an ELF binary with nm -a $file:

nm -a /usr/lib/debug/usr/bin/ls.debug | grep " main"
0000000000402c60 T main

GDB will also be able to display the stack trace (more on that later), but with a limited interest:

(gdb) where
#0  write ()
#1  0x0000003d492769e3 in _IO_new_file_write ()
#2  0x0000003d49277e4c in new_do_write ()
#3  _IO_new_do_write ()
#4  0x0000003d49278223 in _IO_new_file_overflow ()
#5  0x00000000004085bb in print_current_files ()
#6  0x000000000040431b in main ()

We've got the PC addresses, the corresponding function, but that's it. Inside a function, you'll need to debug in assembly!

Now let's add debug information: that's the DWARF standard, gcc -g option. I'm not very familiar with this standard, but I know it provides:

  • address to line and line to address mapping
  • data type definitions, including typedefs and structures
  • local variables and function parameters, with their type

Try dwarfdump to see the information embedded in you binaries. addr2line also uses these information:

$ dwarfdump /usr/lib/debug/usr/bin/ls.debug | grep 402ce4
0x00402ce4  [1289, 0] NS
$ addr2line -e /usr/lib/debug/usr/bin/ls.debug  0x00402ce4
/usr/src/debug/coreutils-8.21/src/ls.c:1289

Many source-level debugging commands will rely on these information, like the command next, that sets a breakpoint at the address of the next line, the print command that relies on the types to display the variables in the right type (char, int, float, instead of binary/hexadecimal!).

Last words

We've seen many aspects of debugger's internals, so I'll just say a few words of the last points:

  • the stack trace is "unwinded" from the current frame ($sp and $bp/#fp) upwards, one frame at a time. Functions' name, parameters and local variables are found in the debug information.
  • watchpoints are implemented (if available) with the help of the processor: write in its registers which addresses should be monitored, and it will raise an exception when the memory is read or written. If this support is not available, or if you request more watchpoints than the processor supports ... then the debugger falls back to "hand-made" watchpoints: execute the application instruction by instruction, and check if the current operation touches a watchpointed address. Yes, that's very slow!
  • Reverse debugging can be done this way too, record the effect of each instruction, and apply it backward for reverse execution.
  • Conditional breakpoints are normal breakpoints, except that, internally, the debugger checks the conditions before giving the control to the user. If the condition is not matched, the execution is silently continued.

And play with gdb gdb, or better (way better actually), gdb --pid $(pidof gdb), because two debuggers in the same terminal is insane :-). Another great thing for learning is system debugging:

qemu-system-i386 -gdb tcp::1234
gdb --pid $(pidof qemu-system-i386)
gdb /boot/vmlinuz --exec "target remote localhost:1234"

but I'll keep that for another article!

Bug(ging) and debugging

Monday, November 10, 2014 - No comments

At the beginning of my PhD, I read two interesting books about debuggers. One by J. Rosenberg, How Debuggers Work: Algorithms, Data Structures, and Architecture which describe the internal algorithms of interactive debuggers; and another by A. Zeller, WHY PROGRAMS FAIL: A Guide to Systematic Debugging that discusses how programmers introduce bugs in their applications. And in particular, in the latter book, Zeller explains what is a bug, through four different steps:

1. The programmer creates a defect. A defect is a piece of the code that can cause an infection. Because the defect is part of the code, and because every code is initially written by a programmer, the defect is technically created by the programmer.

2. The defect causes an infection. The program is executed, and with it the defect. The defect now creates an infection—that is, after execution of the defect, the program state differs from what the programmer intended. A defect in the code does not necessarily cause an infection. The defective code must be executed, and it must be executed under such conditions that the infection actually occurs.

3. The infection propagates. Most functions result in errors when fed with erroneous input. As the remaining program execution accesses the state, it generates further infections that can spread into later program states. An infection need not, however, propagate continuously. It may be overwritten, masked, or corrected by some later program action.

4. The infection causes a failure. A failure is an externally observable error in the program behavior. It is caused by an infection in the program state.


It's important to have these four step in mind when you develop and debug, as although you may have a problem (a defect) in your code, if step 4 (or 2) is never executed it won't be visible in your application .... until the execution takes another code path.

Likewise, it may be easy to 'fix' a failure, but that doesn't mean that you problem is actually resolved.



I just read another though on debugging that I quite like, that compares it with Sherlock Homes' investigation technique:

How do you debug?

> Most people, if you describe a train of events to them, will tell you what the result would be. They can put those events together in their minds, and argue from them that something will come to pass. There are few people, however, who, if you told them a result, would be able to evolve from their own inner consciousness what the steps were which led up to that result. This power is what I mean when I talk of reasoning backwards, or analytically.
>
>Sherlock Holmes A Study in Scarlet, by Sir Arthur Conan Doyle


Debugging is indeed reasoning backwards, you see consequences, a failure (or a murder), and you investigate on what the causes can be.

I don't agree with his conclusion though,

The Holmes method of debugging is superior, I think, to the scientific method of debugging because debugging isn’t just a science. There’s an art to knowing where to look and what data is needed. This comes from experience and is as much intuition as it is logic. Practice debugging and you will be a better debugger.

as I think that both methods are just complementary. You investigate on what the causes can be, then you make hypothesis and you try to validate or disprove them. The better investigator you are, the easier it will be to formulate hypotheses and prove them right and useful!