(gdb) break *0x972

Debugging, GNU± Linux and WebHosting and ... and ...

Printing corrupted (scanned) PDF

My printer is ... special. From its web interface, you can scan a document and get a PDF, but you can't print it!

It generated "nature-friendly" PDFs! Only white pages get out of the printer.

THINK BEFORE YOU PRINT: Please consider the environment before printing this email.

It doesn't work with Evince, nor pdf2ps, nor evince > print to file > print, nor convert.

But Evince does print some useful information:

Syntax Error (5404808): Illegal character '>'
Corrupt JPEG data: premature end of data segment
Corrupt JPEG data: premature end of data segment
Corrupt JPEG data: premature end of data segment

The JPEG images contained in the PDF are corrupted. For some reasons, Evince can display them onscreen, but not translate them to PS for the printer ... There's certainly a PDF library down there that doesn't handle invalid images that is used to transform PDFs into other formats.

Hopefully, Popplet's pdfimages doesn't rely on that "broken" library, and it can extract all the images of a PDF!

When you try to export the images in JPEG format (option -j), it still doesn't work, as it just extracts the invalid images out of the PDF. Eye of Gnome can't display it and explains why:

Error interpreting JPEG image file (Maximum supported image dimension is 65500 pixels)

However, pdfimages can also export PPM images ( portable pixmap file format), that are not invalid! yeay! :-)

pdfimages $PDF $PREFIX

and with ImageMagicks convert, you can rebuild your PDF:

convert-pdf() {
  TMP=$(mktemp -d)
 cp $PDF $TMP
  mv $PDF $PDF.bak
  cd $TMP
  pdfimages $PDF $PREFIX
  # convert ppm to jpg, that saves a lot space!
  for i in $PREFIX*.ppm
      convert $i $(basename $i .ppm).jpg 
  convert $PREFIX*.jpg $PDF
  mv $PDF $WD
  cd  $WD
  rm -rf $TMP