Processing scanned papers, books, and notes

May 21, 2010

It’s been a long time since I posted anything. You can probably forgive me. I’ve been on the job market and just recently secured a job for next year. Yay.

I’ve been processing various scanned papers, books, and notes for easy reading on a (certain un-named) eBook reading “device” I recently acquired and which I nightly curl up with before bedtime. Said device reads PDFs really well, but not other formats, in particular it has trouble with the DjVu format. A DjVu reader exists for said device, but it’s slow, featureless, and junky (e.g. no text search).

The DjVu format is nice, it’s provides high fidelity scanned images packaged with searchable text. I want PDF files though, so here’s what I’m doing to batch-convert my .djvu files to optimized, searchable (OCR’d) PDFs.

  1. Convert a folder full of .djvu files to postscipt: here’s a Bash shell script which calls the djvups program which comes with the Mac OSX packaged version of djview (and djviewlibre).

    for file in *.djvu
        name=`echo $file | sed 's/\(.*\)\.djvu/\1/'`
        djvups "${name}.djvu" "${name}.ps"

  2. Convert postscript .ps files to PDF files (I’m using Adobe Distiller because I have access to a machine that has it installed, as well as Acrobat Pro), but you can use the free and open source ‘ps2pdf‘ command (along with a modified version of the script above) that is included when you install TeXLive or any modern LaTeX distribution.
  3. Batch optimize the PDF files, OCR them and embed the text. I’m using Adobe Acrobat Pro which is slick, but definitely not free. I’d like to figure out a free / open source solution to this part.
