May 142013
 

Every remotely relevant reference I came across during the last 15 years or so resides in a single bibtex file. That is not a problem. The problem is that I’m moving into a shiny, new but somewhat smaller office, together with hundreds of copies of journal articles and hundreds of PDFs. Wouldn’t it be good to know which physical copies are effectively redundant (unreadable comments in the margins aside) and can therefore stay behind?

The trouble is that bibtex files have a rather flexible, human readable format. Each entry begins with the @ sign, followed by a type (book, article etc.), a reference name,  lots of key/value pairs (fields) in arbitrary order,  and even more curly braces.

grep @ full.bib|wc -l tells me that I have 2914 references in total. grep binder|wc -l (binder is a custom field that I use to keep track of the location of my copies) shows that I have printed out/copied 712 texts over the years, and grep file|wc -l indicates that there are 504 PDFs residing on my filesystem. But what is the magnitude of the intersection?

My first inclination was to look for a suitable Python parser/library. Pybtex looked good in principle but is underdocumented and had trouble reading full.bib, because that is encoded in Latin 1. So it was endless hours of amateurish coding and procrastination ahead. Then I remembered the “do one thing, and do it really well” mantra of old. Enter bibtool, which is a fast and reasonably stable bibtex file filter and pretty printer. Bibtool reads “resource files”, which are really just short scripts containing filtering/formatting directives. select = {binder ".+"} keeps those references whose “binder” field contains at least one character (.+ is a regular expression that matches any non-empty string). select = {file ".+"} selects all references for which I have a PDF. But bibtool applies a logical OR to these conditions while I’m interested in finding those references that meet both criteria.

The quick solution is to store each statement in a file of its own and apply bibtool twice, using a pipeline for extra efficiency: bibtool -r find-binder.rsc full.bib|bibtool -r find-pdf >intersection.bib does the trick and solves my problem in under a minute, without any coding.

As it turns out, there were just 65 references in both groups. Apparently, I stopped printing (or at least filing away) some time ago. Eventually, I binned two copies, but it is the principle that matters.