My wife is currently writing her HDR thesis (in France, this is an "accreditation to supervise research"). As part of this, she asked me if it would be possible to split her bibliography into two parts: one containing her own publications and another for the rest of her references.
After a tiny bit of searching, I found this stackoverflow answer: https://tex.stackexchange.com/a/407363
However, the answer uses the biblatex
package, and my wife was using plain BibTeX. (Dun dun duuun!)
No matter, we can probably switch to biblatex
, right? We only had about 6k lines of LaTeX source code and 3k lines worth of BibTeX data, how hard could it be?
To cut a slightly long story short, I ended up using the biber
backend; see https://tex.stackexchange.com/a/25702/ for a great overview of the various LaTeX bibliography tools and packages. Biber can use the same bibliography (.bib
) file, but has a few differences in how it does its processing and default formatting. Which in turn means that slightly different bits of your bibliography file end up getting output in the final document.
Because of the way that LaTeX bibliographies work, the data from your .bib
file doesn't end up getting output directly into your document -- there is a roundabout way where you first have to run LaTeX (pdflatex
in my case) to find out which references are used, then you have to run biber
to (I think) extract the relevant references from your .bib
into an auxiliary .bbl
file, and then finally you run pdflatex
once more to have it actually include the data from the .bbl
into your "References" section.
The bottom line is: This whole process means that if you have a weird .bib
entry that perhaps contains some special bit of a markup and that markup cannot be used in the final context where the bibliography appears in the output document, you will get errors. Errors that just point to your main LaTeX file where your \printbibliography
command is. Which, for a 3k lines long .bib
file is rather inconvenient to debug.
So what do you do when your pdflatex
run ends with something like this:
[]\OT1/bch/m/n/12 BSI. | []. AIS 20 / AIS
! Extra }, or forgotten \endgroup.
\UL@stop ...z@ \else \UL@putbox \fi \else \egroup
\egroup \UL@putbox \fi \if...
l.474
?
Enter C-Reduce...
C-Reduce
According to its website, "C-Reduce is a tool that takes a large C, C++, or OpenCL file that has a property of interest (such as triggering a compiler bug) and automatically produces a much smaller C/C++ file that has the same property".
But how is that relevant here, given that we are dealing with LaTeX? Well, it turns out that C-Reduce can also work with non-C/C++ files, meaning that we now have a way to "reduce" our document (or, well, bibliography file) until it contains ONLY the bits that are actually causing us problems.
The way C-Reduce works is that it takes two inputs: an "interestingness" test (which is really just a shell script) and the file that you would like to reduce. The interestingness test should return either 0 (success) or 1 (failure) depending on whether the document C-Reduce gave it has the property you are searching for.
In our case, the property we want is that LaTeX prints the error we originally encountered. We can find all those errors simply by grepping the pdflatex
log file. Note that the first pdflatex
run, as well as the biber
run, will both succeed without errors, as the error only appears when the bibliography is actually printed in the final document:
$ pdflatex main
[...]
$ biber main
[...]
$ pdflatex -interaction=nonstopmode main
[...]
$ grep -A1 '^!' main.log
! Extra }, or forgotten \endgroup.
\UL@stop ...z@ \else \UL@putbox \fi \else \egroup
--
! Extra }, or forgotten \endgroup.
\UL@stop ... \UL@putbox \fi \else \egroup \egroup
--
! Missing } inserted.
<inserted text>
--
! Missing } inserted.
<inserted text>
--
! Undefined control sequence.
\namepartfamily ->\addtext
--
! Undefined control sequence.
<argument> \addtext
Since we want the errors to remain the same, we can make our interestingness test check that this output remains stable. A quick way to do that is to just hash the output of the command above and ensure the hash doesn't change:
$ grep -A1 '^!' main.log | sha1sum
8ab121373e6b0232f8789f093db4bf20f3bb32c9 -
In the interestingness test shell script we'd then put:
[ "$(grep -A1 '^!' main.log | sha1sum | cut -d ' ' -f 1)" == "8ab121373e6b0232f8789f093db4bf20f3bb32c9" ] || exit 1
This will succeed when the grep
output is what we expect -- and return 1 when it changes.
It can be worth playing with different combinations of grep
options. The ones I found most useful in this kind of context are:
-m N
(stop processing after N matches)-A N
(output N lines following a match)-B N
(output N lines preceding a match)
If there are contextual clues that should remain the same (for example the []\OT1/bch/m/n/12 BSI. | []. AIS 20 / AIS
line in the original error I got), then you can adjust the grep
command accordingly.
Muti-file projects
C-Reduce only knows how to reduce a single file at a time, which poses a small problem for our multi-file project. However, it's merely a small problem, and it's easy to solve. C-Reduce will start your interestingness test shell script in a new (temporary) directory every time, so all we need to do is to copy in the extra files at the start of the script. In my case I only needed the main .tex
file (as the file I was minimizing was the .bib
file, and C-Reduce will take care to get that one for you on its own):
# get whatever extra files you need to build
cp /home/vegard/hdr/main.tex .
That said, it can be worthwhile to hand-optimize your document a little bit at the start to reduce the compilation times of files that you know are irrelevant to the error and which won't be reduced by C-Reduce. In my particular case, chapters were split out into separate files and it was easy enough to comment out the lines that said \input{chapter1}
, etc. -- meaning that we don't actually need C-Reduce to compile the full document every run; I already knew it was a problem with the line that said \printbibliography
right at the end of the document. However, removing the citations meant that the printed bibliography would be empty as well, so I also had to add \nocite{*}
, which includes all bibliography entries whether they are cited or not.
Running C-Reduce
Putting it all together:
$ cat test.sh
#! /bin/bash
# error out by default
set -e
# get whatever extra files you need to build
cp /home/vegard/hdr/main.tex .
# try to compile the document
pdflatex main
biber main
pdflatex -interaction=nonstopmode main
# check that the original errors are still present
[ "$(grep -A1 '^!' main.log | sha1sum | cut -d ' ' -f 1)" == "8ab121373e6b0232f8789f093db4bf20f3bb32c9" ] || exit 1
We can then run C-Reduce with:
creduce --not-c test.sh bibliography.bib
After about 20 minutes, the 3,400-line bibliography.bib
had been reduced down to about 47 lines where it was quite easy to spot the problems by hand: \addtext
around an author name, a stray ~
in a journal name, and a stray #
in a month name.
Conclusion
C-Reduce was not made for LaTeX or BibTeX, but was surprisingly efficient at locating hard-to-find sources of compilation errors. It's true that writing interestingness tests can be unintuitive (AKA "Why is my testcase empty?"). Fortunately, I've used C-Reduce quite a bit in the past for C and C++ so it was straightforward to see how to apply it to this particular problem.
One interesting thing to note is that we didn't ask the tool to fix our problem, quite the opposite: We asked it to remove as much as possible that didn't have anything to do with the errors we were seeing, effectively isolating the problem to just the problematic few lines of code.
In general I think isolation is a very powerful debugging technique. It brings clarity to a problem where you can only see the symptoms. That's why stackoverflow generally asks for "MWEs" (Minimal Working Examples) -- remove confounding variables and everything that is immaterial to the problem at hand; get to the essence of the thing.
On Twitter, some people pointed out a couple of other tools that are like C-Reduce in that they can also minimize files/testcases:
- Sergey Bronnikov mentions halfempty by Tavis Ormandy
- Alexander Potapenko mentions multidelta
I didn't try either of these tools for this specific problem, but I have used halfempty in the past and it's a good tool that's worth getting familiar with. A few years ago I did a simple benchmark of C-Reduce vs. halfempty on C++ source and -- without putting too much into this simplistic comparison -- I think the main takeaway was that halfempty seems to have the potential to be faster when run on fewer cores.