Thursday, July 7, 2022

Stigmergy in programming

Ants are known to leave invisible pheromones on their paths in order to inform both themselves and their fellow ants where to go to find food or signal that a path leads to danger. In biology, this phenomenon is known as stigmergy: the act of modifying your environment to manipulate the future behaviour of yourself or others. From the Wikipedia article:

Stigmergy (/ˈstɪɡmərdʒi/ STIG-mər-jee) is a mechanism of indirect coordination, through the environment, between agents or actions. The principle is that the trace left in the environment by an individual action stimulates the performance of a succeeding action by the same or different agent. Agents that respond to traces in the environment receive positive fitness benefits, reinforcing the likelihood of these behaviors becoming fixed within a population over time.

For ants in particular, stigmergy is useful as it alleviates the need for memory and more direct communication; instead of broadcasting a signal about where a new source of food has been found, you can instead just leave a breadcrumb trail of pheromones that will naturally lead your community to the food.

We humans also use stigmergy in a lot of ways: most notably, we write things down. From post-it notes posted on the fridge to remind ourselves to buy more cheese to writing books that can potentially influence the behaviour of a whole future generation of young people.

Let's face it: We don't have infinite brains and we need to somehow alleviate the need to remember everything. If you remember the movie Memento, the protagonist Leonard has lost his ability to form new long-term memories and relies on stigmergy to inform his future actions; everything that's important he writes down in a place he's sure to come across it again when needed. His most important discoveries he turns into tattoos that he cannot lose or avoid seeing when he wakes up in the morning.

Perhaps a biologist would object and say this is stretching the definition of stigmergy, but I contest that it fits: leaving a trace in the environment in order to stimulate a future action.

For stigmergy to be effective, it must be placed in the right location so that whoever comes across it will perform the correct action at that time. If we return briefly to the shopping list example, we typically keep the list close to the fridge because that is often where we are when we need to write something down -- or when we go to check what we need to buy.

Let's take an example from computing: Have you ever seen a line at the top of a file that says "AUTOMATICALLY GENERATED; DON'T MODIFY THIS"? Well, that's stigmergy. Somebody made sure that line would be there in order to influence the behaviour of whomever came across that file. A little note from the past placed in the environment to manipulate future actions.

In programming, stigmergy mainly manifests as comments scattered throughout the code -- the most common form is perhaps leaving a comment to explain what a piece of code is there to do, where we know somebody will find it and, hopefully, be able to make use of it. Another one is leaving a "TODO" comment where something isn't quite finished -- you may not know that a piece of code isn't handling some corner case just by glancing at it, but a "TODO" comment stands out and may even contain enough information to complete the implementation. In other cases, we see the opposite: "here be dragons"-type comments instructing the reader not to change something, perhaps because the code is known to be complicated, complex, brittle, or prone to breaking.

Stigmergy is a powerful idea, and once you are aware of it you can consciously make use of it to help yourself and others down the line. We're not robotic ants and we can make deliberate choices regarding when, where, and how we modify our environment in order to most effectively influence future behaviour.

Monday, December 13, 2021

Using C-Reduce to debug LaTeX errors

My wife is currently writing her HDR thesis (in France, this is an "accreditation to supervise research"). As part of this, she asked me if it would be possible to split her bibliography into two parts: one containing her own publications and another for the rest of her references.

After a tiny bit of searching, I found this stackoverflow answer:

However, the answer uses the biblatex package, and my wife was using plain BibTeX. (Dun dun duuun!)

No matter, we can probably switch to biblatex, right? We only had about 6k lines of LaTeX source code and 3k lines worth of BibTeX data, how hard could it be?

To cut a slightly long story short, I ended up using the biber backend; see for a great overview of the various LaTeX bibliography tools and packages. Biber can use the same bibliography (.bib) file, but has a few differences in how it does its processing and default formatting. Which in turn means that slightly different bits of your bibliography file end up getting output in the final document.

Because of the way that LaTeX bibliographies work, the data from your .bib file doesn't end up getting output directly into your document -- there is a roundabout way where you first have to run LaTeX (pdflatex in my case) to find out which references are used, then you have to run biber to (I think) extract the relevant references from your .bib into an auxiliary .bbl file, and then finally you run pdflatex once more to have it actually include the data from the .bbl into your "References" section.

The bottom line is: This whole process means that if you have a weird .bib entry that perhaps contains some special bit of a markup and that markup cannot be used in the final context where the bibliography appears in the output document, you will get errors. Errors that just point to your main LaTeX file where your \printbibliography command is. Which, for a 3k lines long .bib file is rather inconvenient to debug.

So what do you do when your pdflatex run ends with something like this:

[]\OT1/bch/m/n/12 BSI. |   []. AIS 20 / AIS
! Extra }, or forgotten \endgroup.
\UL@stop ...z@ \else \UL@putbox \fi \else \egroup 
                                                  \egroup \UL@putbox \fi \if...

Enter C-Reduce...


According to its website, "C-Reduce is a tool that takes a large C, C++, or OpenCL file that has a property of interest (such as triggering a compiler bug) and automatically produces a much smaller C/C++ file that has the same property".

But how is that relevant here, given that we are dealing with LaTeX? Well, it turns out that C-Reduce can also work with non-C/C++ files, meaning that we now have a way to "reduce" our document (or, well, bibliography file) until it contains ONLY the bits that are actually causing us problems.

The way C-Reduce works is that it takes two inputs: an "interestingness" test (which is really just a shell script) and the file that you would like to reduce. The interestingness test should return either 0 (success) or 1 (failure) depending on whether the document C-Reduce gave it has the property you are searching for.

In our case, the property we want is that LaTeX prints the error we originally encountered. We can find all those errors simply by grepping the pdflatex log file. Note that the first pdflatex run, as well as the biber run, will both succeed without errors, as the error only appears when the bibliography is actually printed in the final document:

$ pdflatex main
$ biber main
$ pdflatex -interaction=nonstopmode main
$ grep -A1 '^!' main.log 
! Extra }, or forgotten \endgroup.
\UL@stop ...z@ \else \UL@putbox \fi \else \egroup 
! Extra }, or forgotten \endgroup.
\UL@stop ... \UL@putbox \fi \else \egroup \egroup 
! Missing } inserted.
<inserted text> 
! Missing } inserted.
<inserted text> 
! Undefined control sequence.
\namepartfamily ->\addtext 
! Undefined control sequence.
<argument> \addtext 

Since we want the errors to remain the same, we can make our interestingness test check that this output remains stable. A quick way to do that is to just hash the output of the command above and ensure the hash doesn't change:

$ grep -A1 '^!' main.log | sha1sum
8ab121373e6b0232f8789f093db4bf20f3bb32c9  -

In the interestingness test shell script we'd then put:

[ "$(grep -A1 '^!' main.log | sha1sum | cut -d ' ' -f 1)" == "8ab121373e6b0232f8789f093db4bf20f3bb32c9" ] || exit 1

This will succeed when the grep output is what we expect -- and return 1 when it changes.

It can be worth playing with different combinations of grep options. The ones I found most useful in this kind of context are:

  • -m N (stop processing after N matches)
  • -A N (output N lines following a match)
  • -B N (output N lines preceding a match)

If there are contextual clues that should remain the same (for example the []\OT1/bch/m/n/12 BSI. | []. AIS 20 / AIS line in the original error I got), then you can adjust the grep command accordingly.

Muti-file projects

C-Reduce only knows how to reduce a single file at a time, which poses a small problem for our multi-file project. However, it's merely a small problem, and it's easy to solve. C-Reduce will start your interestingness test shell script in a new (temporary) directory every time, so all we need to do is to copy in the extra files at the start of the script. In my case I only needed the main .tex file (as the file I was minimizing was the .bib file, and C-Reduce will take care to get that one for you on its own):

# get whatever extra files you need to build
cp /home/vegard/hdr/main.tex .

That said, it can be worthwhile to hand-optimize your document a little bit at the start to reduce the compilation times of files that you know are irrelevant to the error and which won't be reduced by C-Reduce. In my particular case, chapters were split out into separate files and it was easy enough to comment out the lines that said \input{chapter1}, etc. -- meaning that we don't actually need C-Reduce to compile the full document every run; I already knew it was a problem with the line that said \printbibliography right at the end of the document. However, removing the citations meant that the printed bibliography would be empty as well, so I also had to add \nocite{*}, which includes all bibliography entries whether they are cited or not.

Running C-Reduce

Putting it all together:

$ cat
#! /bin/bash

# error out by default
set -e

# get whatever extra files you need to build
cp /home/vegard/hdr/main.tex .

# try to compile the document
pdflatex main
biber main
pdflatex -interaction=nonstopmode main

# check that the original errors are still present
[ "$(grep -A1 '^!' main.log | sha1sum | cut -d ' ' -f 1)" == "8ab121373e6b0232f8789f093db4bf20f3bb32c9" ] || exit 1

We can then run C-Reduce with:

creduce --not-c bibliography.bib

After about 20 minutes, the 3,400-line bibliography.bib had been reduced down to about 47 lines where it was quite easy to spot the problems by hand: \addtext around an author name, a stray ~ in a journal name, and a stray # in a month name.


C-Reduce was not made for LaTeX or BibTeX, but was surprisingly efficient at locating hard-to-find sources of compilation errors. It's true that writing interestingness tests can be unintuitive (AKA "Why is my testcase empty?"). Fortunately, I've used C-Reduce quite a bit in the past for C and C++ so it was straightforward to see how to apply it to this particular problem.

One interesting thing to note is that we didn't ask the tool to fix our problem, quite the opposite: We asked it to remove as much as possible that didn't have anything to do with the errors we were seeing, effectively isolating the problem to just the problematic few lines of code.

In general I think isolation is a very powerful debugging technique. It brings clarity to a problem where you can only see the symptoms. That's why stackoverflow generally asks for "MWEs" (Minimal Working Examples) -- remove confounding variables and everything that is immaterial to the problem at hand; get to the essence of the thing.

On Twitter, some people pointed out a couple of other tools that are like C-Reduce in that they can also minimize files/testcases:

I didn't try either of these tools for this specific problem, but I have used halfempty in the past and it's a good tool that's worth getting familiar with. A few years ago I did a simple benchmark of C-Reduce vs. halfempty on C++ source and -- without putting too much into this simplistic comparison -- I think the main takeaway was that halfempty seems to have the potential to be faster when run on fewer cores.

Sunday, June 24, 2018

Compiler fuzzing, part 1

Much has been written about fuzzing compilers already, but there is not a lot that I could find about fuzzing compilers using more modern fuzzing techniques where coverage information is fed back into the fuzzer to find more bugs.

If you know me at all, you know I'll throw anything I can get my hands on at AFL. So I tried gcc. (And clang, and rustc -- but more about Rust in a later post.)

Levels of fuzzing

First let me summarise a post by John Regehr called Levels of Fuzzing, which my approach builds heavily on. Regehr presents a very important idea (which stems from earlier research/papers by others), namely that fuzzing can operate at different "levels". These levels correspond somewhat loosely to the different stages of compilation, i.e. lexing, parsing, type checking, code generation, and optimisation. In terms of fuzzing, the source code that you pass to the compiler has to "pass" one stage before it can enter the next; if you give the compiler a completely random binary file, it is unlikely to even get past the lexing stage, never mind to the point where the compiler is actually generating code. So it is in our interest (assuming we want to fuzz more than just the lexer) to generate test cases more intelligently than just using random binary data.

If we simply try to compile random data, we're not going to get very far.
In a "naïve" approach, we simply compile gcc with AFL instrumentation and run afl-fuzz on it as usual. If we give a reasonable corpus of existing C code, it is possible that the fuzzer will find something interesting by randomly mutating the test cases. But more likely than not, it is mostly going to end up with random garbage like what we see above, and never actually progress to more interesting stages of compilation. I did try this -- and the results were as expected. It takes a long time before the fuzzer hits anything interesting at all. Now, Sami Liedes did this with clang back in 2014 and obtained some impressive results ("34 distinct assertion failures in the first 11 hours"). So clearly it was possible to find bugs in this way. When I tried this myself for GCC, I did not find a single crash within a day or so of fuzzing. And looking at the queue of distinct testcases it had found, it was very clear that it was merely scratching the very outermost surface of the input handling in the compiler -- it was not able to produce a single program that would make it past the parsing stage.

AFL has a few built-in mutation strategies: bit flips, "byte flips", arithmetic on bytes, 2-bytes, and 4-bytes, insertion of common boundary values (like 0, 1, powers of 2, -1, etc.), insertions of and substitution by "dictionary strings" (basically user-provided lists of strings), along with random splicing of test cases. We can already sort of guess that most of these strategies will not be useful for C and C++ source code. Perhaps the "dictionary strings" is the most promising for source code as it allows you to insert keywords and snippets of code that have at least some chance of ending up as a valid program. For the other strategies, single bit flips can change variable names, but changing variable names is not that interesting unless you change one variable into another (which both have to exist, as otherwise you would hit a trivial "undeclared" error). They can also create expressions, but if you somehow managed to change a 'h' into a '(', source code with this mutation would always fail unless you also inserted a ')' somewhere else to balance the expression. Source code has a lot of these "correspondances" where changing one thing also requires changing another thing somewhere else in the program if you want it to still compile (even though you don't generate an equivalent program -- that's not what we're trying to do here). Variable uses match up with variable declarations. Parantheses, braces, and brackets must all match up (and in the right order too!).

These "correspondences" remind me a lot of CRCs and checksums in other file formats, and they give the fuzzer problems for the exact same reason: without extra code it's hard to overcome having to change the test case simultaneously in two or more places, never mind making the exact change that will preserve the relationship between these two values. It's a game of combinatorics; the more things we have to change at once and the more possibilities we have for those changes, the harder it will be to get that exact combination when you're working completely at random. For checksums the answer is easy, and there are two very good strategies: either you disable the checksum verification in the code you're fuzzing, or you write a small wrapper to "fix up" your test case so that the checksum always matches the data it protects (of course, after mutating an input you may not really know where in the file the checksum will be located anymore, but that's a different problem).

For C and C++ source code it's not so obvious how to help the fuzzer overcome this. You can of course generate programs with a grammar (and some heuristics), which is what several C random code generators such as Csmith, ccg, and yarpgen do. This is in a sense on the completely opposite side of the spectrum when it comes to the levels of fuzzing. By generating programs that you know are completely valid (and correct, and free of undefined behaviour), you will breeze through the lexing, the parsing, and the type checking and target the code generation and optimization stages. This is what Regehr et al. did in "Taming compiler fuzzers", another very interesting read. (Their approach does not include instrumentation feedback, however, so it is more of a traditional black-box fuzzing approach than AFL, which is considered grey-box fuzzing.)

But if you use a C++ grammar to generate C++ programs, that will also exclude a lot of inputs that are not valid but nevertheless accepted by the compiler. This approach relies on our ability to express all programs that should be valid, but there may also be programs non-valid programs that crash the compiler. As an example, if our generator knows that you cannot add an integer to a function, or assign a value to a constant, then the code paths checking for those conditions in the compiler would never be exercised, despite the fact that those errors are more interesting than mere syntax errors. In other words, there is a whole range of "interesting" test cases which we will never be able to generate if we restrict ourselves only to those programs that are actually valid code.

Please note that I am not saying that one approach is better than the other! I believe we need all of them to successfully find bugs in all the areas of the compiler. By realising exactly what the limits of each method are, we can try to find other ways to fill the gaps.

Fuzzing with a loose grammar

So how can we fill the gap between the shallow syntax errors in the front end and the very deep of the code generation in the back end? There are several things we can do.

The main feature of my solution is to use a "loose" grammar. As opposed to a "strict" grammar which would follow the C/C++ specs to the dot, the loose grammar only really has one type of symbol, and all the production rules in the grammar create this type of symbol. As a simple example, a traditional C grammar will not allow you to put a statement where an expression is expected, whereas the loose grammar has no restrictions on that. It does, however, take care that your parantheses and braces match up. My grammar file therefore looks something like this (also see the full grammar if you're curious!):
"[void] [f] []([]) { [] }"
"[]; []"
"{ [] }"
"[0] + [0]"
Here, anything between "[" and "]" (call it a placeholder) can be substituted by any other line from the grammar file. An evolution of a program could therefore plausibly look like this:
void f () { }           // using the "[void] [f] []([]) { [] }" rule
void f () { ; }         // using the "[]; []" rule
void f () { 0 + 0; }    // using the "[0] + [0]" rule
void f ({ }) { 0 + 0; } // using the "{ [] }" rule
Wait, what happened at the end there? That's not valid C. No -- but it could still be an interesting thing to try to pass to the compiler. We did have a placeholder where the arguments usually go, and according to the grammar we can put any of the other rules in there. This does quickly generate a lot of nonsensical programs that stop the compiler completely dead in its track at the parsing stage. We do have another trick to help things along, though...

AFL doesn't care at all whether what we pass it is accepted by the compiler or not; it doesn't distinguish between success and failure, only between graceful termination and crashes. However, all we have to do is teach the fuzzer about the difference between exit codes 0 and 1; a 0 means the program passed all of gcc's checks and actually resulted in an object file. Then we can discard all the test cases that result in an error, and keep a corpus of test cases which compile successfully. It's really a no-brainer, but makes such a big difference in what the fuzzer can generate/find.

Enter prog-fuzz

prog-fuzz output

If it's not clear by now, I'm not using afl-fuzz to drive the main fuzzing process for the techniques above. I decided it was easier to write a fuzzer from scratch, just reusing the AFL instrumentation and some of the setup code to collect the coverage information. Without the fork server, it's surprisingly little code, on the order of 15-20 lines of code! (I do have support for the fork server on a different branch and it's not THAT much harder to implement, but I simply haven't gotten around to it yet; and it also wasn't really needed to find a lot of bugs).

You can find prog-fuzz on GitHub:

The code is not particularly clean, it's a hacked-up fuzzer that gets the job done. I'll want to clean that up at some point, document all the steps to build gcc with AFL instrumentation, etc., and merge a proper fork server. I just want the code to be out there in case somebody else wants to have a poke around.


From the end of February until some time in April I ran the fuzzer on and off and reported just over 100 distinct gcc bugs in total (32 of them fixed so far, by my count):
Now, there are a few things to be said about these bugs.

First, these bugs are mostly crashes: internal compiler errors ("ICEs"), assertion failures, and segfaults. Compiler crashes are usually not very high priority bugs -- especially when you are dealing with invalid programs. Most of the crashes would never occur "naturally" (i.e. as the result of a programmer trying to write some program). They represent very specific edge cases that may not be important at all in normal usage. So I am under no delusions about the relative importance of these bugs; a compiler crash is hardly a security risk.

However, I still think there is value in fuzzing compilers. Personally I find it very interesting that the same technique on rustc, the Rust compiler, only found 8 bugs in a couple of weeks of fuzzing, and not a single one of them was an actual segfault. I think it does say something about the nature of the code base, code quality, and the relative dangers of different programming languages, in case it was not clear already. In addition, compilers (and compiler writers) should have these fuzz testing techniques available to them, because it clearly finds bugs. Some of these bugs also point to underlying weaknesses or to general cases where something really could go wrong in a real program. In all, knowing about the bugs, even if they are relatively unimportant, will not hurt us.

Second, I should also note that I did have conversations with the gcc devs while fuzzing. I asked if I should open new bugs or attach more test cases to existing reports if I thought the area of the crash looked similar, even if it wasn't the exact same stack trace, etc., and they always told me to file a new report. In fact, I would like to praise the gcc developer community: I have never had such a pleasant bug-reporting experience. Within a day of reporting a new bug, somebody (usually Martin Liška or Marek Polacek) would run the test case and mark the bug as confirmed as well as bisect it using their huge library of precompiled gcc binaries to find the exact revision where the bug was introduced. This is something that I think all projects should strive to do -- the small feedback of having somebody acknowledge the bug is a huge encouragement to continue the process. Other gcc developers were also very active on IRC and answered almost all my questions, ranging from silly "Is this undefined behaviour?" to "Is this worth reporting?". In summary, I have nothing but praise for the gcc community.

I should also add that I played briefly with LLVM/clang, and prog-fuzz found 9 new bugs (2 of them fixed so far):
In addition to those, I also found a few other bugs that had already been reported by Sami Liedes back in 2014 which remain unfixed.

For rustc, I will write a more detailed blog post about how to set it up, as compiling rustc itself with AFL instrumentation is non-trivial and it makes more sense to detail those exact steps apart from this post.

What next?

I mentioned the efforts by Regehr et al. and Dmitry Babokin et al. on Csmith and yarpgen, respectively, as fuzzers that generate valid (UB-free) C/C++ programs for finding code generation bugs. I think there is work to be done here to find more code generation bugs; as far as I can tell, nobody has yet combined instrumentation feedback (grey-box fuzzing) with this kind of test case generator. Well, I tried to do it, but it requires a lot of effort to generate valid programs that are also interesting, and I stopped before finding any actual bugs. But I really think this is the future of compiler fuzzing, and I will outline the ideas that I think will have to go into it:
  • iterative program generation with instrumentation feedback: As opposed to generating one huge program and hoping that it will tickle some interesting path in the compiler you start with a valid program and you apply transformation rules that will gradually introduce complexity in the program. This allows you to use instrumentation feedback to tell exactly which transformations are valuable in terms of new code paths taken, and will give you a corpus of interesting test cases as well as speeding up the full generate/compile/test cycle.
  • have the program perform a calculation with a known result: Instead of compiling the same program with two different compilers or configurations and checking that the resulting binary outputs the same thing (which is what Regehr et al. did in "Taming compiler fuzzers"), we can test one compiler/configuration at a time and simply check that the output matches the known solution.
I don't have the time to continue working on this at the moment, but please do let me know if you would like to give it a try and I'll do my best to answer any questions about the code or the approach.


Thanks to John Regehr, Martin Liška, Marek Polacek, Jakub Jelinek, Richard Guenther, David Malcolm, Segher Boessenkool, and Martin Jambor for responding to my questions and bug reports!

Thanks to my employer, Oracle, for allowing me to do part of this fuzzing effort using company time and resources.

Friday, March 17, 2017

Fuzzing the OpenSSH daemon using AFL

(EDIT 2017-03-25: All my patches to make OpenSSH more amenable to fuzzing with AFL are available at This also includes improvements to the patches found in this post.)

American Fuzzy Lop is a great tool. It does take a little bit of extra setup and tweaking if you want to go into advanced usage, but mostly it just works out of the box.

In this post, I’ll detail some of the steps you need to get started with fuzzing the OpenSSH daemon (sshd) and show you some tricks that will help get results more quickly.

The AFL home page already displays 4 OpenSSH bugs in its trophy case; these were found by Hanno Böck who used an approach similar to that outlined by Jonathan Foote on how to fuzz servers with AFL.

I take a slightly different approach, which I think is simpler: instead of intercepting system calls to fake network activity, we just run the daemon in “inetd mode”. The inet daemon is not used very much anymore on modern Linux distributions, but the short story is that it sets up the listening network socket for you and launches a new process to handle each new incoming connection. inetd then passes the network socket to the target program as stdin/stdout. Thus, when sshd is started in inet mode, it communicates with a single client over stdin/stdout, which is exactly what we need for AFL.

Configuring and building AFL

If you are just starting out with AFL, you can probably just type make in the top-level AFL directory to compile everything, and it will just work. However, I want to use some more advanced features, in particular I would like to compile sshd using LLVM-based instrumentation (which is slightly faster than the “assembly transformation by sed” that AFL uses by default). Using LLVM also allows us to move the target program’s “fork point” from just before entering main() to an arbitrary location (known as “deferred forkserver mode” in AFL-speak); this means that we can skip some of the setup operations in OpenSSH, most notably reading/parsing configs and loading private keys.

Most of the steps for using LLVM mode are detailed in AFL’s llvm_mode/README.llvm. On Ubuntu, you should install the clang and llvm packages, then run make -C llvm_mode from the top-level AFL directory, and that’s pretty much it. You should get a binary called afl-clang-fast, which is what we’re going to use to compile sshd.

Configuring and building OpenSSH

I’m on Linux so I use the “portable” flavour of OpenSSH, which conveniently also uses git (as opposed to the OpenBSD version which still uses CVS – WTF!?). Go ahead and clone it from git://

Run autoreconf to generate the configure script. This is how I run the config script:

./configure \
    CC="$PWD/afl-2.39b/afl-clang-fast" \
    CFLAGS="-g -O3" \
    --prefix=$PWD/install \
    --with-privsep-path=$PWD/var-empty \
    --with-sandbox=no \

You obviously need to pass the right path to afl-clang-fast. I’ve also created two directories in the current (top-level OpenSSH directory), install and var-empty. This is so that we can run make install without being root (although var-empty needs to have mode 700 and be owned by root) and without risking clobbering any system files (which would be extremely bad, as we’re later going to disable authentication and encryption!). We really do need to run make install, even though we’re not going to be running sshd from the installation directory. This is because sshd needs some private keys to run, and that is where it will look for them.

(EDIT 2017-03-25: Passing --without-pie to configure may help make the resulting binaries easier to debug since instruction pointers will not be randomised.)

If everything goes well, running make should display the AFL banner as OpenSSH is compiled.

You may need some extra libraries (zlib1g-dev and libssl-dev on Ubuntu) for the build to succeeed.

Run make install to install sshd into the install/ subdirectory (and again, please don’t run this as root).

We will have to rebuild OpenSSH a few times as we apply some patches to it, but this gives you the basic ingredients for a build. One particular annoying thing I’ve noticed is that OpenSSH doesn’t always detect source changes when you run make (and so your changes may not actually make it into the binary). For this reason I just adopted the habit of always running make clean before recompiling anything. Just a heads up!

Running sshd

Before we can actually run sshd under AFL, we need to figure out exactly how to invoke it with all the right flags and options. This is what I use:

./sshd -d -e -p 2200 -r -f sshd_config -i

This is what it means:

“Debug mode”. Keeps the daemon from forking, makes it accept only a single connection, and keeps it from putting itself in the background. All useful things that we need.
This makes it log to stderr instead of syslog; this first of all prevents clobbering your system log with debug messages from our fuzzing instance, and also gives a small speed boost.
-p 2200
The TCP port to listen to. This is not really used in inetd mode (-i), but is useful later on when we want to generate our first input testcase.
This option is not documented (or not in my man page, at least), but you can find it in the source code, which should hopefully also explain what it does: preventing sshd from re-execing itself. I think this is a security feature, since it allows the process to isolate itself from the original environment. In our case, it complicates and slows things down unnecessarily, so we disable it by passing -r.
-f sshd_config
Use the sshd_config from the current directory. This just allows us to customise the config later without having to reinstall it or be unsure about which location it’s really loaded from.
“Inetd mode”. As already mentioned, inetd mode will make the server process a single connection on stdin/stdout, which is a perfect fit for AFL (as it will write testcases on the program’s stdin by default).

Go ahead and run it. It should hopefully print something like this:

$ ./sshd -d -e -p 2200 -r -f sshd_config -i
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2g  1 Mar 2016
debug1: private host key #0: ssh-rsa SHA256:f9xyp3dC+9jCajEBOdhjVRAhxp4RU0amQoj0QJAI9J0
debug1: private host key #1: ssh-dss SHA256:sGRlJclqfI2z63JzwjNlHtCmT4D1WkfPmW3Zdof7SGw
debug1: private host key #2: ecdsa-sha2-nistp256 SHA256:02NDjij34MUhDnifUDVESUdJ14jbzkusoerBq1ghS0s
debug1: private host key #3: ssh-ed25519 SHA256:RsHu96ANXZ+Rk3KL8VUu1DBzxwfZAPF9AxhVANkekNE
debug1: setgroups() failed: Operation not permitted
debug1: inetd sockets after dupping: 3, 4
Connection from UNKNOWN port 65535 on UNKNOWN port 65535

If you type some garbage and press enter, it will probably give you “Protocol mismatch.” and exit. This is good!

Detecting crashes/disabling privilege separation mode

One of the first obstacles I ran into was the fact that I saw sshd crashing in my system logs, but AFL wasn’t detecting them as crashes:

[726976.333225] sshd[29691]: segfault at 0 ip 000055d3f3139890 sp 00007fff21faa268 error 4 in sshd[55d3f30ca000+bf000]
[726984.822798] sshd[29702]: segfault at 4 ip 00007f503b4f3435 sp 00007fff84c05248 error 4 in[7f503b3a6000+1bf000]

The problem is that OpenSSH comes with a “privilege separation mode” that forks a child process and runs most of the code inside the child. If the child segfaults, the parent still exits normally, so it masks the segfault from AFL (which only observes the parent process directly).

In version 7.4 and earlier, privilege separation mode can easily be disabled by adding “UsePrivilegeSeparation no” to sshd_config or passing -o UsePrivilegeSeaparation=no on the command line.

Unfortunately it looks like the OpenSSH developers are removing the ability to disable privilege separation mode in version 7.5 and onwards. This is not a big deal, as OpenSSH maintainer Damien Miller writes on Twitter: “the infrastructure will be there for a while and it’s a 1-line change to turn privsep off”. So you may have to dive into sshd.c to disable it in the future.

(EDIT 2017-03-25: I’ve pushed the source tweak for disabling privilege separation for 7.5 and newer to my OpenSSH GitHub repo. This also obsoletes the need for a config change.)

Reducing randomness

OpenSSH uses random nonces during the handshake to prevent “replay attacks” where you would record somebody’s (encrypted) SSH session and then you feed the same data to the server again to authenticate again. When random numbers are used, the server and the client will calculate a new set of keys and thus thwart the replay attack.

In our case, we explicitly want to be able to replay traffic and obtain the same result two times in a row; otherwise, the fuzzer would not be able to gain any useful data from a single connection attempt (as the testcase it found would not be usable for further fuzzing).

There’s also the possibility that randomness introduces variabilities in other code paths not related to the handshake, but I don’t really know. In any case, we can easily disable the random number generator. On my system, with the configure line above, all or most random numbers seem to come from arc4random_buf() in openbsd-compat/arc4random.c, so to make random numbers very predictable, we can apply this patch:

diff --git openbsd-compat/arc4random.c openbsd-compat/arc4random.c
--- openbsd-compat/arc4random.c
+++ openbsd-compat/arc4random.c
@@ -242,7 +242,7 @@ void
 arc4random_buf(void *buf, size_t n)
-       _rs_random_buf(buf, n);
+       memset(buf, 0, n);
 # endif /* !HAVE_ARC4RANDOM_BUF */

One way to test whether this patch is effective is to try to packet-capture an SSH session and see if it can be replayed successfully. We’re going to have to do that later anyway in order to create our first input testcase, so skip below if you want to see how that’s done. In any case, AFL would also tell us using its “stability” indicator if something was really off with regards to random numbers (>95% stability is generally good, <90% would indicate that something is off and needs to be fixed).

Increasing coverage

Disabling message CRCs

When fuzzing, we really want to disable as many checksums as we can, as Damien Miller also wrote on twitter: “fuzzing usually wants other code changes too, like ignoring MAC/signature failures to make more stuff reachable”. This may sound a little strange at first, but makes perfect sense: In a real attack scenario, we can always1 fix up CRCs and other checksums to match what the program expects.

If we don’t disable checksums (and we don’t try to fix them up), then the fuzzer will make very little progress. A single bit flip in a checksum-protected area will just fail the checksum test and never allow the fuzzer to proceed.

We could of course also fix the checksum up before passing the data to the SSH server, but this is slow and complicated. It’s better to disable the checksum test in the server and then try to fix it up if we do happen to find a testcase which can crash the modified server.

The first thing we can disable is the packet CRC test:

diff --git a/packet.c b/packet.c
--- a/packet.c
+++ b/packet.c
@@ -1635,7 +1635,7 @@ ssh_packet_read_poll1(struct ssh *ssh, u_char *typep)
        cp = sshbuf_ptr(state->incoming_packet) + len - 4;
        stored_checksum = PEEK_U32(cp);
-       if (checksum != stored_checksum) {
+       if (0 && checksum != stored_checksum) {
                error("Corrupted check bytes on input");
                if ((r = sshpkt_disconnect(ssh, "connection corrupted")) != 0 ||
                    (r = ssh_packet_write_wait(ssh)) != 0)

As far as I understand, this is a simple (non-cryptographic) integrity check meant just as a sanity check against bit flips or incorrectly encoded data.

Disabling MACs

We can also disable Message Authentication Codes (MACs), which are the cryptographic equivalent of checksums, but which also guarantees that the message came from the expected sender:

diff --git mac.c mac.c
index 5ba7fae1..ced66fe6 100644
--- mac.c
+++ mac.c
@@ -229,8 +229,10 @@ mac_check(struct sshmac *mac, u_int32_t seqno,
        if ((r = mac_compute(mac, seqno, data, dlen,
            ourmac, sizeof(ourmac))) != 0)
                return r;
+#if 0
        if (timingsafe_bcmp(ourmac, theirmac, mac->mac_len) != 0)
                return SSH_ERR_MAC_INVALID;
        return 0;

We do have to be very careful when making these changes. We want to try to preserve the original behaviour of the program as much as we can, in the sense that we have to be very careful not to introduce bugs of our own. For example, we have to be very sure that we don’t accidentally skip the test which checks that the packet is large enough to contain a checksum in the first place. If we had accidentally skipped that, it is possible that the program being fuzzed would try to access memory beyond the end of the buffer, which would be a bug which is not present in the original program.

This is also a good reason to never submit crashing testcases to the developers of a program unless you can show that they also crash a completely unmodified program.

Disabling encryption

The last thing we can do, unless you wish to only fuzz the unencrypted initial protocol handshake and key exchange, is to disable encryption altogether.

The reason for doing this is exactly the same as the reason for disabling checksums and MACs, namely that the fuzzer would have no hope of being able to fuzz the protocol itself if it had to work with the encrypted data (since touching the encrypted data with overwhelming probability will just cause it to decrypt to random and utter garbage).

Making the change is surprisingly simple, as OpenSSH already comes with a psuedo-cipher that just passes data through without actually encrypting/decrypting it. All we have to do is to make it available as a cipher that can be negotiated between the client and the server. We can use this patch:

diff --git a/cipher.c b/cipher.c
index 2def333..64cdadf 100644
--- a/cipher.c
+++ b/cipher.c
@@ -95,7 +95,7 @@ static const struct sshcipher ciphers[] = {
 # endif /* OPENSSL_NO_BF */
 #endif /* WITH_SSH1 */
-       { "none",       SSH_CIPHER_NONE, 8, 0, 0, 0, 0, 0, EVP_enc_null },
+       { "none",       SSH_CIPHER_SSH2, 8, 0, 0, 0, 0, 0, EVP_enc_null },
        { "3des-cbc",   SSH_CIPHER_SSH2, 8, 24, 0, 0, 0, 1, EVP_des_ede3_cbc },
 # ifndef OPENSSL_NO_BF
        { "blowfish-cbc",

To use this cipher by default, just put “Ciphers none” in your sshd_config. Of course, the client doesn’t support it out of the box either, so if you make any test connections, you have to have to use the ssh binary compiled with the patched cipher.c above as well.

You may have to pass pass -o Ciphers=none from the client as well if it prefers to use a different cipher by default. Use strace or wireshark to verify that communication beyond the initial protocol setup happens in plaintext.

Making it fast

afl-clang-fast/LLVM “deferred forkserver mode”

I mentioned above that using afl-clang-fast (i.e. AFL’s LLVM deferred forkserver mode) allows us to move the “fork point” to skip some of the sshd initialisation steps which are the same for every single testcase we can throw at it.

To make a long story short, we need to put a call to __AFL_INIT() at the right spot in the program, separating the stuff that doesn’t depend on a specific input to happen before it and the testcase-specific handling to happen after it. I’ve used this patch:

diff --git a/sshd.c b/sshd.c
--- a/sshd.c
+++ b/sshd.c
@@ -1840,6 +1840,8 @@ main(int ac, char **av)
        /* ignore SIGPIPE */
        signal(SIGPIPE, SIG_IGN);
+       __AFL_INIT();
        /* Get a connection, either from inetd or a listening TCP socket */
        if (inetd_flag) {
                server_accept_inetd(&sock_in, &sock_out);

AFL should be able to automatically detect that you no longer wish to start the program from the top of main() every time. However, with only the patch above, I got this scary-looking error message:

Hmm, looks like the target binary terminated before we could complete a
handshake with the injected code. Perhaps there is a horrible bug in the
fuzzer. Poke <> for troubleshooting tips.

So there is obviously some AFL magic code here to make the fuzzer and the fuzzed program communicate. After poking around in afl-fuzz.c, I found FORKSRV_FD, which is a file descriptor pointing to a pipe used for this purpose. The value is 198 (and the other end of the pipe is 199).

To try to figure out what was going wrong, I ran afl-fuzz under strace, and it showed that file descriptors 198 and 199 were getting closed by sshd. With some more digging, I found the call to closefrom(), which is a function that closes all inherited (and presumed unused) file descriptors starting at a given number. Again, the reason for this code to exist in the first place is probably in order to reduce the attack surface in case an attacker is able to gain control the process. Anyway, the solution is to protect these special file descriptors using a patch like this:

diff --git openbsd-compat/bsd-closefrom.c openbsd-compat/bsd-closefrom.c
--- openbsd-compat/bsd-closefrom.c
+++ openbsd-compat/bsd-closefrom.c
@@ -81,7 +81,7 @@ closefrom(int lowfd)
        while ((dent = readdir(dirp)) != NULL) {
            fd = strtol(dent->d_name, &endp, 10);
            if (dent->d_name != endp && *endp == '\0' &&
-               fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp))
+               fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp) && fd != 198 && fd != 199)
                (void) close((int) fd);
        (void) closedir(dirp);

Skipping expensive DH/curve and key derivation operations

At this point, I still wasn’t happy with the execution speed: Some testcases were as low as 10 execs/second, which is really slow.

I tried compiling sshd with -pg (for gprof) to try to figure out where the time was going, but there are many obstacles to getting this to work properly: First of all, sshd exits using _exit(255) through its cleanup_exit() function. This is not a “normal” exit and so the gmon.out file (containing the profile data) is not written out at all. Applying a source patch to fix that, sshd will give you a “Permission denied” error as it tries to open the file for writing. The problem now is that sshd does a chdir("/"), meaning that it’s trying to write the profile data in a directory where it doesn’t have access. The solution is again simple, just add another chdir() to a writable location before calling exit(). Even with this in place, the profile came out completely empty for me. Maybe it’s another one of those privilege separation things. In any case, I decided to just use valgrind and its “cachegrind” tool to obtain the profile. It’s much easier and gives me the data I need without hassles of reconfiguring, patching, and recompiling.

The profile showed one very specific hot spot, coming from two different locations: elliptic curve point multiplication.

I don’t really know too much about elliptic curve cryptography, but apparently it’s pretty expensive to calculate. However, we don’t really need to deal with it; we can assume that the key exchange between the server and the client succeeds. Similar to how we increased coverage above by skipping message CRC checks and replacing the encryption with a dummy cipher, we can simply skip the expensive operations and assume they always succeed. This is a trade-off; we are no longer fuzzing all the verification steps, but allows the fuzzer to concentrate more on the protocol parsing itself. I applied this patch:

diff --git kexc25519.c kexc25519.c
--- kexc25519.c
+++ kexc25519.c
@@ -68,10 +68,13 @@ kexc25519_shared_key(const u_char key[CURVE25519_SIZE],
        /* Check for all-zero public key */
        explicit_bzero(shared_key, CURVE25519_SIZE);
+#if 0
        if (timingsafe_bcmp(pub, shared_key, CURVE25519_SIZE) == 0)
                return SSH_ERR_KEY_INVALID_EC_VALUE;
        crypto_scalarmult_curve25519(shared_key, key, pub);
        dump_digest("shared secret", shared_key, CURVE25519_SIZE);
diff --git kexc25519s.c kexc25519s.c
--- kexc25519s.c
+++ kexc25519s.c
@@ -67,7 +67,12 @@ input_kex_c25519_init(int type, u_int32_t seq, void *ctxt)
        int r;
        /* generate private key */
+#if 0
        kexc25519_keygen(server_key, server_pubkey);
+       explicit_bzero(server_key, sizeof(server_key));
+       explicit_bzero(server_pubkey, sizeof(server_pubkey));
        dump_digest("server private key:", server_key, sizeof(server_key));

With this patch in place, execs/second went to ~2,000 per core, which is a much better speed to be fuzzing at.

(EDIT 2017-03-25: As it turns out, this patch is not very good, because it causes a later key validity check to fail (dh_pub_is_valid() in input_kex_dh_init()). We could perhaps make dh_pub_is_valid() always return true, but then there is a question of whether this in turn makes something else fail down the line.)

Creating the first input testcases

Before we can start fuzzing for real, we have to create the first few input testcases. Actually, a single one is enough to get started, but if you know how to create different ones taking different code paths in the server, that may help jumpstart the fuzzing process. A few possibilities I can think of:

  • ssh -A for ssh agent forwarding
  • ssh -R to enable arbitrary port forwarding
  • ssh -Y to enable X11 forwarding
  • scp to transfer a file
  • password vs. pubkey authentication

The way I created the first testcase was to record the traffic from the client to the server using strace. Start the server without -i:

./sshd -d -e -p 2200 -r -f sshd_config
Server listening on :: port 2200.

Then start a client (using the ssh binary you’ve just compiled) under strace:

$ strace -e trace=write -o strace.log -f -s 8192 ./ssh -c none -p 2200 localhost

This should hopefully log you in (if not, you may have to fiddle with users, keys, and passwords until you succeed in logging in to the server you just started).

The first few lines of the strace log should read something like this:

2945  write(3, "SSH-2.0-OpenSSH_7.4\r\n", 21) = 21
2945  write(3, "\0\0\4|\5\24\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0010curve25519-sha256,,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c\0\0\1\",,,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,,,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa\0\0\0\4none\0\0\0\4none\0\0\0\,,,,,,,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\,,,,,,,hmac-sha2-256,hmac-sha2-512,hmac-sha1\0\0\0\32none,,zlib\0\0\0\32none,,zlib\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1152) = 1152

We see here that the client is communicating over file descriptor 3. You will have to delete all the writes happening on other file descriptors. Then take the strings and paste them into a Python script, something like:

import sys
for x in [

When you run this, it will print a byte-perfect copy of everything that the client sent to stdout. Just redirect this to a file. That file will be your first input testcase.

You can do a test run (without AFL) by passing the same data to the server again (this time using -i):

./sshd -d -e -p 2200 -r -f sshd_config -i < testcase 2>&1 > /dev/null

Hopefully it will show that your testcase replay was able to log in successfully.

Before starting the fuzzer you can also double check that the instrumentation works as expected using afl-analyze:

~/afl-2.39b/afl-analyze -i testcase -- ./sshd -d -e -p 2200 -r -f sshd_config -i

This may take a few seconds to run, but should eventually show you a map of the file and what it thinks each byte means. If there is too much red, that’s an indication you were not able to disable checksumming/encryption properly (maybe you have to make clean and rebuild?). You may also see other errors, including that AFL didn’t detect any instrumentation (did you compile sshd with afl-clang-fast?). This is general AFL troubleshooting territory, so I’d recommend checking out the AFL documentation.

Creating an OpenSSH dictionary

I created an AFL “dictionary” for OpenSSH, which is basically just a list of strings with special meaning to the program being fuzzed. I just used a few of the strings found by running ssh -Q cipher, etc. to allow the fuzzer to use these strings without having to discover them all at once (which is pretty unlikely to happen by chance).


Just save it as openssh.dict; to use it, we will pass the filename to the -x option of afl-fuzz.

Running AFL

Whew, it’s finally time to start the fuzzing!

First, create two directories, input and output. Place your initial testcase in the input directory. Then, for the output directory, we’re going to use a little hack that I’ve found to speed up the fuzzing process and keep AFL from hitting the disk all the time: mount a tmpfs RAM-disk on output with:

sudo mount -t tmpfs none output/

Of course, if you shut down (or crash) your machine without copying the data out of this directory, it will be gone, so you should make a backup of it every once in a while. I personally just use a bash one-liner that just tars it up to the real on-disk filesystem every few hours.

To start a single fuzzer, you can use something like:

~/afl-2.39b/afl-fuzz -x sshd.dict -i input -o output -M 0 -- ./sshd -d -e -p 2100 -r -f sshd_config -i

Again, see the AFL docs on how to do parallel fuzzing. I have a simple bash script that just launches a bunch of the line above (with different values to the -M or -S option) in different screen windows.

Hopefully you should see something like this:

                         american fuzzy lop 2.39b (31)

┌─ process timing ─────────────────────────────────────┬─ overall results ─────┐
│        run time : 0 days, 13 hrs, 22 min, 40 sec     │  cycles done : 152    │
│   last new path : 0 days, 0 hrs, 14 min, 57 sec      │  total paths : 1577   │
│ last uniq crash : none seen yet                      │ uniq crashes : 0      │
│  last uniq hang : none seen yet                      │   uniq hangs : 0      │
├─ cycle progress ────────────────────┬─ map coverage ─┴───────────────────────┤
│  now processing : 717* (45.47%)     │    map density : 3.98% / 6.67%         │
│ paths timed out : 0 (0.00%)         │ count coverage : 3.80 bits/tuple       │
├─ stage progress ────────────────────┼─ findings in depth ────────────────────┤
│  now trying : splice 4              │ favored paths : 117 (7.42%)            │
│ stage execs : 74/128 (57.81%)       │  new edges on : 178 (11.29%)           │
│ total execs : 74.3M                 │ total crashes : 0 (0 unique)           │
│  exec speed : 1888/sec              │   total hangs : 0 (0 unique)           │
├─ fuzzing strategy yields ───────────┴───────────────┬─ path geometry ────────┤
│   bit flips : n/a, n/a, n/a                         │    levels : 7          │
│  byte flips : n/a, n/a, n/a                         │   pending : 2          │
│ arithmetics : n/a, n/a, n/a                         │  pend fav : 0          │
│  known ints : n/a, n/a, n/a                         │ own finds : 59         │
│  dictionary : n/a, n/a, n/a                         │  imported : 245        │
│       havoc : 39/25.3M, 20/47.2M                    │ stability : 97.55%     │
│        trim : 2.81%/1.84M, n/a                      ├────────────────────────┘
└─────────────────────────────────────────────────────┘          [cpu015: 62%]

Crashes found

In about a day of fuzzing (even before disabling encryption), I found a couple of NULL pointer dereferences during key exchange. Fortunately, these crashes are not harmful in practice because of OpenSSH’s privilege separation code, so at most we’re crashing an unprivileged child process and leaving a scary segfault message in the system log. The fix made it in CVS here:


Apart from the two harmless NULL pointer dereferences I found, I haven’t been able to find anything else yet, which seems to indicate that OpenSSH is fairly robust (which is good!).

I hope some of the techniques and patches I used here will help more people get into fuzzing OpenSSH.

Other things to do from here include doing some fuzzing rounds using ASAN or running the corpus through valgrind, although it’s probably easier to do this once you already have a good sized corpus found without them, as both ASAN and valgrind have a performance penalty.

It could also be useful to look into ./configure options to configure the build more like a typical distro build; I haven’t done anything here except to get it to build in a minimal environment.

Please let me know in the comments if you have other ideas on how to expand coverage or make fuzzing OpenSSH faster!


I’d like to thank Oracle (my employer) for providing the hardware on which to run lots of AFL instances in parallel :-)

  1. Well, we can’t fix up signatures we don’t have the private key for, so in those cases we’ll just assume the attacker does have the private key. You can still do damage e.g. in an otherwise locked down environment; as an example, GitHub uses the SSH protocol to allow pushing to your repositories. These SSH accounts are heavily locked down, as you can’t run arbitrary commands on them. In this case, however, we do have have the secret key used to authenticate and sign messages.

Wednesday, August 31, 2016

Debugging a kernel crash found by syzkaller

Having done quite a bit of kernel fuzzing and debugging lately I’ve decided to take one of the very latest crashes and write up the whole process from start to finish as I work through it. As you will see, I'm not very familiar with the site of this particular crash, the block layer. Being familiar with some existing kernel code helps, of course, since you recognise a lot of code patterns, but the kernel is so large that nobody can be familiar with everything and the crashes found by trinity and syzkaller can show up almost anywhere.

So I got this with syzkaller after running it for a few hours:
general protection fault: 0000 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
CPU: 0 PID: 11941 Comm: syz-executor Not tainted 4.8.0-rc2+ #169
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
task: ffff880110762cc0 task.stack: ffff880102290000
RIP: 0010:[<ffffffff81f04b7a>]  [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
RSP: 0018:ffff880102297cd0  EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc90000bb4000
RDX: 0000000000000097 RSI: 0000000000000000 RDI: 00000000000004b8
RBP: ffff880102297cd8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88011a010a90
R13: ffff88011a594568 R14: ffff88011a010890 R15: 7fffffffffffffff
FS:  00007f2445174700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000200047c8 CR3: 0000000107eb5000 CR4: 00000000000006f0
DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
 1ffff10020452f9e ffff880102297db8 ffffffff81508daa 0000000000000000
 0000000041b58ab3 ffffffff844e89e1 ffffffff81508b30 ffffed0020452001
 7fffffffffffffff 0000000000000000 0000000000000000 7fffffffffffffff
Call Trace:
 [<ffffffff81508daa>] __filemap_fdatawrite_range+0x27a/0x2e0
 [<ffffffff81508b30>] ? filemap_check_errors+0xe0/0xe0
 [<ffffffff83c24b47>] ? preempt_schedule+0x27/0x30
 [<ffffffff810020ae>] ? ___preempt_schedule+0x16/0x18
 [<ffffffff81508e36>] filemap_fdatawrite+0x26/0x30
 [<ffffffff817191b0>] fdatawrite_one_bdev+0x50/0x70
 [<ffffffff817341b4>] iterate_bdevs+0x194/0x210
 [<ffffffff81719160>] ? fdatawait_one_bdev+0x70/0x70
 [<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
 [<ffffffff817196be>] sys_sync+0xce/0x160
 [<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
 [<ffffffff81002b60>] ? exit_to_usermode_loop+0x190/0x190
 [<ffffffff8150455a>] ? __context_tracking_exit.part.4+0x3a/0x1e0
 [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
 [<ffffffff83c3276a>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00 
RIP  [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
 RSP <ffff880102297cd0>
The very first thing to do is to look up the code in the backtrace:
$ addr2line -e vmlinux -i ffffffff81f04b7a ffffffff81508daa ffffffff81508e36 ffffffff817191b0 ffffffff817341b4 ffffffff817196be
The actual site of the crash is this:
 842 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 843 {
 844         return bdev->bd_disk->queue;    /* this is never NULL */
 845 }
Because we’re using KASAN we can’t look at CR2 to find the bad pointer because KASAN triggers before the page fault (or to be completely honest, KASAN tries to access the shadow memory for the bad pointer, which is itself a bad pointer and causes the GPF above).

Let’s look at the “Code:” line to try to find the exact dereference causing the error:
$ echo 'Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00 ' | scripts/decodecode 
Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
All code
   0:   89 fa                   mov    %edi,%edx
   2:   48 c1 ea 03             shr    $0x3,%rdx
   6:   80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)
   a:   75 35                   jne    0x41
   c:   48 8b 9b e0 00 00 00    mov    0xe0(%rbx),%rbx
  13:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
  1a:   fc ff df 
  1d:   48 8d bb b8 04 00 00    lea    0x4b8(%rbx),%rdi
  24:   48 89 fa                mov    %rdi,%rdx
  27:   48 c1 ea 03             shr    $0x3,%rdx
  2b:*  80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
  2f:   75 17                   jne    0x48
  31:   48 8b 83 b8 04 00 00    mov    0x4b8(%rbx),%rax
  38:   5b                      pop    %rbx
  39:   5d                      pop    %rbp
  3a:   48 05 10 02 00 00       add    $0x210,%rax
I’m using CONFIG_KASAN_INLINE=y so most of the code above is actually generated by KASAN which makes things a bit harder to read. The movabs with a weird 0xdffff… address is how it generates the address for the shadow memory bytemap and the cmpb that crashed is where it tries to read the value of the shadow byte.

The address is %rdx + %rax and we know that %rax is 0xdffffc0000000000. Let’s look at %rdx in the crash above… RDX: 0000000000000097; yup, that’s a NULL pointer dereference all right.

But the line in question has two pointer dereferences, bdev->bd_disk and bd_disk->queue, and which one is the crash? The lea 0x4b8(%rbx), %rdi is what gives it away, since that gives us the offset into the structure that is being dereferenced (also, NOT coincidentally, %rbx is 0). Let’s use pahole:
$ pahole -C 'block_device' vmlinux
struct block_device {
        dev_t                      bd_dev;               /*     0     4 */
        int                        bd_openers;           /*     4     4 */
        struct inode *             bd_inode;             /*     8     8 */
        struct super_block *       bd_super;             /*    16     8 */
        struct mutex               bd_mutex;             /*    24   128 */
        /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
        void *                     bd_claiming;          /*   152     8 */
        void *                     bd_holder;            /*   160     8 */
        int                        bd_holders;           /*   168     4 */
        bool                       bd_write_holder;      /*   172     1 */

        /* XXX 3 bytes hole, try to pack */

        struct list_head           bd_holder_disks;      /*   176    16 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        struct block_device *      bd_contains;          /*   192     8 */
        unsigned int               bd_block_size;        /*   200     4 */

        /* XXX 4 bytes hole, try to pack */

        struct hd_struct *         bd_part;              /*   208     8 */
        unsigned int               bd_part_count;        /*   216     4 */
        int                        bd_invalidated;       /*   220     4 */
        struct gendisk *           bd_disk;              /*   224     8 */
        struct request_queue *     bd_queue;             /*   232     8 */
        struct list_head           bd_list;              /*   240    16 */
        /* --- cacheline 4 boundary (256 bytes) --- */
        long unsigned int          bd_private;           /*   256     8 */
        int                        bd_fsfreeze_count;    /*   264     4 */

        /* XXX 4 bytes hole, try to pack */

        struct mutex               bd_fsfreeze_mutex;    /*   272   128 */
        /* --- cacheline 6 boundary (384 bytes) was 16 bytes ago --- */

        /* size: 400, cachelines: 7, members: 21 */
        /* sum members: 389, holes: 3, sum holes: 11 */
        /* last cacheline: 16 bytes */
0x4b8 is 1208 in decimal, which is way bigger than this struct. Let’s try the other one:
$ pahole -C 'gendisk' vmlinux
struct gendisk {
        int                        major;                /*     0     4 */
        int                        first_minor;          /*     4     4 */
        int                        minors;               /*     8     4 */
        char                       disk_name[32];        /*    12    32 */

        /* XXX 4 bytes hole, try to pack */

        char *                     (*devnode)(struct gendisk *, umode_t *); /*    48     8 */
        unsigned int               events;               /*    56     4 */
        unsigned int               async_events;         /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct disk_part_tbl *     part_tbl;             /*    64     8 */
        struct hd_struct           part0;                /*    72  1128 */
        /* --- cacheline 18 boundary (1152 bytes) was 48 bytes ago --- */
        const struct block_device_operations  * fops;    /*  1200     8 */
        struct request_queue *     queue;                /*  1208     8 */
        /* --- cacheline 19 boundary (1216 bytes) --- */
        void *                     private_data;         /*  1216     8 */
        int                        flags;                /*  1224     4 */

        /* XXX 4 bytes hole, try to pack */

        struct kobject *           slave_dir;            /*  1232     8 */
        struct timer_rand_state *  random;               /*  1240     8 */
        atomic_t                   sync_io;              /*  1248     4 */

        /* XXX 4 bytes hole, try to pack */

        struct disk_events *       ev;                   /*  1256     8 */
        struct kobject             integrity_kobj;       /*  1264    64 */
        /* --- cacheline 20 boundary (1280 bytes) was 48 bytes ago --- */
        int                        node_id;              /*  1328     4 */

        /* XXX 4 bytes hole, try to pack */

        struct badblocks *         bb;                   /*  1336     8 */
        /* --- cacheline 21 boundary (1344 bytes) --- */

        /* size: 1344, cachelines: 21, members: 20 */
        /* sum members: 1328, holes: 4, sum holes: 16 */
1208 is ->queue, so that fits well with what we’re seeing; therefore, bdev->bd_disk must be NULL.

At this point I would go up the stack of function to see if anything sticks out – although unlikely, it’s possible that it’s an “easy” bug where you can tell just from looking at the code in a single function that it sets the pointer to NULL just before calling the function that crashed or something like that.

Probably the most interesting function in the stack trace (at a glance) is iterate_bdevs() in fs/block_dev.c:
1880 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
1881 {
1882         struct inode *inode, *old_inode = NULL;
1884         spin_lock(&blockdev_superblock->s_inode_list_lock);
1885         list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
1886                 struct address_space *mapping = inode->i_mapping;
1888                 spin_lock(&inode->i_lock);
1889                 if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
1890                     mapping->nrpages == 0) {
1891                         spin_unlock(&inode->i_lock);
1892                         continue;
1893                 }
1894                 __iget(inode);
1895                 spin_unlock(&inode->i_lock);
1896                 spin_unlock(&blockdev_superblock->s_inode_list_lock);
1897                 /*
1898                  * We hold a reference to 'inode' so it couldn't have been
1899                  * removed from s_inodes list while we dropped the
1900                  * s_inode_list_lock  We cannot iput the inode now as we can
1901                  * be holding the last reference and we cannot iput it under
1902                  * s_inode_list_lock. So we keep the reference and iput it
1903                  * later.
1904                  */
1905                 iput(old_inode);
1906                 old_inode = inode;
1908                 func(I_BDEV(inode), arg);
1910                 spin_lock(&blockdev_superblock->s_inode_list_lock);
1911         }
1912         spin_unlock(&blockdev_superblock->s_inode_list_lock);
1913         iput(old_inode);
1914 }
I can’t quite put my finger on it, but it looks interesting because it has a bunch of locking in it and it seems to be what’s getting the block device from a given inode. I ran git blame on the file/function in question since that might point to a recent change there, but the most interesting thing is commit 74278da9f7 changing some locking logic. Maybe relevant, maybe not, but let’s keep it in mind.

Remember that bd->bd_disk is NULL. Let’s try to check if ->bd_disk is assigned NULL anywhere:
$ git grep -n '\->bd_disk.*=.*NULL'
block/blk-flush.c:470:  if (bdev->bd_disk == NULL)
drivers/block/xen-blkback/xenbus.c:466: if (vbd->bdev->bd_disk == NULL) {
fs/block_dev.c:1295:                                    bdev->bd_disk = NULL;
fs/block_dev.c:1375:    bdev->bd_disk = NULL;
fs/block_dev.c:1615:            bdev->bd_disk = NULL;
kernel/trace/blktrace.c:1624:   if (bdev->bd_disk == NULL)
This by no means necessarily includes the code that set ->bd_disk to NULL in our case (since there could be code that looks like x = NULL; bdev->bd_disk = x; which wouldn’t be found with the regex above), but this is a good start and I’ll look at the functions above just to see if it might be relevant. Actually, for this I’ll just add -W to the git grep above to quickly look at the functions.

The first two and last hits are comparisons so they are uninteresting. The third and fourth ones are part of error paths in __blkdev_get(). That might be interesting if the process that crashed somehow managed to get a reference to the block device just after the NULL assignment (if so, that would probably be a locking bug in either __blkdev_get() or one of the functions in the crash stack trace – OR it might be a bug where the struct block_device * is made visible/reachable before it’s ready). The fifth one is in __blkdev_put(). I’m going to read over __blkdev_get() and __blkdev_put() to figure out what they do and if there’s maybe something going on in either of those.

In all these cases, it seems to me that &bdev->bd_mutex is locked; that’s a good sign. That’s also maybe an indication that we should be taking &bdev->bd_mutex in the other code path, so let’s check if we are. There’s nothing that I can see in any of the functions from inode_to_bdi() and up. Although inode_to_bdi() itself looks interesting, because that’s where the block device pointer comes from; it calls I_BDEV(inode) which returns a struct block_device *. Although if we follow the stack even further up, we see that fdatawrite_one_bdev() in fs/sync.c also knows about a struct block_device *. This by the way appears to be what is called through the function pointer in iterate_bdevs():
1908                 func(I_BDEV(inode), arg);
This in turn is called from the sync() system call. In other words, I cannot see any caller that takes &bdev->bd_mutex. There may yet be another mechanism (maybe a lock) intended to prevent somebody from seeing bdev->bd_disk == NULL, but this seems like a strong indication of what the problem might be.

Let’s try to figure out more about ->bd_mutex, maybe there’s some documentation somewhere telling us what it’s supposed to protect. There is this:
include/linux/fs.h=454=struct block_device {
include/linux/fs.h-455- dev_t                   bd_dev;  /* not a kdev_t - it's a search key */
include/linux/fs.h-456- int                     bd_openers;
include/linux/fs.h-457- struct inode *          bd_inode;       /* will die */
include/linux/fs.h-458- struct super_block *    bd_super;
include/linux/fs.h:459: struct mutex            bd_mutex;       /* open/close mutex */
There is this:
include/linux/genhd.h-681- * Any access of part->nr_sects which is not protected by partition
include/linux/genhd.h:682: * bd_mutex or gendisk bdev bd_mutex, should be done using this
include/linux/genhd.h-683- * accessor function.
include/linux/genhd.h-684- *
include/linux/genhd.h-685- * Code written along the lines of i_size_read() and i_size_write().
include/linux/genhd.h-686- * CONFIG_PREEMPT case optimizes the case of UP kernel with preemption
include/linux/genhd.h-687- * on.
include/linux/genhd.h-688- */
include/linux/genhd.h=689=static inline sector_t part_nr_sects_read(struct hd_struct *part)
And there is this:
include/linux/genhd.h:712: * Should be called with mutex lock held (typically bd_mutex) of partition
include/linux/genhd.h-713- * to provide mutual exlusion among writers otherwise seqcount might be
include/linux/genhd.h-714- * left in wrong state leaving the readers spinning infinitely.
include/linux/genhd.h-715- */
include/linux/genhd.h-716-static inline void part_nr_sects_write(struct hd_struct *part, sector_t size)
Under Documentation/ there is also this:
--------------------------- block_device_operations -----------------------
locking rules:
open:                   yes
release:                yes
ioctl:                  no
compat_ioctl:           no
direct_access:          no
media_changed:          no
unlock_native_capacity: no
revalidate_disk:        no
getgeo:                 no
swap_slot_free_notify:  no      (see below)
Looking at __blkdev_get() again, there’s also one comment above it hinting at locking rules:
1233 /*                  
1234  * bd_mutex locking:    
1235  *                      
1236  *  mutex_lock(part->bd_mutex)
1237  *    mutex_lock_nested(whole->bd_mutex, 1)
1238  */             
1240 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
__blkdev_get() is called as part of blkdev_get(), which is what is called when you open a block device. In other words, it seems likely that we may have a race between opening/closing a block device and calling sync() – although for the sync() call to reach the block device, we should have some inode open on that block device (since we start out with an inode that is mapped to a block device with I_BDEV(inode)).

Looking at the syzkaller log file, there is a sync() call just before the crash, and I also see references to [sr0] unaligned transfer (and sr0 is a block device, so that seems slightly suspicious):
2016/08/25 05:45:02 executing program 0:
mmap(&(0x7f0000001000)=nil, (0x4000), 0x3, 0x31, 0xffffffffffffffff, 0x0)
mbind(&(0x7f0000004000)=nil, (0x1000), 0x8003, &(0x7f0000002000)=0x401, 0x9, 0x2)
shmat(0x0, &(0x7f0000001000)=nil, 0x4000)
dup2(0xffffffffffffffff, 0xffffffffffffff9c)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
clock_gettime(0x0, &(0x7f0000000000)={0x0, 0x0})
sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
kasan: CONFIG_KASAN_INLINE enabled
2016/08/25 05:45:03 result failed=false hanged=false:

2016/08/25 05:45:03 executing program 1:
mmap(&(0x7f0000002000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = syz_open_dev$sr(&(0x7f0000002000)="2f6465762f73723000", 0x0, 0x4800)
readahead(r0, 0xcb84, 0x10001)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000001000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
syz_open_dev$mixer(&(0x7f0000002000-0x8)="2f6465762f6d6978657200", 0x0, 0x86000)
mmap(&(0x7f0000001000)=nil, (0x1000), 0x6, 0x12, r0, 0x0)
mount$fs(&(0x7f0000001000-0x6)="6d73646f7300", &(0x7f0000001000-0x6)="2e2f62757300", &(0x7f0000001000-0x6)="72616d667300", 0x880, &(0x7f0000000000)="1cc9417348")
kasan: GPF could be caused by NULL-ptr deref or user memory access
Here we see both the sync() call and the syz_open_dev$sr() call and we see that the GFP seems to happen some time shortly after opening sr0:
r0 = syz_open_dev$sr(&(0x7f0000002000)="2f6465762f73723000", 0x0, 0x4800)

>>> "2f6465762f73723000".decode('hex')
There’s also a mount$fs() call there that looks interesting. Its arguments are:
>>> "6d73646f7300".decode('hex')
>>> "2e2f62757300".decode('hex')
>>> "72616d667300".decode('hex')
However, I can’t see any references to any block devices in fs/ramfs, so I think this is unlikely to be it. I do still wonder how opening /dev/sr0 can do anything for us if it doesn’t have a filesystem or even a medium. [Note from the future: block devices are represented as inodes on the “bdev” pseudo-filesystem. Go figure!] Grepping for sr0 in the rest of the syzkaller log shows this bit, which seems to indicate we do in fact have inodes for sr0:
VFS: Dirty inode writeback failed for block device sr0 (err=-5).
Grepping for “Dirty inode writeback failed”, I find bdev_write_inode() in fs/block_dev.c, called only from… __blkdev_put(). It definitely feels like we’re on to something now – maybe a race between sync() and open()/close() for /dev/sr0.

syzkaller comes with some scripts to rerun the programs from a log file. I’m going to try that and see where it gets us – if we can reproduce the crash. I’ll first try to convert the two programs (the one with sync() and the one with the open(/dev/sr0)) to C and compile them. If that doesn’t work, syzkaller also has an option to auto-reproduce based on all the programs in the log file, but that’s likely slower and not always likely to succeed.

I use syz-prog2c and launch the two programs in parallel in a VM, but it doesn’t show anything at all. I switch to syz-repro to see if it can reproduce anything given the log file, but this fails too. I see that there are other sr0-related messages in the kernel log, so there must be a way to open the device without just getting ENOMEDIUM. I do a stat on /dev/sr0 to find the device numbers:
$ stat /dev/sr0 
  File: ‘/dev/sr0’
  Size: 0               Blocks: 0          IO Block: 4096   block special file
Device: 5h/5d   Inode: 7867        Links: 1     Device type: b,0
So the device major is 0xb (11 decimal). We can find this in include/uapi/linux/major.h and it gives us:
include/uapi/linux/major.h:#define SCSI_CDROM_MAJOR     11
We see that this is the driver responsible for /dev/sr0:
drivers/scsi/sr.c:      rc = register_blkdev(SCSI_CDROM_MAJOR, "sr");
(I could have guessed this as well, but there are so many systems and subsystems and drivers that I often double check just to make sure I’m in the right place.) I look for an open() function and I find two – sr_open() and sr_block_open(). sr_block_open() does cdrom_open() – from drivers/cdrom/cdrom.c – and this has an interesting line:
        /* if this was a O_NONBLOCK open and we should honor the flags,
         * do a quick open without drive/disc integrity checks. */
        if ((mode & FMODE_NDELAY) && (cdi->options & CDO_USE_FFLAGS)) {
                ret = cdi->ops->open(cdi, 1);
So we need to pass O_NONBLOCK to get the device to open. When I add this to the test program from the syzkaller log and run sync() in parallel… ta-da!
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
CPU: 3 PID: 1333 Comm: sync1 Not tainted 4.8.0-rc2+ #169
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
task: ffff880114114080 task.stack: ffff880112bf0000
RIP: 0010:[<ffffffff8170654d>]  [<ffffffff8170654d>] wbc_attach_and_unlock_inode+0x23d/0x760
RSP: 0018:ffff880112bf7ca0  EFLAGS: 00010206
RAX: dffffc0000000000 RBX: ffff880112bf7d10 RCX: ffff8801141147d0
RDX: 0000000000000093 RSI: ffff8801170f8750 RDI: 0000000000000498
RBP: ffff880112bf7cd8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8801141147e8 R11: 0000000000000000 R12: ffff8801170f8750
R13: 0000000000000000 R14: ffff880112bf7d38 R15: ffff880112bf7d10
FS:  00007fd533aa2700(0000) GS:ffff88011ab80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000601028 CR3: 0000000112b04000 CR4: 00000000000006e0
 ffff8801170f8750 0000000000000000 1ffff1002257ef9e ffff8801170f8950
 ffff8801170f8750 0000000000000000 ffff880112bf7d10 ffff880112bf7db8
 ffffffff81508d70 0000000000000000 0000000041b58ab3 ffffffff844e89e1
Call Trace:
 [<ffffffff81508d70>] __filemap_fdatawrite_range+0x240/0x2e0
 [<ffffffff81508b30>] ? filemap_check_errors+0xe0/0xe0
 [<ffffffff83c24b47>] ? preempt_schedule+0x27/0x30
 [<ffffffff810020ae>] ? ___preempt_schedule+0x16/0x18
 [<ffffffff81508e36>] filemap_fdatawrite+0x26/0x30
 [<ffffffff817191b0>] fdatawrite_one_bdev+0x50/0x70
 [<ffffffff817341b4>] iterate_bdevs+0x194/0x210
 [<ffffffff81719160>] ? fdatawait_one_bdev+0x70/0x70
 [<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
 [<ffffffff817196be>] sys_sync+0xce/0x160
 [<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
 [<ffffffff81002b60>] ? exit_to_usermode_loop+0x190/0x190
 [<ffffffff82001a47>] ? check_preemption_disabled+0x37/0x1e0
 [<ffffffff8150455a>] ? __context_tracking_exit.part.4+0x3a/0x1e0
 [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
 [<ffffffff83c3276a>] entry_SYSCALL64_slow_path+0x25/0x25
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 b3 04 00 00 49 8d bd 98 04 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 63 30 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 83 04 00 00 4d 8b bd 98 04 00 00 48 b8 00 00
RIP  [<ffffffff8170654d>] wbc_attach_and_unlock_inode+0x23d/0x760
 RSP <ffff880112bf7ca0>
---[ end trace 50fffb72f7adb3e5 ]---
This is not exactly the same oops that we saw before, but it’s close enough that it’s very likely to be a related crash. The reproducer is actually taking quite a while to trigger the issue, though. Even though I’ve reduced to two threads/processes executing just a handful of syscalls it still takes nearly half an hour to reproduce in a tight loop. I spend some time playing with the reproducer, trying out different things (read() instead of readahead(), just open()/close() with no reading at all, 2 threads doing sync(), etc.) to see if I can get it to trigger faster. In the end, I find that having many threads doing sync() in parallel seems to be the key to a quick reproducer, on the order of a couple of seconds.

Now that I have a fairly small reproducer it should be a lot easier to figure out the rest. I can add as many printk()s as I need to validate my theory that sync() should be taking the bd_mutex. For cases like this I set up a VM so that I can start the VM and run the reproducer by running a single command. I also actually like to use trace_printk() instead of plain printk() and boot with ftrace_dump_on_oops on the kernel command line – this way, the messages don’t get printed until the crash actually happens (and have a lower probability of interfering with the race itself; printk() goes directly to the console, which is usually pretty slow).

I apply this patch and recompile the kernel:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..fb9d5c5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1292,6 +1292,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
                                        bdev->bd_part = NULL;
+                                       trace_printk("%p->bd_disk = NULL\n", bdev);
                                        bdev->bd_disk = NULL;
                                        bdev->bd_queue = NULL;
@@ -1372,6 +1373,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
+       trace_printk("%p->bd_disk = NULL\n", bdev);
        bdev->bd_disk = NULL;
        bdev->bd_part = NULL;
        bdev->bd_queue = NULL;
@@ -1612,6 +1614,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
                bdev->bd_part = NULL;
+               trace_printk("%p->bd_disk = NULL\n", bdev);
                bdev->bd_disk = NULL;
                if (bdev != bdev->bd_contains)
                        victim = bdev->bd_contains;
@@ -1905,6 +1908,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
                old_inode = inode;
+               trace_printk("%p->bd_disk = %p\n", I_BDEV(inode), I_BDEV(inode)->bd_disk);
                func(I_BDEV(inode), arg);
With this patch applied, I get this output on a crash:
   sync1-1343    3.... 8303954us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
   sync1-1340    0.... 8303955us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
   sync1-1343    3.... 8303961us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
   sync1-1335    1.... 8304043us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
   sync2-1327    1.... 8304852us : __blkdev_put: ffff88011a0105c0->bd_disk = NULL
CPU: 2 PID: 1336 Comm: sync1 Not tainted 4.8.0-rc2+ #170
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
task: ffff88011212d600 task.stack: ffff880112190000
RIP: 0010:[<ffffffff81f04c3a>]  [<ffffffff81f04c3a>] blk_get_backing_dev_info+0x4a/0x70
RSP: 0018:ffff880112197cd0  EFLAGS: 00010202
Since __blkdev_put() is the very last line of output before the crash (and I don’t see any other call setting ->bd_disk to NULL in the last few hundred lines or so), there is a very strong indication that this is the problematic assignment. Rerunning this a couple of times shows that it tends to crash with the same symptoms every time.

To get slightly more information about the context in which __blkdev_put() is called in, I apply this patch instead:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..298bf70 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1612,6 +1612,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
                bdev->bd_part = NULL;
+               trace_dump_stack(0);
                bdev->bd_disk = NULL;
                if (bdev != bdev->bd_contains)
                        victim = bdev->bd_contains;
With that, I get the following output:
   <...>-1328    0.... 9309173us : <stack trace>
 => blkdev_close
 => __fput
 => ____fput
 => task_work_run
 => exit_to_usermode_loop
 => do_syscall_64
 => return_from_SYSCALL_64
CPU: 3 PID: 1352 Comm: sync1 Not tainted 4.8.0-rc2+ #171
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
task: ffff88011248c080 task.stack: ffff880112568000
RIP: 0010:[<ffffffff81f04b7a>]  [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
One thing that’s a bit surprising to me is that this actually isn’t called directly from close(), but as a delayed work item on a workqueue. But in any case we can tell it comes from close() since fput() is called when closing a file descriptor.

Now that I have a fairly good idea of what’s going wrong, it’s time to focus on the fix. This is almost more difficult than what we’ve done so far because it’s such an open-ended problem. Of course I could add a brand new global spinlock to provide mutual exclusion between sync() and clone(), but that would be a bad solution and the wrong thing to do. Usually the author of the code in question had a specific locking scheme or design in mind and the bug is just due to a small flaw or omission somewhere. In other words, it’s usually not a bug in the general architecture of the code (which might require big changes to fix), but a small bug somewhere in the implementation, which would typically require just a few changed lines to fix. It’s fairly obvious that close() is trying to prevent somebody else from seeing bdev->bd_disk == NULL by wrapping most of the __blkdev_put() code in the ->bdev_mutex. This makes me think that it’s the sync() code path that is missing some locking.

Looking around __blkdev_put() and iterate_bdevs(), another thing that strikes me is that iterate_bdevs() is able to get a reference to a block device which is nevertheless in the process of being destroyed – maybe the real problem is that the block device is being destroyed too soon (while iterate_bdevs() is holding a reference to it). So it’s possible that iterate_bdevs() simply needs to formally take a reference to the block device by bumping its reference count while it does its work.

There is a function called bdgrab() which is supposed to take an extra reference to a block device – but only if you aready have one. Thus, using this would be just as racy, since we’re not already formally holding a reference to it. Another function, bd_acquire() seems to formally acquire a reference through a struct inode *. That seems quite promising. It is using the bdev_lock spinlock to prevent the block device from disappearing. I try this tentative patch:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..489473d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1884,6 +1884,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
        list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
                struct address_space *mapping = inode->i_mapping;
+               struct block_device *bdev;
                if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
@@ -1905,7 +1906,11 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
                old_inode = inode;
-               func(I_BDEV(inode), arg);
+               bdev = bd_acquire(inode);
+               if (bdev) {
+                       func(bdev, arg);
+                       bdput(bdev);
+               }
My reasoning is that the call to bd_acquire() will prevent close() from actually reaching the bits in __blkdev_put() that do the final cleanup (i.e. setting bdev->bd_disk to NULL) and so prevent the crash from happening.

Unfortunately, running the reproducer again shows no change that I can see. It seems that I was wrong about this preventing __blkdev_put() from running: blkdev_close() calls blkdev_put() unconditionally, which calls __blkdev_put() unconditionally.

Another idea might be to remove the block device from the list that iterate_bdevs() is traversing before setting bdev->bd_disk to NULL. However, it seems that this is all handled by the VFS and we can’t really change it just for block devices.

Reading over most of fs/block_dev.c, I decide to fall back to my first (and more obvious) idea: take bd_mutex in iterate_bdevs(). This should be safe since both the s_inode_list_lock and inode->i_lock are dropped before calling the iterate_bdevs() callback function. However, I am still getting the same crash… On second thought, even taking bd_mutex is not enough because bdev->bd_disk will still be NULL when __blkdev_put() releases the mutex. Maybe there’s a condition we can test while holding the mutex that will tell us whether the block device is “useable” or not. We could test ->bd_disk directly, which is what we’re really interested in, but that seems like a derived property and not a real indication of whether the block device has been closed or not; ->bd_holders or ->bd_openers MAY be better candidates.

While digging around trying to figure out whether to check ->bd_disk, ->bd_holders, or ->bd_openers, I came across this comment in one of the functions in the crashing call chain:
 106 /**
 107  * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
 108  * @bdev:       device
 109  *
 110  * Locates the passed device's request queue and returns the address of its
 111  * backing_dev_info.  This function can only be called if @bdev is opened
 112  * and the return value is never NULL.
 113  */
 114 struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
 115 {
 116         struct request_queue *q = bdev_get_queue(bdev);
 118         return &q->backing_dev_info;
 119 }
In particular, the “This function can only be called if @bdev is opened” requirement seems to be violated in our case.

Taking bdev->bd_mutex and checking bdev->bd_disk actually seems to be a fairly reliable test of whether it’s safe to call filemap_fdatawrite() for the block device inode. The underlying problem here is that sync() is able to get a reference to a struct block_device without having it open as a file. Doing something like this does fix the bug:
diff --git a/fs/sync.c b/fs/sync.c
index 2a54c1f..9189eeb 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -81,7 +81,10 @@ static void sync_fs_one_sb(struct super_block *sb, void *arg)
 static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
-       filemap_fdatawrite(bdev->bd_inode->i_mapping);
+       mutex_lock(&bdev->bd_mutex);
+       if (bdev->bd_disk)
+               filemap_fdatawrite(bdev->bd_inode->i_mapping);
+       mutex_unlock(&bdev->bd_mutex);
 static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
What I don’t like about this patch is that it simply skips block devices which we don’t have any open file descriptors for. That seems wrong to me because sync() should do writeback on (and wait for) all devices, not just the ones that we happen to have an open file descriptor for. Imagine if we opened a device, wrote a lot of data to it, closed it, called sync(), and sync() returns. Now we should be guaranteed the data was written, but I’m not sure we are in this case.

Another slightly ugly thing is that we’re now holding a new mutex over a potentially big chunk of code (everything that happens inside filemap_fdatawrite()).

I’m not sure I can do much better in terms of a small patch at the moment, so I will submit this to the linux-block mailing list with a few relevant people on Cc (Jens Axboe for being the block maintainer, Tejun Heo for having written a lot of the code involved according to git blame, Jan Kara for writing iterate_bdevs(), and Al Viro for probably knowing both the block layer and VFS quite well).

I submitted my patch here: thread

Rabin Vincent answered pretty quickly that he already sent a fix for the very same issue. Oh well, at least his patch is quite close to what I came up with and I learned quite a few new things about the kernel.

Tejun Heo also responded that a better fix would probably be to prevent the disk from going away by getting a reference to it. I tried a couple of different patches without much luck. The currently last patch from me in that thread seemed to prevent the crash, but as I only realised a few minutes after sending it: we’re decrementing the reference count without doing anything when it reaches 0! Of course we don’t get a NULL pointer dereference if we never do the cleanup/freeing in the first place…

If you liked this post and you enjoy fixing bugs like this one, you may enjoy working with us in the Ksplice group at Oracle. Ping me at my Oracle email address :-)