Having done quite a bit of kernel fuzzing and debugging lately I’ve decided to take one of the very latest crashes and write up the whole process from start to finish as I work through it. As you will see, I'm not very familiar with the site of this particular crash, the block layer. Being familiar with some existing kernel code helps, of course, since you recognise a lot of code patterns, but the kernel is so large that nobody can be familiar with everything and the crashes found by trinity and syzkaller can show up almost anywhere.
So I got this with syzkaller after running it for a few hours:
general protection fault: 0000 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 0 PID: 11941 Comm: syz-executor Not tainted 4.8.0-rc2+ #169
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: ffff880110762cc0 task.stack: ffff880102290000
RIP: 0010:[<ffffffff81f04b7a>] [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
RSP: 0018:ffff880102297cd0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc90000bb4000
RDX: 0000000000000097 RSI: 0000000000000000 RDI: 00000000000004b8
RBP: ffff880102297cd8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88011a010a90
R13: ffff88011a594568 R14: ffff88011a010890 R15: 7fffffffffffffff
FS: 00007f2445174700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000200047c8 CR3: 0000000107eb5000 CR4: 00000000000006f0
DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
1ffff10020452f9e ffff880102297db8 ffffffff81508daa 0000000000000000
0000000041b58ab3 ffffffff844e89e1 ffffffff81508b30 ffffed0020452001
7fffffffffffffff 0000000000000000 0000000000000000 7fffffffffffffff
Call Trace:
[<ffffffff81508daa>] __filemap_fdatawrite_range+0x27a/0x2e0
[<ffffffff81508b30>] ? filemap_check_errors+0xe0/0xe0
[<ffffffff83c24b47>] ? preempt_schedule+0x27/0x30
[<ffffffff810020ae>] ? ___preempt_schedule+0x16/0x18
[<ffffffff81508e36>] filemap_fdatawrite+0x26/0x30
[<ffffffff817191b0>] fdatawrite_one_bdev+0x50/0x70
[<ffffffff817341b4>] iterate_bdevs+0x194/0x210
[<ffffffff81719160>] ? fdatawait_one_bdev+0x70/0x70
[<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
[<ffffffff817196be>] sys_sync+0xce/0x160
[<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
[<ffffffff81002b60>] ? exit_to_usermode_loop+0x190/0x190
[<ffffffff8150455a>] ? __context_tracking_exit.part.4+0x3a/0x1e0
[<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
[<ffffffff83c3276a>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
RIP [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
RSP <ffff880102297cd0>
The very first thing to do is to look up the code in the backtrace:
$ addr2line -e vmlinux -i ffffffff81f04b7a ffffffff81508daa ffffffff81508e36 ffffffff817191b0 ffffffff817341b4 ffffffff817196be
./include/linux/blkdev.h:844
block/blk-core.c:116
./include/linux/backing-dev.h:186
./include/linux/backing-dev.h:229
mm/filemap.c:316
mm/filemap.c:334
fs/sync.c:85
./include/linux/spinlock.h:302
fs/block_dev.c:1910
fs/sync.c:116
The actual site of the crash is this:
842 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
843 {
844 return bdev->bd_disk->queue; /* this is never NULL */
845 }
Because we’re using KASAN we can’t look at CR2 to find the bad pointer because KASAN triggers before the page fault (or to be completely honest, KASAN tries to access the shadow memory for the bad pointer, which is itself a bad pointer and causes the GPF above).
Let’s look at the “Code:” line to try to find the exact dereference causing the error:
$ echo 'Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00 ' | scripts/decodecode
Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
All code
========
0: 89 fa mov %edi,%edx
2: 48 c1 ea 03 shr $0x3,%rdx
6: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
a: 75 35 jne 0x41
c: 48 8b 9b e0 00 00 00 mov 0xe0(%rbx),%rbx
13: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
1a: fc ff df
1d: 48 8d bb b8 04 00 00 lea 0x4b8(%rbx),%rdi
24: 48 89 fa mov %rdi,%rdx
27: 48 c1 ea 03 shr $0x3,%rdx
2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2f: 75 17 jne 0x48
31: 48 8b 83 b8 04 00 00 mov 0x4b8(%rbx),%rax
38: 5b pop %rbx
39: 5d pop %rbp
3a: 48 05 10 02 00 00 add $0x210,%rax
I’m using
CONFIG_KASAN_INLINE=y
so most of the code above is actually generated by KASAN which makes things a bit harder to read. The movabs with a weird 0xdffff… address is how it generates the address for the shadow memory bytemap and the cmpb that crashed is where it tries to read the value of the shadow byte.
The address is %rdx + %rax and we know that %rax is 0xdffffc0000000000. Let’s look at %rdx in the crash above…
RDX: 0000000000000097
; yup, that’s a NULL pointer dereference all right.
But the line in question has two pointer dereferences,
bdev->bd_disk
and
bd_disk->queue
, and which one is the crash? The
lea 0x4b8(%rbx), %rdi
is what gives it away, since that gives us the offset into the structure that is being dereferenced (also, NOT coincidentally, %rbx is 0). Let’s use pahole:
$ pahole -C 'block_device' vmlinux
struct block_device {
dev_t bd_dev; /* 0 4 */
int bd_openers; /* 4 4 */
struct inode * bd_inode; /* 8 8 */
struct super_block * bd_super; /* 16 8 */
struct mutex bd_mutex; /* 24 128 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
void * bd_claiming; /* 152 8 */
void * bd_holder; /* 160 8 */
int bd_holders; /* 168 4 */
bool bd_write_holder; /* 172 1 */
/* XXX 3 bytes hole, try to pack */
struct list_head bd_holder_disks; /* 176 16 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct block_device * bd_contains; /* 192 8 */
unsigned int bd_block_size; /* 200 4 */
/* XXX 4 bytes hole, try to pack */
struct hd_struct * bd_part; /* 208 8 */
unsigned int bd_part_count; /* 216 4 */
int bd_invalidated; /* 220 4 */
struct gendisk * bd_disk; /* 224 8 */
struct request_queue * bd_queue; /* 232 8 */
struct list_head bd_list; /* 240 16 */
/* --- cacheline 4 boundary (256 bytes) --- */
long unsigned int bd_private; /* 256 8 */
int bd_fsfreeze_count; /* 264 4 */
/* XXX 4 bytes hole, try to pack */
struct mutex bd_fsfreeze_mutex; /* 272 128 */
/* --- cacheline 6 boundary (384 bytes) was 16 bytes ago --- */
/* size: 400, cachelines: 7, members: 21 */
/* sum members: 389, holes: 3, sum holes: 11 */
/* last cacheline: 16 bytes */
};
0x4b8 is 1208 in decimal, which is way bigger than this struct. Let’s try the other one:
$ pahole -C 'gendisk' vmlinux
struct gendisk {
int major; /* 0 4 */
int first_minor; /* 4 4 */
int minors; /* 8 4 */
char disk_name[32]; /* 12 32 */
/* XXX 4 bytes hole, try to pack */
char * (*devnode)(struct gendisk *, umode_t *); /* 48 8 */
unsigned int events; /* 56 4 */
unsigned int async_events; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct disk_part_tbl * part_tbl; /* 64 8 */
struct hd_struct part0; /* 72 1128 */
/* --- cacheline 18 boundary (1152 bytes) was 48 bytes ago --- */
const struct block_device_operations * fops; /* 1200 8 */
struct request_queue * queue; /* 1208 8 */
/* --- cacheline 19 boundary (1216 bytes) --- */
void * private_data; /* 1216 8 */
int flags; /* 1224 4 */
/* XXX 4 bytes hole, try to pack */
struct kobject * slave_dir; /* 1232 8 */
struct timer_rand_state * random; /* 1240 8 */
atomic_t sync_io; /* 1248 4 */
/* XXX 4 bytes hole, try to pack */
struct disk_events * ev; /* 1256 8 */
struct kobject integrity_kobj; /* 1264 64 */
/* --- cacheline 20 boundary (1280 bytes) was 48 bytes ago --- */
int node_id; /* 1328 4 */
/* XXX 4 bytes hole, try to pack */
struct badblocks * bb; /* 1336 8 */
/* --- cacheline 21 boundary (1344 bytes) --- */
/* size: 1344, cachelines: 21, members: 20 */
/* sum members: 1328, holes: 4, sum holes: 16 */
};
1208 is
->queue
, so that fits well with what we’re seeing; therefore,
bdev->bd_disk
must be NULL.
At this point I would go up the stack of function to see if anything sticks out – although unlikely, it’s possible that it’s an “easy” bug where you can tell just from looking at the code in a single function that it sets the pointer to NULL just before calling the function that crashed or something like that.
Probably the most interesting function in the stack trace (at a glance) is
iterate_bdevs()
in
fs/block_dev.c
:
1880 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
1881 {
1882 struct inode *inode, *old_inode = NULL;
1883
1884 spin_lock(&blockdev_superblock->s_inode_list_lock);
1885 list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
1886 struct address_space *mapping = inode->i_mapping;
1887
1888 spin_lock(&inode->i_lock);
1889 if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
1890 mapping->nrpages == 0) {
1891 spin_unlock(&inode->i_lock);
1892 continue;
1893 }
1894 __iget(inode);
1895 spin_unlock(&inode->i_lock);
1896 spin_unlock(&blockdev_superblock->s_inode_list_lock);
1897 /*
1898 * We hold a reference to 'inode' so it couldn't have been
1899 * removed from s_inodes list while we dropped the
1900 * s_inode_list_lock We cannot iput the inode now as we can
1901 * be holding the last reference and we cannot iput it under
1902 * s_inode_list_lock. So we keep the reference and iput it
1903 * later.
1904 */
1905 iput(old_inode);
1906 old_inode = inode;
1907
1908 func(I_BDEV(inode), arg);
1909
1910 spin_lock(&blockdev_superblock->s_inode_list_lock);
1911 }
1912 spin_unlock(&blockdev_superblock->s_inode_list_lock);
1913 iput(old_inode);
1914 }
I can’t quite put my finger on it, but it looks interesting because it has a bunch of locking in it and it seems to be what’s getting the block device from a given inode. I ran
git blame
on the file/function in question since that might point to a recent change there, but the most interesting thing is commit 74278da9f7 changing some locking logic. Maybe relevant, maybe not, but let’s keep it in mind.
Remember that
bd->bd_disk
is NULL. Let’s try to check if
->bd_disk
is assigned NULL anywhere:
$ git grep -n '\->bd_disk.*=.*NULL'
block/blk-flush.c:470: if (bdev->bd_disk == NULL)
drivers/block/xen-blkback/xenbus.c:466: if (vbd->bdev->bd_disk == NULL) {
fs/block_dev.c:1295: bdev->bd_disk = NULL;
fs/block_dev.c:1375: bdev->bd_disk = NULL;
fs/block_dev.c:1615: bdev->bd_disk = NULL;
kernel/trace/blktrace.c:1624: if (bdev->bd_disk == NULL)
This by no means necessarily includes the code that set
->bd_disk
to NULL in our case (since there could be code that looks like
x = NULL; bdev->bd_disk = x;
which wouldn’t be found with the regex above), but this is a good start and I’ll look at the functions above just to see if it might be relevant. Actually, for this I’ll just add
-W
to the
git grep
above to quickly look at the functions.
The first two and last hits are comparisons so they are uninteresting. The third and fourth ones are part of error paths in
__blkdev_get()
. That
might be interesting if the process that crashed somehow managed to get a reference to the block device just after the NULL assignment (if so, that would probably be a locking bug in either
__blkdev_get()
or one of the functions in the crash stack trace – OR it might be a bug where the
struct block_device *
is made visible/reachable before it’s ready). The fifth one is in
__blkdev_put()
. I’m going to read over
__blkdev_get()
and
__blkdev_put()
to figure out what they do and if there’s maybe something going on in either of those.
In all these cases, it seems to me that
&bdev->bd_mutex
is locked; that’s a good sign. That’s also maybe an indication that we should be taking
&bdev->bd_mutex
in the other code path, so let’s check if we are. There’s nothing that I can see in any of the functions from
inode_to_bdi()
and up. Although
inode_to_bdi()
itself looks interesting, because that’s where the block device pointer comes from; it calls
I_BDEV(inode)
which returns a
struct block_device *
. Although if we follow the stack even further up, we see that
fdatawrite_one_bdev()
in
fs/sync.c
also knows about a
struct block_device *
. This by the way appears to be what is called through the function pointer in
iterate_bdevs()
:
1908 func(I_BDEV(inode), arg);
This in turn is called from the
sync()
system call. In other words, I cannot see any caller that takes
&bdev->bd_mutex
. There may yet be another mechanism (maybe a lock) intended to prevent somebody from seeing
bdev->bd_disk == NULL
, but this seems like a strong indication of what the problem might be.
Let’s try to figure out more about
->bd_mutex
, maybe there’s some documentation somewhere telling us what it’s supposed to protect. There is this:
include/linux/fs.h=454=struct block_device {
include/linux/fs.h-455- dev_t bd_dev; /* not a kdev_t - it's a search key */
include/linux/fs.h-456- int bd_openers;
include/linux/fs.h-457- struct inode * bd_inode; /* will die */
include/linux/fs.h-458- struct super_block * bd_super;
include/linux/fs.h:459: struct mutex bd_mutex; /* open/close mutex */
There is this:
include/linux/genhd.h-680-/*
include/linux/genhd.h-681- * Any access of part->nr_sects which is not protected by partition
include/linux/genhd.h:682: * bd_mutex or gendisk bdev bd_mutex, should be done using this
include/linux/genhd.h-683- * accessor function.
include/linux/genhd.h-684- *
include/linux/genhd.h-685- * Code written along the lines of i_size_read() and i_size_write().
include/linux/genhd.h-686- * CONFIG_PREEMPT case optimizes the case of UP kernel with preemption
include/linux/genhd.h-687- * on.
include/linux/genhd.h-688- */
include/linux/genhd.h=689=static inline sector_t part_nr_sects_read(struct hd_struct *part)
And there is this:
include/linux/genhd.h-711-/*
include/linux/genhd.h:712: * Should be called with mutex lock held (typically bd_mutex) of partition
include/linux/genhd.h-713- * to provide mutual exlusion among writers otherwise seqcount might be
include/linux/genhd.h-714- * left in wrong state leaving the readers spinning infinitely.
include/linux/genhd.h-715- */
include/linux/genhd.h-716-static inline void part_nr_sects_write(struct hd_struct *part, sector_t size)
Under
Documentation/
there is also this:
--------------------------- block_device_operations -----------------------
[...]
locking rules:
bd_mutex
open: yes
release: yes
ioctl: no
compat_ioctl: no
direct_access: no
media_changed: no
unlock_native_capacity: no
revalidate_disk: no
getgeo: no
swap_slot_free_notify: no (see below)
Looking at
__blkdev_get()
again, there’s also one comment above it hinting at locking rules:
1233 /*
1234 * bd_mutex locking:
1235 *
1236 * mutex_lock(part->bd_mutex)
1237 * mutex_lock_nested(whole->bd_mutex, 1)
1238 */
1239
1240 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
__blkdev_get()
is called as part of
blkdev_get()
, which is what is called when you open a block device. In other words, it seems likely that we may have a race between opening/closing a block device and calling
sync()
– although for the
sync()
call to reach the block device, we should have some inode open on that block device (since we start out with an inode that is mapped to a block device with
I_BDEV(inode)
).
Looking at the syzkaller log file, there is a
sync()
call just before the crash, and I also see references to
[sr0] unaligned transfer
(and sr0 is a block device, so that seems slightly suspicious):
2016/08/25 05:45:02 executing program 0:
mmap(&(0x7f0000001000)=nil, (0x4000), 0x3, 0x31, 0xffffffffffffffff, 0x0)
mbind(&(0x7f0000004000)=nil, (0x1000), 0x8003, &(0x7f0000002000)=0x401, 0x9, 0x2)
shmat(0x0, &(0x7f0000001000)=nil, 0x4000)
dup2(0xffffffffffffffff, 0xffffffffffffff9c)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
sync()
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
clock_gettime(0x0, &(0x7f0000000000)={0x0, 0x0})
sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
kasan: CONFIG_KASAN_INLINE enabled
2016/08/25 05:45:03 result failed=false hanged=false:
2016/08/25 05:45:03 executing program 1:
mmap(&(0x7f0000002000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = syz_open_dev$sr(&(0x7f0000002000)="2f6465762f73723000", 0x0, 0x4800)
readahead(r0, 0xcb84, 0x10001)
mmap(&(0x7f0000000000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000001000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
syz_open_dev$mixer(&(0x7f0000002000-0x8)="2f6465762f6d6978657200", 0x0, 0x86000)
mmap(&(0x7f0000001000)=nil, (0x1000), 0x6, 0x12, r0, 0x0)
mount$fs(&(0x7f0000001000-0x6)="6d73646f7300", &(0x7f0000001000-0x6)="2e2f62757300", &(0x7f0000001000-0x6)="72616d667300", 0x880, &(0x7f0000000000)="1cc9417348")
kasan: GPF could be caused by NULL-ptr deref or user memory access
Here we see both the
sync()
call and the
syz_open_dev$sr()
call and we see that the GFP seems to happen some time shortly after opening sr0:
r0 = syz_open_dev$sr(&(0x7f0000002000)="2f6465762f73723000", 0x0, 0x4800)
>>> "2f6465762f73723000".decode('hex')
'/dev/sr0\x00'
There’s also a
mount$fs()
call there that looks interesting. Its arguments are:
>>> "6d73646f7300".decode('hex')
'msdos\x00'
>>> "2e2f62757300".decode('hex')
'./bus\x00'
>>> "72616d667300".decode('hex')
'ramfs\x00'
However, I can’t see any references to any block devices in
fs/ramfs
, so I think this is unlikely to be it. I do still wonder how opening
/dev/sr0
can do anything for us if it doesn’t have a filesystem or even a medium. [Note from the future: block devices are represented as inodes on the “bdev” pseudo-filesystem. Go figure!] Grepping for
sr0
in the rest of the syzkaller log shows this bit, which seems to indicate we do in fact have inodes for
sr0
:
VFS: Dirty inode writeback failed for block device sr0 (err=-5).
Grepping for “Dirty inode writeback failed”, I find
bdev_write_inode()
in
fs/block_dev.c
, called only from…
__blkdev_put()
. It definitely feels like we’re on to something now – maybe a race between
sync()
and
open()
/
close()
for
/dev/sr0
.
syzkaller comes with some scripts to rerun the programs from a log file. I’m going to try that and see where it gets us – if we can reproduce the crash. I’ll first try to convert the two programs (the one with
sync()
and the one with the
open(/dev/sr0)
) to C and compile them. If that doesn’t work, syzkaller also has an option to auto-reproduce based on all the programs in the log file, but that’s likely slower and not always likely to succeed.
I use
syz-prog2c
and launch the two programs in parallel in a VM, but it doesn’t show anything at all. I switch to
syz-repro
to see if it can reproduce anything given the log file, but this fails too. I see that there are other sr0-related messages in the kernel log, so there must be a way to open the device without just getting
ENOMEDIUM
. I do a stat on
/dev/sr0
to find the device numbers:
$ stat /dev/sr0
File: ‘/dev/sr0’
Size: 0 Blocks: 0 IO Block: 4096 block special file
Device: 5h/5d Inode: 7867 Links: 1 Device type: b,0
So the device major is 0xb (11 decimal). We can find this in
include/uapi/linux/major.h
and it gives us:
include/uapi/linux/major.h:#define SCSI_CDROM_MAJOR 11
We see that this is the driver responsible for
/dev/sr0
:
drivers/scsi/sr.c: rc = register_blkdev(SCSI_CDROM_MAJOR, "sr");
(I could have guessed this as well, but there are so many systems and subsystems and drivers that I often double check just to make sure I’m in the right place.) I look for an
open()
function and I find two –
sr_open()
and
sr_block_open()
.
sr_block_open()
does
cdrom_open()
– from
drivers/cdrom/cdrom.c
– and this has an interesting line:
/* if this was a O_NONBLOCK open and we should honor the flags,
* do a quick open without drive/disc integrity checks. */
cdi->use_count++;
if ((mode & FMODE_NDELAY) && (cdi->options & CDO_USE_FFLAGS)) {
ret = cdi->ops->open(cdi, 1);
So we need to pass
O_NONBLOCK
to get the device to open. When I add this to the test program from the syzkaller log and run
sync()
in parallel… ta-da!
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 3 PID: 1333 Comm: sync1 Not tainted 4.8.0-rc2+ #169
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: ffff880114114080 task.stack: ffff880112bf0000
RIP: 0010:[<ffffffff8170654d>] [<ffffffff8170654d>] wbc_attach_and_unlock_inode+0x23d/0x760
RSP: 0018:ffff880112bf7ca0 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: ffff880112bf7d10 RCX: ffff8801141147d0
RDX: 0000000000000093 RSI: ffff8801170f8750 RDI: 0000000000000498
RBP: ffff880112bf7cd8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8801141147e8 R11: 0000000000000000 R12: ffff8801170f8750
R13: 0000000000000000 R14: ffff880112bf7d38 R15: ffff880112bf7d10
FS: 00007fd533aa2700(0000) GS:ffff88011ab80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000601028 CR3: 0000000112b04000 CR4: 00000000000006e0
Stack:
ffff8801170f8750 0000000000000000 1ffff1002257ef9e ffff8801170f8950
ffff8801170f8750 0000000000000000 ffff880112bf7d10 ffff880112bf7db8
ffffffff81508d70 0000000000000000 0000000041b58ab3 ffffffff844e89e1
Call Trace:
[<ffffffff81508d70>] __filemap_fdatawrite_range+0x240/0x2e0
[<ffffffff81508b30>] ? filemap_check_errors+0xe0/0xe0
[<ffffffff83c24b47>] ? preempt_schedule+0x27/0x30
[<ffffffff810020ae>] ? ___preempt_schedule+0x16/0x18
[<ffffffff81508e36>] filemap_fdatawrite+0x26/0x30
[<ffffffff817191b0>] fdatawrite_one_bdev+0x50/0x70
[<ffffffff817341b4>] iterate_bdevs+0x194/0x210
[<ffffffff81719160>] ? fdatawait_one_bdev+0x70/0x70
[<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
[<ffffffff817196be>] sys_sync+0xce/0x160
[<ffffffff817195f0>] ? sync_filesystem+0x240/0x240
[<ffffffff81002b60>] ? exit_to_usermode_loop+0x190/0x190
[<ffffffff82001a47>] ? check_preemption_disabled+0x37/0x1e0
[<ffffffff8150455a>] ? __context_tracking_exit.part.4+0x3a/0x1e0
[<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
[<ffffffff83c3276a>] entry_SYSCALL64_slow_path+0x25/0x25
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 b3 04 00 00 49 8d bd 98 04 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 63 30 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 83 04 00 00 4d 8b bd 98 04 00 00 48 b8 00 00
RIP [<ffffffff8170654d>] wbc_attach_and_unlock_inode+0x23d/0x760
RSP <ffff880112bf7ca0>
---[ end trace 50fffb72f7adb3e5 ]---
This is not exactly the same oops that we saw before, but it’s close enough that it’s very likely to be a related crash. The reproducer is actually taking quite a while to trigger the issue, though. Even though I’ve reduced to two threads/processes executing just a handful of syscalls it still takes nearly half an hour to reproduce in a tight loop. I spend some time playing with the reproducer, trying out different things (
read()
instead of
readahead()
, just
open()
/
close()
with no reading at all, 2 threads doing
sync()
, etc.) to see if I can get it to trigger faster. In the end, I find that having many threads doing
sync()
in parallel seems to be the key to a quick reproducer, on the order of a couple of seconds.
Now that I have a fairly small reproducer it should be a lot easier to figure out the rest. I can add as many
printk()
s as I need to validate my theory that
sync()
should be taking the
bd_mutex
. For cases like this I set up a VM so that I can start the VM and run the reproducer by running a single command. I also actually like to use
trace_printk()
instead of plain
printk()
and boot with
ftrace_dump_on_oops
on the kernel command line – this way, the messages don’t get printed until the crash actually happens (and have a lower probability of interfering with the race itself;
printk()
goes directly to the console, which is usually pretty slow).
I apply this patch and recompile the kernel:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..fb9d5c5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1292,6 +1292,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
*/
disk_put_part(bdev->bd_part);
bdev->bd_part = NULL;
+ trace_printk("%p->bd_disk = NULL\n", bdev);
bdev->bd_disk = NULL;
bdev->bd_queue = NULL;
mutex_unlock(&bdev->bd_mutex);
@@ -1372,6 +1373,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
out_clear:
disk_put_part(bdev->bd_part);
+ trace_printk("%p->bd_disk = NULL\n", bdev);
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
@@ -1612,6 +1614,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
disk_put_part(bdev->bd_part);
bdev->bd_part = NULL;
+ trace_printk("%p->bd_disk = NULL\n", bdev);
bdev->bd_disk = NULL;
if (bdev != bdev->bd_contains)
victim = bdev->bd_contains;
@@ -1905,6 +1908,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
iput(old_inode);
old_inode = inode;
+ trace_printk("%p->bd_disk = %p\n", I_BDEV(inode), I_BDEV(inode)->bd_disk);
func(I_BDEV(inode), arg);
spin_lock(&blockdev_superblock->s_inode_list_lock);
With this patch applied, I get this output on a crash:
sync1-1343 3.... 8303954us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
sync1-1340 0.... 8303955us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
sync1-1343 3.... 8303961us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
sync1-1335 1.... 8304043us : iterate_bdevs: ffff88011a0105c0->bd_disk = ffff880114618880
sync2-1327 1.... 8304852us : __blkdev_put: ffff88011a0105c0->bd_disk = NULL
---------------------------------
CPU: 2 PID: 1336 Comm: sync1 Not tainted 4.8.0-rc2+ #170
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: ffff88011212d600 task.stack: ffff880112190000
RIP: 0010:[<ffffffff81f04c3a>] [<ffffffff81f04c3a>] blk_get_backing_dev_info+0x4a/0x70
RSP: 0018:ffff880112197cd0 EFLAGS: 00010202
Since
__blkdev_put()
is the very last line of output before the crash (and I don’t see any other call setting
->bd_disk
to NULL in the last few hundred lines or so), there is a very strong indication that this is the problematic assignment. Rerunning this a couple of times shows that it tends to crash with the same symptoms every time.
To get slightly more information about the context in which
__blkdev_put()
is called in, I apply this patch instead:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..298bf70 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1612,6 +1612,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
disk_put_part(bdev->bd_part);
bdev->bd_part = NULL;
+ trace_dump_stack(0);
bdev->bd_disk = NULL;
if (bdev != bdev->bd_contains)
victim = bdev->bd_contains;
With that, I get the following output:
<...>-1328 0.... 9309173us : <stack trace>
=> blkdev_close
=> __fput
=> ____fput
=> task_work_run
=> exit_to_usermode_loop
=> do_syscall_64
=> return_from_SYSCALL_64
---------------------------------
CPU: 3 PID: 1352 Comm: sync1 Not tainted 4.8.0-rc2+ #171
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: ffff88011248c080 task.stack: ffff880112568000
RIP: 0010:[<ffffffff81f04b7a>] [<ffffffff81f04b7a>] blk_get_backing_dev_info+0x4a/0x70
One thing that’s a bit surprising to me is that this actually isn’t called directly from
close()
, but as a delayed work item on a workqueue. But in any case we can tell it comes from
close()
since
fput()
is called when closing a file descriptor.
Now that I have a fairly good idea of what’s going wrong, it’s time to focus on the fix. This is almost more difficult than what we’ve done so far because it’s such an open-ended problem. Of course I could add a brand new global spinlock to provide mutual exclusion between
sync()
and
clone()
, but that would be a bad solution and the wrong thing to do. Usually the author of the code in question had a specific locking scheme or design in mind and the bug is just due to a small flaw or omission somewhere. In other words, it’s usually not a bug in the general architecture of the code (which might require big changes to fix), but a small bug somewhere in the implementation, which would typically require just a few changed lines to fix. It’s fairly obvious that
close()
is trying to prevent somebody else from seeing
bdev->bd_disk == NULL
by wrapping most of the
__blkdev_put()
code in the
->bdev_mutex
. This makes me think that it’s the
sync()
code path that is missing some locking.
Looking around
__blkdev_put()
and
iterate_bdevs()
, another thing that strikes me is that
iterate_bdevs()
is able to get a reference to a block device which is nevertheless in the process of being destroyed – maybe the real problem is that the block device is being destroyed too soon (while
iterate_bdevs()
is holding a reference to it). So it’s possible that
iterate_bdevs()
simply needs to formally take a reference to the block device by bumping its reference count while it does its work.
There is a function called
bdgrab()
which is supposed to take an extra reference to a block device – but only if you aready have one. Thus, using this would be just as racy, since we’re not already formally holding a reference to it. Another function,
bd_acquire()
seems to formally acquire a reference through a
struct inode *
. That seems quite promising. It is using the
bdev_lock
spinlock to prevent the block device from disappearing. I try this tentative patch:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e17bdbd..489473d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1884,6 +1884,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
spin_lock(&blockdev_superblock->s_inode_list_lock);
list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
struct address_space *mapping = inode->i_mapping;
+ struct block_device *bdev;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
@@ -1905,7 +1906,11 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
iput(old_inode);
old_inode = inode;
- func(I_BDEV(inode), arg);
+ bdev = bd_acquire(inode);
+ if (bdev) {
+ func(bdev, arg);
+ bdput(bdev);
+ }
spin_lock(&blockdev_superblock->s_inode_list_lock);
}
My reasoning is that the call to
bd_acquire()
will prevent
close()
from actually reaching the bits in
__blkdev_put()
that do the final cleanup (i.e. setting
bdev->bd_disk
to NULL) and so prevent the crash from happening.
Unfortunately, running the reproducer again shows no change that I can see. It seems that I was wrong about this preventing
__blkdev_put()
from running:
blkdev_close()
calls
blkdev_put()
unconditionally, which calls
__blkdev_put()
unconditionally.
Another idea might be to remove the block device from the list that
iterate_bdevs()
is traversing before setting
bdev->bd_disk
to NULL. However, it seems that this is all handled by the VFS and we can’t really change it just for block devices.
Reading over most of
fs/block_dev.c
, I decide to fall back to my first (and more obvious) idea: take
bd_mutex
in
iterate_bdevs()
. This should be safe since both the
s_inode_list_lock
and
inode->i_lock
are dropped before calling the
iterate_bdevs()
callback function. However, I am still getting the same crash… On second thought, even taking
bd_mutex
is not enough because
bdev->bd_disk
will still be NULL when
__blkdev_put()
releases the mutex. Maybe there’s a condition we can test while holding the mutex that will tell us whether the block device is “useable” or not. We could test
->bd_disk
directly, which is what we’re really interested in, but that seems like a derived property and not a real indication of whether the block device has been closed or not;
->bd_holders
or
->bd_openers
MAY be better candidates.
While digging around trying to figure out whether to check
->bd_disk
,
->bd_holders
, or
->bd_openers
, I came across this comment in one of the functions in the crashing call chain:
106 /**
107 * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
108 * @bdev: device
109 *
110 * Locates the passed device's request queue and returns the address of its
111 * backing_dev_info. This function can only be called if @bdev is opened
112 * and the return value is never NULL.
113 */
114 struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
115 {
116 struct request_queue *q = bdev_get_queue(bdev);
117
118 return &q->backing_dev_info;
119 }
In particular, the “This function can only be called if
@bdev is opened” requirement seems to be violated in our case.
Taking
bdev->bd_mutex
and checking
bdev->bd_disk
actually seems to be a fairly reliable test of whether it’s safe to call
filemap_fdatawrite()
for the block device inode. The underlying problem here is that
sync()
is able to get a reference to a
struct block_device
without having it open as a file. Doing something like this does fix the bug:
diff --git a/fs/sync.c b/fs/sync.c
index 2a54c1f..9189eeb 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -81,7 +81,10 @@ static void sync_fs_one_sb(struct super_block *sb, void *arg)
static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
{
- filemap_fdatawrite(bdev->bd_inode->i_mapping);
+ mutex_lock(&bdev->bd_mutex);
+ if (bdev->bd_disk)
+ filemap_fdatawrite(bdev->bd_inode->i_mapping);
+ mutex_unlock(&bdev->bd_mutex);
}
static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
What I don’t like about this patch is that it simply skips block devices which we don’t have any open file descriptors for. That seems wrong to me because
sync()
should do writeback on (and wait for)
all devices, not just the ones that we happen to have an open file descriptor for. Imagine if we opened a device, wrote a lot of data to it, closed it, called
sync()
, and
sync()
returns. Now we should be guaranteed the data was written, but I’m not sure we are in this case.
Another slightly ugly thing is that we’re now holding a new mutex over a potentially big chunk of code (everything that happens inside
filemap_fdatawrite()
).
I’m not sure I can do much better in terms of a small patch at the moment, so I will submit this to the linux-block mailing list with a few relevant people on Cc (Jens Axboe for being the block maintainer, Tejun Heo for having written a lot of the code involved according to
git blame
, Jan Kara for writing
iterate_bdevs()
, and Al Viro for probably knowing both the block layer and VFS quite well).
I submitted my patch here:
lkml.org thread
Rabin Vincent answered pretty quickly that he already sent a fix for the very same issue. Oh well, at least his patch is quite close to what I came up with and I learned quite a few new things about the kernel.
Tejun Heo also responded that a better fix would probably be to prevent the disk from going away by getting a reference to it. I tried a couple of different patches without much luck. The currently last patch from me in that thread seemed to prevent the crash, but as I only realised a few minutes after sending it: we’re decrementing the reference count without doing anything when it reaches 0! Of course we don’t get a NULL pointer dereference if we never do the cleanup/freeing in the first place…
If you liked this post and you enjoy fixing bugs like this one, you may enjoy working with us in the Ksplice group at Oracle. Ping me at my Oracle email address :-)