Friday, August 29, 2008

REP STOS -- part 2

I finally did it.

I fixed the problem with single-stepping REP STOS (and MOVS) instructions on the P4. (Look in the blog archive to find the original post.)

At first, I wanted emulate the instructions completely. But it wasn't really that easy. My naïve implementation did support different register/data widths. But I soon hit some real show-stoppers. It turned out that the kernel will use REP STOS or MOVS on strings which may not be aligned naturally. This means that memory accesses may now cross a page boundary, which is a big problem. Now we suddenly have to look for page boundaries, and in the case of REP MOVS, we must watch out for both the ESI and EDI registers.

The reason is of course that pages are tracked one by one, and if we cross over to a new page, we have to make sure that it is also marked present (P flag) in the page tables before touching it. Two consecutive pages are also not guaranteed to have consecutive shadow-memory pages (in fact, that's very unlikely), so we should also take care of changing shadow-memory pointers upon hitting a page boundary. But still, that is not so easy when we have a single write that can cross the boundary.

I thought about this for a long while and I came up with a scheme that would work: We can allow a REP MOVS/STOS run to cross at most one page boundary. Then we only need to keep track of at most two pages at the same time (or four pages for REP MOVS), and if we hit this limit, then we can return to the code and simply wait for a new page fault to restart where we left off.

But still it was not quite that easy. Emulating the memory accesses in effect means that we do them from the page fault handler. The kernel got through the boot sequence and a bit into userspace. But then it would BUG on a test that wanted irqs to be enabled. After a bit of debugging, I found (to my great horror) that copy_to_user() was using REP MOVS to move data from the kernel into userspace. This is of course not safe, when we consider that the userspace page might not even be present in the page tables. Because now we were making a write from the kernel (in fact, from the page fault handler itself) to userspace, causing a new page fault, and then calling into various memory subsystem functions to allocate a new page for userspace. All of this with interrupts disabled, which is forbidden. (I believe that this is also the reason why copy_to_user() cannot be used in atomic contexts. Who can confirm?)

We absolutely cannot emulate the instruction by doing the write from the page fault handler. So what can we do?

Well, all is not lost. I am rather proud of my little hack, too. This is what I wrote (in my great excitement) to Ingo Molnar:
Instead of emulating the _whole_ REP MOVS/STOS, we only emulate the REP part. That is, on #PF, we increment %eip by one, which means that when the #PF returns, it will execute just a normal MOVS/STOS instruction (and give is the #DB straight afterwards). Now, in the #DB, we check the flag that says "was this really a REP instruction?" and if it was, we start counting down %ecx and rewinding %eip each time until %ecx is 0. Each time we return to the original instruction and let the CPU execute it natively. When %ecx is 0, we turn off single-stepping and hide the pages again.
The great thing is that it actually works. With this (rather small) patch, I am able to boot my P4 and get exactly the same error reports that I get on my Pentium Dual-Core laptop.

This also prompted me to start fixing the numerous false positive warnings that occur because kmemcheck is reporting eagerly. Most prominent of these are the bitfield operations, which load a multiple of 8 bits at a time, even though just one of the bits are actually used. The solution is to explicitly initialize the whole bitfield at once just after the struct has been allocated. Yes, this means that we won't be able to detect errors in the use of uninitialized bitfields, but this has always been true and is a result of the combination of the x86 architecture and the way we detect memory accesses.

And this is the current status: 0 errors reported during kernel initialization, 3 errors reported during userspace initialization, 0 errors while transferring a 4 MiB bzImage over SSH. Compare that to the ~2400 errors this would give us a month ago. This time I believe that we are truly ready for mainline. It will be interesting to see if anybody will try it out or review the patches, however...