Thursday, November 17, 2011

Due diligence continues to pay off

At work, we recently fixed a number of instances of memcpy() that should have been memmove() because the objects involved overlapped (thus violating e.g.: POSIX and C99). One particular instance of memcpy() had been wrong since at least 1990, only to be exposed by relatively recent versions of glibc. The fact that wrong code has been undetected for at least 21 years illustrates that it is very easy for programs to merely appear to work and, thus, that it is incredibly important to always pay attention to the relevant specifications.

Speaking of paying attention, we also found sometimes we use a Duff device instead of memcpy() / memmove() / memset() because of an interface impedance mismatch: the C library functions work with bytes, and we need to copy, move or set pointer size values. Alas, our Duff device is currently written in a way that requires way too many assembler instructions. So now I have a new Duff device prototype that does the same work in about half the instructions on x86, Power and SPARC. Preliminary tests show a measurable performance improvement, which is visible both on micro and macro benchmarks.

Moving along...

7 comments:

Martin McClure said...

Have you tested performance removing the Duff's devices altogether? Modern CPUs have much better branch prediction than earlier ones did, and can make loop unrolling counterproductive. See, for instance, http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html

Andrés said...

I did. At least on some compilers, when the arguments to memcpy() / memmove() are not known at compile time, the compiler does not necessarily inline memcpy() / memmove() so you get an extra call to some RTL function.

Moreover, when we looked at the implementation of e.g.: memcpy() on Linux, we see that there are some checks at the beginning to see how large the copy is because if it is large then Linux tries to "copy" stuff by doing page mapping instead of actually copying data. But we know the vast majority of Duff device uses are for much less than a page, so we think the test and jump are equivalent to the test and jump instructions at the head of the Duff device. After that, the Duff device effectively hardcodes the data movement instructions, and so memcpy() is very unlikely to win just on a per-instruction basis.

We should also note that we did remove a few uses of the Duff device in favor of memmove(). For our purposes, the Duff device works best when we know the copy size is very likely to be small but variable in size, and when we can make some assumptions about the memory locations involved. Finally, we also noted that at least some implementations of memcpy() could not beat the new Duff device even when copying largish chunks of memory.

Andrés said...

Oh, regarding your email... yeah, they removed half a megabyte's worth of binary executable by getting rid of the Duff devices. That's a lot of executable!

In our case, removing it completely would save on the order of 5kb. And, in fact, since the new one is about half the size, then it's cheaper than before. This also implies we don't use Duff devices very often to begin with.

Martin McClure said...

Interesting info, but I didn't actually mean "replace the Duff's devices with memcpy", I meant "have you tried replacing the Duff's devices with a simple hand-coded non-unrolled loop."

Andrés said...

Hmmm... for the purposes we're using Duff devices for, I think I'd shy away from non-unrolled loops because at one point we noticed a few percent penalty for using those instead of memcpy()...

It's been a while, so I'd have to look at it again. I'll see what I find after I am done with a rewritten scavenger I'm close to finishing off. Something I do remember it's a good idea to pay attention to is the way you write the loop. Optimizing C compilers don't always realize what you're doing, and so doing something like

for (i=1; i <= 8; i++) *d++ = *s++;

can be more onerous than

l = d + 8; while (d < l) *d++ = *s++;

because on the second loop you only use a test while on the first one you also have to keep a running variable for i. Details, details...

Martin McClure said...

Those details do indeed matter. I'll be interested to hear what you find when you try it -- I've seen explicit advice to "avoid loop unrolling", especially in Intel's Sandy Bridge microarchitecture, since it has a micro-op cache that you want to avoid thrashing, and even on older (but modern) processors the additional cost of the test and branch at each iteration can be very small, in some cases zero.

Andrés said...

Something that also matters, and I'd like to point it out, is that not everyone uses x86. I haven't had a chance to run some tests, I'll write another post when I do.