Friday, October 12, 2012

Duff's device implementation details

Typically one writes a Duff's device to copy data like it's shown in the Wikipedia, i.e. using *to++ = *from++ for each step.  Most likely a compiler dealing with *to++ = *from++ will emit 4 instructions: a load, a store, and two additions to pointers likely stored in registers.  But, for example, if you have an 8 case device, you can arrange things so that the pointers are incremented with the slack iterations before the switch and then do

  • *(to-8) = *(from-8);
  • *(to-7) = *(from-7);
  • ...
  • *(to-1) = *(from-1);
  • if not done, increment to and from by 8 and go to the top;
With the explicit offsets above, each step requires just a load and store with fixed offsets that a reasonable CPU with instructions such as mov rax, [rsi+offset] calculates on the fly.  The resulting Duff's device is now roughly half the size in assembler instructions.


haupz said...

... or use vector intrinsics:

Andrés said...

... the page begins with "on some targets"...