Tuesday, October 21, 2008

Another improvement for k-nucleotide

I had been thinking about k-nucleotide, one of the benchmarks in the Computer Language Shootout. Finally I couldn't stand it anymore and I had to see if I could make it go faster.

First I tried replacing the dictionary with a finite state model. Nice idea, and nice code too. But, alas, quite slower because now basically all the objects created were dumped into old space. Before, a scavenge would just get rid of the unnecessary strings. And also, objects use oop pointers instead of byte strings using bytes, and so strings use new space more efficiently. Grrr...

Then I thought perhaps I could cause less GC activity. I did that by shamelessly reusing a buffer for the frequency counter. Then I looked at the time profiler and squished all the hashed collection growth away by presizing the hashed collections properly.

Finally, I also checked out VisualWorks' string hash function performance with DNA sequences. The behavior was very good indeed, with proper chi^2 mod p values as the sample size increased.

The bottom line is that the new k-nucleotide program runs 38% faster than the old one. Take that! :)

Update: alas, it only improved by 10% with the official load...


Anonymous said...

Andres, If the GC is increasing the time of any of the benchmarks, why not increase the size of newspace. If the space is big enough only a few objects will survive a scavenge.

Joachim Geidel said...

If early tenuring was the problem with the finite state machine approach: Did you try enlarging Eden and SurvivorSpace to avoid the problem? I am regularly setting these to 20 times the default sizes without negative effects.

Andres said...


It would be great to do that, however I do not control the environment in which the benchmarks run and so I opted to decrease GC activity.