Wednesday, June 07, 2006

Hash vs Binary Search

Ah yes, I just couldn't help it. Take a look at this article. The last paragraph essentially states that the performance advantage of hashing over binary search disappears as the data set grows.

This, even when the complexity of hash lookup is O(1), while that of binary search is O(log n). The article finishes by going as far as saying that this performance loss "is a good thing"... ???...

I'm sorry, but there is no way that can possibly be true. So to see what's the deal with that kind of assertions, I decided to run some tests myself.

I set up an array with integers from 1 to 10 million. I used binary search to look up the first million integers in this array. It took 22+ seconds*.

But of course, looking up 7 million integers in a set with 10 million integers took just 3+ seconds. In other words, it was roughly 50 times faster.

If O(1) cannot beat O(log n), it is anything but a "good thing". In fact, chances are it is because something is busted in the hash table department!!!

* My computer is far from state of the art.

4 comments:

Anonymous said...

I don't think you understood that article.

Andrés said...

Anynomous, you have to state why you say I didn't understand the article...

Anonymous said...

Agree, I feel you have an incorrect understanding of the article.
For a large dataset hashtable need some way to handle collisions and the common approach is chaining.(although instead of chain a tree could be used making it uber)
So the hashtable would have to do a lot more comparisions before finding a result.
binary search relatively does a lot less comparisions for the same memory usage
assuming we give both ideas equal memory to be implemented as data size increases binary starts winning after a certain size

Andrés said...

Hello,

> For a large dataset hashtable need some way to handle collisions and the common approach is chaining.(although instead of chain a tree could be used making it uber)

Well, right there the assumptions start creeping in. Chaining is just _one_ way of doing that. Much better is open addressing with linear probing. You just need a good hash function, even a semi expensive one, because each cache line miss will cost you on the order of 200 cycles.

> So the hashtable would have to do a lot more comparisions before finding a result.

Yeah, with _that_ type of collision resolution. Which isn't the only one.

> assuming we give both ideas equal memory to be implemented as data size increases binary starts winning after a certain size

Just use another type of collision resolution mechanism, like bucketing or open addressing.