Sunday, July 31, 2005

On hash

I think I should write this piece at least once.

There is a notion of equality between two entities in many languages. In C, for instance, one compares integers using the == operand.

The degree of freedom to compare two things is much greater in Smalltalk, since at least in principle you could compare whatever two objects you want. The method #= serves this purpose. Sometimes you may need to make sure that two names reference exactly the same object, and then #== serves that purpose.

As far as #= goes, consider then what happens when you choose one object. By means of #=, you have implicitly chosen all objects equivalent to that object. In this way, the object you chose is the representative of all the objects which are equal to the one you chose according to #=.

Using basic algebra classes, you can see that #= is an equivalence relationship, and that #= partitions the space of all objects into classes of equivalence, each of them being a collection of objects that are equal to each other according to #=. All classes of equivalence are disjoint, since as soon as they have a non-empty intersection, the viral behavior of #= joins them completely.

Consider now a Set. Sets use the method #= to partition objects into classes of equivalence. They will hold up to one representative object for each equivalence class. Hence their well known characteristic: they will hold only one occurrence of each object.

This is because you will have A = A. Suppose the set S already contains A. When you try to add A to S again, it will find it already has a representative of the equivalence class of A. And it will know this because it will find, through its contents, one object equal to A --- in this case, A itself.

Think of the implementation details though. If you have a large set, searching for representatives in a linear fashion will take a long time. But this would not take advantage of the fact that no two objects in the set are equal to each other.

Sets usually implement object storage in terms of an array. Let's say you had a way to give a more or less unique name to each equivalence class. You could easily map those names into integers, and then store each representative in the storage array at the position dictated by those integers.

Sets obtain such integers from objects by sending them #hash.

Note that since Dictionaries keep an implicit set of their keys, they also depend on #hash.

Let's examine the properties of this hash function that takes an equivalence class, gives it a name, and answers a more or less unique integer.

First of all, why "more or less" unique? Because even with simple objects, you can have a huge number of equivalence classes. This amount can be so large that it frequently makes the number of particles in the Universe, roughly 2^266, look insignificant.

Consider for example the set of all ASCII string of size 256. There are 2^2048 classes of equivalence because there are 2^2048 possible strings. Chances are you won't need (and certainly won't be able to store) all possible strings at the same time.

A compromise solution is usually taken: equivalence class names will be mapped to integers in the positive SmallInteger range. This provides room for something on the order of 2^30 possible distinct names. Another advantage is that this limit also keeps arithmetic in the SmallInteger range, where it is fastest.

Let's examine a straightforward way to implement #hash given an implementation of #=. Typically, #= will compare the values of some instance variables. These are the ones that determine the equivalence class the object belongs to. Since most of the time we can delegate #hash to those instance variables, all we need to do is to combine their hashes.

At first we may be tempted to sum all the hashes. But that may result in LargeInteger arithmetic. Since fixing that takes ifTrue:ifFalse: usually after the damage is done, addition is not a good idea. The logical operations #bitOr: and #bitAnd: tend to just turn all bits on or off, respectively, so it's not a good idea either.

The logical operation #bitXor:, however, does not have a tendency to either all ones or all zeros. Using it to mix hash values is fine, with an important assumption: the values hashed should already be in the SmallInteger range.

This is because #bitXor: outputs a result of the same size as the arguments. For example, if it was used to hash a byte array by combining the hash values of the bytes, the result would result in an integer between 0 and 255 --- too constraining. These situations will be discussed in more detail in a separate post.

So in situations where you can obtain SmallInteger range hash values from the instance variables, #hash can be implemented by mixing those hashes with #bitXor:. For example, if you had 3 instance variables, you could let #hash be implemented in terms of ^(iVar1 hash bitXor: iVar2 hash) bitXor: iVar3 hash.

The hash functions that map an equivalence class name to an integer must also comply with this requirement:

x = y => x hash = y hash

This is clear now, because if x = y that means they belong to the same equivalence class. This equivalence class can only have one name, and the hash function will obviously map it to a single, unique integer.

Note that nothing can be said when x ~= y. In that case, you may even have x hash = y hash. This happens because of the size restriction applied to the integers given by #hash.

When x ~= y but x hash = y hash, it is said that the hash values of x and y collide. As far as Sets are concerned, this may cause a slowdown because they will have to resort to storing representatives in array positions reserved for other representatives. It is still the case that this will be faster than linear search for the vast majority of cases, but it should be avoided.

A hash function is good when it produces few collisions. Few collisions are good because they allow sets to be fast.

However, we will be constraining hashes to the SmallInteger range. So when #= partitions objects into a LargeInteger amount of equivalence classes, we will have collisions for sure. How can we obtain a hash function that is good then?

The key observation is that we will usually deal with far less objects than the maximum SmallInteger value. In practice, however, this implies that it will be pretty common to hash things that are similar in structure. Names of up to 32 characters, for example, are an extremely tiny amount of all possible strings up to size 32. So in general, hash functions will have to be good even when hashing things that have a very constrained shape.

The arguments above lead to our first rule of thumb: #hash should produce different results when the objects hashed are similar.

This rule of thumb has a powerful implication. In the absence of any knowledge about the structure and characteristics of the objects that will be hashed in practice, there is no other choice than to hash everything that is referenced in the #= method.

More concretely, sometimes the hash of collections is implemented in terms of hashing a few selected objects, typically the first one, last one, and a few from around the middle. This may seem clever, however note that it violates our rule of thumb. And because of this, hash functions implemented in this manner will produce lots of collisions when the inputs tend to be similar at the indices used to calculate the hash.

This is more common than it may seem. For example, you could consider a LargeInteger as a collection of bytes. Thus, the hash of a LargeInteger could be the hash of a few selected bytes. This was the case in Squeak Smalltalk, and it was just fine for a long time. But when Magma, a relational database written in Squeak, used LargeIntegers as pointers, it so happened to be that pointers were usually equal at all the cherry-picked locations. Hence, the hash of most pointers was the same.

This problem could have been fixed by implementing #magmaHash and subclassing Set (or Dictionary, which keeps a set of keys using #hash). But then things get uncomfortable. Because if #magmaHash is better than #hash, then why shouldn't #hash be replaced? And if this is the case, then it would have been preferable to let #hash go through all the bytes to begin with.

This raises the issue of performance when hashing large collections. The solution to this is to implement #applicationHash using the knowledge about what is being hashed. This can be done because the knowledge is known in the context of the application in question. And in fact, such a hash could be even more efficient than a generic, cherry-picking hash --- it could be the case that less cherry-picking is enough.

Since #applicationHash would be written with specific knowledge of the application in question, this solution is significantly different and better in quality than implementing a cherry-picking #hash and forcing everybody else to write a better, all-encompassing #hash when excessive collisions become evident within their application. In every case, #hash would produce quality results which would keep the performance of collections that depend on #hash at a near optimum value.

In short, here is our second rule of thumb: hash the whole collection, unless it causes performance issues. Only when this is an issue as shown by a profiler, resort to specific knowledge to implement an #applicationSpecificHash in terms of fewer, significant elements of the collections in question. You should make sure that this new hash is good in terms of collisions.

With this in mind, note that some Smalltalks already come with pluggable-hash collections, in which the implementation of #hash used by the collection can be provided by a block. Depending on the Smalltalk you are using, you may even be able to skip subclassing Set or Dictionary so they use the application specific hash method.

This leaves addressing the implementation of #hash for collections that contain objects whose hashes are heavily constrained. Examples of this would be the hashes of byte arrays, strings, and integers. I will address these in a separate post.

No comments: