Sunday, July 31, 2005

On hash

I think I should write this piece at least once.

There is a notion of equality between two entities in many languages. In C, for instance, one compares integers using the == operand.

The degree of freedom to compare two things is much greater in Smalltalk, since at least in principle you could compare whatever two objects you want. The method #= serves this purpose. Sometimes you may need to make sure that two names reference exactly the same object, and then #== serves that purpose.

As far as #= goes, consider then what happens when you choose one object. By means of #=, you have implicitly chosen all objects equivalent to that object. In this way, the object you chose is the representative of all the objects which are equal to the one you chose according to #=.

Using basic algebra classes, you can see that #= is an equivalence relationship, and that #= partitions the space of all objects into classes of equivalence, each of them being a collection of objects that are equal to each other according to #=. All classes of equivalence are disjoint, since as soon as they have a non-empty intersection, the viral behavior of #= joins them completely.

Consider now a Set. Sets use the method #= to partition objects into classes of equivalence. They will hold up to one representative object for each equivalence class. Hence their well known characteristic: they will hold only one occurrence of each object.

This is because you will have A = A. Suppose the set S already contains A. When you try to add A to S again, it will find it already has a representative of the equivalence class of A. And it will know this because it will find, through its contents, one object equal to A --- in this case, A itself.

Think of the implementation details though. If you have a large set, searching for representatives in a linear fashion will take a long time. But this would not take advantage of the fact that no two objects in the set are equal to each other.

Sets usually implement object storage in terms of an array. Let's say you had a way to give a more or less unique name to each equivalence class. You could easily map those names into integers, and then store each representative in the storage array at the position dictated by those integers.

Sets obtain such integers from objects by sending them #hash.

Note that since Dictionaries keep an implicit set of their keys, they also depend on #hash.

Let's examine the properties of this hash function that takes an equivalence class, gives it a name, and answers a more or less unique integer.

First of all, why "more or less" unique? Because even with simple objects, you can have a huge number of equivalence classes. This amount can be so large that it frequently makes the number of particles in the Universe, roughly 2^266, look insignificant.

Consider for example the set of all ASCII string of size 256. There are 2^2048 classes of equivalence because there are 2^2048 possible strings. Chances are you won't need (and certainly won't be able to store) all possible strings at the same time.

A compromise solution is usually taken: equivalence class names will be mapped to integers in the positive SmallInteger range. This provides room for something on the order of 2^30 possible distinct names. Another advantage is that this limit also keeps arithmetic in the SmallInteger range, where it is fastest.

Let's examine a straightforward way to implement #hash given an implementation of #=. Typically, #= will compare the values of some instance variables. These are the ones that determine the equivalence class the object belongs to. Since most of the time we can delegate #hash to those instance variables, all we need to do is to combine their hashes.

At first we may be tempted to sum all the hashes. But that may result in LargeInteger arithmetic. Since fixing that takes ifTrue:ifFalse: usually after the damage is done, addition is not a good idea. The logical operations #bitOr: and #bitAnd: tend to just turn all bits on or off, respectively, so it's not a good idea either.

The logical operation #bitXor:, however, does not have a tendency to either all ones or all zeros. Using it to mix hash values is fine, with an important assumption: the values hashed should already be in the SmallInteger range.

This is because #bitXor: outputs a result of the same size as the arguments. For example, if it was used to hash a byte array by combining the hash values of the bytes, the result would result in an integer between 0 and 255 --- too constraining. These situations will be discussed in more detail in a separate post.

So in situations where you can obtain SmallInteger range hash values from the instance variables, #hash can be implemented by mixing those hashes with #bitXor:. For example, if you had 3 instance variables, you could let #hash be implemented in terms of ^(iVar1 hash bitXor: iVar2 hash) bitXor: iVar3 hash.

The hash functions that map an equivalence class name to an integer must also comply with this requirement:

x = y => x hash = y hash

This is clear now, because if x = y that means they belong to the same equivalence class. This equivalence class can only have one name, and the hash function will obviously map it to a single, unique integer.

Note that nothing can be said when x ~= y. In that case, you may even have x hash = y hash. This happens because of the size restriction applied to the integers given by #hash.

When x ~= y but x hash = y hash, it is said that the hash values of x and y collide. As far as Sets are concerned, this may cause a slowdown because they will have to resort to storing representatives in array positions reserved for other representatives. It is still the case that this will be faster than linear search for the vast majority of cases, but it should be avoided.

A hash function is good when it produces few collisions. Few collisions are good because they allow sets to be fast.

However, we will be constraining hashes to the SmallInteger range. So when #= partitions objects into a LargeInteger amount of equivalence classes, we will have collisions for sure. How can we obtain a hash function that is good then?

The key observation is that we will usually deal with far less objects than the maximum SmallInteger value. In practice, however, this implies that it will be pretty common to hash things that are similar in structure. Names of up to 32 characters, for example, are an extremely tiny amount of all possible strings up to size 32. So in general, hash functions will have to be good even when hashing things that have a very constrained shape.

The arguments above lead to our first rule of thumb: #hash should produce different results when the objects hashed are similar.

This rule of thumb has a powerful implication. In the absence of any knowledge about the structure and characteristics of the objects that will be hashed in practice, there is no other choice than to hash everything that is referenced in the #= method.

More concretely, sometimes the hash of collections is implemented in terms of hashing a few selected objects, typically the first one, last one, and a few from around the middle. This may seem clever, however note that it violates our rule of thumb. And because of this, hash functions implemented in this manner will produce lots of collisions when the inputs tend to be similar at the indices used to calculate the hash.

This is more common than it may seem. For example, you could consider a LargeInteger as a collection of bytes. Thus, the hash of a LargeInteger could be the hash of a few selected bytes. This was the case in Squeak Smalltalk, and it was just fine for a long time. But when Magma, a relational database written in Squeak, used LargeIntegers as pointers, it so happened to be that pointers were usually equal at all the cherry-picked locations. Hence, the hash of most pointers was the same.

This problem could have been fixed by implementing #magmaHash and subclassing Set (or Dictionary, which keeps a set of keys using #hash). But then things get uncomfortable. Because if #magmaHash is better than #hash, then why shouldn't #hash be replaced? And if this is the case, then it would have been preferable to let #hash go through all the bytes to begin with.

This raises the issue of performance when hashing large collections. The solution to this is to implement #applicationHash using the knowledge about what is being hashed. This can be done because the knowledge is known in the context of the application in question. And in fact, such a hash could be even more efficient than a generic, cherry-picking hash --- it could be the case that less cherry-picking is enough.

Since #applicationHash would be written with specific knowledge of the application in question, this solution is significantly different and better in quality than implementing a cherry-picking #hash and forcing everybody else to write a better, all-encompassing #hash when excessive collisions become evident within their application. In every case, #hash would produce quality results which would keep the performance of collections that depend on #hash at a near optimum value.

In short, here is our second rule of thumb: hash the whole collection, unless it causes performance issues. Only when this is an issue as shown by a profiler, resort to specific knowledge to implement an #applicationSpecificHash in terms of fewer, significant elements of the collections in question. You should make sure that this new hash is good in terms of collisions.

With this in mind, note that some Smalltalks already come with pluggable-hash collections, in which the implementation of #hash used by the collection can be provided by a block. Depending on the Smalltalk you are using, you may even be able to skip subclassing Set or Dictionary so they use the application specific hash method.

This leaves addressing the implementation of #hash for collections that contain objects whose hashes are heavily constrained. Examples of this would be the hashes of byte arrays, strings, and integers. I will address these in a separate post.

Saturday, July 30, 2005

Brave new world

So did you read that book by Aldous Huxley? I just read the first chapter. I don't know if I will have the patience it will take to read it whole. Because what Huxley is describing is happening right now under our very noses: the conditioning, the dumbing down of our culture, of what we think is valuable in life... anything, whatever it takes as long as we like being the slaves we are.

And what about the really important issues? Just a deafening silence.

About stuff for kids

Maria Elena Walsh is one of those people that change things. She made a lot of stuff for children, and I was one of those who read her books and listened to her songs. Here are a few things she said recently.

My home was an illustrated middle class environment. People sensitive to art, reading, music. That is a huge privilege for a kid. It is as if you inherited a fortune.

Kids do not irritate me, but parents that say "give a kiss to the lady" do. Kids don't like to give kisses. But if I were with 3 children, I'd be comfortable.

Her works are really good. They are about to be republished in Argentina. Unfortunately, you'd have to know Spanish to fully enjoy them.

Quadratic formula fun

Given

  • N = m^2 + r
  • 0 < r < 2m + 1
Then for what values of j does the equation below hold?
  • j^2 + 2jm - r = k^2

Further toy refinement

If N is an odd constant (odd, as in N = 2k+1), when does this happen?

(N - 1) / 2 = i (mod 2i + 1)

Sunday, July 24, 2005

Simpler toys

Given

  • N = m^2 + r
  • 0 < r < 2m + 1
When does this happen?
  • N + i^2 = 0 (mod 2i)
  • 0 < i <= m

Saturday, July 23, 2005

GCD grinding done

By the way, I finished the gcd grinder. The basic idea of how it works fits in a single, reasonable size method. Finally I chunked out all the scheduling mess into a class called SchedulingSetQueue. Fantastic! Now I can do the following:

    14175 factorization refineWithOtherFactor: 105
      < ctrl-p >
    14175 = [3] * [3] * [3] * [3] * [5] * [5] * [7]

All those small factors appeared without even knowing they were there!

News suck, man

This is not democracy, it is not civilization --- it is utter bullshit. And we do nothing about it, which is the most astonishing part.

How about the execs of a French bus company who decide to sue their passengers when they choose to carpool instead of using their service? Twice?

That's news I do not need to read. Back home, remember a corporation called SCO which started a smear campaign against Linux some years ago? Apparently, they knew it was false all along. They do not seem to have cared one bit about it.

From Wikipedia: The Corporation, a documentary exploring the psyche of the corporation, came to the conclusion that if the corporation can be regarded as a person, rather than a legal entity, as it is under United States law, its personality would meet all of the DSM-IV requirements for being a psychopath (such as conning others for profit and recklessness).

Windows Vista --- eat, drink, and be merry

The whole point of Vista is to keep up the sales. Complaining about it being bloated, bad, not innovative, and whatever else is useless --- that criticism does not address its motivations.

In the same way there is obesity, there is consumism. Hence, the cycle of eat+useTheToilet applies to money as well: it's buy+throwInTheGarbage.

If we ever decided to create instead of consume, then the fears so well represented by all those hate articles on the Creative Commons license would become true. Because we might stop working like slaves to have money just so that we can give it to those who have more money than us in return for having whatever is left of our lives occupied with things we don't really need.

As if we had sold our souls to some politically correct Satan.

And just a final word on the "remix culture" argument thrown against the Creative Commons license (essentially, if whatever you do can be used by others, there's no creativity because any "creation" is a remix of something else you didn't create)...

... please explain how movies like The War of the Worlds and Herbie - Fully Loaded are not examples of the remix culture. Please explain how any latest Disney / Pixar movie isn't a previous movie remixed with different characters. Please explain how record company style music is innovative in any sense. Please explain how 100 years of copyright protection is not supporting a remix culture. Please explain how absurd patents are not supporting a remix culture.

The whole point is that in this culture, my friend, our task as the little slaves we are is to work until we drop dead so we can, as much as possible, buy+throwInTheGarbage. A measure of the frenetic pace of this activity is otherwise known as the strength of the economy.

The end result is that just a select few have the chance to stop working like a slave. But as proclaimed and shown on television: economic freedom could happen to you in this land of opportunity! (this offer is provided to you on an "as is" basis, no warranties apply especially that of fitness for a purpose, your mileage may vary, taxes not included, no social security, no medicare, no health insurance, no retirement, working like a slave may be detrimental to your health, excess garbage generation may be detrimental to the environment, to your health and to that of others. For your entertainment only. The terms of this offer may not be the sound advice of anyone or anything. Odds of winning: less than 1 in 1 million. Play responsibly).

Toys

When is it that the following happens?

(h^2 + r) / 2 (m - h) = integer

m, r constants
m + r = 1 (mod 2)
r = h (mod 2)
0 < h < m

Monday, July 18, 2005

On code written under stress

So in my previous post I described how the stress created by having too many names in the execution context is an invitation to constraints just so that it's possible to cope with the situation, and how it would be better to create classes instead.

After some time, I realized something else.

If code is written under such conditions, then clearly it is written from a point of view where self is bound to the developer. Think of the implications for those reading the code later...

In any case, there won't be many occurrences of self being bound to the context where self itself has been written. Rather, self would mean something along the lines of:

  • goto: aNamedRoutine draggingTheseArgumentsAlong: aCollection
  • callFunction: aFunctionName withParenthesisThrownAroundAll: aCollection separatedBy: SyntaxConstants comma
How subtle yet profound is where self is bound when you implement things in a computer...

Saturday, July 16, 2005

On the cost of change and gcd grinding (final)

I find myself stuck with an algoritm that should be easy to implement, but that has denied me the opportunity to write it in beautiful terms.

Given a collection of integers k1...kn greater than 1 and pair-perpendicular to each other, I want to break them into smaller pair-perpendicular factors if a little bird tells me that there is a certain integer m such that for at least one i, gcd(ki, m) > 1.

Well sure enough one could do the easy thing saying g = gcd(ki, m) and then obviously ki = g * ki', thus you'd replace ki with g and ki'. But, observation:

What if g and ki' are not perpendicular?

This happens in the presence of squareful integers (as opposed to square-free integers). For example, it occurs when trying to grind 14175 down using 105 as a factor. First, gcd(14175, 105) = 105, thus you find that 14175 = 105 * 135. You could stop there, but obviously 105 is not perpendicular to 135.

If you want to take things further, you could say well gcd(135, 105) = 15. Now you need to refine both 135 and 105, obtaining 135 = 15 * 9 and 105 = 15 * 7. So now you have obtained new grinders, namely 7 and 9. In short, 14175 = 135 * 105 = 15 * 15 * 9 * 7.

Of course you are still not done. Because if you consider gcd(15, 9) = 3, you find another grinder and after some work you obtain 14175 = 3 * 3 * 3 * 3 * 5 * 5 * 7. Finally, you are done because those factors as a set are pair-perpendicular.

Implementing this is irritating because of the type of recursion necessary. The hackish approach would indicate heavy use of indices, temporary collections and other stuff. But going down that road thinking that refactoring will be able to clean it up later has failed for me. I can't even write a functioning algorithm in hackish terms I can stand.

Most of the problem for me revolves around the scheduling of grinders. In short, I need a queue that also acts like a set and that is thread-safe. I have implemented this kind of thing before in my reference finder, but since the reference finder had a more clear separation between the visitor and traverse processes, each execution context did not have a large number of names being referenced, so modifying the queue of objects scheduled to be visited while the queue was being depleted was not confusing neither it was too problematic from a thread-safe point of view. In short, proper use of an ordered collection was good enough (see below for links) .

However, this gcd grinding process does have a large number of names in each execution context already. Adding the following complexities on top of it makes it unmanageable:
  1. Thread-safe access to the grinder queue
  2. Indexed access to the grinder queue
  3. Avoiding adding repeated grinders to the queue
  4. Avoiding adding a repeated grinder that has already been taken from the queue
  5. Ability to skip a grinder being processed and jumping to the next grinder from any step in the grinding process
Therefore I was stuck. Too many names in an execution context will bring progress down to a halt because your brain has to either a) swap the names in and out of your scratch ram all the time, or b) spend enough time to chunk it all.

But, observation: developers' abilities are usually measured by how much they can brute-force progress in spite of a) without resorting to b). While this rewards efficient thrashing, it also leads to the size of egos being directly proportional to those of scratch rams, both of these being directly proportional to the apparent short-term progress, to the degree to which produced code is an ugly mass of unmaintainable moldy rotten spaghetti, to the increasing disparity of perceived abilities between developers in the same team, to the resulting frictions between team members, to the increasing proportion of brute-forcers in the team as chunkers tend to leave, and to the overall cost of change measured over time.

In particular, the more this happens, the more the code reads like commandments given by a god-like figure. Evidently, the stress caused by the excess of referenced names in the execution context is an invitation to constraining and locking everything down --- as if all objects had to stand perfectly still unless ordered to do something, because otherwise chaos would take over and make brute-forcing impossible.

In my experience, the amount of stress is directly proportional to how much the code gives the impression of somebody shouting orders on top of increasingly louder white noise, while getting closer and closer to a major emotional breakdown.

Since we are not god-like figures with infinitely big scratch rams, chaos eventually takes over and after that any change made to the code leads to the introduction of more bugs than the ones being removed.

Hence, as a humble mere mortal I decided that I had to avoid all that by using classes to compartmentalize the names being referenced at each execution context --- chunking, in other words. The first thing I'm going to chunk out of the mess right now is that scheduling queue. Then things should be easier, and I should be able to write a working gcd grinder within the limits of everybody's scratch ram, namely: 7 ± 2 named chunks.

It is really unfortunate that chunking is usually seen as a sign of weakness (because of the assumed inability to make progress like others by e.g. brute-forcing the situation), and as a sign of time-wasting. Because in general terms, brute-forcing problems past the size of our scratch rams will get us nowhere.

Think of an ant colony: does the queen have to give out precise orders to all ants? In fact, does it have to micromanage by providing control-freak-style orders to every ant while at the same time being aware of what every ant is doing?

Note that the job of the colony is achieved without needing a context in which all the aggregate knowledge of every ant can be referenced using some labeling convention. The colony's self-awareness does not have to be omnipresent. Same thing with your own body which is supporting your act of reading this.

There is not even one example of a complex system needing an omnipresent self-awareness context to evolve. Hence, there is no need neither excuse for brute-forcing code into existence.

* Examine the reference finder's breadth first traverse implementation by downloading the source code from here (and download RAR if you don't have it already), or by going to the Cincom public repository and loading packages Distinctions and ReferenceFinder.

Thursday, July 14, 2005

Unexpected use of exceptions

Typically, when you use an algorithm to determine if a number has strict divisors, you use the algorithm to both factor into primes and to determine primality (no strict divisors found).

But how irritating it is to write an algorithm that can do both efficiently! Because if you are just looking to determine primality, any divisor found already gives you the answer --- so you'd need a check to see if you should stop. But when you want to find all factors, then you don't want to check because it's unnecessary.

And yes you can subclass, and you can do all sorts of complicated stuff, or you can raise a notification when you find a factor! Since notifications' default event handler proceeds, nothing bad happens if nobody is listening. Thus,

    Integer>>isPrime

      ^[self refinedFactorization isDefinitelyPrime]
        on: ComposedIntegerFoundEvent
        do: [:ex | ex return: false]

Wow... so much nicer!

Observation: usage of when:send:to: within the same process amounts to raising a notification. Think about the implications...

Sunday, July 10, 2005

Regarding Dennis

As an expression of nature, hurricane-like storms are extraordinarily beautiful. When we see them in other planets, we seem unable to help thinking they are gorgeous.

There's not much more to say about the damage hurricane-like storms cause on this planet's landmass, though. I just wish we didn't get in their way.

Did you know the point of any storm, as far as nature is concerned, is to cool down the surface?

Friday, July 01, 2005

Near-optimum Nysqually strategy

Do you know the game Nysqually? If you do, here's a near-optimum playing strategy.

  • "Reduce playing rectangle"
  • Get a histogram of all types of tiles.
  • Pick a side of the playing rectangle, count its length.
  • Pick a combination of groups of 3, 4 or 5 tiles such that the sum of their sizes is equal to the length of the side of the playing rectangle you chose. Make sure that you never choose in a way that leaves 1 or 2 tiles of the same type (since otherwise you won't be able to take them out).
  • Assuming it's possible to shuffle tiles around without making a mistake or inadvertent tile deletion, move them in such a way that the side of the playing rectangle you chose goes away.
Execute until the whole playing rectangle goes away. Works best by interleaving horizontal / vertical sides in your order. Therefore, after every 2 sides, you always end up with a square.

For easier playing, favor tile choices so that you tend to constraint tile counts to multiples of 3 when you are about to have a rectangle whose sides' lengths are multiples of 3. When you reach that situation, then you can take out the whole thing in columns / rows of 3 and you win.

When you have to take all tiles out starting with a 12x12 playing rectangle, no strategy will almost certainly make you lose. The strategy above, however, makes the game quite boring since it appears you cannot lose --- but I have not proven that, hence the near-optimum claim :).

Number 10 is mine, mine and only mine

So I come back from work and I read these two emails saying that some dude I know from Spain has a) tied my world champion record in Hat Trick Hero (56 goal difference over 7 games), and b) has posted a higher record in Euro Football Champ 92 (45 goal difference over 7 games vs 42).

I was furious. So I decided this situation wouldn't last 1 day. It didn't even last 10 hours.

Euro Football Champ 92 now has a record of 49 goal difference with zero against.

Football Champ now has a record of 58 goal difference with 1 against (59-1, improving my previous recording of 53 goal difference, 56-3).

Hat Trick Hero now has a record of 66 goal difference with zero against.

Good-bye extraneous results!