During my tenure at Cincom, HPS (the Cincom Smalltalk VM) went from a peak of ~359.5k LOC, to ~233.5k LOC as of a couple days ago. This loss of ~126k LOC represents 35% of the source code base from seven years ago. As per my presentation at Smalltalks 2013, HPS has not been this small since the early 90s --- this, while being markedly faster and having significant new features.
Friday, March 07, 2014
Wednesday, February 26, 2014
I gave my two week notice at Cincom today. I'm grateful for the opportunities. I'm also glad I had the chance to meet some extraordinary people.
Posted by Andrés at 21:25
Wednesday, December 25, 2013
Regarding the 3x+1 fun I've been having, I have some updates...
First, you will recall that I had taken a C program used to do massive computations on 3x+1, and optimized it by a factor of about 3x (no pun intended). I'm almost done implementing the C program in Smalltalk as well. This work involves the industrialization / productization of various bits of workspace code I used while doing the C version. With this Smalltalk code, I plan to verify a few ideas I have to make the C program run even faster.
Second, as mentioned in my earlier posts the T^(x) function is x/2 for even x, and (3x+1)/2 for odd x. Let's take a look at the below T^4(x) trace matrix. Each row corresponds to each x mod 16 value, and columns 2 through 5 correspond to T^1(x), T^2(x), T^3(x) and T^4(x).
- 16k + 0 8k + 0 4k + 0 2k + 0 1k + 0
- 16k + 1 24k + 2 12k + 1 18k + 2 9k + 1
- 16k + 2 8k + 1 12k + 2 6k + 1 9k + 2
- 16k + 3 24k + 5 36k + 8 18k + 4 9k + 2
- 16k + 4 8k + 2 4k + 1 6k + 2 3k + 1
- 16k + 5 24k + 8 12k + 4 6k + 2 3k + 1
- 16k + 6 8k + 3 12k + 5 18k + 8 9k + 4
- 16k + 7 24k + 11 36k + 17 54k + 26 27k + 13
- 16k + 8 8k + 4 4k + 2 2k + 1 3k + 2
- 16k + 9 24k + 14 12k + 7 18k + 11 27k + 17
- 16k + 10 8k + 5 12k + 8 6k + 4 3k + 2
- 16k + 11 24k + 17 36k + 26 18k + 13 27k + 20
- 16k + 12 8k + 6 4k + 3 6k + 5 9k + 8
- 16k + 13 24k + 20 12k + 10 6k + 5 9k + 8
- 16k + 14 8k + 7 12k + 11 18k + 17 27k + 26
- 16k + 15 24k + 23 36k + 35 54k + 53 81k + 80
- 16k + 0 0 0 0 0
- 16k + 1 1 0 1 0
- 16k + 2 0 1 0 1
- 16k + 3 1 1 0 0
- 16k + 4 0 0 1 0
- 16k + 5 1 0 0 0
- 16k + 6 0 1 1 0
- 16k + 7 1 1 1 0
- 16k + 8 0 0 0 1
- 16k + 9 1 0 1 1
- 16k + 10 0 1 0 0
- 16k + 11 1 1 0 1
- 16k + 12 0 0 1 1
- 16k + 13 1 0 0 1
- 16k + 14 0 1 1 1
- 16k + 15 1 1 1 1
But why does this happen in the first place? That is, why do parity matrices have all possible parity vectors? I have the following proof by induction to offer, which also explains the structure of the matrix. If you look at the trace matrix above, you will note that the even cases resolve immediately to the mod 8 matrix. But what about the odd cases? Those can be remapped to the mod 8 matrix too:
- 24k+2 => 8 * 3k + 2
- 24k+5 => 8 * 3k + 5
- 24k+8 => 8 * (3k+1) + 0
- 24k+11 => 8 * (3k+1) + 3
- 24k+14 => 8 * (3k+1) + 6
- 24k+17 => 8 * (3k+2) + 1
- 24k+20 => 8 * (3k+2) + 4
- 24k+23 => 8 * (3k+2) + 7
We can subtract 2 from both sides, and then since 3 is inversible we can multiply by 3^-1, leaving that r1 must be equal to r2 mod 2^m. This equivalence implies equality because 0 <= r < 2^m. Consequently, there are no duplicates, and so there are as many different rows as there can possibly be.
These observations about the even and odd cases show that each 2^m matrix is embedding the 2^(m-1) matrices twice: once for the even case, and once for the odd case. In the case of the even case, it prefixes the 2^(m-1) matrix with a leading zero in the parity vectors. The odd case prefixes the 2^(m-1) matrix with a leading one in the parity vectors.
The above proves the induction step. So what about the basis, i.e. the 2^1 matrix? Well, that's simply
- 2k+0 => 1k+0
- 2k+1 => 3k+2
- 2k+0 0
- 2k+1 1
Side comment: it's really fun to write Smalltalk code to run through the proof for a certain matrix, and then put that in an SUnit test.
Third, I began looking at join sieves in Smalltalk, too. From my earlier posts, remember that a join sieve is a T^k(x) matrix that is used to skip some cases that aren't interesting. All even cases are skipped in favor of the resulting odd cases. Also, since some T(x) paths join after a few iterations, e.g. T^3(8k+4) = T^3(8k+5) = 3k+2, we can skip processing 8k+5. Consider then a join sieve mod 2^m, storing whether each case should be analyzed in a large bit vector. In fact, the bit vector only records the information for odd cases, since obviously any even case would have a zero in the bit vector. This means that a 2^m sieve needs 2^(m-1) bits of storage. If m=20, the sieve requires 512 kilobits, or 64 kilobytes.
So, exactly how many byte values appear in the 2^20 sieve's bit vector? Exactly 32. First, we have 8 multiples of 8, from 0 to 56. Then, we have those same multiples + 128. And then, we have those 16 multiples of 8, plus 1. In other words,
Posted by Andrés at 15:48
Thursday, December 12, 2013
Wednesday, November 20, 2013
Saturday, November 09, 2013
So, I am working on the next VisualWorks release, and yes we are deleting obsolete / unused / unnecessary code. Specifically, we have removed about
5200 5700 9500 LOC from the VM so far in this release cycle which has just begun.
Maybe deleting code does not sound glamorous. However, keep in mind we are in an unprecedented period of almost 7 years in which source code size has monotonically decreased with every release. In fact, the current HPS VM source code base is as large as it was roughly 20 years ago. Moreover, HPS is far more robust, is significantly faster regarding GC in particular, and has new non-trivial features.
The metrics I showed in my presentation at Smalltalks 2013 also illustrate a marked increase in closed VM tickets during these last 7 years. In other words, our clean up efforts allow us to work faster in the long run.
And we are not done deleting cruft yet.
Posted by Andrés at 23:40
Friday, October 25, 2013
Friday, September 06, 2013
Regarding the ongoing work on the 3x+1 program I talked about here, I set up a generational scavenger of sorts for the hashed table used during the T^k(r) sieve calculation. This strategy assumes the sieve gaps will be relatively small compared to the sieve size, so you only need a comparatively few recent j, T^k(r) pairs to detect all paths that join. With this new algorithm, I calculated the following 2^32 and 2^33 sieve approximations using no more than 1gb of RAM.
- 2^32 sieve: gap >= 11102 (e.g. 2661008385), skip rate 92.617%.
- 2^33 sieve: gap >= 11102 (e.g. 2661008385), skip rate 92.916%.
The implication of these results is that it's entirely feasible now to run calculations using the largest sieve size currently supported by the program, namely a 2^33 sieve, by default. This is a huge improvement: with the original setup, such an operation would have been from counterproductive to impossible in practice.
More to come later.
Posted by Andrés at 01:26
Saturday, August 31, 2013
From time to time I like spending time on some of my favorite problems. Recently it's been the turn of a good old friend, 3x+1. I have a few things to comment on. First, you should check out this book. There is also this awesome page with tons of information in it. Eric Roosendaal (quoted in the above mentioned book) passed me the source code for the program he uses to hunt for delay records. According to Eric, the program had been optimized quite a bit. I confirm: yes it was optimized quite a bit. But, you know... optimization and math together are such an inviting and exciting challenge for me that I couldn't help it. What follows is an experience report.
- T(n) = n/2, n even
- T(n) = (3n+1)/2, n odd
using the usual division algorithm, and requiring that the remainder r is always non-negative. Then,
where j is the number of times T^k(r) chose the odd branch. In addition,
Isn't that awesome? Now let's put that identity to good use. Consider T^3(8q+4):
- T(8q+4) = 4q+2
- T(4q+2) = 2q+1
- T(2q+1) = 3q+2
- T(8q+5) = 12q+8
- T(12q+8) = 6q+4
- T(6q+4) = 3q+2
But who said you can only do this with 2^3? To begin with, Eric's program calculates a 2^20 sieve based on the awesome identity. The process consists of finding all j, T^k(r) pairs for all 0 <= r < 2^20, and then keeping the lowest values of r for all pairs that repeat. Eric's observation for doing this was that, if you are going to get a duplicate pair for some r1, any value of r2 > r1 that results in the same j, T^k(r) pair is not far away from r1. For example, for the 2^20 sieve, r2 - r1 <= 359. This distance is called the sieve gap. Well, since the gap is small and can be calculated once by doing the linear search thing, in practice you just need a circular buffer and then you can recalculate the sieve at will. Now, obviously, calculating gaps by brute force is incredibly costly. Since you don't know what the gap is, you are forced to do linear search through all the results seen so far during sieve generation. This limitation basically forbids checking gaps for sieves much larger than 2^20. Larger sieves mean doing less work later though, so it's worth looking into this issue. So, that became my first goal: enable sieve analysis on much larger sieves, and try to make the sieve calculation faster if possible.
The first change was to replace the circular buffer linear search with a hashed table. The lookup key is the pair j, T^k(r), and the stored value is r (storing r allows to calculate the sieve gap on the fly, which although now unnecessary is still interesting on its own). I packed all that information in 16 bytes per entry. Slightly tighter packings are possible at the cost of additional complexity. In any case, this approach slashed the processing time for the 2^20 sieve from several minutes to 0.2 seconds. I approximated the final hashed collection size with something that looks like 1/x, so I stopped growths and rehashing for larger sieves while still being efficient space-wise. This strategy allowed this machine to calculate the full 2^31 sieve using just 6gb of memory and 8 minutes of 5-year old CPU time. Here are some interesting results (obviously the sieves skip all even numbers).
- 2^20 sieve: gap 359 (e.g. 271105), skip rate 87.371%.
- 2^21 sieve: gap 442 (e.g. 361473), skip rate 87.968%.
- 2^22 sieve: gap 878 (e.g. 774721), skip rate 88.528%.
- 2^23 sieve: gap 878 (e.g. 774721), skip rate 89.053%.
- 2^24 sieve: gap 1210 (e.g. 10905985), skip rate 89.546%.
- 2^25 sieve: gap 2360 (e.g. 21811969), skip rate 90.011%.
- 2^26 sieve: gap 2360 (e.g. 21811969), skip rate 90.449%.
- 2^27 sieve: gap 3146 (e.g. 29082625), skip rate 90.862%.
- 2^28 sieve: gap 3146 (e.g. 29082625), skip rate 91.252%.
- 2^29 sieve: gap 4195 (e.g. 38776833), skip rate 91.621%.
- 2^30 sieve: gap 8237 (e.g. 922014465), skip rate 91.971%.
- 2^31 sieve: gap 8237 (e.g. 922014465), skip rate 92.303%.
After the sieve is calculated in full, it can be represented as a bitfield that also skips all even numbers. So, a 2^31 sieve requires just 2^27 bytes of memory. But that's not all the sieving that can be done. We can also skip 3q+2, 9q+4 and 81q+10 numbers precisely because these values can be easily derived from others such as 8q+4. The original program issued, on average, 5/3 divisions per number checked to remove the first two cases. In other words, there was code along the lines of
Both programs use a 4 way switch to determine which computation to use.
- n = 4q+0: shift 2 bits to the right.
- n = 4q+1: calculate 3q+1 by doing n/2 + n/4 + 1, in integers.
- n = 4q+2: calculate 3q+2 by doing n/2 + n/4 + 1, in integers.
- n = 4q+3: calculate 9q+8 by doing 2n + n/4 + 2, in integers (and check against potential overflow and switch to multiprecision arithmetic if needed).
Disassembly of the GCC emitted code does not necessarily suggest that rewriting the calculation code directly in e.g. 64 bit Intel assembler will result in very significant speedups at this time because the parts that might benefit from e.g. adc instructions aren't that long. Moreover, using assembler would allow using 64 bit digits even during multiprecision calculations. However, the resulting code would need numerous e.g. adc instructions, whereas now the code just deals with numerous such instructions with just two instructions: and, and shl/shr. In addition, the current C code is portable, while the assembly code is not.
- First version of the new program, single threaded, 2^20 sieve size, 60*10^12 numbers: about 11 days' CPU time.
- Estimated original program, single threaded, 2^20 sieve size, 20*10^12 numbers: about 7 days' CPU time.
Posted by Andrés at 22:38
Saturday, August 10, 2013
It wouldn't be too hard to assume a general dislike of negative feedback. Moreover, it's rather easy to see how we're constantly bombarded with (potentially) positive feedback in the form of ads: do this, and good stuff happens to you. Add to this mix the seemingly reasonable maxim "do unto others as you'd like done unto you". When put together, do these imply we're not exercising our capability to judge on merits alone? Are we censoring our public opinions so that we only mention positives, ignore the negatives, and counterbalance the negatives with positives as well?
Well, apparently, yes.
Consider now what happens when that mentality clashes with a rational experience of reality. Is that programming language good? Is that math stuff correct? Does that critical bit of software work? Let's hear the comments of the team on that, right? Hmmm...
Posted by Andrés at 17:41