Wednesday, December 25, 2013

3x+1, the update

Regarding the 3x+1 fun I've been having, I have some updates...

First, you will recall that I had taken a C program used to do massive computations on 3x+1, and optimized it by a factor of about 3x (no pun intended).  I'm almost done implementing the C program in Smalltalk as well.  This work involves the industrialization / productization of various bits of workspace code I used while doing the C version.  With this Smalltalk code, I plan to verify a few ideas I have to make the C program run even faster.

Second, as mentioned in my earlier posts the T^(x) function is x/2 for even x, and (3x+1)/2 for odd x.  Let's take a look at the below T^4(x) trace matrix.  Each row corresponds to each x mod 16 value, and columns 2 through 5 correspond to T^1(x), T^2(x), T^3(x) and T^4(x).

  • 16k + 0    8k + 0    4k + 0    2k + 0    1k + 0
  • 16k + 1    24k + 2    12k + 1    18k + 2    9k + 1
  • 16k + 2    8k + 1    12k + 2    6k + 1    9k + 2
  • 16k + 3    24k + 5    36k + 8    18k + 4    9k + 2
  • 16k + 4    8k + 2    4k + 1    6k + 2    3k + 1
  • 16k + 5    24k + 8    12k + 4    6k + 2    3k + 1
  • 16k + 6    8k + 3    12k + 5    18k + 8    9k + 4
  • 16k + 7    24k + 11    36k + 17    54k + 26    27k + 13
  • 16k + 8    8k + 4    4k + 2    2k + 1    3k + 2
  • 16k + 9    24k + 14    12k + 7    18k + 11    27k + 17
  • 16k + 10    8k + 5    12k + 8    6k + 4    3k + 2
  • 16k + 11    24k + 17    36k + 26    18k + 13    27k + 20
  • 16k + 12    8k + 6    4k + 3    6k + 5    9k + 8
  • 16k + 13    24k + 20    12k + 10    6k + 5    9k + 8
  • 16k + 14    8k + 7    12k + 11    18k + 17    27k + 26
  • 16k + 15    24k + 23    36k + 35    54k + 53    81k + 80
Because of the properties of T(x), the last iteration will always end in 3^n k + something.  Specifically, 3^n appears when T(x) takes the odd branch n times.  So, how often does each 3^n power of three come up?  Surprise: the binomial coefficient (4 n).  But why?  Let's look at the associated parity matrix, which has 1 when T(x) takes the odd branch, and 0 when T^(x) takes the even branch.
  • 16k + 0        0        0        0        0
  • 16k + 1        1        0        1        0
  • 16k + 2        0        1        0        1
  • 16k + 3        1        1        0        0
  • 16k + 4        0        0        1        0
  • 16k + 5        1        0        0        0
  • 16k + 6        0        1        1        0
  • 16k + 7        1        1        1        0
  • 16k + 8        0        0        0        1
  • 16k + 9        1        0        1        1
  • 16k + 10        0        1        0        0
  • 16k + 11        1        1        0        1
  • 16k + 12        0        0        1        1
  • 16k + 13        1        0        0        1
  • 16k + 14        0        1        1        1
  • 16k + 15        1        1        1        1
All possible bit vectors appear in the parity matrix.  This means that e.g. a parity vector of size 4 having 3 bits turned on will appear (4 3) times.  Consequently, 3^3 will appear (4 3) times too.

But why does this happen in the first place?  That is, why do parity matrices have all possible parity vectors?  I have the following proof by induction to offer, which also explains the structure of the matrix.  If you look at the trace matrix above, you will note that the even cases resolve immediately to the mod 8 matrix.  But what about the odd cases?  Those can be remapped to the mod 8 matrix too:
  • 24k+2 => 8 * 3k + 2
  • 24k+5 => 8 * 3k + 5
  • 24k+8 => 8 * (3k+1) + 0
  • 24k+11 => 8 * (3k+1) + 3
  • 24k+14 => 8 * (3k+1) + 6
  • 24k+17 => 8 * (3k+2) + 1
  • 24k+20 => 8 * (3k+2) + 4
  • 24k+23 => 8 * (3k+2) + 7
And sure enough, those rows can be resorted into a nice 8k+0, ..., 8k+7 sequence.  This is possible because the general form of those 8k expressions is 8 * something + 2 + 3r, with r in [0,7].  That's why we see the order 2, 5, 0, 3, 6, 1, 4, 7.  Why do those expressions have to always result in all possible values?  Well, consider the equality

2 + 3r1 = 2 + 3r2 (mod 2^m)

We can subtract 2 from both sides, and then since 3 is inversible we can multiply by 3^-1, leaving that r1 must be equal to r2 mod 2^m.  This equivalence implies equality because 0 <= r < 2^m.  Consequently, there are no duplicates, and so there are as many different rows as there can possibly be.

These observations about the even and odd cases show that each 2^m matrix is embedding the 2^(m-1) matrices twice: once for the even case, and once for the odd case.  In the case of the even case, it prefixes the 2^(m-1) matrix with a leading zero in the parity vectors.  The odd case prefixes the 2^(m-1) matrix with a leading one in the parity vectors.

The above proves the induction step.  So what about the basis, i.e. the 2^1 matrix?  Well, that's simply
  • 2k+0 => 1k+0
  • 2k+1 => 3k+2
and consequently the parity matrix looks like
  • 2k+0    0
  • 2k+1    1
which obviously has all possible parity vectors.  QED.

Side comment: it's really fun to write Smalltalk code to run through the proof for a certain matrix, and then put that in an SUnit test.

Third, I began looking at join sieves in Smalltalk, too.  From my earlier posts, remember that a join sieve is a T^k(x) matrix that is used to skip some cases that aren't interesting.  All even cases are skipped in favor of the resulting odd cases.  Also, since some T(x) paths join after a few iterations, e.g. T^3(8k+4) = T^3(8k+5) = 3k+2, we can skip processing 8k+5.  Consider then a join sieve mod 2^m, storing whether each case should be analyzed in a large bit vector.  In fact, the bit vector only records the information for odd cases, since obviously any even case would have a zero in the bit vector.  This means that a 2^m sieve needs 2^(m-1) bits of storage.  If m=20, the sieve requires 512 kilobits, or 64 kilobytes.

So, exactly how many byte values appear in the 2^20 sieve's bit vector?  Exactly 32.  First, we have 8 multiples of 8, from 0 to 56.  Then, we have those same multiples + 128.  And then, we have those 16 multiples of 8, plus 1.  In other words,

0, 1, 8, 9, 16, 17, 24, 25, 32, 33, 40, 41, 48, 49, 56, 57,
128, 129, 136, 137, 144, 145, 152, 153, 160, 161, 168, 169, 176, 177, 184, 185

And how many byte digrams appear?  Just 432 out of a possible 65536, and following a very similar pattern.  Similarly, there are only 4914 byte tetragrams out of a possible 2^32.  Pragmatically, this clearly shows that the sieve can be compacted in order to save memory.  It would be nicer to understand why the specific values come up.

More to come...

Thursday, December 12, 2013

Math exercises...

Ahh, vacation, a time to do nothing... riiiight...

Here's a fun problem I worked on a moment ago: sum all integers in the interval [x, y] with no divisions and only one multiplication.

Wednesday, November 20, 2013

Smalltalks 2013 video now available

Check out the playlist here.  Don't dare miss Smalltalks 2014!

Saturday, November 09, 2013

HPS code size so far after 7.10

So, I am working on the next VisualWorks release, and yes we are deleting obsolete / unused / unnecessary code.  Specifically, we have removed about 5200 5700 9500 LOC from the VM so far in this release cycle which has just begun.

Maybe deleting code does not sound glamorous.  However, keep in mind we are in an unprecedented period of almost 7 years in which source code size has monotonically decreased with every release.  In fact, the current HPS VM source code base is as large as it was roughly 20 years ago.  Moreover, HPS is far more robust, is significantly faster regarding GC in particular, and has new non-trivial features.

The metrics I showed in my presentation at Smalltalks 2013 also illustrate a marked increase in closed VM tickets during these last 7 years.  In other words, our clean up efforts allow us to work faster in the long run.

And we are not done deleting cruft yet.

Friday, October 25, 2013

Smalltalks 2013 schedule

Smalltalks 2013 starts next week, and I can't wait!  In fact, I'm sure the 300 registrants to our first Rosario edition can't wait either.  Did you check the conference schedule?

Friday, September 06, 2013

About 3x+1, hash table scavenging

Regarding the ongoing work on the 3x+1 program I talked about here, I set up a generational scavenger of sorts for the hashed table used during the T^k(r) sieve calculation.  This strategy assumes the sieve gaps will be relatively small compared to the sieve size, so you only need a comparatively few recent j, T^k(r) pairs to detect all paths that join.  With this new algorithm, I calculated the following 2^32 and 2^33 sieve approximations using no more than 1gb of RAM.

  • 2^32 sieve: gap >= 11102 (e.g. 2661008385), skip rate 92.617%.
  • 2^33 sieve: gap >= 11102 (e.g. 2661008385), skip rate 92.916%.
In addition, since the program uses far less memory (with cache misses mostly in sequential as opposed to random order), the approximations take somewhat less CPU time to produce than the full results.

The implication of these results is that it's entirely feasible now to run calculations using the largest sieve size currently supported by the program, namely a 2^33 sieve, by default.  This is a huge improvement: with the original setup, such an operation would have been from counterproductive to impossible in practice.

More to come later.

Saturday, August 31, 2013

About 3x+1

From time to time I like spending time on some of my favorite problems.  Recently it's been the turn of a good old friend, 3x+1.  I have a few things to comment on.  First, you should check out this book.  There is also this awesome page with tons of information in it.  Eric Roosendaal (quoted in the above mentioned book) passed me the source code for the program he uses to hunt for delay records.  According to Eric, the program had been optimized quite a bit.  I confirm: yes it was optimized quite a bit.  But, you know... optimization and math together are such an inviting and exciting challenge for me that I couldn't help it.  What follows is an experience report.

First, some definitions.  The 3x+1 problem tends to be better studied using the T(x) form as follows:
  • T(n) = n/2, n even
  • T(n) = (3n+1)/2, n odd
The above definition allows the following beautiful identity.  First, break n by dividing it by a power of two 2^k

n = q2^k + r

using the usual division algorithm, and requiring that the remainder r is always non-negative.  Then,

T^k(n) = T^k(q2^k + r) = q3^j + T^k(r)

where j is the number of times T^k(r) chose the odd branch.  In addition,

T^k(2^k - 1) = 3^k - 1

Isn't that awesome?  Now let's put that identity to good use.  Consider T^3(8q+4):
  1. T(8q+4) = 4q+2
  2. T(4q+2) = 2q+1
  3. T(2q+1) = 3q+2
Now look at T^3(8q+5):
  1. T(8q+5) = 12q+8
  2. T(12q+8) = 6q+4
  3. T(6q+4) = 3q+2
So T^3(8q+4) = T^3(8q+5).  If you're checking for T(n) records, that is, the least values of n such that something interesting happens, then you'd be well served by ignoring all 5 mod 8 numbers because checking 4 mod 8 already takes care of 5 mod 8.

But who said you can only do this with 2^3?  To begin with,  Eric's program calculates a 2^20 sieve based on the awesome identity.  The process consists of finding all j, T^k(r) pairs for all 0 <= r < 2^20, and then keeping the lowest values of r for all pairs that repeat.  Eric's observation for doing this was that, if you are going to get a duplicate pair for some r1, any value of r2 > r1 that results in the same j, T^k(r) pair is not far away from r1.  For example, for the 2^20 sieve, r2 - r1 <= 359.  This distance is called the sieve gap.  Well, since the gap is small and can be calculated once by doing the linear search thing, in practice you just need a circular buffer and then you can recalculate the sieve at will.  Now, obviously, calculating gaps by brute force is incredibly costly.  Since you don't know what the gap is, you are forced to do linear search through all the results seen so far during sieve generation.  This limitation basically forbids checking gaps for sieves much larger than 2^20.  Larger sieves mean doing less work later though, so it's worth looking into this issue.  So, that became my first goal: enable sieve analysis on much larger sieves, and try to make the sieve calculation faster if possible.

The first change was to replace the circular buffer linear search with a hashed table.  The lookup key is the pair j, T^k(r), and the stored value is r (storing r allows to calculate the sieve gap on the fly, which although now unnecessary is still interesting on its own).  I packed all that information in 16 bytes per entry.  Slightly tighter packings are possible at the cost of additional complexity.  In any case, this approach slashed the processing time for the 2^20 sieve from several minutes to 0.2 seconds.  I approximated the final hashed collection size with something that looks like 1/x, so I stopped growths and rehashing for larger sieves while still being efficient space-wise.  This strategy allowed this machine to calculate the full 2^31 sieve using just 6gb of memory and 8 minutes of 5-year old CPU time.  Here are some interesting results (obviously the sieves skip all even numbers).
  • 2^20 sieve: gap 359 (e.g. 271105), skip rate 87.371%.
  • 2^21 sieve: gap 442 (e.g. 361473), skip rate 87.968%.
  • 2^22 sieve: gap 878 (e.g. 774721), skip rate 88.528%.
  • 2^23 sieve: gap 878 (e.g. 774721), skip rate 89.053%.
  • 2^24 sieve: gap 1210 (e.g. 10905985), skip rate 89.546%.
  • 2^25 sieve: gap 2360 (e.g. 21811969), skip rate 90.011%.
  • 2^26 sieve: gap 2360 (e.g. 21811969), skip rate 90.449%.
  • 2^27 sieve: gap 3146 (e.g. 29082625), skip rate 90.862%.
  • 2^28 sieve: gap 3146 (e.g. 29082625), skip rate 91.252%.
  • 2^29 sieve: gap 4195 (e.g. 38776833), skip rate 91.621%.
  • 2^30 sieve: gap 8237 (e.g. 922014465), skip rate 91.971%.
  • 2^31 sieve: gap 8237 (e.g. 922014465), skip rate 92.303%.
So, this new code enables on the fly calculation of sieves much larger than 2^20.  The best part is that, as you can see, going from a 2^20 sieve to a 2^31 sieve makes you work about 7.7% of the time, instead of 12.6% of the time.  In other words, that's about 40% less work!  The extra setup sieve overhead does pay off, because even 1% less work is very significant.  As you can see in Eric's page here, work is distributed in 20*10^12 number chunks.  Each chunk takes on the order of a week with the original program.  1% of a week is much more than the 8 minutes it takes to figure out the 2^31 sieve, so you win.

After the sieve is calculated in full, it can be represented as a bitfield that also skips all even numbers.  So, a 2^31 sieve requires just 2^27 bytes of memory.  But that's not all the sieving that can be done.  We can also skip 3q+2, 9q+4 and 81q+10 numbers precisely because these values can be easily derived from others such as 8q+4.  The original program issued, on average, 5/3 divisions per number checked to remove the first two cases.  In other words, there was code along the lines of

if (x % 3 !=2 && x % 9 != 4)

Integer multiplication has become very fast in modern processors, but division is still comparatively slow.  This is particularly so when operating on largish values.  I removed these divisions, and added the x % 81 != 10 case as well, by using the Chinese Remainder Theorem.  First I rewrote the program's iteration so that instead of scanning a block of numbers in sequential fashion, the program will scan the block using a set of linear combs.  In other words, instead of scanning e.g. [32, 47] like this:

32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47

the program breaks that in classes of equivalence.  If the program was using a 2^2 sieve, it would scan like this:

32, 36, 40, 44
33, 37, 41, 45
34, 38, 42, 46
35, 39, 43, 47

The linear comb traversal reduces the sieve bit field's accesses tremendously.  Imagine scanning a block of 20*10^12 integers by checking every (odd) number against the sieve: that's 10*10^12 sieve bit field operations.  With the linear comb approach, a sieve of 2^k will require exactly 2^{k-1} accesses, regardless of how many numbers are being checked.  Since generally k is much less than 40, these computation savings are significant.  On top of this, I also used a streaming approach for the sieve.  Thus, the sieve bits are only accessed once every 64 bits.

The next thing now is to use the Chinese Remainder Theorem.  If we make 9 copies of the 2^k sieve, then every single combination of 2^k, 3 and 9 moduli will appear in the sieve.  Similarly, if we make 81 copies, the sieve will cover every combination of 2^k, 3, 9 and 81 moduli.  Then, instead of making 81 copies, we just keep a running set of mod 3, 9 and 81 counters (which are incremented with an addition and a compare to reset their tiny values as needed, instead of requiring a division of a number with at least 50 bits these days) and cycle through the sieve 81 times.  Rough speed measurements and estimates suggest this change alone deleted a significant double digit percentage fraction off the execution time in the program.

The original program used a mix of 64 and 32 bit counters.  I rewrote all that using 64 bit arithmetic only.  When multiprecision numbers are needed (T(n) can grow as large as about n^2), the new program uses 60 bit digits instead (except the top digit, which is used in full as possible).  This allows up to 3 overflow bits (see below) to accumulate safely without needing to continuously check for overflow, and also allows a simplified base 10 printing routine to work by using the 4 bit overflow area.

Both programs use a 4 way switch to determine which computation to use.
  • n = 4q+0: shift 2 bits to the right.
  • n = 4q+1: calculate 3q+1 by doing n/2 + n/4 + 1, in integers.
  • n = 4q+2: calculate 3q+2 by doing n/2 + n/4 + 1, in integers.
  • n = 4q+3: calculate 9q+8 by doing 2n + n/4 + 2, in integers (and check against potential overflow and switch to multiprecision arithmetic if needed).
That is, the program calculates more than one iteration of T(n) in each step.  As you can see, cases 4q+1 and 4q+2 are handled the same way.  Hence, the C switch has 3 cases: 3, 2 and 1, and 0.  Alas, the original program did not have default cases in the switches, and GCC 4.2 wasn't smart enough to realize the switches were handling every x & 3 case that way (maybe some spec detail prevents GCC from applying such smarts, or in practice it's handled as a case of garbage in, garbage out).  The result is that, according to C semantics, the compiler generated code for what would happen if x & 3 wasn't handled in a switch.  After adding a default case to the switches, the compiler removed a bunch of cmp, jmp, ja etc instructions from many places.  A 1060 byte function lost about 50 bytes' worth of those unnecessary instructions, and the program became about 7% faster too.

Disassembly of the GCC emitted code does not necessarily suggest that rewriting the calculation code directly in e.g. 64 bit Intel assembler will result in very significant speedups at this time because the parts that might benefit from e.g. adc instructions aren't that long.  Moreover, using assembler would allow using 64 bit digits even during multiprecision calculations.  However, the resulting code would need numerous e.g. adc instructions, whereas now the code just deals with numerous such instructions with just two instructions: and, and shl/shr.  In addition, the current C code is portable, while the assembly code is not.

Speaking of the and instructions, I also wrote the code carefully so the arguments always fit within a byte.  Consequently, the compiler can use the compact Intel encodings.  I did the same with the sieve hashed table packing by choosing where to put what chunks of information, as well as with the streaming sieve accesses (both reads during use, as well as with writes during creation).  I also ensured the single and multiprecision computation order allowed the compiler to use the lea instruction when possible.  Using lea also eliminates the need for some add, shl, and mov instructions.

I automated the GCC builds with makefiles.  In fact, I also automated the process of running a first compile of the program with instrumented branches, then recompiling the program using the resulting information.  Doing so gained another 7% or so.

In addition, as of a while ago, I also multithreaded the main calculation section of the program using POSIX threads.  Using 4 threads and a 2^30 sieve, I estimate the program will now be able to crunch through 60*10^12 numbers in a bit less than 2 days.  Compare that to the following:
  • First version of the new program, single threaded, 2^20 sieve size, 60*10^12 numbers: about 11 days' CPU time.
  • Estimated original program, single threaded, 2^20 sieve size, 20*10^12 numbers: about 7 days' CPU time.
Bang!

So what's next?  The program still needs a rigorous shakeout and a significant amount of audit to ensure it is actually correct in every case, despite the fact that it already can reproduce previously verified results.  We'll see what after that work is complete.  More to come...

Saturday, August 10, 2013

Excess of poor quality feedback

It wouldn't be too hard to assume a general dislike of negative feedback.  Moreover, it's rather easy to see how we're constantly bombarded with (potentially) positive feedback in the form of ads: do this, and good stuff happens to you.  Add to this mix the seemingly reasonable maxim "do unto others as you'd like done unto you".  When put together, do these imply we're not exercising our capability to judge on merits alone?  Are we censoring our public opinions so that we only mention positives, ignore the negatives, and counterbalance the negatives with positives as well?

Well, apparently, yes.

Consider now what happens when that mentality clashes with a rational experience of reality.  Is that programming language good?  Is that math stuff correct?  Does that critical bit of software work?  Let's hear the comments of the team on that, right?  Hmmm...

Friday, August 09, 2013

Smalltalks 2013 invitation

The Fundación Argentina de Smalltalk (FAST, http://www.fast.org.ar) invites you to the 7th International Conference on Smalltalk Technologies (Smalltalks), to be held from October 30th through November 1st at the Universidad Tecnológica Nacional, Facultad Regional Rosario, located in Rosario, Argentina.

Everyone, including teachers, students, researchers, developers and entrepreneurs, are welcome as speakers or attendees. Registration is free and now open at http://www.fast.org.ar/smalltalks2013.

This year we are accepting donations from participants to help fund the conference's costs. Please see the Donate section at FAST's website, http://www.fast.org.ar/smalltalks2013/donate. Contributions are greatly appreciated, and can be received both in pesos for local attendees, as well as via Paypal for those coming from abroad. Donors will receive a set of thank you gifts as well as the conference's shirt.

The goal of the Conference is to strengthen the bonds between the Argentine and the international Smalltalk community through the exchange of works, experiences and anecdotes connected with this technology and related matters.

As in the previous editions, renowned members of the international Smalltalk community will visit us, and like last year we will have a Research Session with publications reviewed by an international committee.

Also, the traditional Industry Track will be available for those who prefer a more practical environment. You can submit papers or talks now through our website at

http://www.fast.org.ar/smalltalks2013/technical-session

The Industry Track's submission deadline is October 7th.

For more information about related events, such as a Pharo Sprint or talks and presentations around the dates of the conference, please visit

http://www.fast.org.ar/smalltalks2013/events

We will update related event information as we get closer to the conference, so please check for updates.

As in previous years, we will have a social dinner during the conference. If you are planning to attend, please reply to this email with a subject of "[Social Dinner]" and indicate how many people will be coming in your party. We will have a variety of dishes available for everyone. If you have special needs other than vegetarian dishes, please let us know.

For any additional questions please contact us at our info email address at fast.org.ar.

See you there!

Monday, August 05, 2013

Skype hangs due to removable media

So... on OS X, you can get Skype 6.6.0.467 (as well as the latest version 6.7.0.343) to hang all the time by doing the following.

  1. Mount a removable file system, such as a disk image.
  2. Send a file from that removable file system to someone else.
  3. Close Skype --- this is just so that you can...
  4. Eject the removable file system.
  5. Reopen Skype, and look at the chat history for the contact you sent the file to.
Skype then hangs.  However, if at this point you remount the removable file system and reopen Skype, then the chat history for your contact works and Skype doesn't hang.

It looks like the complete path of such files is stored in main.db under ~/Library/Application Support/Skype.  That file looks like sqlite, so no you don't get to deal with a text file you can edit with vi and at least work around the problem.

So how do you work around the issue?  Do you, like, delete all history for all contacts?  Or delete all history for that contact?  And you have to do this simply because you sent a contact a file from a removable device that has since been removed?  And you have to keep in mind that you can only remove the removable device after closing Skype because otherwise you can't remove the removable device?  Nobody thought those file paths might no longer be accessible at some point in a (say) 5 year chat history?

Seriously?

Saturday, June 15, 2013

Internet "speed tests" generally unhelpful

So let's say you are having internet issues that look like traffic drops.  The ISP's technician runs a speed test and shows you that your connection speed is ok.  But you know from your previous runs that the results are unstable.  A reason why is that several popular speed test sites measure speed by receiving files (download speed) and sending files (upload speed).  Some examples include http://www.speedtest.net/, http://www.speakeasy.net/speedtest/, http://speedtest.comcast.net/, http://www.bandwidthplace.com/, and http://whatismyipaddress.com/speed-test.  The issue with this procedure is that they also end up measuring file I/O and, particularly, how your browser does file I/O, and how your operating system caches file I/O.

Yes yes I know your SSD drive is fast, and yes yes your hardware RAID card is awesome.  But why should there be file[system] I/O at all when doing a network test?  Here's the net effect I have observed on OS X with Firefox, on a fast machine with plenty of RAM to cache disk reads.

First, there is the issue of where Firefox puts the files, which is the browser's cache.  If your cache is full, then Firefox will have to make space in the browser's cache as it's writing the file during the speed test.  The cleanup operation will cause random file I/O in the cache's hash bucketed directory structure to find the files, and then more random file I/O to delete the files.  OS X doesn't like caching a lot of disk writes while deleting files, so this means file I/O checkpoints.  If your file system is POSIX, that also means a number of atime updates as well.  Finally, I assume something will have to be done in the database that's holding which URLs correspond to what files.  And what if the cache is full of tiny (less than 4kb) files, as it usually is?  Then all this file I/O will happen every 4kb or less during your speed test measured in megabytes per second.  Oh, and did I mention there has to be file I/O to write the file to the browser's cache too?

Well ok you could at least clear the cache before beginning.  Try deleting 250k files out of a 1gb browser cache on OS X.  With an SSD drive, that can take a significant fraction of a minute (top speed on the order of 2k IOPS, usually less than 1k IOPS).  With a regular hard drive, that will cost you half an hour or more (top speed on the order of 300 IOPS, and a lot of audible seeking --- as if the OS was flushing disk buffers continuously).

But even if you flush the cache so the browser can write without deleting anything, that's not a guarantee of a proper test because OS X also buffers writes until the buffer fills, and then further I/O is effectively blocked until the buffers are flushed to disk.  You can see this effect by using the activity monitor app during the test.  The writes will flatline at zero most of the time, and at some point the buffers will be flushed (at hopefully much higher speeds than the speed test itself is receiving data).  While the flushing happens, most speed tests seem to get stuck and their fancy graphics don't update, etc.

I've yet to find a browser based speed test that simply allocates memory to do these things.  I'm not saying none exists.  Speed tests such as the above are not useful because of the problems described above.

Generally, the issue with the Firefox cache induced file I/O can be improved dramatically by creating an ExFAT disk image and mounting it where Firefox expects the hash bucketed directory structure to be.  Clearing the browser's cache is vastly improved with this approach.  Whereas it can take half an hour or more with a regular drive to delete 250k files and 1gb of cached data, the disk image approach can do the same thing in a few seconds.  Effectively, you get SSD performance out of your mechanical drive simply because you coax software to do a better job of file I/O.  The only hiccup will occur when OS X decides to flush disk buffers, at which point the machine might seem stuck for a couple seconds.

Can we please fix these underlying issues so that we can get proper performance out of SSD drives too?

Update: a Comcast technician suggests http://testmy.net, which doesn't look like it's writing stuff to the browser's cache.  Nice!

Saturday, April 20, 2013

Excess of poor quality communication

Check out this article.  The short version of it is: undertaking complicated tasks in social networks frequently gets out of hand and produces spectacularly bad results because the urge to have specific social interactions overrides the goal of doing a good job.  The article uses the example of the Boston bombing, and how various groups of people came up with an assortment of pretty bad ideas while recklessly framing innocent people along the way.  Here are some of the article's key phrases, summed up in a paragraph:

This is one of the most alarming social media events of our time.  We're really good at uploading images and unleashing amateurs, but we're not good with the social norms that would protect the innocent.  People in the moment want to participate.  They want to be a part of what's going on.  But beyond the photos they upload, their speculation and theorizing don't necessarily lead to a more efficient resolution.  There is just a lot of meaningless noise out there.  People see trends and patterns that aren't really trends and patterns.  People love to speculate and some people love to make the Web equivalent of crank calls.  The instinct is to satisfy our voyeuristic urges.  That's when we see the arrogance of the crowd take over.

Compare and contrast with the opinion of an unnamed Reddit user:

I feel like we've reached a certain threshold here — the Internet is finally outstripping cable news completely [...] In fact, I wonder if we're inadvertently doing their work for them.

Also check out what happened in the Kasparov vs The World game, in which a crowd of chess players exchanged ideas in an open forum and voted (by majority) on moves to play against Kasparov.  Despite the fact that Kasparov was reading the thread, so perhaps the result of the game isn't entirely fair, time and time again crowd sourcing failed to consider and evaluate significantly better moves under social stress.

In this state of affairs, consider then the majority of posts you see in programming forums.  Are the opinions truly knowledgeable, or are the entries a shot from the hip coming from a programmer "rock star"?  In my experience, 99% such entries are in error simply because the stated opinion does not match the relevant manual or source code.  Moreover, in most cases a cursory examination of the relevant authoritative sources is enough to find errors.  I could understand the occasional slip in terms of the human condition.  I really suspect the significant issue is the absence of an honest effort.

It is really hard to produce properly thought out material worth reading.  Considering the sheer volume of technical communication made possible by the internet and social networks, it's clearly impossible that a significant fraction of such technical communication can be correct or worth reading.  It just can't be, because otherwise the fraction of true diligent experts (like Knuth) would be much higher, so then how come there are so many bugs in the programs we use every day?

So please, if you are not really sure of what you want to say, then do one of two things:
  • Mark it clearly with "I haven't checked this", or "I am not sure", or "I don't really know".
  • Given that the above indicates the communication is already kind of worthless, wait until an opportunity to offer better advice comes by.
And then, what do you do with the time you now have available?  As a suggestion, use that time to do something productive for yourself.  Or make something out there more correct, smaller, or easier to understand. But please, avoid littering the material others might want to use in the pursuit of their goals. It is even harder to do something worthwhile when finding relevant, good quality information requires sorting out gold needles from a haystack of spam. Specifically, the programming craft would be better off without the kinds of "social" efforts described above.

Saturday, March 23, 2013

HPS source size over time

When I joined Cincom in 2007, source code cleanup for our VM was one of the first priorities for engineers because cruft was getting too much in our way.  This cleanup might sound easy to do, but it takes a large amount of effort precisely because so much cruft had accumulated.  What tends to happen is that one thing leads to another and all of a sudden what was meant as a removal of an obsolete feature involves investigating how every compiler in use reacts to various code constructs.  And then you also find bugs that had been masked by the code you're trying to remove, and these require even more time to sort out.

From a features point of view, it's unrewarding work because at the end of the day you have what you had before.  The key difference, which becomes observable over time, is that when you put in this kind of work then support call volume starts going down, random crashes stop happening, and the code cruft doesn't get in your way so you don't even have to research it.  As a result, the fraction of time you can spend adding new features goes up.

We've been at it for over 6 years now, and we can clearly see the difference in our everyday work.  Take a look:

  • VW 7.4 (2005): 337776 LOC.
  • VW 7.4 (2005): 338139 LOC.
  • VW 7.4a (2006): 358636 LOC.
  • VW 7.4b (2006): 358957 LOC.
  • VW 7.4c (2006): 359419 LOC.
  • VW 7.4d (2006): 358782 LOC.
  • VW 7.5 (2007): 358921 LOC.
  • VW 7.6 (2007): 357264 LOC.
  • VW 7.6a (2008): 350093 LOC.
  • VW 7.7 (2009): 345618 LOC.
  • VW 7.7a (2010): 270093 LOC.
  • VW 7.7.1 (2010): 270124 LOC.
  • VW 7.7.1a (2010): 270119 LOC.
  • VW 7.8 (2011): 261580 LOC.
  • VW 7.8a (2011): 261611 LOC.
  • VW 7.8b (2011): 261739 LOC.
  • VW 7.8.1a (2011): 261748 LOC.
  • VW 7.9 (2012): 252309 LOC.
  • VW 7.10 (2013): 240880 LOC (March).
As you can see, from the high water mark of 359419 LOC to today's 240880 LOC, we have removed 33% of the VM's source code.  Think of it: wouldn't it be nice to drop a third of your source code while killing a ton of bugs and adding new features?  Also, using the standard measuring stick of "400 page book" = "20k LOC", we can see HPS went from requiring 18 to 12 books.

We still have more code deletions queued up for 7.10, which are associated with various optimizations and bug fixes.  With a bit of luck, we'll reach 12k LOC (that is, about 240 printed pages) deleted in this release cycle.

Update: we went into code freeze in preparation for 7.10.  Here's an update on the code deletion.
  • VW 7.10 (2013): 240368 LOC (April code freeze).
Another 500 LOC bit the dust since March, and the VM executable became a couple kilobytes smaller too.  Finally, we're at basically 12k LOC deleted for the whole release.

Update 2: we gained about 500 LOC for 7.10, but only in exchange for IPv6 functionality.  I'll update the LOC count later since we know there are a few more fixes that will go in.