Saturday, June 15, 2013

Internet "speed tests" generally unhelpful

So let's say you are having internet issues that look like traffic drops.  The ISP's technician runs a speed test and shows you that your connection speed is ok.  But you know from your previous runs that the results are unstable.  A reason why is that several popular speed test sites measure speed by receiving files (download speed) and sending files (upload speed).  Some examples include http://www.speedtest.net/, http://www.speakeasy.net/speedtest/, http://speedtest.comcast.net/, http://www.bandwidthplace.com/, and http://whatismyipaddress.com/speed-test.  The issue with this procedure is that they also end up measuring file I/O and, particularly, how your browser does file I/O, and how your operating system caches file I/O.

Yes yes I know your SSD drive is fast, and yes yes your hardware RAID card is awesome.  But why should there be file[system] I/O at all when doing a network test?  Here's the net effect I have observed on OS X with Firefox, on a fast machine with plenty of RAM to cache disk reads.

First, there is the issue of where Firefox puts the files, which is the browser's cache.  If your cache is full, then Firefox will have to make space in the browser's cache as it's writing the file during the speed test.  The cleanup operation will cause random file I/O in the cache's hash bucketed directory structure to find the files, and then more random file I/O to delete the files.  OS X doesn't like caching a lot of disk writes while deleting files, so this means file I/O checkpoints.  If your file system is POSIX, that also means a number of atime updates as well.  Finally, I assume something will have to be done in the database that's holding which URLs correspond to what files.  And what if the cache is full of tiny (less than 4kb) files, as it usually is?  Then all this file I/O will happen every 4kb or less during your speed test measured in megabytes per second.  Oh, and did I mention there has to be file I/O to write the file to the browser's cache too?

Well ok you could at least clear the cache before beginning.  Try deleting 250k files out of a 1gb browser cache on OS X.  With an SSD drive, that can take a significant fraction of a minute (top speed on the order of 2k IOPS, usually less than 1k IOPS).  With a regular hard drive, that will cost you half an hour or more (top speed on the order of 300 IOPS, and a lot of audible seeking --- as if the OS was flushing disk buffers continuously).

But even if you flush the cache so the browser can write without deleting anything, that's not a guarantee of a proper test because OS X also buffers writes until the buffer fills, and then further I/O is effectively blocked until the buffers are flushed to disk.  You can see this effect by using the activity monitor app during the test.  The writes will flatline at zero most of the time, and at some point the buffers will be flushed (at hopefully much higher speeds than the speed test itself is receiving data).  While the flushing happens, most speed tests seem to get stuck and their fancy graphics don't update, etc.

I've yet to find a browser based speed test that simply allocates memory to do these things.  I'm not saying none exists.  Speed tests such as the above are not useful because of the problems described above.

Generally, the issue with the Firefox cache induced file I/O can be improved dramatically by creating an ExFAT disk image and mounting it where Firefox expects the hash bucketed directory structure to be.  Clearing the browser's cache is vastly improved with this approach.  Whereas it can take half an hour or more with a regular drive to delete 250k files and 1gb of cached data, the disk image approach can do the same thing in a few seconds.  Effectively, you get SSD performance out of your mechanical drive simply because you coax software to do a better job of file I/O.  The only hiccup will occur when OS X decides to flush disk buffers, at which point the machine might seem stuck for a couple seconds.

Can we please fix these underlying issues so that we can get proper performance out of SSD drives too?

Update: a Comcast technician suggests http://testmy.net, which doesn't look like it's writing stuff to the browser's cache.  Nice!

Saturday, April 20, 2013

Excess of poor quality communication

Check out this article.  The short version of it is: undertaking complicated tasks in social networks frequently gets out of hand and produces spectacularly bad results because the urge to have specific social interactions overrides the goal of doing a good job.  The article uses the example of the Boston bombing, and how various groups of people came up with an assortment of pretty bad ideas while recklessly framing innocent people along the way.  Here are some of the article's key phrases, summed up in a paragraph:

This is one of the most alarming social media events of our time.  We're really good at uploading images and unleashing amateurs, but we're not good with the social norms that would protect the innocent. People in the moment want to participate.  They want to be a part of what's going on.  But beyond the photos they upload, their speculation and theorizing don't necessarily lead to a more efficient resolution.  There is just a lot of meaningless noise out there.  People see trends and patterns that aren't really trends and patterns.  People love to speculate and some people love to make the Web equivalent of crank calls.  The instinct is to satisfy our voyeuristic urges.  That's when we see the arrogance of the crowd take over.

Compare and contrast with the opinion of an unnamed Reddit user:

I feel like we've reached a certain threshold here — the Internet is finally outstripping cable news completely [...] In fact, I wonder if we're inadvertently doing their work for them.

Also check out what happened in the Kasparov vs The World game, in which a crowd of chess players exchanged ideas in an open forum and voted (by majority) on moves to play against Kasparov.  Despite the fact that Kasparov was reading the thread, so perhaps the result of the game isn't entirely fair, time and time again crowd sourcing failed to consider and evaluate significantly better moves under social stress.

In this state of affairs, consider then the majority of posts you see in programming forums.  Are the opinions truly knowledgeable, or are the entries a shot from the hip coming from a programmer "rock star"?  In my experience, 99% such entries are in error simply because the stated opinion does not match the relevant manual or source code.  Moreover, in most cases a cursory examination of the relevant authoritative sources is enough to find errors.  I could understand the occasional slip in terms of the human condition.  I really suspect the significant issue is the absence of an honest effort.

It is really hard to produce properly thought out material worth reading.  Considering the sheer volume of technical communication made possible by the internet and social networks, it's clearly impossible that a significant fraction of such technical communication can be correct or worth reading.  It just can't be, because otherwise the fraction of true diligent experts (like Knuth) would be much higher, so then how come there are so many bugs in the programs we use every day?

So please, if you are not really sure of what you want to say, then do one of two things:
  • Mark it clearly with "I haven't checked this", or "I am not sure", or "I don't really know".
  • Given that the above indicates the communication is already kind of worthless, wait until an opportunity to offer better advice comes by.
And then, what do you do with the time you now have available?  As a suggestion, use that time to do something productive for yourself.  Or make something out there more correct, smaller, or easier to understand. But please, avoid littering the material others might want to use in the pursuit of their goals. It is even harder to do something worthwhile when finding relevant, good quality information requires sorting out gold needles from a haystack of spam. Specifically, the programming craft would be better off without the kinds of "social" efforts described above.

Saturday, March 23, 2013

HPS source size over time

When I joined Cincom in 2007, source code cleanup for our VM was one of the first priorities for engineers because cruft was getting too much in our way.  This cleanup might sound easy to do, but it takes a large amount of effort precisely because so much cruft had accumulated.  What tends to happen is that one thing leads to another and all of a sudden what was meant as a removal of an obsolete feature involves investigating how every compiler in use reacts to various code constructs.  And then you also find bugs that had been masked by the code you're trying to remove, and these require even more time to sort out.

From a features point of view, it's unrewarding work because at the end of the day you have what you had before.  The key difference, which becomes observable over time, is that when you put in this kind of work then support call volume starts going down, random crashes stop happening, and the code cruft doesn't get in your way so you don't even have to research it.  As a result, the fraction of time you can spend adding new features goes up.

We've been at it for over 6 years now, and we can clearly see the difference in our everyday work.  Take a look:

  • VW 7.4 (2005): 337776 LOC.
  • VW 7.4 (2005): 338139 LOC.
  • VW 7.4a (2006): 358636 LOC.
  • VW 7.4b (2006): 358957 LOC.
  • VW 7.4c (2006): 359419 LOC.
  • VW 7.4d (2006): 358782 LOC.
  • VW 7.5 (2007): 358921 LOC.
  • VW 7.6 (2007): 357264 LOC.
  • VW 7.6a (2008): 350093 LOC.
  • VW 7.7 (2009): 345618 LOC.
  • VW 7.7a (2010): 270093 LOC.
  • VW 7.7.1 (2010): 270124 LOC.
  • VW 7.7.1a (2010): 270119 LOC.
  • VW 7.8 (2011): 261580 LOC.
  • VW 7.8a (2011): 261611 LOC.
  • VW 7.8b (2011): 261739 LOC.
  • VW 7.8.1a (2011): 261748 LOC.
  • VW 7.9 (2012): 252309 LOC.
  • VW 7.10 (2013): 240880 LOC (March).
As you can see, from the high water mark of 359419 LOC to today's 240880 LOC, we have removed 33% of the VM's source code.  Think of it: wouldn't it be nice to drop a third of your source code while killing a ton of bugs and adding new features?  Also, using the standard measuring stick of "400 page book" = "20k LOC", we can see HPS went from requiring 18 to 12 books.

We still have more code deletions queued up for 7.10, which are associated with various optimizations and bug fixes.  With a bit of luck, we'll reach 12k LOC (that is, about 240 printed pages) deleted in this release cycle.

Update: we went into code freeze in preparation for 7.10.  Here's an update on the code deletion.
  • VW 7.10 (2013): 240368 LOC (April code freeze).
Another 500 LOC bit the dust since March, and the VM executable became a couple kilobytes smaller too.  Finally, we're at basically 12k LOC deleted for the whole release.

Update 2: we gained about 500 LOC for 7.10, but only in exchange for IPv6 functionality.  I'll update the LOC count later since we know there are a few more fixes that will go in.

Monday, December 24, 2012

Update on Fundamentals volume 2

I've had some free time lately, so I went back to the multithreading chapter of Fundamentals volume 2.  I wrote several pages, and made good progress towards finishing the chapter.  The draft is now 198 pages.

Update on memory manager work

A while ago we were going over the memory manager changes I've been working on lately.  Among other things, I rewrote and optimized the OT and data compactors.  I knew the new code had to be significantly faster just from an algorithm analysis point of view.  But we hadn't measured the actual impact yet, so we just did.  The below is the run time, in seconds, for one of the stress tests from our memory policy stress tests.

  • Old VM: 550 seconds.
  • New VM: 452 seconds.
The new code runs through the test about 21.5% faster.  Note this is just a preliminary result for code that has not been fully reviewed much less integrated at this time (and your mileage may vary, etc).  But still, that's yet another significant performance increase for the HPS memory manager on top of everything else...

Large integer primitive improvements

Recently we had to go through the large integer primitives because the C type "long" does not mean the same thing in all our 64 bit platforms.  This type was used by the GCD primitives, so we had to go audit that code. Right, that code...

  • The code uses a hybrid of two different multiprecision GCD algorithms from Knuth.
  • Although it seems to work, the code comments do not have a proof for the correctness of the code.
That was an interesting bit of code audit.  The implementation uses Algorithm L, except that the multiprecision division is replaced with a customized implementation of Algorithm B.  It's quite complex because Knuth glosses over what happens when you use signed types to implement these algorithms.  So... why does all this stuff work, exactly?...

In any case, we deleted the usual few hundred LOC, wrote a new ultra paranoid set of tests, and we also produced a proof that says the code should work.  We didn't stop there either: we also threw out a bunch of big endian related code, and we also improved the performance for some of our big endian platforms (up to 25-30% speedup, depending on the use).

Moving along in this department, too...

Update: now up to 37% faster.

Sunday, December 23, 2012

More memory management work

Work on the HPS memory manager doesn't stop, or so it seems.  Earlier today I finished rewriting the object table compactor code.  The result is about 200 LOC gone, and a few kilobytes less executable code.  And since the code is written more clearly, it's far easier to produce a proof that says the code actually works --- and if the proof is wrong then it should be much easier to figure out why.  Other bits of work include refactoring the remember table implementation (more deleted code), and a fix / optimization for 64 bits.

By the way, we also had a cleanup pending for the large integer primitives.  We took advantage, improved big endian platform performance by double digit percentages, and deleted another few hundred LOC.

We also have a few hundred new tests for all this stuff.  Moving along...

Saturday, November 24, 2012

Smalltalks 2012 videos

See here for Smalltalks 2012's videos... they will be posted to the playlist as they become available.  Enjoy!

Wednesday, November 21, 2012

Smalltalks 2012 photos

Hello!... here are some photo sets from Smalltalks 2012.

In chronological order, first we have our usual visit to Tigre.  Then, we had a pre-Smalltalks event at Trelew.  After that, the conference proper on November 7, November 8, and November 9.  After the conference, some of us went on a day trip to explore a bit of Patagonia.  Finally, the day after, we had a small break before flying back to Buenos Aires.

Videos will be coming shortly...

Friday, October 12, 2012

Duff's device implementation details

Typically one writes a Duff's device to copy data like it's shown in the Wikipedia, i.e. using *to++ = *from++ for each step.  Most likely a compiler dealing with *to++ = *from++ will emit 4 instructions: a load, a store, and two additions to pointers likely stored in registers.  But, for example, if you have an 8 case device, you can arrange things so that the pointers are incremented with the slack iterations before the switch and then do

  • *(to-8) = *(from-8);
  • *(to-7) = *(from-7);
  • ...
  • *(to-1) = *(from-1);
  • if not done, increment to and from by 8 and go to the top;
With the explicit offsets above, each step requires just a load and store with fixed offsets that a reasonable CPU with instructions such as mov rax, [esi+offset] calculates on the fly.  The resulting Duff's device is now roughly half the size in assembler instructions.