Monday, December 01, 2008

On implicit self, v2.1 --- size of keywords

Vassili commented that most of the time, things like keywords are small enough so that the $: (or '::') fits within the bounds of our fovea and so we recognize words at a glance as opposed to scanning them.

A word is taken in at once in one saccade, and a terminating colon is perceived immediately. You don't have to "scan to the end" to see if a word is an identifier or a keyword, you just see it. Even if you don't believe it. Exception are words too long to fit in the fovea (more than 10 characters or so for the usual reading conditions), but even so a colon falls in the right parafoveal area and is still recognizable thanks to its distinctive shape (some typefaces can make it easier or harder than others).

So I decided to run an experiment. Evaluating the following code (sorry for the nasty variable names)

(ByteSymbol allInstances collect: [:x | x keywords]) inject: Bag new into: [:t :x | x do: [:y | t add: y size]. t yourself]


and then digging through the bag's dictionary reveals the following.

  • There were 73408 keywords in 62107 symbols.
  • Sorting the keywords by size frequency (to see which keyword sizes occur more often) reveals that 36789 keywords (about 50% of the keywords in the following 11 sizes: 11, 10, 12, 14, 9, 13, 15, 16, 17, 5 and 8, in that order) have an arithmetic mean size of 11.85 characters. The corresponding harmonic mean keyword size is 10.63.
  • The arithmetic mean keyword size for all keywords is 16.94 characters. The corresponding harmonic mean keyword size is 10.79 characters.
So it seems that at least half of the time we're pushing keyword sizes past 10 characters. Even taking out outliers with the harmonic mean results in 10.79 characters per keyword*.

It seems reasonable to suspect that most of the words we use when programming do not fit the fovea (usual reading conditions)**. What are the implications for reading the code in which these words appear? What should be concluded here?

* In the sampled image. Your mileage may vary. I wish I could run the stats in large scale applications to see what numbers occur in the field. From my direct experience, I would not be surprised to see harmonic mean values of 15 or larger.

** Full data in array format. An integer j at index k says keyword size k occurred j times. All sizes from 1 to 78 occurred at least once. Consequently, the array has 78 integers. #(261 388 2112 1389 2992 2210 2536 2944 3471 3561 3756 3536 3410 3510 3372 3185 3052 2740 2436 2423 2016 1733 1652 1319 1154 1050 855 842 721 631 607 576 562 547 468 445 466 439 382 381 352 323 334 266 277 238 210 181 182 136 122 112 84 65 39 54 35 32 33 35 27 16 16 15 9 16 15 12 7 6 11 6 2 4 2 1 2 1)

5 comments:

Vassili said...

"So it seems that at least half of the time we're pushing keyword sizes past 10 characters" -- so is the glass half-full or half-empty? :)

Andres said...

Vassili,

Well, I am not sure. The average keyword size for all keywords was still over 16. In addition, I think a rather large project I worked on would score much higher harmonic mean numbers if measured. Alas, I do not have access to it so I cannot offer concrete evidence for my claim :(.

One way or the other, it's interesting to be more aware of what is going on when we choose words.

Andres.

Jecel said...

I see two problems with your numbers.

The first is that you are counting #at:put: as a 7 character symbol even though you have to only read at most three characters before getting to a colon.

The second is that though most symbols are large, these are not used as frequently as small ones. If you take the average size of words in an English dictionary you will also think they are reasonably large, but any typical text has so many "the" and "of" that the average size for that text will be much smaller.

The second problem can be addressed by going through all CompiledMethods and looking at symbols among their literals.

Andres said...

Jecel,

No, at:put: is being separated into at: and put: first, and so it contributes a keyword of length 3 and another keyword of length 4.

> The second is that though most symbols are large, these are not used as frequently as small ones. If you take the average size of words in an English dictionary you will also think they are reasonably large, but any typical text has so many "the" and "of" that the average size for that text will be much smaller.

Sure, however who calls their variables and receivers "the" or "or"? Also, your suggestion...

> The second problem can be addressed by going through all CompiledMethods and looking at symbols among their literals.

does not apply in this case. This is because the presence of $: (or '::') is used to determine whether an identifier is a receiver or not, and the original context of this discussion was implicit receivers. So really, things like class names or names of "globals" should be included in the statistics.

Andres.

Jecel said...

Andres,

sorry, I did miss the "x keywords" detail in your code.

About the second issue, I was trying to get a picture of symbols used instead of symbols defined. You are right that symbols representing globals should also be counted, but these can also be found indirectly in the method literals with a little extra effort.

The whole point is to count #at: 2000 times but #initializeBeforeChecking: only 3 times when calculating the average size.