Tuesday, December 02, 2008

Keyword size in large scale application

Ah, here we go. Here are the results for a large scale financial application. No, not the one you are thinking about. This one has 12.5k classes and 750k LOC. The results are as follows*.

  1. 191466 keywords in total. Arithmetic mean keyword size 18.08, harmonic mean 12.05.
  2. 96490 keywords, about 50%, occur with these 12 most frequent sizes (in order): 11, 9, 13, 14, 12, 15, 16, 17, 10, 18, 5 and 19. For this subset, arithmetic mean keyword size 13.11, harmonic mean 11.64.
The glass seems half full on the side of more than 10 characters...

* Full data in array format. The array has 133 entries. The first one corresponds to size zero. #(1 45 349 4881 3125 7181 5133 5519 6140 9286 7472 9678 8331 8616 8406 8086 7807 7652 7255 6720 6290 5910 5397 4966 4517 3985 3806 3304 3075 2690 2401 2199 2073 1761 1660 1559 1297 1310 1144 978 945 883 782 705 652 614 514 523 414 379 330 341 283 259 206 200 181 158 135 135 118 99 92 83 77 42 43 37 29 31 15 20 15 21 14 13 7 8 8 5 4 6 3 3 6 3 2 0 3 1 1 1 0 0 1 0 0 0 2 0 2 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 2)

4 comments:

Martin McClure said...

If I'm reading correctly, this is a static analysis of all Symbols that exist in the system. Most likely, some of these Symbols are used as selectors in code in many places, and others may not appear as selectors in code at all.

To make your point, it might be more interesting to see what the frequency with which each keyword length appears in the code of the system. Possibly short keywords are used more frequently, possibly the longer ones are more frequently used. I admit that this is more work to figure out.

Andres said...

Martin,

Something that also needs to be considered is that some keywords, even though they are not part of a message, can be identifiers and thus fall in the scope of the original discussion (needing to "scan" until $: or '::' to determine whether the leading element of a sentence is a receiver or not). For example,

OrderedCollection new: 10

In this case, the symbol #OrderedCollection is clearly not a keyword --- but its length matters because it pushes the place where $: or '::' would appear in a syntax that allows implicit receivers.

Andres.

Peter William Lount said...

It would be interesting to also know the distribution of the of keywords within each method selector.

For example: "x:", "x:y:", "x:y:z:".

Further it's of interest to know the length of each keyword in multi keyword selectors. Is the first selector longer than the others on average? How about the second one?

One could also sort methods by the length (or number) of keyword selectors.

Do people generally find longer selectors more useful than shorter one? I know that I'll type variable names out in full rather than using short cuts so that the resulting code is more "literate" for later reading.

Andres said...

Peter,

Personally I prefer fully spelled out words rather than abbreviations. Also, there are only so many short words in the first place, and so I think it's natural for more specific domains to induce longer terms as the compact ones run out.

Andres.