Monday, November 28, 2011

The wonderful world of e^x cores and threads

Let's compare the performance of several SPARC machines at running some integer workloads. Fortunately the SPEC folks already compiled the relevant (and voluminous) results. In particular, we will look at the SPARC T2, T2+ and T3 CPUs against the Fujitsu SPARC64-VII and SPARC64-VII+ CPUs.

Unfortunately there are a multitude of numbers to look at. How many CPUs does the machine have? How many cores per CPU? How many threads per core? What's the speed of the CPU in question? How many concurrent test instances are running? And, finally, how long did it take for the machine to finish the workload? To simplify these matters, first we will look at three figures:

  1. Throughput, defined as concurrent test instances over time taken.
  2. Throughput per core per GHz * 1000. This number gives us how much performance is being driven by core.
  3. Throughput per thread per GHz * 1000. This number gives us how much performance is being driven per theoretically executable thread.

With these 3, the higher the better. To simplify, we will note the results as follows.

  • Chip name (chips x cores x threads @ GHz): #1, #2, #3.
First, lets' look at some GCC batch runs.
  • SPARC T2 (1x8x8 @ 1.582): 0.008, 0.646, 0.081.
  • SPARC T2+ (2x8x8 @ 1.582): 0.014, 0.555, 0.069.
  • SPARC T2+ (4x8x8 @ 1.596): 0.030, 0.589, 0.074.

We can see a number of things in these. For example, looking at raw throughput is misleading in a number of ways. First, it hides the number of threads the CPU is capable of running, so it doesn't give us a good idea of how efficient the CPU is. Second, compare the first two lines: we added another CPU, but we did not get 2x the throughput (0.008 to 0.014). In other words, adding CPUs is not necessarily making your box more efficient. We should also look at the throughput per core and thread. Between the first two lines, we added twice as many cores and yet we lost a bunch of throughput. The last result for the T2+ shows some improvement, although the base GHz is higher by roughly 7%...

But we have more, we can look at some T3 SPARCs.

  • SPARC T3 (1x16x8 @ 1.649): 0.012, 0.472, 0.059.
  • SPARC T3 (2x16x8 @ 1.649): 0.025, 0.465, 0.058.
  • SPARC T3 (4x16x8 @ 1.649): 0.049, 0.464, 0.058.

Nice, we can see the T3s have twice as many cores as the T2s, and yet as we add chips the performance does not suffer as much as with the T2s. However, we should also note the T2s had better throughput per core-GHz and per thread-GHz. So, again, we see we add more computing resources but the effective efficiency is lower.

There are basically two vendors of SPARC CPUs that I could find. One is Sun Oracle and it sells the SPARC T CPUs. Another vendor is Fujitsu, and it sells SPARC64 CPUs. Through several iterations, they've taken an older SPARC64-V all the way to a SPARC64-IX. This last CPU was recently used in a top 500 super computer entry. How does it stack up to Oracle's offerings? Well, at first it's hard to tell because SPARC64s run much faster than T2, T2+ and T3 CPUs, yet they have less cores and their cores run less concurrent threads. What a mess! But our 3 figures are discriminating enough. Let's take a look.

  • SPARC64-VII (4x4x2 @ 2.53): 0.012, 0.304, 0.152.
  • SPARC64-VII (8x4x2 @ 2.53): 0.024, 0.298, 0.149.

At first glance, it would seem as if the T2+ and T3 CPU configurations above achieve more throughput than the SPARC64-VII. In fact, the SPARC64-VII scores less throughput per core-GHz. But a closer look will reveal the SPARC64-VII above pushes between 2 and 3 times more work per thread-GHz. And that's after taking out the effect of the GHz difference, which is not even 2x in favor of the SPARC64. In other words, the SPARC64-VII's thread execution is just more efficient than that of Sun Oracle's SPARC offerings. And that's for a somewhat slow SPARC64-VII, here are a couple more numbers.

  • SPARC64-VII (16x4x2 @ 2.88): 0.066, 0.360, 0.180.
  • SPARC64-VII (32x4x2 @ 2.88): 0.126, 0.340, 0.170.

Here we see the effect of the speed increase in terms of throughput per GHz figures. And again we see that with more execution capabilities, there is less effective throughput. Fortunately the T3 and the SPARC64-VII chips scale well with increased CPUs. Nevertheless, the net efficiency of the SPARC64-VII is higher, and I'm guessing it's because with less execution machinery the CPU can spend more energy just plowing forward.

Unfortunately there are no results for SPARC T4 CPUs. However, we know they will run at 2.8GHz or higher, and have 8 cores per CPU each of which can run 8 threads. Plugging in the numbers for the T2s (which have the same number of threads per CPU), we can see the individual cores should be about 2x faster than those in T2s to match a SPARC64-VII on throughput per thread. Moreover, SPARC64-VII+ chips are faster and seem a bit more efficient as well. Unfortunately I could not find results for SPARC-VIII or SPARC-IX chips. In the mean time, though... Sun Oracle SPARC chips don't necessarily look all that great :(...

Thursday, November 17, 2011

Assessments 1.53

Changes with this version: improved the RB status bar to report skipped checks, and also deleted some dead code from the RB status bar.

Enjoy!

Due diligence continues to pay off

At work, we recently fixed a number of instances of memcpy() that should have been memmove() because the objects involved overlapped (thus violating e.g.: POSIX and C99). One particular instance of memcpy() had been wrong since at least 1990, only to be exposed by relatively recent versions of glibc. The fact that wrong code has been undetected for at least 21 years illustrates that it is very easy for programs to merely appear to work and, thus, that it is incredibly important to always pay attention to the relevant specifications.

Speaking of paying attention, we also found sometimes we use a Duff device instead of memcpy() / memmove() / memset() because of an interface impedance mismatch: the C library functions work with bytes, and we need to copy, move or set pointer size values. Alas, our Duff device is currently written in a way that requires way too many assembler instructions. So now I have a new Duff device prototype that does the same work in about half the instructions on x86, Power and SPARC. Preliminary tests show a measurable performance improvement, which is visible both on micro and macro benchmarks.

Moving along...

Doctoral Symposium CIBSE 2012

DOCTORAL SYMPOSIUM CIBSE 2012 (Buenos Aires, Argentina)

CALL FOR SUBMISSIONS

The Organizing Committee of the XIVIbero-American Conference on Software Engineering (CIbSE 2012) are pleased to invite PhD students in the Software Engineering and related areas to participate actively in the Doctoral Symposium CIbSE 2012 by submitting papers describing their doctoral work.

The CIbSE 2012 Doctoral Symposium is an international forum for PhD students to discuss their research goals, methodology, and early results, in a critical but supportive and constructive environment. It will be performed in a one-day session. Selected students will present their work and receive constructive feedback both from a panel of experts and from other Doctoral Symposium students. The students will also have the opportunity to seek advice on various aspects of completing a Ph.D. and performing research in Software Engineering.

Format of the Submissions

The PhD students interested in participating in the Doctoral Symposium should take into account the following items to submit their works.
Provide a clear description of the research problem being addressed.
Motivate the proposed research (i.e. state why the research work is being conducted, and what benefits the research will bring).
Outline the current knowledge of the problem domain, briefly describe what existing work the research builds upon (citing key papers), and also briefly describe any existing solutions that have been developed or are currently being developed (citing key papers).
Clearly present preliminary results from the research work, and propose a plan of research for completing the PhD.
Point out the contributions of the applicant to the solution of the problem, and state in what aspects the suggested solution is different, new or better as compared to existing approaches to the problem.

Submitted papers must not exceed 6 pages in Springer LNCS format (http://www.springer.de/comp/lncs/authors.html). Papers may be written in English, Spanish or Portuguese.

What and how to Submit

Papers must be sent in PDF format to Gabriela Arévalo (gabriela (dot) b (dot) arevalo (at) gmail.com) by December 20th, 2011. In addition, you must also attach an “expectation and benefits” statement (1 page maximum, in PDF format) describing the kind of advice you would like to receive and how this would help you in your research. You should seek your supervisor’s guidance when preparing this statement. So, each submission must consist of the following files:
Attached PDF with your research paper submission (maximum 6 pages)
Attached PDF with your expectation and benefits statement (maximum 1 page)

Please state [CIbSE 2012] Doctoral Symposium- Submission as the subject of the e-mail. The body of the email should also contain: Title, abstract (200 words), keywords, and student and advisor (name, e-mail address, affiliation and postal address).

Evaluation of Submissions

Submitted papers will be subject to a review process by an international Program Committee. The selected submissions will be published as part of the proceedings of the CIbSE 2012 Conference.

Program Committee

Alejandra Garrido (Universidad Nacional de La Plata, Argentina)
Alexandre Bergel (Universidad de Chile, Chile)
Catalina Mostaccio (Universidad Nacional de La Plata, Argentina)
Eduardo Bonelli (Universidad Nacional de Quilmes, Argentina)
Gabriela Robiolo (Universidad Austral, Argentina)
Hernán Astudillo (Universidad Técnica Federico Santa María, Chile)

Important Dates

Submission Deadline: December 20th, 2011
Notification: February 20th, 2012
Camera Ready version: February 28th, 2012

Tuesday, November 08, 2011

Smalltalks 2011 (short) report

Hello, this is the Smalltalks 2011 report.

The conference was quite good, we had a good time. I did not attend all the talks due to hosting duties. Talks were fantastic. We had at least 310 registrations. By the second day we had around 210 unique attendees, which is awesome. We're still working on the final tally, so it's probably a bit higher than that. Here are some random comments.

Ian Piumarta's To Trap A Better Mouse was awesome. He examined how languages are divided into an open class of words (nouns, verbs etc) which change all the time and which describe the entities we want to talk about, and a closed class of words (prepositions, articles, pronouns) which hardly ever changes. The closed class effectively dictates what thoughts can be expressed in the language, because they encode the possible relational patterns between words in sentences. With that in mind, he took a look at open / closed word classes in *computer* languages and the effect those have in how easy it is to express a certain program in a given language. He emphasized that text substitution with ad-hoc parsers can have a tremendous power because then we can easily change from a bad (limiting) representation to a better (enabling) representation. He suggested looking at Earley parsers in favor of PEGs, LL, LR, and similar parsers because Earley parsers can eat both left and right recursive grammars, plus all context free languages, see here.

Gerardo Richarte and Javier Burroni posted updates on their Smalltalk based GC implementation. Now they have limited forms of multithreaded GCs. Interestingly, they tend to go slower. We speculated cache poisoning is the culprit, because usually GC algorithms end up looking like "couple instructions, uncached memory fetch" cycles. If you cannot easily partition spaces so that different CPUs go after memory areas, things look like they will be more painful.

On Saturday, I really liked Ian Piumarta and Kim Rose's talk on what happens when technological advances are promised to revolutionize education. Almost immediately, they are trivialized and dumbed down so that all kids (and adults) have to do is to mash up pre-existing stuff (think of clip art collages) instead of doing anything creative. This is a problem because it builds up the inertia that causes training to pass for an actual education. There were plenty of examples, and plenty of evidence. The situation is somewhat depressing, really, because with things like Facebook, Twitter, and SMS, all we do is emphasize immediate gratification and zero effort as a successful or productive expenditure of time. Without the time (and, thanks to dumbed down technology, without the incentive) to concentrate on anything, we cannot really hope for much. This is an issue that really resonated with Alan Kay's observation that when technology is too easy then there's no effort to actually do something good, so the vast majority of the results are trivial.

You can see the rest of the schedule at FAST's website. Some talks have been shown at other conferences, and videos of such talks are either available now or should become available soon at e.g.: ESUG's youtube video channel. I apologize in advance for not writing up reports on every talk I attended, but we will also post the videos from Smalltalks 2011 at our website soon and I don't want to give out (too many) spoilers :).

From all of us at FAST, we thank you for coming and making this conference a success. See you in Smalltalks 2012!

Wednesday, November 02, 2011

Smalltalks 2011 streamed live starting tomorrow

Smalltalks 2011 starts tomorrow, and will be shown live here. Enjoy, and see you in the morning!