Monday, November 28, 2011

The wonderful world of e^x cores and threads

Let's compare the performance of several SPARC machines at running some integer workloads. Fortunately the SPEC folks already compiled the relevant (and voluminous) results. In particular, we will look at the SPARC T2, T2+ and T3 CPUs against the Fujitsu SPARC64-VII and SPARC64-VII+ CPUs.

Unfortunately there are a multitude of numbers to look at. How many CPUs does the machine have? How many cores per CPU? How many threads per core? What's the speed of the CPU in question? How many concurrent test instances are running? And, finally, how long did it take for the machine to finish the workload? To simplify these matters, first we will look at three figures:

  1. Throughput, defined as concurrent test instances over time taken.
  2. Throughput per core per GHz * 1000. This number gives us how much performance is being driven by core.
  3. Throughput per thread per GHz * 1000. This number gives us how much performance is being driven per theoretically executable thread.

With these 3, the higher the better. To simplify, we will note the results as follows.

  • Chip name (chips x cores x threads @ GHz): #1, #2, #3.
First, lets' look at some GCC batch runs.
  • SPARC T2 (1x8x8 @ 1.582): 0.008, 0.646, 0.081.
  • SPARC T2+ (2x8x8 @ 1.582): 0.014, 0.555, 0.069.
  • SPARC T2+ (4x8x8 @ 1.596): 0.030, 0.589, 0.074.

We can see a number of things in these. For example, looking at raw throughput is misleading in a number of ways. First, it hides the number of threads the CPU is capable of running, so it doesn't give us a good idea of how efficient the CPU is. Second, compare the first two lines: we added another CPU, but we did not get 2x the throughput (0.008 to 0.014). In other words, adding CPUs is not necessarily making your box more efficient. We should also look at the throughput per core and thread. Between the first two lines, we added twice as many cores and yet we lost a bunch of throughput. The last result for the T2+ shows some improvement, although the base GHz is higher by roughly 7%...

But we have more, we can look at some T3 SPARCs.

  • SPARC T3 (1x16x8 @ 1.649): 0.012, 0.472, 0.059.
  • SPARC T3 (2x16x8 @ 1.649): 0.025, 0.465, 0.058.
  • SPARC T3 (4x16x8 @ 1.649): 0.049, 0.464, 0.058.

Nice, we can see the T3s have twice as many cores as the T2s, and yet as we add chips the performance does not suffer as much as with the T2s. However, we should also note the T2s had better throughput per core-GHz and per thread-GHz. So, again, we see we add more computing resources but the effective efficiency is lower.

There are basically two vendors of SPARC CPUs that I could find. One is Sun Oracle and it sells the SPARC T CPUs. Another vendor is Fujitsu, and it sells SPARC64 CPUs. Through several iterations, they've taken an older SPARC64-V all the way to a SPARC64-IX. This last CPU was recently used in a top 500 super computer entry. How does it stack up to Oracle's offerings? Well, at first it's hard to tell because SPARC64s run much faster than T2, T2+ and T3 CPUs, yet they have less cores and their cores run less concurrent threads. What a mess! But our 3 figures are discriminating enough. Let's take a look.

  • SPARC64-VII (4x4x2 @ 2.53): 0.012, 0.304, 0.152.
  • SPARC64-VII (8x4x2 @ 2.53): 0.024, 0.298, 0.149.

At first glance, it would seem as if the T2+ and T3 CPU configurations above achieve more throughput than the SPARC64-VII. In fact, the SPARC64-VII scores less throughput per core-GHz. But a closer look will reveal the SPARC64-VII above pushes between 2 and 3 times more work per thread-GHz. And that's after taking out the effect of the GHz difference, which is not even 2x in favor of the SPARC64. In other words, the SPARC64-VII's thread execution is just more efficient than that of Sun Oracle's SPARC offerings. And that's for a somewhat slow SPARC64-VII, here are a couple more numbers.

  • SPARC64-VII (16x4x2 @ 2.88): 0.066, 0.360, 0.180.
  • SPARC64-VII (32x4x2 @ 2.88): 0.126, 0.340, 0.170.

Here we see the effect of the speed increase in terms of throughput per GHz figures. And again we see that with more execution capabilities, there is less effective throughput. Fortunately the T3 and the SPARC64-VII chips scale well with increased CPUs. Nevertheless, the net efficiency of the SPARC64-VII is higher, and I'm guessing it's because with less execution machinery the CPU can spend more energy just plowing forward.

Unfortunately there are no results for SPARC T4 CPUs. However, we know they will run at 2.8GHz or higher, and have 8 cores per CPU each of which can run 8 threads. Plugging in the numbers for the T2s (which have the same number of threads per CPU), we can see the individual cores should be about 2x faster than those in T2s to match a SPARC64-VII on throughput per thread. Moreover, SPARC64-VII+ chips are faster and seem a bit more efficient as well. Unfortunately I could not find results for SPARC-VIII or SPARC-IX chips. In the mean time, though... Sun Oracle SPARC chips don't necessarily look all that great :(...

No comments: