Original Link: https://www.anandtech.com/show/2550



Introduction

Accounting for up to 56% of market share in the US and 40% worldwide, the quad-socket market is the last stronghold of AMD. It is a small stronghold, as for every 4S server, there about 17 dual-socket and four single-socket servers sold. However, since each 4S server contains four CPUs, the 4S server market accounts for about 10% of the server CPUs sold. More importantly, the margins are quite a bit higher than in the popular 2S market, and as a result those 10% of server CPU shipments are good for 20% of the revenue, and it gets even better.

IDC expects that in 2008 the 4S market will grow up to 14%, more than twice the growth of the dual-socket server market[1]. GCC reports that about 57% of 297 enterprises reported that they are going to buy quad-socket servers, and no less than 31% indicated that they were going to buy more quad-socket systems than in the past. Only 15.5% said they would buy less of these systems[2]. This trend is relatively easy to explain: IDC studies indicate that the number of x86 servers bought for the purpose of virtualization will show a Compound Annual Growing Rate of 40-45% until 2010[3]. Although IDC's studies have been overoptimistic before, we think it's safe to conclude that the 4S x86 market has a bright future ahead.

It is no surprise that AMD and Intel have started a bitter fight for this profitable market. The hostilities started to heat up in early Q3 of 2007. AMD introduced a 3GHz (95W) Opteron 8222 and 3.2GHz (120W) Opteron 8224 that outpaced Intel's Xeon MP "Tulsa" by a wide margin. Just a few weeks later, the most impressive Xeon MP launch we have seen in years became a reality: the Caneland platform with the Intel Xeon "Tigerton MP" was born. The new Intel 7300 chipset offered no less than four independent FSB, twice as many as the previous quad-socket platform, but the most important improvement was the CPU: the old Xeon 71xx CPU had two NetBurst based cores; the new Xeon 73xx was a quad-core CPU, based on Intel's 65nm Core architecture.

At the end of our review of the quad-core Xeon DP "Clovertown", we called attention to an Intel CPU labeled as "Clovertown MP". That is exactly what the current quad-core Intel is, a Xeon MP version of the Xeon "Clovertown" 53xx DP. Originally, the "Clovertown MP" was probably a backup plan of Intel's Whitefield processor, a quad-core CPU with a massive shared L2 cache. As Whitefield failed to materialize, plan B went into action. The Clovertown MP was renamed to "Tigerton", or a 65nm quad-core Xeon MP that is based on two dual-core "Core" chips.

Intel conquered the dual and single socket server markets as you can see in the graphic below. Its "Core" based Xeons have been gaining ground quickly for almost two years now. The sharp increase in market share since the introduction of the dual-core "Core" architecture ("Woodcrest" May 2006) and quad-core (November 2006) is remarkable. A more recent report of mercury research claims that Intel Server CPU unit share is still about 87% which is a crushing superiority in market share.


Since the introduction of the Core architecture in Q2 2006, Intel has gained ground quickly.

The introduction of quad-core Xeon MP should not have been a big problem for AMD, as it launched the "native" quad-core Opteron 83xx series a few weeks after the appearance of the Intel's Xeon MP. Unfortunately for AMD, the TLB bug threw a fly into the soup. While this bug is irrelevant for the desktop market, erratum 298 is not acceptable in the quad-socket market. On a virtualized server there is a small chance - compared to the infinitesimal small chance in desktop use - that the bug could rear its ugly head.

That allowed Intel's Tigerton to besiege the last AMD stronghold for more than 7 months. The only thing AMD could do is lower the prices of their server CPUs. AMD managed to keep the market share loss to a minimum but the average selling price of an AMD server chip fell from a little less than $400 (Q2 2007) to less than $300 at the end of 2007. Something had to be done, or the Intel engineers would get bored without a decent challenge….



AMD: back in the quad-socket race

On April 9, 2008, AMD had an answer to the quad-socket, quad-core Intel platform. To quickly get an idea of what AMD and Intel are offering, let us take a look at the 1K pricing:

Server CPU Pricing
AMD CPU Price Intel CPU Price
Opteron 8360 SE 2.5GHz (125W) $2,149 Xeon X7350 2.93 GHz
(130W, 2x4MB L2)
$2,301
Opteron 8358 SE 2.4GHz (125W) $1,865 Xeon E7340 2.4 GHz
(80W, 2x4MB L2)
$1,980
Opteron 8356 2.3GHz (95W) $1,514 Xeon E7330 2.4 GHz
(80W, 2x3 MB L2)
$1,391
Opteron 8354 2.2GHz (95W) $1,165 Xeon E7310 2.13 GHz
(80W, 2x2 MB L2)
$1,177
Opteron 8350 2.0GHz (95W) $873 Xeon E7310 1.6GHz
(80W, 2x2 MB L2)
$856
Opteron 8347 HE 1.9 GHz (68W) $873 Xeon L7345 1.86 GHz
(50W, 2x 4 MB L2)
$2,301

The worst offerings are the Opteron 8358 SE and the Xeon MP L7345. The Opteron 8358 SE will clearly lose the performance per Watt battle. We have serious doubts there is any reason of existence for the Opteron 8358 SE, as it is hardly faster than the 8356 and consumes much more energy. The OEMs agree: the Opteron 8358 SE cannot be found anywhere.

The Xeon MP L7345 price is baffling. Buy an HP ProLiant BL680c G5 and you'll pay about $22,000 for a blade with four Xeon L7345 and 32GB of RAM. Compare that with the same blade equipped with four E7330 which costs $16700. HP does not offer (yet?) the low performance Opterons which is a shame, considering that a 1.9GHz 8347 performs competitively with the Intel options (which we'll see in a moment). An HP ProLiant Bl685c G5 with Opteron 8347 HE would cost about $15000. A saving of $7000 per blade server is no small change, so we hope that HP will offer this soon. This would also put pressure on Intel to lower the price of the L7345.

The Opteron 8354 and 8350 are probably the most competitive offerings from AMD. The E7330 is most likely the best price/performance Intel chip, while the X7350 will be the performance chip to beat.

Beauty contest

One IT journalist described a CPU as "a slice of silicon with needles under it". We cannot help it; we still get a kick of having a massive quad-core CPU between our forefinger and thumb. On the outside, the quad-core Xeon "Tigerton" looks a lot like the dual-core Xeon "Tulsa". It uses the same packaging…


The impressive Intel Xeon Tigerton

… And the same 604-pin socket. The old ("Tulsa") and the new Xeon are not electrically compatible, but the 45nm successor ("Dunnington") will be completely compatible with the Xeon "Tigerton" 73xx motherboards. Or put another way, both "Tigerton" and "Dunnington" will use the same "Caneland" platform based on the 7300 chipset.


The good old ZIF socket: not as fragile as the modern LGA socket

The newest third generation Opteron uses the flip chip Land Grid Array (LGA) with 1207 pins on the motherboard. That makes it a bit scary to change CPUs as these LGA pins are very fragile. We prefer the good old ZIF socket.


The new Opteron 8356 is a worthy opponent for Intel's Xeon MP.

Let us take a look at the platform.



Platform Comparison

As most of you might know, Intel and AMD have made fundamentally different choices when it comes to the platform, chipset, and memory subsystems. Intel chose to create a massive chipset that has four independent front side buses.


Intel Xeon MP Caneland platform.

In theory, this platform should deliver up to 8.5GB/s to each CPU, but the bottleneck is of course the connection to the memory. Four channels of FB-DIMMs are capable of delivering 21GB/s at most or about 5GB/s per CPU.

AMD promised us very low latency HyperTransport 3 connections in the quad-socket space…


Single hop HT3 connects were promised for the quad-core Opteron.

…but decided to give the current platform a longer life:


The current platform for the brand-new quad-core Opteron uses the old 1GHz HyperTransport connections.

So at the moment AMD's platform still works with a 1GHz DDR, 16-bit HyperTransport connection between the different CPUs. The 4GB/s bandwidth (full duplex) seems a little low when you consider that the dual-channel DDR2 DIMMs can deliver about 10.6GB/s in theory and 5.2GB/s in reality. This means that whenever a CPU has to get data from a remote node, bandwidth is limited by the HT connection. Also, latency can sometimes be increased by the fact that in some cases, the remote data has to go over two hops.

Although in most applications this is not a show stopper, AMD has some headroom when it introduces the new 45nm quad-core Opteron with HT3 connection. Each HT3 connection is capable of a 2.6GHz connection, which is able to deliver up to 10.4GB/s in full duplex. Intel will deliver a similar NUMA platform for its Nehalem CPU with 12.8GB/s QPI (CSI) full duplex links at the end of this year.



Words of thanks

Quite a few people gave us assistance with this project, and we would of course like to thank them.

Trevor Lawless, Intel US
Sanjay Sharma, Intel US
Matty Bakkeren, Intel Netherlands
(www.intel.com)

Damon Muzny, AMD US
Brent Kerby, AMD US
(www.amd.com)

Benchmark configuration

Here is the list of the different configurations. All servers have been flashed to the latest BIOS, and unless we add a "BIOS comment", the BIOS was set to default settings.

Xeon Server 1: Supermicro SC818TQ-1000 1U Chassis

2x - 4 x Intel Xeon E7330 at 2.4GHz
Supermicro X7QCE
32GB (16x2GB) ATP Registered FBDIMM DDR2-667 CL 5 ECC
NIC: Dual Intel PRO/1000 Server NIC
PSU: Supermicro 1000W w/PFC (Model PWS-1K01-1R)

Xeon Server 1: Intel "Stoakley platform" server

Supermicro X7DWE+/X7DWN+ Bios rev 1.0
2x Xeon E5472 at 3GHz
16GB (8x2GB) ATP Registered FBDIMM DDR2-667 CL 5 ECC
NIC: Dual Intel PRO/1000 Server NIC
PSU: Supermicro 700W

Opteron Server 1: Supermicro SC818TQ-1000 1U Chassis

2x - 4x AMD Opteron 8356 series at 2.3GHz
System board Supermicro H8QMi-2+
32GB (16x2GB) ATP Registered FBDIMM DDR2-667 CL 5 ECC
NIC: Dual Intel PRO/1000 Server NIC
PSU: Supermicro 1000W w/PFC (Model PWS-1K01-1R)

Client Configuration: Intel Core 2 Quad Q6600

Foxconn P35AX-S
4GB (2x 2GB) Kingston 667MHz DDR2
NIC: Intel Pro/1000

Software

SUSE Linux SLES SP1 (2.6.16.54-0.2.5-smp)
Windows 2003 x64 R2
SPECjbb2005 1.07
3DSMax 2008
CINEBENCH 10



Finally, an EPIC battle!

Intel has executed well the past 2-3 years - Swiss clockwork well. The 45nm family has lowered power consumption significantly and raised performance by about 10 to 20%. That allows Intel to win almost every benchmark in the desktop and workstation market. But don't worry; things are a lot more interesting in the server market.

You might remember from our in-depth analysis that the floating point power of the 45nm Intel CPUs is at least as good as or better than AMD's latest in raw FP performance on a clock-for-clock basis. When it comes to pure integer power, the quad-core Opteron does not have a chance against the 45nm Intel CPUs: the latter is clock-for-clock an impressive 40-45% faster. Add to that the fact that the fastest Intel CPU runs at 3.2GHz while AMD is stuck at 2.5GHz for the moment, and it is clear that the AMD chips do not have a chance in single-threaded integer workloads. HP posted the SPEC CPU 2006 scores of two very similar servers:

SPEC2006 Performance Comparison
CPU Tested Server SpecInt2006 (Base - Peak) Specfp2006 (Base - Peak)
Opteron 2356 2.3 GHz Proliant BL465c G5 13.2 - 14.8 16.2 -17.8
Xeon L5410 2.33 GHz Proliant BL460c 18.8 - 21.6 16.8 -19.8

The latest Opteron is left far behind in the integer benchmark, but is competitive in floating point when you compare clock-for-clock. It is not hard to see why Intel's 45nm CPUs are superior in single-threaded workloads. Luckily (for AMD), Intel took it's time to introduce its impressive 45nm technology in the quad-socket market, and AMD only faces the Intel's 65nm family for now.

Extended SPEC2006 Performance Comparison
CPU Tested Server SpecInt2006 (base - peak) Specfp2006 (base - peak) SpecInt2006 rate (base - peak) SpecFp2006 rate (base - peak)
Opteron 8356 2.3 GHz ProLiant BL685c G5 12.2 - 13.8 15.1 - 17.2 160 - 184 143 -157
Xeon 7340 2.4 GHz ProLiant BL680c G5 18 - 20.4 15.9 - 18.3 157 - 188 100 - 108
Xeon 7330 2.4 GHz PRIMERGY RX600 S4     151 - 177 97.6 - 104

While AMD's flagship processor is still no match in single-threaded integer code, it matches a slightly higher clocked Intel Xeon 7340 in multi-threaded integer performance. Single-threaded floating performance is essentially the same (clock-for-clock), and when it comes to multi-threaded floating point performance performed upon huge datasets, there is no stopping to the best AMD chip: it is up to 50% faster than its competitor. It is interesting to note that the x7350 Xeon at 2.93GHz is not faster than its slower brother at 2.4GHz in SPECfp2006, clearly indicating a bottleneck.

You probably guessed what the bottleneck in the Xeon system is. We used our multi-threaded, 64-bit Linux Stream binary (Courtesy Alf Birger Rustad) based on v2.4 of Pathscale's C-compiler, compiled with the following switches:

-Ofast -lm -static -mp

We tested with 16 threads.

Memory Performance Comparison
  Copy Scale Add Triad Average
Quad Opteron 8356 20867 20860 20892 20945 20891
Quad Xeon 7330 9778 8973 9008 9008 9192

No matter which Xeon 73xx you use, the best each can hope for is less than 600MB/s of memory bandwidth. That is slightly better than a PIII 1GHz with 133MHz SDRAM! Considering a current Xeon 2.4GHz is at least 4 times (and more) faster than a PIII at 1GHz, it is clear that this is a severe bottleneck that won't be solved until a Xeon "Nehalem" MP with CSI is available. Until then, Intel's Xeon MP faces a very capable competitor.



SAP SD

The SAP SD (sales and distribution, two-tier internet configuration) benchmark is an extremely interesting benchmark as it is a real-world client/server application. So we decided to take a look at SAP's benchmark database. The results below are two-tier benchmarks, so the database and the underlying OS can make a big difference. Unless we keep those parameters the same, we cannot compare the results. The results below are all run on a Windows 2003 Enterprise Edition and MS SQL Server 2005 (both 64-bit). All Xeon 73xx and Opteron 83x systems were equipped with 64GB of RAM; only a few older systems (Opteron 22xx) had only 32GB, but the impact is not significant. All these benchmarks are done on the SAP "ERP release 2005" two-tier Sales and Distribution benchmark.

SAP
Sales and Distribution Two-Tier benchmark

The graph above contains a flood of numbers, but it definitely deserves a deeper analyses. The Quad Opteron 2.3GHz manages to outperform a 2.4GHz Xeon 7440, and is less than 6% slower than an x7350 which has a 27% higher clock. This excellent performance of the AMD chip is not only the result of the ample bandwidth that the AMD cores have to their disposal. The Xeon E5345 "Clovertown" and Xeon "Tigerton" E7340 are the same CPUs, running on very similar platforms. The former has about 6GB/s for 8 cores, the latter 9GB/s for 16 cores. The SAP performance still scales almost as perfectly as you can expect from such a complex piece of software: we are seeing up to 80% performance gain from doubling the number of cores.

If bandwidth really was the bottleneck, the quad Opteron 8222SE (3GHz, up to 12GB for 8 cores) should be able to outperform the Xeon 5365 (3GHz, 6GB/s for 8 cores), and the Dual Intel Xeon 5365 at 3GHz (quad-core) would not be able to scale so well (+86%!) compared to the Dual Xeon 5160 (3GHz, dual-core) as both systems have the same bandwidth.

It would require a lengthy profiling of the SAP application and the underlying database to fully understand the results, but there are few hints available. First of all, analyses by for example Intel and Sun show that the underlying code of SAP SD is rather branch intensive, latency sensitive, and running at low IPC. Now look at the dual (8220 extrapolate to 8222), quad, and octal-socket Opteron 8222. We know that the software scales very well. From dual to quad we get again an estimated 80-85% scaling, but from quad to octal only 41%. There is another clue: the problem of the octal socket Opteron 8222 is that some nodes are three hops away from each other (from CPU0 to CPU15 for example) and that the synchronization latency from those CPUs can be quite high. So let us sum this all up in a rough profile of SAP SD.

  • Very parallel; excellent scaling
  • Low to medium IPC, mostly due to "branchy" code
  • Not really limited by memory bandwidth
  • Likes large caches
  • Sensitive to Sync latency; Octal Core Opteron scales rather badly

Add to the above clues that eight cores of 2.3GHz "Barcelona" are a bit faster than eight cores of the previous Opteron generation at 3GHz. The small improvements in integer IPC cannot explain this, as the SAP is hard to speed up by IPC improvements. We think it is safe to conclude that the SAP SD benchmark is one of the examples where being a "native quad-core" pays off. The cache coherency traffic of the third generation of Opterons scales with the number of sockets and not the number of caches. The four L2 caches might be small, but their latency is good, and they make sure that all four different threads running on a CPU do not interfere too much with each other. In case of SAP, this pays off in excellent performance.

The 45nm Xeons are about 10-15% faster than older siblings. If AMD can finally produce some higher clock speeds of Barcelona, the latest quad-core Opteron should be able to keep up with the fastest 45nm CPUs… until Intel's next generation "Nehalem" comes out, at least, but we'll save that analysis for a future article.



Java Servers SPECjbb2005

SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation) evaluates the performance of server-side Java by emulating a three-tier client/server system with emphasis on the middle tier. Instead of testing with a possible disk intensive database system, SPECjbb uses tables of objects, implemented by Java Collections, rather than a separate database. A longer description can be found here.

Again, it is not our objective to show the best possible scores. Very few people will take the time to fully tune the JVM and take the risk that some of the ultra aggressive optimizations backfire. So we tested with some decent but rather generic tuning that we could use on all systems. We used almost the exact same setup as we described here in great detail. The only changes were:

  • We use the BEA JRockit p27.5.0-5 Linux x86_64, which has optimizations for both the "Barcelona" and the "Tigerton" CPUs.
  • We have enabled large pages of 2MB each. This delivers a big performance jump.
  • We do not use the SUN JVM as Sun is about to release a new JVM version which is significantly faster than the current one. This new JVM (1.6.0_06-prelease) allows Sun to claim the crown in SPECjbb2005 as you can see here.
  • We changed one of our java parameters: -Xms1800m -Xmx1800m -Xns1300m -XXaggressive -XXlazyunlocking -Xgc:genpar -XXtlasize:min=4k,preferred=512k -XXcallprofiling
  • We use Xns = 1.3GB instead of 1.5GB (used in previous reviews) as this leaves more room for the old space (Xms minus Xns) which avoids excessive garbage collecting.

Below you can find the final score which SPECjbb2005 reports, which is an average of the last four runs.

SPECjbb2005
64-bit on SUSE Linux 10 SP1

The scores on spec.org do not agree with our scores. Dell, for example, reports that the Quad Xeon 7350 should be able to reach up to 446209 BOPS. HP reports that the Quad Opteron can reach 368534 BOPS. Who is right? We're probably both right, as we use very different settings. It shows how confusing a benchmark can become when software optimization is pushed to insane heights just to get the best score.

We feel that it is likely that our results are closer to the real world. We keep our optimizations consistent over the different architectures, and it is a bit weird to see that the current scores of the Xeon 7350 are so high when it only has 9GB/s of bandwidth available. Considering that SPECjbb2005 uses between 8 and 32GB of RAM very actively, it is no surprise that memory bandwidth makes a big difference in performace. In fact, if you do not use Large Pages of 2MB, the benchmark is severely bottlenecked by the TLBs - another indication of how this benchmark uses memory.

We welcome feedback and our testing methods and results can be reproduced.



HPC: LINPAC on SUSE Linux 64-bit

LINPACK, a benchmark application based on the LINPACK TPP code, has become the industry standard benchmark for HPC. It solves large systems of linear equations by using a high performance matrix kernel. We used a workload of a 40,000 square matrix. We ran eight or sixteen threads. As the system was equipped with 16 to 32GB of RAM, the large matrixes all ran in memory. LINPACK is expressed in GFLOPs (Giga/Billions of Floating Operations Per Second). We used two versions of LINPAC:

  • Intel's version of LINPACK, which uses the highly optimized Intel Math Kernel Library (MKL), version 9.1
  • A "K10 only" version, which is fully optimized for AMD's quad-core.

The "K10 only" version uses the ACML version 4.0.0 and is compiled using the PGI 7.0.7. The following flags were used:

pgcc -O3 -fast -tp=barcelona-64

So basically we used the same binaries as in our previous dual-socket article. The results are comparable, but we didn't test with matrix sizes of 40,000 last time as we had less memory. We measured about 3-4% better performance with a matrix size of 40k instead of 30k on the quad-socket systems. On dual-socket systems, this was about 1-3%.

LINPAC
matrix size = 40000 on 64-bit SUSE Linux 10 SP1

All our compiled binaries are based on Math Kernel Libraries that were available at the end of 2007. So the graph above gives you an idea how HPC code compiled for the 65nm generation of Intel will perform. With the newest Intel Math Kernels (10.0), the dual Xeon 5472 can achieve 77 GFLOPS according to Intel. AMD has released ACML 4.1.0 now. We'll update our LINPAC binaries in a later article.

To reduce the enormous amount of benchmarking time, we reduced the LINPAC testing to only 3 platforms. You can see that the Intel and AMD platforms are very close on a clock-for-clock basis. In this kind of parallel workload, quad-socket really pays off; you get a 70% performance premium over the best dual-socket platform.



Render Servers

CINEBENCH 10, which is based on Maxon's Cinema 4D rendering engine, is our next test. This test is easy to repeat at home and one of the few benchmarks that can use up to 16 cores. This benchmark was performed on Windows 2003 x64 R2.

CINEBENCH
10 64-bit

Clock-for-clock, the Intel CPUs are about 15% faster. The top Intel CPU beats the best AMD chip by a margin of 40%. A dual-core Xeon 5472 scores in the 22k range making it a much cheaper and lower power option for those of you looking for a faster render server than the AMD 8356. In this case, you might even ask yourself if a cluster of dual-socket 5472 is not a much better solution than our quad monsters.

We tried also 64-bit versus 32-bit:

CINEBENCH 32-bit vs. 64-bit
CPU config CINEBENCH 10 32-bit CINEBENCH 10 64-bit 64-bit vs 32-bit
Quad Opteron 8356 18823 21796 16%
Quad Xeon 7330 22477 25770 15%
Quad Xeon 7350 27366 30943 13%

It is a very small difference but the Opteron seems to gain just a tiny bit more out of 64-bit than the Intel's Xeon. Of course we should try to see how the platforms scale:

CINEBENCH 1CPU vs. XCPU
CPU config CINEBENCH 10 64-bit (1 thread) CINEBENCH 10 64-bit Scaling
Quad Opteron 8356 2.3 GHz 2196 21796 9,9
Quad Xeon 7330 2.4 GHz 2748 25770 9,4
Quad Xeon 7350 2.93 GHz 3346 30943 9,2
Dual Xeon 7350 2.93 GHz 3346 20326 6,1
Dual Opteron 8356 2.3 GHz 2196 14487 6,6
Dual Xeon 7330 2.4 GHz 2748 16424 6,0

The scaling numbers are consistent: the Opteron scales just a little bit better. That is no surprise, but do understand that CINEBENCH runs almost perfectly in the L2 cache.

3DS Max 2008

We used the "architecture" scene which is included in the SPEC APC 3dsmax test. All tests were done with 3dsmax's default scanline renderer, SSE enabled, and we rendered at 720p (1280x720) resolution. We measured the time it takes to render 10 frames from 20 to 29. All results are reported as rendered images per hour; higher is thus better. We used the 32-bit version of 3dsmax 2008 on Windows 2003 x64 R2 - for some weird reason the 64-bit version is a bit slower (especially when you use the scanline renderer).



3DSMax
2008 32 bit - architecture scene

Although the quad Opteron 8356 does a little better here, the advantage over the dual-socket Intel 5472 is too small. From a TCO point of view, dual-socket 5472 render servers are the way to go. The quad Xeon at 2.4GHz is also not capable of leaving its 45nm brother far behind. A quad 2.93GHz server system is not only a lot more expensive, it consume also a lot more power as we will see further.



Power

Both our AMD and Intel systems found a home in exactly the same chassis, powered by exactly the same power supply (Supermicro 1000W w/PFC Model PWS-1K01-1R). The Xeon 5472 results are not directly comparable as it uses a different PSU.

System Power Requirements
CPU configuration CINEBENCH 10 64-bit Idle Idle+ Powernow! / Speedstep Spejbb2005, CPU load at 100%
Quad Opteron 8356 537 303 289 530
Dual Opteron 8356 16 GB 340 215 208 337
Quad Xeon 7350 702 462 381 701
Quad Xeon 7330 536 375 355 536
Dual Xeon 7350 500 380 339 499
Dual Xeon 7330 416 359 335 410
Dual Xeon 5472* 349* n/a 251* 345*
* Other PSU and only 16GB of DIMMs.

When it comes to full load, the Intel Xeon 7330 and AMD 8356 based server consume about the same. The AMD server consumes about 66W less when it is running idle. As always, the FB-DIMMs are the problem. We could not test the dual Opteron with 16 DIMMs (it has only eight slots available with two CPUs), but the difference between a quad and a dual Opteron 8356 is almost 200W. Assume that the eight DIMMs of DDR2 are good for about 50W. In that case, two Opteron 8356 are good for 150W of extra power. In the case of the Xeon MP 7330, we see that taking away two Xeons lowers the power consumption about 120W, or 60W per CPU. If you consider the power usage of the memory controller, the AMD and Intel CPUs are in the same ballpark.

We were very surprised to see so little difference between the idle power of the Opteron 8356 at 2.3GHz, and the power consumed when PowerNow! throttles the CPU down to half its speed. AMD implemented advanced clock gating in the newest Opteron: big blocks like the FPU can be turned off, but it is possible to turn off many small sections too. Even the memory controller can turn off both the read and write logic. The result is that we measured that PowerNow! makes only a 3.5W difference per core both in dual and quad-core configurations. According to AMD, PowerNow! provides at most between 1% and 80% benefit at load (yes, that's a big range). The other reason might be that the voltage drop from 2.3GHz to 1.15GHz is rather modest: from 1.15v at 2.3GHz to 1.05v at 1.15GHz. Intel Xeon MP 7330 reduces it's voltage from 1.26V at 2.4 GHz to 1.13V at 1.6 GHz.


This might also explain the small difference we have seen. In the table below we compare 100% CPU load (full load) with "lowest". With lowest we refer to the SpeedStep/PowerNow! enable power consumption numbers.

System Power Savings
CPU configuration Full load vs lowest Percentage Saving Idle versus lowest Percentage Saving
Quad Opteron 8356 241 45% 14 5%
Dual Opteron 8356 16 GB 129 38% 7 3%
Quad Xeon 7350 320 46% 81 18%
Quad Xeon 7330 181 34% 20 5%
Dual Xeon 7350 160 32% 41 11%
Dual Xeon 7330 75 18% 24 7%

As the CPUs in the AMD system are the ones which consume the most, more power is saved between full load and idle. It also seems that the AMD's clock gating is a bit more advanced than the Intel's 65nm CPU family.



Conclusion

The first conclusion is that we should have included more applications for benchmarking. There are two reasons for this. The first one is that a multitude of little problems made us delay our VMware/Xen/Hyper-V benchmarking efforts to a later article. The second one is that many of our standard benchmarks simply cannot scale to 16 cores: our chess, WinRAR, zVisuel and MySQL benchmarks are limited to 8 cores, and this clearly illustrates that how few applications can benefit from even higher numbers of cores on a CPU when running in a non-virtualized environment.

The Intel Xeons reign supreme when it comes to rendering but there is a catch. The catch is that the superb 45nm Xeon 5472 makes many of the quad-socket platforms of both AMD and Intel pretty obsolete. When it comes to rendering, the quad-socket Xeon servers at less than 2.4GHz simply do not make sense when compared to a Xeon 5472 server: more expensive, more power, and slower performance.

The SPECjbb and SAP tests show that AMD's quad-core Opteron, even at 2.3GHz, is a very potent server CPU. In fact, if the current Barcelona chips would not have been stuck at these rather disappointing clock speeds, AMD would have given Intel's engineers a really though challenge. Now it is a very close call, just like in the LINPAC benchmark.

At the end of the day, that does not really matter. What matters is that the enterprise that wants to run a Java or an ERP application can run it on a server with an excellent performance/Watt ratio. The AMD and Intel platform are very close in this respect, but AMD pulls slightly ahead thanks to the lower power consumption when running at low load. In many cases, ERP and Java applications run at low load during some parts of the day. Many of the HPC benchmarks (Fluent, LSDyna) also give the AMD CPU an advantage.

So for a few months, AMD might have a slight edge over the Intel Armada, but not for long. Looking at how well the Xeon "Harpertown" 5472 performs, and knowing that the next Xeon MP "Dunnington" is basically three of those "Harpertown" Xeons with a large L3 cache, it is clear Intel is going to assault AMD's last stronghold in the near future.

References

[1] IDC, Kenneth Cayton and Jed Scaramella, "IBM System X4: Delivering High Value Through Scale Up", Sponsored by: IBM, January 2008

[2] Gabriel Consulting Group, "X86 buying trends: big is in", October 2007

[3] IDC, John Humphreys and Tim Grieser, "Mainstreaming Server Virtualization: The Intel Approach", Sponsored by: Intel, June 2006

[4] AMD, Phil Hester CTO, "2006 Technology Analyst Day"

Log in

Don't have an account? Sign up now