Next:
IBM POWER5
Up:
The Main Architectural Classes
Previous:
Hewlett-Packard PA-RISC 8800
IBM POWER4+
In the systems that feature as IBM's supercomputer line, the p690 series the
nodes still contain the POWER4+ chip as the computational engine although the
succesor, the POWER5 chip already is available in other server lines of IBM.
Undoubtly the POWER5 will eventually appear in the p690-type systems but
presently this is not yet the case. So, we discuss both chips, the POWER4+ and
the POWER5 as the successor that will replace it.
At the time of writing, the clock frequency of the POWER4+ is 1.5--1.9 GHz. It
is a boost of the original POWER4 chip and has no new features other than the
higher clock frequency. The chip size has become so large (or rather the
feature size has become so small) that IBM places now two processor cores on
one chip as shown in Figure 10a. The chip also
harbours 1.5 MB of secondary cache divided over three modules of 0.5 MB each.
Figure 10a: Diagram of the IBM POWER4+ chip layout.
The L2 cache module are connected to the processors by the Core
Interface Unit (CIU) switch, a 2×3 crossbar with a bandwidth of 40
B/cycle per port. This enables to ship 32 B to either the L1
instruction cache or the data cache of each of the processors and to
store 8 B values at the same time. Also, for each processor there is a
Non-cacheable Unit that interfaces with the Fabric Controller and that
takes care of non-cacheable operations. The Fabric Controller is
responsible for the communication with three other chips that are
embedded in the same Multi Chip Module (MCM), to L3 cache, and to other
MCMs. The bandwidths at 1.7 GHz are 13.6, 9.0, and 6.8 GB/s, respectively. The
chip further still contains a variety of devices: the L3 cache
directory and the L3 and Memory Controller that should bring down the
off-chip latency considerably, the GX Controller that is responsible for
the traffic on the GX bus. This bus transports data to/from the system
and in practice is used for I/O. The maximum size of the L3 cache is huge:
32 MB. Some of the integrated devices, like the Performance Monitor, and logic
for error detection and logging are not shown in Figure
10a.
A block diagram of the processor core is shown in Figure
10b.
Figure 10b:Block diagram of the POWER4+ processor core.
In many ways the POWER4+ processor core is similar to the former POWER3
processor: there are 2 integer functional units instead of 3 (called
Fixed Point Units by IBM) and instead of a fused Branch/Dispatch Unit,
the POWER4 core has a separate Branch and Conditional Register Unit, 8
execution units in all. Oddly, the instruction cache is two times
larger than the data cache (64 KB direct-mapped vs. 32 KB two-way set
associative, respectively) and all execution units have instruction
queues associated with them that enables the out-of-order processing of
up to 200 instructions in various stages. Having so may instructions
simultaneously in flight calls for very sophisticated branch prediction
facilities. Instructions are fetched from the Instruction Cache under
control of the Instruction Fetch Address Register which in turn is
influenced by the branch predict logic. This consists of a local and a
global Branch History Table (BHT), each with 16 K entries and a
so-called selector table which keeps track of which of the BHTs has
functioned best in a particular case in order to select the prediction
priority of the BHTs for similar cases coming up.
Unlike in the POWER3, the fixed point units performs integer arithmetic
operations that can complete in one cycle as well as
multi-cycle operations like integer multiply and divide. There are no
separate floating-point units for operations that require many cycles
like divisions and square roots. All floating-point operations are
taken care of in the FP units and, like in the HP PA-8800, there is an
instruction to accommodate the axpy operation, called Fused
Multiply Add (FMA) at IBM's which could deliver 2 floating-point
results every cycle. This brings the theoretical peak performance at
7.6 Gflop/s at the current clock frequency. Like in the HP processor,
the composition of the floating-point operations should be such that the
units have indeed enough FMAs to perform otherwise the performance drops
by a factor of 2.
Although here the dual core version of the chip is described that
is positioned for general processing, also a single core version is
marketed that is recommended for HPC use. The reason is that in this
case the bandwidth from the L2 cache does not have to be shared between
the CPUs and a contention-free transfer of up to 108.8 GB/s can be
achieved while in the dual core version a peak bandwidth of 163.2 GB/s
is to be shared between both CPUs.
It is interesting to see that presently three vendors (AMD, Compaq, and
IBM) have facilities that enable glueless coupling of processors
although the packaging and implementation is somewhat different. All
implementations allow for low-latency SMP nodes with a considerable
number of processors stimulating the trend to build parallel systems
based on SMP nodes.
Next:
IBM POWER5
Up:
The Main Architectural Classes
Previous:
Hewlett-Packard PA-RISC 8800
Aad van der Steen
Thu Oct 7 17:30:43 CEST 2004
|