Next:
The NEC SX-8.
Up:
Recount of (almost) available ...
Previous:
The IBM eServer p690.
| Machine type |
RISC-based distributed-memory multi-processor |
| Models |
IBM BlueGene/L. |
| Operating system |
Linux |
| Connection structure |
3-D Torus, Tree network |
| Compilers |
XL Fortran (Fortran 90), XL C, C++ |
| Vendors information Web page |
www-1.ibm.com/servers/deepcomputing/ |
| Year of introduction |
2004. |
System parameters:
| Model |
BlueGene/L |
| Clock cycle |
700 MHz |
| Theor. peak performance |
| Per Proc. (64-bits) |
2.8 Gflop/s |
| Maximal |
367/183.5 Tflop/s |
| Main memory |
| Memory/card |
<= 512 MB |
| Memory/maximal |
<= 16 TB |
| No. of processors |
2×65,536 |
| Communication bandwidth |
| Point-to-point (3-D Torus) |
175 MB/s |
| Point-to-point (Tree network) |
175 MB/s |
Remarks:
The BlueGene/L is the first in a new generation of systems made by IBM for very
massively parallel computing. The individual speed of the processor has
therefore been traded in favour of very dense packaging and a low power
consumption per processor. The basic processor in the system is a modified
PowerPC 400 at 700 MHz. Two of these processors reside on a chip together with
4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The
processors have two load ports and one store port from/to the L2 caches at 8
bytes/cycle. This is half of the bandwidth required by the two floating-point
units (FPUs) and as such quite high. The CPUs have 32 KB of instruction cache
and of data cache on board. In favourable circumstances a CPU can deliver a
peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add
operations. Note that the L2 cache is smaller than the L1 cache which is quite
unusual but which allows it to be fast.
The packaging in the system is as follows: two chips fit on a compute card with
512 MB of memory. Sixteen of these compute cards are placed on a node board of
which in turn 32 go into one cabinet. So, one cabinet contains 1024 chips,
i.e., 2048 CPUs. For a maximal configuration 64 cabinets are coupled to form
one system with 65,356 chips/130,712 CPUs. In normal operation mode one of the
CPUs on a chip is used for computation while the other takes care of
communication tasks. In this mode the Theoretical Peak Performance of the
system is 183.5 Tflop/s. It is however possible when the communication
requirements are very low to use both CPUs for computation, doubling the peak
speed; hence the double entries in the System Parameters table above. The
number of 360 Tflop/s is also the speed that IBM is using in its marketing
material.
The BlueGene/L possesses no less than 5 networks, 2 of which are of interest for
inter-processor communication: a 3-D torus network and a tree network. The torus
network is used for most general communication patterns. The tree network is
used for often occurring collective communication patterns like broadcasting,
reduction operations, etc. The hardware bandwidth of the tree network is twice
that of the torus: 350 MB/s against 175 MB/s per link.
At the time of writing this report no fully configured system exists yet. One
such system should be delivered to Lawrence Livermore Lab by the end of this
year. A smaller system of around 34 Tflop/s peak will be delivered at ASTRON,
an astronomical research organisation in the Netherlands for the synthesis of
radio-astronomical images.
Measured Performances:
Recently IBM has reported to have attained a speed of 70.7 Tflop/s on the HPC
Linpack benchmark in solving a linear system of size N = 933,887,
see [44].
Next:
The NEC SX-8.
Up:
Recount of (almost) available ...
Previous:
The IBM eServer p690.
Aad van der Steen
Tue Mar 8 14:01:55 CET 2005
|