Next:
The NEC TX-7 series.
Up:
Recount of (almost) available ...
Previous:
The IBM BlueGene/L.
| Machine type |
Distributed-memory multi-vector processor |
| Models |
SX-8B, SX-8A, SX-8xMy |
| Operating system |
Super-UX (Unix variant based on BSD V.4.3 Unix). |
| Connection structure |
Multi-stage crossbar (see Remarks) |
| Compilers |
Fortran 90, HPF, ANSI C, C++. |
| Vendors information Web page |
http://www.hpce.nec.com/572.0.html |
| Year of introduction |
2004. |
System parameters:
| Model |
SX-8B |
SX-8A |
SX-8xMy |
| Clock cycle |
2 GHz |
2 GHz |
2 GHz |
| Theor. peak performance |
| Per Proc. (64 bits) |
16 Gflop/s |
16 Gflop/s |
16 Gflop/s |
| Maximal |
| Single frame: |
64 Gflop/s |
128 Gflop/s |
--- |
| Multi frame: |
--- |
--- |
90.1 Tflop/s |
| Main memory, DDR-SDRAM |
32—64 GB |
32—128 GB |
≤ 16 TB |
| Main memory, FCRAM |
16—32 GB |
32—64 GB |
≤ 8 TB |
| No. of processors |
1—4 |
4—8 |
8—4096 |
Remarks:
The SX-8 series is offered in numerous models but most of these are just frames
that house a smaller amount of the same processors. We only discuss the
essentially different models here. All models are based on the same processor,
an 8-way replicated vector processor where each set of vector pipes contains a
logical, mask, add/shift, multiply, and division pipe (see section SM-SIMD systems for an explanation of these
components). As multiplication and addition can be chained (but not division)
the peak performance of a pipe set at 2 GHz is 4 Gflop/s. Because of the 4-way
replication a single CPU can deliver a peak performance of 16 Gflop/s. The
official NEC documentation quotes higher peak performances because the peak
performance of the scalar processor (rated at 4 Gflop/s, see below) is added to
the peak performance of the vector processor to which it belongs. We do not
follow this practice as a full utilisation of the scalar processor along with
the vector processor in reality will be next to non-existent. The scalar
processor that is 2-way super scalar and at 2 GHz has a theoretical peak of 4
Gflop/s. The peak bandwidth per CPU is 64 B/cycle. This is sufficient to ship 8
8-byte operands back or forth and just enough to feed one operand to each of
the replicated pipe sets.
Unlike from what one would expect from the naming the SX-8B is the simpler
configuration of the two single-frame systems: it can be had with 1—4
processors but is in virtually all other respects equal to the larger SX-8A
that can house 4—8 processors. There is one difference connected to the
maximal amout of memory per frame: NEC now offers the interesting choice
between the usual DDR2-SDRAM or FCRAM (Fast Cycle Memory. The latter type of
memory can a factor of 2—3 faster than the former type of memory.
However, because of the more complex structure of the memory, the density is
about two times lower. Hence that in the systemparamemters table, the entries
for FCRAM are about two times lower than for SDRAM. The lower bound for SDRAM
in the SX-8A and SX-8B systems are the same: 32 GB. For the very memory-hungry
applications that are usually run on vector-type systems, the availability of
FCRAM can be beneficial for quite some of these applications.
In a single frame of the SX-8A models fit up to 8 CPUs. Internally the CPUs in
the frame are connected by a 1-stage crossbar with the same bandwidth as that
of a single CPU system: 64 GB/s/port. The fully configured frame can
therefore attain a peak speed of 128 Gflop/s.
In addition, there are multi-frame models (SX-8xMy) where
x = 8,...,4096 is the total number of CPUs and
y = 2,...,512 is the number of frames coupling the single-frame
systems into a larger system. There are two ways to couple the SX-8
frames in a multi-frame configuration: NEC provides a full crossbar,
the so-called IXS crossbar to connect the various frames together at a
speed of 8 GB/s for point-to-point unidirectional out-of-frame
communication (1024 GB/s bisectional bandwidth for a maximum
configuration). Also a HiPPI interface is available for inter-frame
communication at lower cost and speed. When choosing for the IXS
crossbar solution, the total multi-frame system is globally
addressable, turning the system into a NUMA system. However, for
performance reasons it is advised to use the system in distributed
memory mode with MPI.
For distributed computing there is an HPF compiler and for message
passing an optimised MPI (MPI/SX) is available. In addition for shared
memory parallelism, OpenMP is available.
Measured Performances:
The NEC SX-8 was announced in November 2004. Some very early benchmark results
with the HPC Challenge benchmarks can be found at
[20] for a 6-processor SX-8A located at the
Institute of Laser Engineering in Osaka, Japan.
Next:
The NEC TX-7 series.
Up:
Recount of (almost) available ...
Previous:
The IBM BlueGene/L.
Aad van der Steen
Tue Mar 8 11:33:08 CET 2005
|