Next:
Processors
Up:
The Main Architectural Classes
Previous:
ccNUMA machines
The adoption of clusters, collections of workstations/PCs connected by
a local network, has virtually exploded since the introduction of the
first Beowulf cluster in 1994. The attraction lies in the (potentially)
low cost of both hardware and software and the control that
builders/users have over their system. The interest for
clusters can be seen for instance from the active IEEE Task Force on
Cluster Computing (TFCC) which reviews
the current status of cluster computing on a regular basis
[45].
Also books how to build and maintain clusters have greatly added to their
popularity (see, e.g.,[41] and
[35]. As the cluster scene becomes
relatively mature and an attractive market, large HPC vendors as well as
many start-up companies have entered the field and offer more or less
ready out-of-the-box cluster solutions for those groups that do not want
to build their cluster from scratch.
The number of vendors that sell cluster configurations has become so
large that it is not sensible to include all these products in this
report. In addition, there is generally a large difference in the usage
of clusters and their more integrated counterparts that we discuss in
the following sections: clusters are mostly used for capability
computing while the integrated machines primarily are used for
capacity computing. The first mode of usage meaning that the
system is employed for one or a few programs for which no alternative
is readily available in terms of computational capabilities. The second
way of operating a system is in employing it to the full by using the
most of its available cycles by many, often very demanding,
applications and users. Traditionally, vendors of large supercomputer
systems have learned to provide for this last mode of operation as the
precious resources of their systems were required to be used as
effectively as possible. By contrast, Beowulf clusters are mostly
operated through the Linux operating system (a small minority using
Microsoft Windows) where these operating systems either miss the tools
or these tools are relatively immature to use a cluster well for
capacity computing. However, as clusters become on average both larger
and more stable, there is a trend to use them also as computational
capacity servers. In [41] is
looked at some of the aspects that are necessary conditions for this
kind of use like available cluster management tools and batch systems.
In the same study also the performance on an application workload was
assessed, both on a RISC (Compaq Alpha) based configuration and on
Intel Pentium III based systems. An important, but not very surprising
conclusion was that the speed of the network is very important in all
but the most compute bound applications. Another notable observation
was that using compute nodes with more than 1 CPU may be attractive
from the point of view of compactness and (possibly) energy and cooling
aspects, but that the performance can be severely damaged by the fact
that more CPUs have to draw on a common node memory. The bandwidth of
the nodes is in this case not up to the demands of memory intensive
applications.
Fortunately, there is nowadays a fair choice of communication networks
available in clusters. Of course 100 Mb/s Ethernet or Gigabit Ethernet is
always possible, which is attractive for economic reasons, but has the drawback
of a high latency (≅ 100 µs). Alternatively, there are for instance
networks that operate from user space, like Myrinet
[26,27], Infiniband,
[34] and SCI
[22]. The first two have maximum
bandwidths in the order of 200 MB/s nd 850 MB/s, respecitvely, and a latency in
the range of 7--9 µs. SCI has a bandwidth of 400--500 MB/s
theoretically and a latency under 3 µs. The latter solution is more
costly but is nevertheless employed in some cluster configurations. The network
speeds as shown by Myrinet and, certainly, QsNET and SCI is more or less on par
with some integrated parallel systems as discussed later. So, possibly apart
from the speed of the processors and of the software that is provided by the
vendors of DM-MIMD supercomputers, the distinction between clusters and this
class of machines becomes rather small and will undoubtly decrease in the
coming years.
The best starting point for the state-of-the-art in cluster computing
is given in the TFCC White Paper
[45] already mentioned. It gives an
pointers to available products, both hardware and software, open
questions and the focus of the present research regarding these
questions.
Next:
Processors
Up:
The Main Architectural Classes
Previous:
ccNUMA machines
Aad van der Steen
Tue Mar 8 10:28:22 CET 2005
|