Next:
Shared-memory MIMD machines
Up:
The Main Architectural Classes
Previous:
Shared-memory SIMD machines
Machines of this type are sometimes also known as
processor-array machines [18].
Because the processors of these machines operate in lock-step, i.e.,
all processors execute the same instruction at the same time (but on
different data items), no synchronisation between processors is
required. This greatly simplifies the design of such systems.
A control processor
issues the instructions that are to be executed by the processors in
the processor array. All currently available DM-SIMD machines use a front-end
processor to which they are connected by a data path to the
control processor. Operations that cannot be executed by the processor
array or by the control processor are off-loaded to the front-end
system. For instance, I/O may be through the
front-end system, by the processor array machine itself or both.
Figure 3 shows a generic model of a DM-SIMD machine from
which actual models will deviate to some degree.
Figure 3: A generic block diagram of a distributed memory SIMD
machine.
Figure 3 might suggest that all processors in
such systems are connected in a 2-D grid and indeed, the
interconnection topology of this type of machines always includes the
2-D grid. As opposing ends of each grid line are also always connected
the topology is rather that of a torus. For several machines this is
not the only interconnection scheme: They might also be connected in
3-D, diagonally, or more complex structures.
It is possible to exclude processors in the array from executing an
instruction on certain logical conditions, but this means that for the
time of this instruction these processors are idle (a direct
consequence of the SIMD type operation) which immediately lowers the
performance. Another factor that may adversely affect the speed occurs
when data required by processor i resides in the memory of
processor j (in fact, as this occurs for all processors at the
same time this effectively means that data will have to be permuted
across the processors). To access the data in processor j, the
data will have to be fetched by this processor and then send through
the routing network to processor i. This may be fairly time
consuming. For both reasons mentioned DM-SIMD machines are rather
specialised in their use when one wants to employ their full
parallelism. Generally, they perform excellently on digital signal and
image processing and on certain types of Monte Carlo simulations where
virtually no data exchange between processors is required and exactly
the same type of operations is done on massive datasets with a size
that can be made to fit comfortable in these machines.
The control processor as depicted in Figure 3 may
be more or less intelligent. It issues the instruction sequence that
will be executed by the processor array. In the worst case (that means
a less autonomous control processor) when an instruction is not fit for
execution on the processor array (e.g., a simple print instruction) it
might be offloaded to the front-end processor which may be much slower
than execution on the control processor. In case of a more autonomous
control processor this can be avoided thus saving processing interrupts
both on the front-end and the control processor. Most DM-SIMD systems
have the possibility to handle I/O independently from the front/end
processors. This is not only favourable because the communication
between the front-end and back-end systems is avoided. The
(specialised) I/O devices for the processor-array system is generally
much more efficient in providing the necessary data directly to the
memory of the processor array. Especially for very data-intensive
applications like radar- and image processing such I/O systems are very
important.
A feature that is peculiar to this type of machines is that the
processors sometimes are of a very simple bit-serial type, i.e., the
processors operate on the data items bitwise, irrespective of their
type. So, e.g., operations on integers are produced by software
routines on these simple bit-serial processors which takes at least as
many cycles as the operands are long. So, a 32-bit integer result will
be produced two times faster than a 64-bit result. For floating-point
operations a similar situation holds, be it that the number of cycles
required is a multiple of that needed for an integer operation. As the
number of processors in this type of systems is mostly large (1024 or
larger, the Quadrics Appemille was a notable exception, however), the
slower operation on floating-point numbers can be often compensated for
by their number, while the cost per processor is quite low as compared
to full floating-point processors. In some cases, however,
floating-point coprocessors are added to the processor-array. Their
number is 8--16 times lower than that of the bit-serial processors
because of the cost argument. An advantage of bit-serial processors is
that they may operate on operands of any length. This is particularly
advantageous for random number generation (which often boils down to
logical manipulation of bits) and for signal processing because in both
cases operands of only 1--8 bits are abundant. As the execution time
for bit-serial machines is proportional to the length of the operands,
this may result in significant speedups.
Next:
Shared-memory MIMD machines
Up:
The Main Architectural Classes
Previous:
Shared-memory SIMD machines
Aad van der Steen
Tue Mar 8 10:08:27 CET 2005
|