Next:
Acknowlegdments.
Up:
Overview of Recent...
Previous:
Systems under development.
This section contains the explanation of some often-used terms that
either are not explained in the text or, by contrast, are described
extensively and for which a short description may be convenient.
Architecture: The internal structure of a computer system or
a chip that determines its operational functionality and
performance.
Architectural class: Classification of computer systems according
to its architecture: e.g., distributed memory MIMD computer,
symmetric multi processor (SMP), etc. See this glossary and section
architecture for the description of
the various classes.
ASCI: Accelerated Strategic Computer Initiative. A massive funding
project in the USA concerning research and production of
high-performance systems. The main motivation is said to be the
management of the USA nuclear stockpile by computational modeling
instead of actual testing. ASCI has greatly influenced the
development of high-performance systems in a single direction:
clusters of SMP systems.
Bank cycle time: The time needed by a (cache-)memory bank to
recover from a data access request to that bank. Within the bank
cycle time no other requests can be accepted.
Beowulf cluster: Cluster of PCs or workstations with a
private network to connect them. Initially the name was used for
do-it-yourself collections of PCs mostly connected by Ethernet
and running Linux to have a cheap alternative for "integrated"
parallel machines. Presently, the definition is wider including
high-speed switched networks, fast RISC-based processors and
complete vendor-preconfigured rack-mounted systems with either
Linux or Windows as an operating system.
Bit-serial: The operation on data on a bit-by-bit basis
rather than on byte or 4/8-byte data entities in parallel.
Bit-serial operation is done in processor array machines where for
signal and image processing this mode is advantageous.
Cache --- data, instruction: Small, fast memory close to the
CPU that can hold a part of the data or instructions to be
processed. The primary or level 1 caches are virtually always
located on the same chip as the CPU and are divided in a cache
for instructions and one for data. A secondary or level 2 cache
is mostly located off-chip and holds both data and instructions.
Caches are put into the system to hide the large latency that
occurs when data have to be fetched from memory. By loading data
and or instructions into the caches that are likely to be needed,
this latency can be significantly reduced.
Capability computing: A type of large-scale computing in
which one wants to accommodate very large and time consuming
computing tasks. This requires that parallel machines or clusters
are managed with the highest priority for this type of computing
possibly with the consequence that the computing resources in the
system are not always used with the greatest efficiency.
Capacity computing: A type of large-scale computing in which
one wants to use the system (cluster) with the highest possible
throughput capacity using the machine resources as efficient as
possible. This may have adverse effects on the performance of
individual computing tasks while optimising the overall usage of
the system.
ccNUMA: Cache Coherent Non-Uniform Memory Access. Machines
that support this type of memory access have a physically
distributed memory but logically it is shared. Because of the
physical difference of the location of the data items, a data
request may take a varying amount of time depending on the
location of the data. As both the memory parts and the caches in
such systems are distributed a mechanism is necessary to keep the
data consistent system-wide. There are various techniques to
enforce this (directory memory, snoopy bus protocol). When one of
these techniques is implemented the system is said to be cache
coherent.
Clock cycle: Fundamental time unit of a computer. Every
operation executed by the computer takes at least one and
possibly multiple cycles. Typically, the clock cycle is now
in the order of one to a few nanoseconds.
Clock frequency: Reciproke of the clock cycle: the number
of cycles per second expressed in Hertz (Hz). Typical clock
frequencies nowadays are 400 MHz--1 GHz.
Clos network: A logarithmic network in which the nodes are
attached to switches that form a spine that ultimately
connects all nodes.
Communication latency: Time overhead occurring when a message is sent
over a communication network from one processor to another. Typically the
latencies are in the order of a few µs for specially designed
networks, like Infiniband or Myrinet, to about 100 µs for (Gbit)
Ethernet.
Control processor: The processor in a processor array machine
that issues the instructions to be executed by all the processors
in the processor array. Alternatively, the control processor may
perform tasks in which the processors in the array are not
involved, e.g., I/O operations or serial operations.
CRC: Type of error detection/correction method based treating a
data item as a large binary number. This number is divided by another
fixed binary number and the remainder is regarded as a checksum from
which the correctness and sometimes the (type of) error can be recovered.
CRC error detection is for instances used in SCI networks.
Crossbar (multistage): A network in which all input ports are
directly connected to all output ports without interference from
messages from other ports. In a one-stage crossbar this has the
effect that for instance all memory modules in a computer system
are directly coupled to all CPUs. This is often the case in
multi-CPU vector systems. In multistage crossbar networks the
output ports of one crossbar module are coupled with the input
ports of other crossbar modules. In this way one is able to build
networks that grow with logarithmic complexity, thus reducing the
cost of a large network.
Distributed Memory (DM): Architectural class of machines in
which the memory of the system is distributed over the nodes in
the system. Access to the data in the system has to be done via
an interconnection network that connects the nodes and may be
either explicit via message passing or implicit (either using HPF
or automatically in a ccNUMA system).
Dual core chip: A chip that contains two CPUs and (possibly
common) caches. Due to the progression of the integration level
more devices can be fitted on a chip. In fact, IBM makes a dual
core chip: the POWER4 and other vendors may follow in the near
future.
EPIC: Explicitly Parallel Instruction Computing. This term is
coined by Intel for its IA-64 chips and the Instruction Set that
is defined for them. EPIC can be seen as Very Large Instruction
Word computing with a few enhancements. The gist of it is that no
dynamic instruction scheduling is performed as is done in RISC
processors but rather that instruction scheduling and speculative
execution of code is determined beforehand in the compilation
stage of a program. This simplifies the chip design while
potentially many instructions can be executed in parallel.
Fat tree: A network that has the structure of a binary (quad)
tree but that is modified such that near the root the available
bandwidth is higher than near the leafs. This stems from the fact
that often a root processor has to gather or broadcast data to
all other processors and without this modification contention
would occur near the root.
FPGA: FPGA stands for Field Programmable Gate Array. This
is an array of logic gates that can be hardware-programmed to
fulfill user-specified tasks. In this way one can devise special
purpose functional units that may be very efficient for this
limited task. As FPGAs can be reconfigured dynamically, be it only
100--1,000 times per second, it is theoretically possible to
optimise them for more complex special tasks at speeds that
are higher than what can be achieved with general purpose
processors.
Functional unit: Unit in a CPU that is responsible for the
execution of a predefined function, e.g., the loading of data in
the primary cache or executing a floating-point addition.
Grid --- 2-D, 3-D: A network structure where the nodes are
connected in a 2-D or 3-D grid layout. In virtually all cases the
end points of the grid are again connected to the starting points
thus forming a 2-D or 3-D torus.
HBA:
HBA stands for Host Bus Adaptor. It is the part in an
external network that constitutes the interface between the network
itself and the PCI bus of the compute node. HBAs usually carry
a good amount of processing intelligence themselves for initiating
communication, buffering, checking for correctness, etc. HBAs tend
to have different names in different networks: HCA or TCA for Infiniband,
LANai for Myrinet, ELAN for QsNet, etc.
HPF: High Performance Fortran. A compiler and run time system
that enables to run Fortran programs on a distributed memory
system as on a shared memory system. Data partition, processors
layout, etc. are specified as comment directives that makes
it possible to run the processor also serially. Present HPF
available commercially allow only for simple partitioning schemes
and all processors executing exactly the same code at the same
time (on different data, so-called Single Program Multiple
Data (SPMD) mode).
Hypercube: A network with logarithmic complexity which has
the structure of a generalised cube: to obtain a hypercube of the
next dimension one doubles the perimeter of the structure and
connect their vertices with the original structure.
Instruction Set Architecture: The set of instructions that
a CPU is designed to execute. The Instruction Set Architecture
(ISA) represents the repertoire of instructions that the
designers determined to be adequate for a certain CPU. Note that
CPUs of different making may have the same ISA. For instance the
AMD processors (purposely) implement the Intel IA-32 ISA on a
processor with a different structure.
Memory bank: Part of (cache) memory that is addressed
consecutively in the total set of memory banks, i.e., when data
item a(n) is stored in bank b, data item
a(n+1) is stored in bank b+1. (Cache) memory is
divided in banks to evade the effects of the bank cycle time (see
above). When data is stored or retrieved consecutively each bank
has enough time to recover before the next request for that bank
arrives.
Message passing: Style of parallel programming for distributed
memory systems in which non-local data that is required
explicitly must be transported to the processor(s) that need(s)
it by appropriate send and receive messages.
MPI: A message passing library, Message Passing Interface,
that implements the message passing style of programming.
Presently MPI is the de facto standard for this kind of
programming.
OpenMP: A shared memory parallel programming model in which
shared memory systems and SMPs can be operated in parallel. The
parallelisation is controlled by comment directives (in Fortran)
or pragmas (in C and C++), so that the same programs also can be
run unmodified on serial machines.
PCI bus: Bus on PC node, typically used for I/O, but also
to connect nodes with a communication network. The bandwidth
varies with the type from 110-480 MB/s. Newer upgraded versions
PCI-X and PCI Express are (becoming) available presently.
Pipelining: Segmenting a functional unit such that it can
accept new operands every cycle while the total execution of the
instruction may take many cycles. The pipeline construction works
like a conveyor belt accepting units until the pipeline is filled
and than producing results every cycle.
Processor array: System in which an array (mostly a 2-D grid)
of simple processors execute its program instructions in
lock-step under the control of a Control Processor.
PVM: Another message passing library that has been widely
used. It was originally developed to run on collections of
workstations and it can dynamically spawn or delete processes
running a task. PVM now largely has been replaced by MPI.
Register file: The set of registers in a CPU that are
independent targets for the code to be executed possibly
complemented with registers that hold constants like 0/1,
registers for renaming intermediary results, and in some cases a
separate register stack to hold function arguments and routine
return addresses.
RISC: Reduced Instruction Set Computer. A CPU with its
instruction set that is simpler in comparison with the earlier Complex
Instruction Set Computers (CISCs) The instruction set was reduced
to simple instructions that ideally should execute in one cycle.
Shared Memory (SM): Memory configuration of a computer in
which all processors have direct access to all the memory in the
system. Because of technological limitations on shared bandwidth
generally not more than about 16 processors share a common
memory.
SMP: Symmetric Multi-Processing. This term is often used for
compute nodes with shared memory that are part of a larger system
and where this collection of nodes forms the total system. The
nodes may be organised as a ccNUMA system or as a distributed
memory system of which the nodes can be programmed using OpenMP
while inter-node communication should be done by message passing.
TLB: Translation Look-aside Buffer. A specialised cache that
holds a table of physical addresses as generated from the virtual
addresses used in the program code.
Torus: Structure that results when the end points of a grid
are wrapped around to connect to the starting points of that
grid. This configuration is often used in the interconnection
networks of parallel machines either with a 2-D grid or with 3-D
grid.
Vector unit (pipe): A pipelined functional unit that is
fed with operands from a vector register and will produce a
result every cycle (after filling the pipeline) for the complete
contents of the vector register.
VLIW processing: Very Large Instruction Word processing. The
use of large instruction words to keep many functional units busy
in parallel. The scheduling of instructions is done statically by
the compiler and, as such, requires high quality code generation
by that compiler. VLIW processing has been revived in the IA-64
chip architecture, there called EPIC (see above).
Next:
Acknowlegdments.
Up:
Overview of Recent...
Previous:
Systems under development.
Aad van der Steen
Tue Nov 4 11:52:15 CET 2003
|