Parallel Computer Architectures
OVERVIEW OF PARALLEL COMPUTER ARCHITECTURES
Published in CORE, Jan/Feb, 1994
Developments in computer technology have been very rapid since the beginning of the electronic era. The computer industry has experienced four generations of development ,physically marked by rapid changing of building blocks from relays and vacuum tubes to discrete diodes and transistors, small- and medium-scale integrated (SSI/MSI) circuits and to large- and very large-scale integrated (LSI/VLSI) devices. Improvement in circuit technology, communications method and their cost has led to the introduction of powerful computers which use parallelism at various stages. Low-level parallelism has become a standard feature of most of the modern computers. Hence a precise definition is needed such that the term parallel computer architectures include only modern parallel architectures and not all computers.
The 80's has heralded the introduction
of a wide variety of parallel architectures that complement and extend
the major approaches developed in the 60's and the 70's. In order to
place parallel architectures in a coherent framework by comprehending
their relationship with one other, an orderly schema definition is
necessary for them. This article primarily concerns architecture rather
than specific parallel machines.
TERMINOLOGIES IN PARALLEL ARCHITECTURES
Most of the modern computers involve some degree of parallelism. Some of them allow parallel execution of CPU and IO instructions. Because of their almost universal use in modern computers, the foregoing types of parallelism are of limited value in specifying a parallel architecture taxonomy. The problem of specifying a definition and consequent taxonomy for modern parallel architecture must satisfy following criteria
Without a strict standard for parallel computers all the modern computers will be included in parallel processors. Computers with architectural features described below are categorized as not enough to define a machine as parallel architecture on their own.
Inclusion of above features enhance the performance of a computer system significantly. Nevertheless they do not make an architecture parallel.
Flynn's classification of parallel processors is based on the number of simultaneous instruction and data streams seen by the processor during program execution. If mI and mD are the minimum number of instructions and data streams , respectively, being processed with processor operating at its maximum capacity then the multiplicity of mI and mD yield following four categories :
SISD (single instruction, single data stream) mI = mD = 1. This category defines conventional serial computers. Instructions are executed sequentially but may be overlapped(pipelined) in their execution stages. An SISD computer may have more than one functional unit in it. All the functional units are under the supervision of one control unit.
MISD (multiple instruction, single data stream ) mI > 1, mD = 1. This category involves multiple processors applying different instructions to a single datum. Not many parallel processors fit into this category and is challenged as impractical. No real MISD computer exists.
SIMD (single instruction, multiple data stream ) mI = 1, mD > 1. This category includes machines that have a single processor and multiple execution units, such as array processors. There are multiple processing units supervised by the same control unit.
MIMD (multiple instruction, multiple data stream) mI > 1, mD > 1. This category covers most multiprocessors systems and multiple computer systems. An intrinsic MIMD computer implies interactions among the n processors because all memory streams are derived from the same data space shared by all processors.
Flynn's taxonomy only is not sufficient for classification of all modern parallel processors. Pipelined vector processors is an example of a class of parallel processors which does not fit well into any of the above categories.
DEFINITION AND TAXONOMY
A satisfactory taxonomy for parallel architecture requires a definition of parallel architecture. The definition should include the computers that are not covered by Flynn's scheme and exclude computers employing low-level parallelism.
Based upon the definition proposed above, an informal taxonomy using high-level categories to delineate parallel architecture principles can be derived. This taxonomy can be illustrated as :
Synchronous architectures are those in which concurrent operations are performed in lockstep through central control units, global clocks, or vector unit controllers.
Pipelined Vector Processors
Vector computations are often involved in processing large array of data. The main feature of this type of architectures is that a vector is fed into a "pipeline" of processors. So the different elements of a vector are processed by different processors at the same time. CRAY-1, FUJITSU VP-200, CDC-205 are few pipelined vector processors.
The SIMD architectures are characterized by a central control unit, and multiple processors, plus an interconnection network(IN) for either processor-to-processor or processor-to-memory communications. The control unit sends an instruction and all the processors act on the instruction using locally available data. processors send results of calculation, which are used as operand in another processors via interconnection network.
Processor array architecture
The main feature of processor array architectures is that they can accommodate big word-size operand, which make them more suitable for scientific applications. Operand are usually floating-point values ranging from 32 to 64 bits. The interconnection network may vary from one to another but the most popular ones are the mesh and cross-bar. University of Illinois's ILLIAC-IV is based on processor array architecture.
Associative memory processor architecture
Array processors designed with associative memories are called associative memory processor. An associative machine has the following capabilities
i) Stored data items are content-addressable, and
Bell Laboratories developed PEPE based on this architecture.
Systolic architecture was proposed by H.T. Kung of Carnegie-Mellon University to solve the problems of special purpose systems that often balance intensive computation with demanding I/O bandwidths. Systolic architectures are in fact pipelined processors. They are typically formed by interconnecting a set of identical cells (processing elements) in a highly regular array like manner. Data flows synchronously from cell to cell, with each cell performing a small step in the overall operation of the array. The data is not fully processed until it emerges from the boundary cells of the array.
One of the main advantage of this type of architecture is that, a higher degree of parallelism is obtained by reducing the memory/IO interface.
These types of systolic organizations are most suited for algorithm-specific architectures but in addition to it programmable systolic architectures are also widely used (Carnegie Millon Warp and Saxpy's Matrix-1).
Most multiprocessor systems and multiple computer systems can be classified in this category. An intrinsic MIMD computer implies interactions among the n processors because all memory streams are derived from the same data space shared by all processors. Each of these processes have their own locus of control and hence MIMD architectures are asynchronous computers, characterized by decentralized hardware control. The processes executing in different processors achieve synchronization by means of shared variables or message passing.
The two main reasons of having multiprocessor systems are
i) Higher level parallelism supported and
cost-effectiveness of n-processor systems over n-single processor
systems. The high level parallelism supported can be exploited using an
important class of algorithms called divide and conquer algorithms.
Two different sets of architectural models for a multiprocessor are described below.
Distributed Memory Architectures:
In this type of architectures processing nodes, which include an autonomous processor and its local memory, are connected by an interconnection network. Data sharing and process synchronizing are done explicitly by message passing.
Ring Topology Architectures:
Ring topologies are most appropriate for a small number of processors, executing algorithms not dominated by data communications. It provides two distinct paths between every pairs of users, so that the failure of any link in the main communication loop can be tolerated
Mesh Topology Architectures:
Topological correspondence between meshes and matrix-oriented algorithms encourage mesh-based architectures. It occurs in systolic multiplier.
Tree Topology Architectures:
Tree topologies are suitable for divide and conquer algorithms. It is characterized by the interconnection of two or more 'children' PEs (processing elements) to the higher 'parent' PE. The neighboring processors can communicate fairly rapidly, but the communication between non-neighboring computers are slower requiring intermediate processor to act as store and forward message transfer station to reduce maximum distance between two nodes, additional links may be used to connect all nodes at the same level.
Hypercube Topology Architectures:
A hypercube topology uses N=2n processors arranged in an n-dimensional cube, where each node has n links to adjacent nodes. They have been designed in an effort to support the requirements of 3-dimensional scientific applications. Ametek Series 2010, Intel personal super computer are examples of this architecture.
Reconfigurable Topology Architectures:
These architectures use programmable switches to provide a dynamic interconnection network reconfigurable under program control. A single architecture can in this way serve as many special-purpose architectures that efficiently support the communication pattern requirements of an algorithm.
Shared Memory Architectures:
Shared memory architectures provide a global shared memory addressable by each one of the processors. There may be inconsistency in data in local memory, serving as cache, of each processor. Cache incoherency requires various hardware or software solution. The processors may be connected to the shared memory in various ways.
Time-shared buses offer a simple way to access the shared memory. A single bus can be used for moderate number of processors.
This type of interconnection uses n X m switches for interconnecting n processors and m memory modules. Hardware complexity (wiring) quickly becomes prohibitive as m and n increase thus limiting the number of processor. ALLIANT Fx/8 uses crossbar interconnection.
Multistage Interconnection Networks:
Multiple stages or banks of switches can be used to provide dynamic interconnections, striking a trade off between cost and performance.
MIMD BASED ARCHITECTURAL PARADIGMS
Most of the architecture supporting parallelism have common MIMD principles of asynchronous operation and concurrent manipulation of multiple instructions and data architectures.
According to Flynn one of the four classifications of parallel processing is Multiple Instruction Multiple Data stream known as MIMD but with the advent of more sophisticated architectures, a certain class has emerged which can not be strictly termed as MIMD although the basic structures are MIMD. This fact can be highlighted by taking four such cases and putting them in a category broadly defined as MIMD based architectural paradigms.
These types of architectures are highlighted by the use of SIMD (single instruction multiple data stream) architecture to control certain portions of an MIMD based architecture. SIMD is a different classification in itself according to Flynn. Hence its use in another architecture , MIMD, results in a hybrid which can be rated important enough to be called a paradigm. DADO and TRAC can be classified as hybrid parallel computers.
The principle characteristic of this type of architecture is that an instructions are enabled for execution only when all of its operand are available. Thus, the sequence of executed instructions is based on data dependencies and is in contrary to the "normal" instruction execution style where the instruction is first decoded and the required operand are fetched as they are needed. Hence the operand which after all are data play the predominant role in this architecture. One of the practically implemented computer based on this architecture is Manchester Dataflow computer.
"Demand" is the key word in these types of architectures. An instruction is executed only if the results computed by it are needed by an instruction that has already been enabled to be executed and is in the execution phase.
The difficulties involved in practical implementation of this type of architecture are synchronizing demands for the results of an instruction and maintaining copies of evaluated results. The latter stems up from the fact that a calculated result might be "eaten up" by the subsequent reduction rendering it unavailable for any later demands.
Wavefront array architectures:
This paradigm combines systolic data pipelining with asynchronous dataflow execution paradigm. These architectures are characterized by modular processors and regular, local interconnection networks.
The processors communicate by sending ready to receive and ready to transmit signals to each other. Thus at each stage a "wavefront" is propagating smoothly with each processor acting as a wave propagating medium.
There have been many approaches to architecture designs so as to exploit maximum parallelism and that is why it has been very difficult to classify them within fixed boundaries. We presented a brief overview of the broadening field of parallel computer architectures. The objective of the article is "...to explain how the principal types of parallel architectures work,". In essence the main characteristic of parallel architectures is a fundamental organization principle for concurrent execution and not an indiscriminate collection of hardware and software.
We have tried to introduce various kinds of parallel computer architectures in such a brief article like this. Our aim is to introduce such a vast area to enthusiastic readers of CORE. In the process we have tried not to get too much involved with the technical details. Any interested readers can find elaborate discussions in Hayes, John P. (Computer Architecture and Organization, Mac Graw Hill), Hwang, K. & Briggs, F.A. (Computer Architecture and Parallel Processing, Mac Graw Hill), Ralph Duncan (A Survey of Parallel Computer Architectures IEEE Computer, Feb 1990).
The article is the condensed and simplified version of the assignment submitted by Shailesh, Rabindra & Yogendra in their first term for the course of "Computer Architecture and Organization" at AIT , Bangcock. All three are doing their MS at Asian Instttute of Technology.