Parallel Processors

Current technology trends will be able to support Moore's Law for the next 10 to 20 years, after which we begin to reach the fundamental limits of the speed of light and quantum mechanics.  Parallel and distributed computing has been offered as a way of advancing computing power beyond these limits. In order to realize the potential benefits of parallelism, we will need compilers and software support that can effectively utilize the hardware.

Speedup

Given a parallel algorithm that uses p processors (p can be a function of the size n of the problem) and terminates in time Tp(n), we will define T*(n) as the optimal serial time to solve the same problem.  The ratio of the optimal serial time to solve this problem divided by the time required using p processors is called the speedup of the algorithm.

The efficiency of the algorithm measures the fraction of the time that a typcial processor is usefully employed.

At best the efficiency is equal to 1.  The value of these expressions are limited by the fact that we usually do not know the optimal serial time T*(n). In practice we can use alternative definitions for this parameter such as "the time required by the best known serial algorithm", or "the time required for a single processor to execute the parallel algorithm being analyzed".  The second option can be applied by allowing a single processor to execute the program segment allocated to each processor and computing the time required.  In this case we would not include any idle time expended by a processor waiting for results from other processors.  If each processor was utilized with 100% efficiency then we would obtain Ep(n)=1, otherwise we would obtain Ep(n)<1.

When a large number of processors is available the parallelizable portions of an algorithm can be executed quickly, but the inherently sequential portions create delays.  Amdahl's law quantifies this effect in the following way:  If a program consists of two section, one that is inherently sequential and one that is fully parallelizatble, and if the inherently sequential seciton consumes a fraction f of the total computation then the speedup is limited by,

When the application involves repeated inherently sequential computations on a large data set, we can use pipelining to reduce the effect of Amdahl's law. For example if we are required to apply a k step algorithm on n data sets we can give each step of the computation to a different processor.  At each time step a processor executes its associated step of the algorithm on a different data set and passes its results along to the next processor.  The entire data set can be processed in 2k + n steps (this includes k steps to load the pipeline in the beginning and k steps to flush it out at the end).  During initialization and finalization some processeors will be idle.
NIST Dictionary of Algorithms, Data Structures and Problems.

PRAM Model

The Parallel Random Access Machine or PRAM is a theoretical model of parallel computation in which an arbitrary but finite number of processors can access any value in an arbitrarily large shared memory in a single time step. Processors may execute different instruction streams, but work synchronously. The three most important variations of the PRAM are: EREW - Exclusive read, exclusive write; any memory location may only be accessed once in any one step. CREW - Concurrent read, exclusive write; any memory location may be read any number of times during a single step, but only written to once, with the write taking place after the reads. CRCW - Concurrent read, concurrent write; any memory location may be written to or read from any number of times during a single step. A CRCW PRAM model must define some rule for resolving multiple writes, such as giving priority to the lowest-numbered processor or choosing amongst processors randomly. The PRAM is popular because it is theoretically tractable and because it gives algorithm designers a common target. However, PRAMs cannot be emulated optimally on all architectures. National HPCC (High Performance Computing and Communications) Software Exchange - Glossary
 
EREW - Exclusive Read Exclusive Write
CREW - Concurrent Read Exclusive Write
CRCW - Concurrent Read Concurrent Write

Flynn's Taxonomy

Another method of classifyng parallel computer systems is provided by Flynn's Taxonomy. This is a classification system for architectures that has two axes: the number of instructions streams executing concurrently, and the number of data sets to which those instructions are being applied. The scheme was proposed by Flynn in 1966.

SISD - Single Instruction Single Data - This is the standard single processor computer system.

SIMD - Single Instruction Multiple Data - Examples of this type of system include array processors in which many processors are used to perform the same operations elements of  a large data set.
 
 

MISD - Multiple Instruction Single Data - While some references suggest that this classification is limited to redundant systems it more accurately refers to applications in which a data set is being processed to extract multiple features are characteristics.  For example consider a speaker idenfication system in which dozens of features of a persons voice are being extracted.  Each processor can be used to search for and report the presence or absence of a particular feature.  Other processors can analyze the collection and arrangement of these features to identify a particular speaker.  The same data set (single data) is being provided to many processors each performing a different operation (multiple instruction).

MIMD - Multiple Instruction Multiple Data.  This is the classification for the general purpose multiprocessor.  This category includes loosely coupled multi-processor topologies such as the mesh and the hypercube as well as distributed processing systems such as the Beowulf Cluster.

Network Topologies

Introduction

A key design issue in SIMD, MISD and MIMD systems is the interconnection network.  Which processors and/or local memories communicate directly with each other and what is the overall structure of the system network?  There are a number of ways the elements of a parallel computer system communicate depending on the perpose of the communication.

Some Terminology

nodes - Nodes are the junction points of an interconnection network.  Depending on the application a node can be a processor, a memory bank, an I/O processor, an interconnection link, a hub or a router.

circuit switching - Analogous to a private telephone connection for which there is a dedicated physical connection between the communication nodes.

packet switching - Analogous to a party (shared) telephone connection in which multiple nodes all share the same physical communication line.  A local area network or a domain on the Internet uses a type of packet switching (asynchronous transfer mode).  In ATM a message is broken down into small segments with each one placed inside a bit string (packet) that includes a header and destination information.  The receiver collects the packets, strips off the container info and pieces the original message back together.  In ATM communication there is no guarantee that the packets will arrive in any particular order of even that they will arrive at all.

multistage interconnection networks (MINs) - Have a relatively large number of simple nodes (e.g. store-and-forward nodes) but require longer times to transfer information.  The Internet is the largest of all the MINs.

centralized vs. distributed control - Networks are categorized by the type of control they use.  Circuit switched networks use a centralized control which is a dedicated processors that manages the interconnection links needed for all node-to-node communications. Circuit switching networks typically communication synchronously.  Distributed control is used to manage packet switching networks.  Packet switching or message passing networks usually communicate asynchronously.

communication diameter - This is also called the communication distance and is the number of nodes that a message must pass through between the two most distance nodes.

expandability - The ease at which nodes (usually processing nodes) may be added to an existing network.

redundancy - This is the level to which a network can tolerate (i.e. continue to function at some level) loss of nodes and other elements of the network.

bandwidth - The amount of data that can be sent over the network per unit time.  This is a measure of the rate of information transfer only and DOES NOT include the parity, control or routing ID bits of the packet.

network latency - This is the time required for a message to pass between two points in the network.  In packet switching the latency is not fixed so the maximum, minimum or average latency is specified.

connection degree - The number of communication pathways leaving a node.

Interconnection Topology


Single Bus
 
 
 


Multiple Bus
 
 
 

Crossbar Switching Network
 


Fully Connected Network
 
 
 
 


Ring Network
 
 
 
 
 

Single Bus Tree
 
 
 
 
 

Multiple Bus Tree
 
 
 
 


2D Array Processor
 
 
 
 


2x2 Cross Bar
 
 
 
 
 
 


HyperCube




The hypercube topology can be defined geometrically as having N nodes on the corner of a "cube" in n-space where n-log2N.  A 1-cube is a single processor, a 2-cube is a 4-processor ring topology, a 3-cube has 8 processors, etc...

We can assign each processor a unique ID as a binary value as shown for the 4-cube below:

SIMD Machines

An SIMD (single instruction multiple data) machine has a single master processor that controls many computational processing elements (PE's).  The master processor allocates resources, assigns tasks and collects results from the PEs.  The master processor can broadcast to all PE's simultaneously or it can communicate with individual PEs an necessary.  Typically SIMD systems are designed to solve a specific class of problems and are not considered general-purpose parallel computers.

Processor arrays operate on two-dimensional arrays of data and are desgined to perform either bit-serial operations (see MPP and CM-1) or word-parallel operations (see ILLIAC IV).  The bit-serial processor arrays operate on bit planes which are 2-D arrays of binary values.  The PEs communicate through a 2D interconnection network with nearest-neighbor connections only.

The standard PE to memory structure shown above has a number of variations.


Operations on SIMD Machines

associative search - an associated-memory operation implement on an SIMD machine. This is the process of locating data based on its value.

vector computations - using pipelining and/or ILP to parallelize arithemetic computations

image processing - such as image blurring and edge detection can be performed using templates in parallel with each PE performing the processing for one pixel of the image.

matrix multiplication - using a CREW (concurrent read exclusive write) PRAM model, we can reduce matrix multiplication from an n3 process to a linear process using n2 PEs.
 

more to come....