Current technology trends will be able to support Moore's Law for the next 10 to 20 years, after which we begin to reach the fundamental limits of the speed of light and quantum mechanics. Parallel and distributed computing has been offered as a way of advancing computing power beyond these limits. In order to realize the potential benefits of parallelism, we will need compilers and software support that can effectively utilize the hardware.
Speedup
Given a parallel algorithm that uses p processors (p can be a function of the size n of the problem) and terminates in time Tp(n), we will define T*(n) as the optimal serial time to solve the same problem. The ratio of the optimal serial time to solve this problem divided by the time required using p processors is called the speedup of the algorithm.
![]()
The efficiency of the algorithm measures the fraction of the time that a typcial processor is usefully employed.
![]()
At best the efficiency is equal to 1. The value of these expressions are limited by the fact that we usually do not know the optimal serial time T*(n). In practice we can use alternative definitions for this parameter such as "the time required by the best known serial algorithm", or "the time required for a single processor to execute the parallel algorithm being analyzed". The second option can be applied by allowing a single processor to execute the program segment allocated to each processor and computing the time required. In this case we would not include any idle time expended by a processor waiting for results from other processors. If each processor was utilized with 100% efficiency then we would obtain Ep(n)=1, otherwise we would obtain Ep(n)<1.
When a large number of processors is available the parallelizable portions of an algorithm can be executed quickly, but the inherently sequential portions create delays. Amdahl's law quantifies this effect in the following way: If a program consists of two section, one that is inherently sequential and one that is fully parallelizatble, and if the inherently sequential seciton consumes a fraction f of the total computation then the speedup is limited by,


PRAM Model
| EREW - Exclusive Read Exclusive Write
CREW - Concurrent Read Exclusive Write CRCW - Concurrent Read Concurrent Write |
Flynn's Taxonomy
Another method of classifyng parallel computer systems is provided by Flynn's Taxonomy. This is a classification system for architectures that has two axes: the number of instructions streams executing concurrently, and the number of data sets to which those instructions are being applied. The scheme was proposed by Flynn in 1966.
SISD - Single Instruction Single Data - This is the standard single processor computer system.
SIMD - Single Instruction Multiple Data - Examples of this type
of system include array processors in which many processors are used to
perform the same operations elements of a large data set.
MISD - Multiple Instruction Single Data - While some references suggest that this classification is limited to redundant systems it more accurately refers to applications in which a data set is being processed to extract multiple features are characteristics. For example consider a speaker idenfication system in which dozens of features of a persons voice are being extracted. Each processor can be used to search for and report the presence or absence of a particular feature. Other processors can analyze the collection and arrangement of these features to identify a particular speaker. The same data set (single data) is being provided to many processors each performing a different operation (multiple instruction).

Network Topologies
Introduction
A key design issue in SIMD, MISD and MIMD systems is the interconnection network. Which processors and/or local memories communicate directly with each other and what is the overall structure of the system network? There are a number of ways the elements of a parallel computer system communicate depending on the perpose of the communication.
nodes - Nodes are the junction points of an interconnection network. Depending on the application a node can be a processor, a memory bank, an I/O processor, an interconnection link, a hub or a router.
circuit switching - Analogous to a private telephone connection for which there is a dedicated physical connection between the communication nodes.
packet switching - Analogous to a party (shared) telephone connection in which multiple nodes all share the same physical communication line. A local area network or a domain on the Internet uses a type of packet switching (asynchronous transfer mode). In ATM a message is broken down into small segments with each one placed inside a bit string (packet) that includes a header and destination information. The receiver collects the packets, strips off the container info and pieces the original message back together. In ATM communication there is no guarantee that the packets will arrive in any particular order of even that they will arrive at all.
multistage interconnection networks (MINs) - Have a relatively large number of simple nodes (e.g. store-and-forward nodes) but require longer times to transfer information. The Internet is the largest of all the MINs.
centralized vs. distributed control - Networks are categorized by the type of control they use. Circuit switched networks use a centralized control which is a dedicated processors that manages the interconnection links needed for all node-to-node communications. Circuit switching networks typically communication synchronously. Distributed control is used to manage packet switching networks. Packet switching or message passing networks usually communicate asynchronously.
communication diameter - This is also called the communication distance and is the number of nodes that a message must pass through between the two most distance nodes.
expandability - The ease at which nodes (usually processing nodes) may be added to an existing network.
redundancy - This is the level to which a network can tolerate (i.e. continue to function at some level) loss of nodes and other elements of the network.
bandwidth - The amount of data that can be sent over the network per unit time. This is a measure of the rate of information transfer only and DOES NOT include the parity, control or routing ID bits of the packet.
network latency - This is the time required for a message to pass between two points in the network. In packet switching the latency is not fixed so the maximum, minimum or average latency is specified.
connection degree - The number of communication pathways leaving a node.
Interconnection Topology
Single Bus
Multiple Bus
Crossbar Switching Network
Fully Connected Network
Ring Network
Single Bus Tree
Multiple Bus Tree
2D Array Processor
2x2 Cross Bar
HyperCube
The hypercube topology can be defined geometrically as having N nodes on the corner of a "cube" in n-space where n-log2N. A 1-cube is a single processor, a 2-cube is a 4-processor ring topology, a 3-cube has 8 processors, etc...
We can assign each processor a unique ID as a binary value as shown for the 4-cube below:
SIMD Machines
An SIMD (single instruction multiple data) machine has a single master processor that controls many computational processing elements (PE's). The master processor allocates resources, assigns tasks and collects results from the PEs. The master processor can broadcast to all PE's simultaneously or it can communicate with individual PEs an necessary. Typically SIMD systems are designed to solve a specific class of problems and are not considered general-purpose parallel computers.

The standard PE to memory structure shown above has a number of variations.
Operations on SIMD Machines
associative search - an associated-memory operation implement on an SIMD machine. This is the process of locating data based on its value.
vector computations - using pipelining and/or ILP to parallelize arithemetic computations
image processing - such as image blurring and edge detection can be performed using templates in parallel with each PE performing the processing for one pixel of the image.
matrix multiplication - using a CREW (concurrent read exclusive
write) PRAM model, we can reduce matrix multiplication from an n3
process to a linear process using n2 PEs.
more to come....