Parallel Program Design with example (2024)

Home | | Multi - Core Architectures and Programming | Parallel Program Design with example

Chapter: An Introduction to Parallel Programming : Parallel Hardware and Parallel Software

How do we parallelize it? We know that in general we need to divide the work among the processes/threads so that each process gets roughly the same amount of work and communication is minimized. In most cases, we also need to arrange for the processes/threads to synchronize and communicate.

PARALLEL PROGRAM DESIGN

So we’vegot a serial program. How do we parallelize it? We know that in general we needto divide the work among the processes/threads so that each process gets roughlythe same amount of work and communication is minimized. In most cases, we alsoneed to arrange for the processes/threads to synchronize and communicate.

Unfortunately,there isn’t some mechanical process we can follow; if there were, we couldwrite a program that would convert any serial program into a parallel program,but, as we noted in Chapter 1, in spite of a tremendous amount of work and someprogress, this seems to be a problem that has no universal solution.

However,Ian Foster provides an outline of steps in his online book Designing and Building ParallelPrograms [19]:

Partitioning. Divide the computation to be performed andthe data operated on by the computation into small tasks. The focushere should be on identifying tasks that can be executed in parallel.

Communication. Determine what communication needs to becarried out among the tasks identified in the previous step.

Agglomeration or aggregation. Combine tasks andcommunications identified in the first step into larger tasks. For example,if task A must be executed before task B can be executed, it may make sense toaggregate them into a single composite task.

Mapping. Assign the composite tasks identified in theprevious step to processes/ threads. This should be done so that communicationis minimized, and each process/thread gets roughly the same amount of work.

This issometimes called Foster’s methodology.

An example

Let’slook at a small example. Suppose we have a program that generates largequan-tities of floating point data that it stores in an array. In order to getsome feel for the distribution of the data, we can make a histogram of thedata. Recall that to make a histogram, we simply divide the range of the dataup into equal sized subinter-vals, or bins, determine the number ofmeasurements in each bin, and plot a bar graph showing the relative sizes ofthe bins. As a very small example, suppose our data are

1.3, 2.9, 0.4, 0.3, 1.3, 4.4, 1.7, 0.4, 3.2, 0.3, 4.9, 2.4, 3.1,4.4, 3.9, 0.4, 4.2, 4.5, 4.9, 0.9.

Then thedata lie in the range 0–5, and if we choose to have five bins, the histogrammight look something like Figure 2.20.

bin_maxes[b 1] <= data[i] < bin_maxes[b]

(Herewe’re thinking of binmaxes[ 1] as min meas.) This will be fine if there aren’t very manybins, but if there are a lot of bins, binary search will be much better.

Parallelizing the serial program

If weassume that datacount is muchlarger than bincount, theneven if we use binary search in the Find bin function, the vast majority of the work inthis code will be in the loop that determines the values in bin counts. The focus of our paralleliza-tion shouldtherefore be on this loop, and we’ll apply Foster’s methodology to it. Thefirst thing to note is that the outcomes of the steps in Foster’s methodologyare by no means uniquely determined, so you shouldn’t be surprised if at anystage you come up with something different.

For thefirst step we might identify two types of tasks: finding the bin to which anelement of data belongs and incrementing theappropriate entry in bincounts.

For thesecond step, there must be a communication between the computation of theappropriate bin and incrementing an element of bin counts. If we represent our tasks with ovals andcommunications with arrows, we’ll get a diagram that looks something like thatshown in Figure 2.21. Here, the task labelled with “data[i]” is determining which bin the value data[i] belongs to, and the task labelled with “bin counts[b]++” is incrementing bin counts[b].

For anyfixed element of data, the tasks “find the bin b for element of data” and “increment bin counts[b]” can be aggregated, since the second can onlyhappen once the first has been completed.

However,when we proceed to the final or mapping step, we see that if two pro-cesses orthreads are assigned elements of data that belong to the same bin b, they’ll both result in execution of thestatement bincounts[b]++. If bin counts[b] is shared (e.g., the array bin counts is stored in shared-memory), then this willresult in a race condition. If bincounts has been partitioned among the processes/threads, then updates toits elements will require communication. An alternative is to store multiple“local” copies of bincounts and add the values in the local copies after all the calls to Find bin.

If thenumber of bins, bin count, isn’t absolutely gigantic,there shouldn’t be a problem with this. So let’s pursue this alternative, sinceit is suitable for use on both shared- and distributed-memory systems.

In thissetting, we need to update our diagram so that the second collection of tasksincrements loc bincts[b]. We also need to add a third collection oftasks, adding the various locbin cts[b] to get bin counts[b]. See Figure 2.22. Now we

see thatif we create an array locbin cts for each process/thread, then we can map the tasks in the firsttwo groups as follows:

Elements of data are assigned to theprocesses/threads so that each process/thread gets roughly the same number ofelements.

Each process/thread is responsible for updating its loc bin cts array on the basis of its assigned elements.

To finish up, we need to add the elements loc bin cts[b] into bin counts[b]. If both the number of processes/threads issmall and the number of bins is small, all of the additions can be assigned toa single process/thread. If the number of bins is much larger than the numberof processes/threads, we can divide the bins among the processes/threads inmuch the same way that we divided the elements of data. If the number of processes/threads is large, we can use atree-structured global sum similar to the one we discussed in Chapter 1. Theonly difference is that now the sending pro-cess/threads are sending an array,and the receiving process/threads are receiving and adding an array. Figure2.23 shows an example with eight processes/threads. Each

circle in the top row corresponds to aprocess/thread. Between the first and the second rows, the odd-numberedprocesses/threads make their loc bin cts available to the even-numberedprocesses/threads. Then in the second row, the even-numbered processes/threadsadd the new counts to their existing counts. Between the sec-ond and third rowsthe process is repeated with the processes/threads whose ranks aren’t divisibleby four sending to those whose are. This process is repeated untilprocess/thread 0 has computed bin counts.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

An Introduction to Parallel Programming : Parallel Hardware and Parallel Software : Parallel Program Design with example |

Related Topics

Multi - Core Architectures and Programming

Anna University - All Subjects

Anna University CSE - All Subjects

Engineering - All Subjects

Computer Science Engineering - All Subjects

An Introduction to Parallel Programming : Parallel Hardware and Parallel Software

Parallel Hardware and Parallel Software

Some Background: von Neumann architecture, Processes, multitasking, and threads

Modifications to the Von Neumann Model

Parallel Hardware

Parallel Software

Input and Output

Performance of Parallel Programming

Parallel Program Design with example

Writing and Running Parallel Programs

Assumptions - Parallel Programming