Parallel Program Design with example (2024)

Home | | Multi - Core Architectures and Programming | Parallel Program Design with example

  • Prev Page
  • Next Page

Chapter: An Introduction to Parallel Programming : Parallel Hardware and Parallel Software

How do we parallelize it? We know that in general we need to divide the work among the processes/threads so that each process gets roughly the same amount of work and communication is minimized. In most cases, we also need to arrange for the processes/threads to synchronize and communicate.

PARALLEL PROGRAM DESIGN

So we’vegot a serial program. How do we parallelize it? We know that in general we needto divide the work among the processes/threads so that each process gets roughlythe same amount of work and communication is minimized. In most cases, we alsoneed to arrange for the processes/threads to synchronize and communicate.

Unfortunately,there isn’t some mechanical process we can follow; if there were, we couldwrite a program that would convert any serial program into a parallel program,but, as we noted in Chapter 1, in spite of a tremendous amount of work and someprogress, this seems to be a problem that has no universal solution.

However,Ian Foster provides an outline of steps in his online book Designing and Building ParallelPrograms [19]:

Partitioning. Divide the computation to be performed andthe data operated on by the computation into small tasks. The focushere should be on identifying tasks that can be executed in parallel.

Communication. Determine what communication needs to becarried out among the tasks identified in the previous step.

Agglomeration or aggregation. Combine tasks andcommunications identified in the first step into larger tasks. For example,if task A must be executed before task B can be executed, it may make sense toaggregate them into a single composite task.

Mapping. Assign the composite tasks identified in theprevious step to processes/ threads. This should be done so that communicationis minimized, and each process/thread gets roughly the same amount of work.

This issometimes called Foster’s methodology.

An example

Let’slook at a small example. Suppose we have a program that generates largequan-tities of floating point data that it stores in an array. In order to getsome feel for the distribution of the data, we can make a histogram of thedata. Recall that to make a histogram, we simply divide the range of the dataup into equal sized subinter-vals, or bins, determine the number ofmeasurements in each bin, and plot a bar graph showing the relative sizes ofthe bins. As a very small example, suppose our data are

1.3, 2.9, 0.4, 0.3, 1.3, 4.4, 1.7, 0.4, 3.2, 0.3, 4.9, 2.4, 3.1,4.4, 3.9, 0.4, 4.2, 4.5, 4.9, 0.9.

Then thedata lie in the range 0–5, and if we choose to have five bins, the histogrammight look something like Figure 2.20.

A serial program

It’spretty straightforward to write a serial program that generates a histogram. Weneed to decide what the bins are, determine the number of measurements in eachbin, and print the bars of the histogram. Since we’re not focusing on I/O,we’ll limit ourselves to just the first two steps, so the input will be

the number of measurements, data count;

an array of datacount floats,data;

the minimum value for the bin containing the smallest values, min meas;

the maximum value for the bin containing the largest values, max meas;

the number of bins, bin count;

Parallel Program Design with example (1)

Theoutput will be an array containing the number of elements of data that lie ineach bin. To make things precise, we’ll make use of the following datastructures:

binmaxes. An array of bin_count floats

bincounts. An array of bin_count ints

Thearray binmaxes willstore the upper bound for each bin, and bin counts will store the number of data elements in eachbin. To be explicit, we can define

bin_width= (max_meas – min_meas)/bin_count

Then binmaxes will be initialized by

for (b =0; b < bin count; b++)

bin_maxes[b]= min_meas + bin_width (b+1);

We’lladopt the convention that bin b will be all the measurements in the range

bin_maxes[b-1]<= measurement < bin_maxes[b]

Ofcourse, this doesn’t make sense if b D 0, and in this case we’ll use the rulethat bin 0 will be the measurements in the range

min_meas<= measurement < bin maxes[0]

Thismeans we always need to treat bin 0 as a special case, but this isn’t tooonerous. Once we’ve initialized bin maxes and assigned 0 to all the elements of

bincounts, we can get the counts by using the following pseudo-code:

for (i =0; i < data count; i++) {

bin =Find_bin(data[i], bin_maxes, bin_count, min_meas);

bin_counts[bin]++;

}

The Find bin function returns the bin that data[i] belongs to. This could be a simple linearsearch function: search through bin maxes until you find a bin b that satisfies

bin_maxes[b 1] <= data[i] < bin_maxes[b]

(Herewe’re thinking of binmaxes[ 1] as min meas.) This will be fine if there aren’t very manybins, but if there are a lot of bins, binary search will be much better.

Parallelizing the serial program

If weassume that datacount is muchlarger than bincount, theneven if we use binary search in the Find bin function, the vast majority of the work inthis code will be in the loop that determines the values in bin counts. The focus of our paralleliza-tion shouldtherefore be on this loop, and we’ll apply Foster’s methodology to it. Thefirst thing to note is that the outcomes of the steps in Foster’s methodologyare by no means uniquely determined, so you shouldn’t be surprised if at anystage you come up with something different.

For thefirst step we might identify two types of tasks: finding the bin to which anelement of data belongs and incrementing theappropriate entry in bincounts.

For thesecond step, there must be a communication between the computation of theappropriate bin and incrementing an element of bin counts. If we represent our tasks with ovals andcommunications with arrows, we’ll get a diagram that looks something like thatshown in Figure 2.21. Here, the task labelled with “data[i]” is determining which bin the value data[i] belongs to, and the task labelled with “bin counts[b]++” is incrementing bin counts[b].

For anyfixed element of data, the tasks “find the bin b for element of data” and “increment bin counts[b]” can be aggregated, since the second can onlyhappen once the first has been completed.

However,when we proceed to the final or mapping step, we see that if two pro-cesses orthreads are assigned elements of data that belong to the same bin b, they’ll both result in execution of thestatement bincounts[b]++. If bin counts[b] is shared (e.g., the array bin counts is stored in shared-memory), then this willresult in a race condition. If bincounts has been partitioned among the processes/threads, then updates toits elements will require communication. An alternative is to store multiple“local” copies of bincounts and add the values in the local copies after all the calls to Find bin.

If thenumber of bins, bin count, isn’t absolutely gigantic,there shouldn’t be a problem with this. So let’s pursue this alternative, sinceit is suitable for use on both shared- and distributed-memory systems.

In thissetting, we need to update our diagram so that the second collection of tasksincrements loc bincts[b]. We also need to add a third collection oftasks, adding the various locbin cts[b] to get bin counts[b]. See Figure 2.22. Now we

Parallel Program Design with example (2)

see thatif we create an array locbin cts for each process/thread, then we can map the tasks in the firsttwo groups as follows:

Elements of data are assigned to theprocesses/threads so that each process/thread gets roughly the same number ofelements.

Each process/thread is responsible for updating its loc bin cts array on the basis of its assigned elements.

To finish up, we need to add the elements loc bin cts[b] into bin counts[b]. If both the number of processes/threads issmall and the number of bins is small, all of the additions can be assigned toa single process/thread. If the number of bins is much larger than the numberof processes/threads, we can divide the bins among the processes/threads inmuch the same way that we divided the elements of data. If the number of processes/threads is large, we can use atree-structured global sum similar to the one we discussed in Chapter 1. Theonly difference is that now the sending pro-cess/threads are sending an array,and the receiving process/threads are receiving and adding an array. Figure2.23 shows an example with eight processes/threads. Each

Parallel Program Design with example (3)

circle in the top row corresponds to aprocess/thread. Between the first and the second rows, the odd-numberedprocesses/threads make their loc bin cts available to the even-numberedprocesses/threads. Then in the second row, the even-numbered processes/threadsadd the new counts to their existing counts. Between the sec-ond and third rowsthe process is repeated with the processes/threads whose ranks aren’t divisibleby four sending to those whose are. This process is repeated untilprocess/thread 0 has computed bin counts.

  • Prev Page
  • Next Page

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

An Introduction to Parallel Programming : Parallel Hardware and Parallel Software : Parallel Program Design with example |

  • Prev Page
  • Next Page

Related Topics

Multi - Core Architectures and Programming

Anna University - All Subjects

Anna University CSE - All Subjects

Engineering - All Subjects

Computer Science Engineering - All Subjects

An Introduction to Parallel Programming : Parallel Hardware and Parallel Software

Parallel Hardware and Parallel Software

Some Background: von Neumann architecture, Processes, multitasking, and threads

Modifications to the Von Neumann Model

Parallel Hardware

Parallel Software

Input and Output

Performance of Parallel Programming

Parallel Program Design with example

Writing and Running Parallel Programs

Assumptions - Parallel Programming

Privacy Policy, Terms and Conditions, DMCA Policy and Compliant

Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.

Parallel Program Design with example (2024)
Top Articles
Latest Posts
Article information

Author: Jerrold Considine

Last Updated:

Views: 6471

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Jerrold Considine

Birthday: 1993-11-03

Address: Suite 447 3463 Marybelle Circles, New Marlin, AL 20765

Phone: +5816749283868

Job: Sales Executive

Hobby: Air sports, Sand art, Electronics, LARPing, Baseball, Book restoration, Puzzles

Introduction: My name is Jerrold Considine, I am a combative, cheerful, encouraging, happy, enthusiastic, funny, kind person who loves writing and wants to share my knowledge and understanding with you.