<![CDATA[The MulticoreBSP Forums / Parallelising existing code: parallel for]]> 2012-09-12T20:45:53Z FluxBB http://www.multicorebsp.com/forum/viewtopic.php?id=17 <![CDATA[Parallelising existing code: parallel for]]> When interested in transforming only parts of an existing codebase into BSP, one of the common patterns is to parallelise a single for-loop. In this example we parallelise the numerical integration of 4*sqrt(1-x^2), from 0 to 1, using the repeated trapezoidal rule.

The sequential operation looks as follows:

``````#include <math.h>

unsigned int precision = 100000000;

double f( const double x ) {
return 4 * sqrt( 1 - x * x );
}

double sequential() {
const double h = 1.0 / ( (double)precision );
double I = f( 0 ) + f( 1 );
for( unsigned int i = 1; i < precision - 1; ++i )
I += 2 * f( i * h );
return I / (double)(2*precision);
}``````

An easy parallelisation using MulticoreBSP for C, is by cutting up the for-loop by using the unique thread identification numbers (bsp_pid()), and the total number of threads used in computation (bsp_nprocs()), as follows:

``````void parallel() {
bsp_begin( bsp_nprocs() );``````

This signals the start of a Single Program, Multiple Data (SPMD) section. Multiple threads are started which each execute this function. The number of threads started is given by bsp_nprocs(), which, outside of an SPMD context, returns the total number of available cores on the current system.

``````        //perform the local part of the loop
const double h      = 1.0 / ( (double)precision );
double partial_work = 0.0;
bsp_push_reg( &partial_work, sizeof( double ) );``````

We need to register the memory area the partial_work variable corresponds to for communication, first.

``````        unsigned int start = (unsigned int)  (  bsp_pid()  * precision / (double)bsp_nprocs());
unsigned int end   = (unsigned int) ((bsp_pid()+1) * precision / (double)bsp_nprocs());``````

This evenly distributes the loop and assigns this thread with its own unique piece of the loop.

``````        if( bsp_pid() == 0 ) {
partial_work += f( 0 );
start = 1;
}
if( bsp_pid() == bsp_nprocs() - 1 ) {
partial_work += f( 1 );
end = precision - 1;
}``````

This takes care of the special cases of the repeated trapezoidal rule. We can now start the actual loop:

``````        for( unsigned int i = start; i < end; ++i )
I += 2 * f( i * h );
partial_work /= (double)(2*precision);``````

Now each thread hols a partial result. We need to combine these to obtain the final result. The required all-to-one communication is known as a gather operation:

``````        bsp_sync();
if( bsp_pid() == 0 ) {
double integral = partial_work;
for( unsigned int s = 1; s < bsp_nprocs(); ++s ) {
bsp_direct_get( s, &partial_work, 0, &partial_work, sizeof( double ) );
integral += partial_work;
}``````

The initial synchronisation (bsp_sync) is necessary to ensure each thread was ready with calculating its partial result, before continuing. Note this implementation makes use of the new MulticoreBSP direct_get() primitive; alternatively, each thread could have bsp_put its local contributions in an array local to thread 0, which then could have been read-out after a synchronisation:

``````        double *buffer = NULL;
if( bsp_pid() == 0 ) {
buffer = (double*) malloc( bsp_nprocs() * sizeof( double ) );
bsp_push_reg( &buffer, bsp_nprocs() * sizeof( double ) );
} else
bsp_push_reg( &buffer, 0 );
bsp_put( 0, &partial_work, &buffer, bsp_pid() * sizeof( double ), sizeof( double ) );
bsp_sync();
if( bsp_pid() == 0 ) {
double integral = partial_work;
for( unsigned int s = 0; s < bsp_nprocs(); ++s ) {
integral += buffer[ s ];
}``````

This variant is compatible with the BSPlib variant and thus also runs on distributed-memory systems. In any case, the computation is now finished:

``````                printf( "Integral is %.14lf, time taken for parallel calculation using %d threads: %f\n", integral, bsp_nprocs(), bsp_time() );
}
bsp_end();
}``````

Instead of an all-to-one communication, an all-to-all would enable all threads to know the exact integral in the second superstep (the code region after bsp_sync). It is good to realise the cost of such a communication is the same as the all-to-one, in the BSP model.

]]>
http://www.multicorebsp.com/forum/profile.php?id=3 2012-09-12T20:45:53Z http://www.multicorebsp.com/forum/viewtopic.php?pid=29#p29