Parallelising existing code: parallel for

2012-09-12T20:45:53Z

When interested in transforming only parts of an existing codebase into BSP, one of the common patterns is to parallelise a single for-loop. In this example we parallelise the numerical integration of 4*sqrt(1-x^2), from 0 to 1, using the repeated trapezoidal rule.

The sequential operation looks as follows:

#include 

unsigned int precision = 100000000;

double f( const double x ) {
        return 4 * sqrt( 1 - x * x );
}

double sequential() {
        const double h = 1.0 / ( (double)precision );
        double I = f( 0 ) + f( 1 );
        for( unsigned int i = 1; i < precision - 1; ++i )
                I += 2 * f( i * h );
        return I / (double)(2*precision);
}

An easy parallelisation using MulticoreBSP for C, is by cutting up the for-loop by using the unique thread identification numbers (bsp_pid()), and the total number of threads used in computation (bsp_nprocs()), as follows:

void parallel() {
        bsp_begin( bsp_nprocs() );

This signals the start of a Single Program, Multiple Data (SPMD) section. Multiple threads are started which each execute this function. The number of threads started is given by bsp_nprocs(), which, outside of an SPMD context, returns the total number of available cores on the current system.

        //perform the local part of the loop
        const double h      = 1.0 / ( (double)precision );
        double partial_work = 0.0;
        bsp_push_reg( &partial_work, sizeof( double ) );

We need to register the memory area the partial_work variable corresponds to for communication, first.

        unsigned int start = (unsigned int)  (  bsp_pid()  * precision / (double)bsp_nprocs());
        unsigned int end   = (unsigned int) ((bsp_pid()+1) * precision / (double)bsp_nprocs());

This evenly distributes the loop and assigns this thread with its own unique piece of the loop.

        if( bsp_pid() == 0 ) {
                partial_work += f( 0 );
                start = 1;
        }
        if( bsp_pid() == bsp_nprocs() - 1 ) {
                partial_work += f( 1 );
                end = precision - 1;
        }

This takes care of the special cases of the repeated trapezoidal rule. We can now start the actual loop:

        for( unsigned int i = start; i < end; ++i )
                I += 2 * f( i * h );
        partial_work /= (double)(2*precision);

Now each thread hols a partial result. We need to combine these to obtain the final result. The required all-to-one communication is known as a gather operation:

        bsp_sync();
        if( bsp_pid() == 0 ) {
                double integral = partial_work;
                for( unsigned int s = 1; s < bsp_nprocs(); ++s ) {
                        bsp_direct_get( s, &partial_work, 0, &partial_work, sizeof( double ) );
                        integral += partial_work;
                }

The initial synchronisation (bsp_sync) is necessary to ensure each thread was ready with calculating its partial result, before continuing. Note this implementation makes use of the new MulticoreBSP direct_get() primitive; alternatively, each thread could have bsp_put its local contributions in an array local to thread 0, which then could have been read-out after a synchronisation:

        double *buffer = NULL;
        if( bsp_pid() == 0 ) {
                buffer = (double*) malloc( bsp_nprocs() * sizeof( double ) );
                bsp_push_reg( &buffer, bsp_nprocs() * sizeof( double ) );
        } else
                bsp_push_reg( &buffer, 0 );
        bsp_put( 0, &partial_work, &buffer, bsp_pid() * sizeof( double ), sizeof( double ) );
        bsp_sync();
        if( bsp_pid() == 0 ) {
                double integral = partial_work;
                for( unsigned int s = 0; s < bsp_nprocs(); ++s ) {
                        integral += buffer[ s ];
                }

This variant is compatible with the BSPlib variant and thus also runs on distributed-memory systems. In any case, the computation is now finished:

                printf( "Integral is %.14lf, time taken for parallel calculation using %d threads: %f\n", integral, bsp_nprocs(), bsp_time() );
        }
        bsp_end();
}

Instead of an all-to-one communication, an all-to-all would enable all threads to know the exact integral in the second superstep (the code region after bsp_sync). It is good to realise the cost of such a communication is the same as the all-to-one, in the BSP model.

The MulticoreBSP Forums / Parallelising existing code: parallel for

Parallelising existing code: parallel for