The MulticoreBSP Forums

A place to discuss the MulticoreBSP library and its applications, and for discussing the use of the Bulk Synchronous Parallel model on shared-memory architectures, or hybrid distributed/shared architectures.

You are not logged in.

#1 2012-09-12 20:45:53

Albert-Jan Yzelman
Moderator
Registered: 2011-08-04
Posts: 29

Parallelising existing code: parallel for

When interested in transforming only parts of an existing codebase into BSP, one of the common patterns is to parallelise a single for-loop. In this example we parallelise the numerical integration of 4*sqrt(1-x^2), from 0 to 1, using the repeated trapezoidal rule.

The sequential operation looks as follows:

#include <math.h>

unsigned int precision = 100000000;

double f( const double x ) {
        return 4 * sqrt( 1 - x * x );
}

double sequential() {
        const double h = 1.0 / ( (double)precision );
        double I = f( 0 ) + f( 1 );
        for( unsigned int i = 1; i < precision - 1; ++i )
                I += 2 * f( i * h );
        return I / (double)(2*precision);
}

An easy parallelisation using MulticoreBSP for C, is by cutting up the for-loop by using the unique thread identification numbers (bsp_pid()), and the total number of threads used in computation (bsp_nprocs()), as follows:

void parallel() {
        bsp_begin( bsp_nprocs() );

This signals the start of a Single Program, Multiple Data (SPMD) section. Multiple threads are started which each execute this function. The number of threads started is given by bsp_nprocs(), which, outside of an SPMD context, returns the total number of available cores on the current system.

        //perform the local part of the loop
        const double h      = 1.0 / ( (double)precision );
        double partial_work = 0.0;
        bsp_push_reg( &partial_work, sizeof( double ) );

We need to register the memory area the partial_work variable corresponds to for communication, first.

        unsigned int start = (unsigned int)  (  bsp_pid()  * precision / (double)bsp_nprocs());
        unsigned int end   = (unsigned int) ((bsp_pid()+1) * precision / (double)bsp_nprocs());

This evenly distributes the loop and assigns this thread with its own unique piece of the loop.

        if( bsp_pid() == 0 ) {
                partial_work += f( 0 );
                start = 1;
        }
        if( bsp_pid() == bsp_nprocs() - 1 ) {
                partial_work += f( 1 );
                end = precision - 1;
        }

This takes care of the special cases of the repeated trapezoidal rule. We can now start the actual loop:

        for( unsigned int i = start; i < end; ++i )
                I += 2 * f( i * h );
        partial_work /= (double)(2*precision);

Now each thread hols a partial result. We need to combine these to obtain the final result. The required all-to-one communication is known as a gather operation:

        bsp_sync();
        if( bsp_pid() == 0 ) {
                double integral = partial_work;
                for( unsigned int s = 1; s < bsp_nprocs(); ++s ) {
                        bsp_direct_get( s, &partial_work, 0, &partial_work, sizeof( double ) );
                        integral += partial_work;
                }

The initial synchronisation (bsp_sync) is necessary to ensure each thread was ready with calculating its partial result, before continuing. Note this implementation makes use of the new MulticoreBSP direct_get() primitive; alternatively, each thread could have bsp_put its local contributions in an array local to thread 0, which then could have been read-out after a synchronisation:

        double *buffer = NULL;
        if( bsp_pid() == 0 ) {
                buffer = (double*) malloc( bsp_nprocs() * sizeof( double ) );
                bsp_push_reg( &buffer, bsp_nprocs() * sizeof( double ) );
        } else
                bsp_push_reg( &buffer, 0 );
        bsp_put( 0, &partial_work, &buffer, bsp_pid() * sizeof( double ), sizeof( double ) );
        bsp_sync();
        if( bsp_pid() == 0 ) {
                double integral = partial_work;
                for( unsigned int s = 0; s < bsp_nprocs(); ++s ) {
                        integral += buffer[ s ];
                }

This variant is compatible with the BSPlib variant and thus also runs on distributed-memory systems. In any case, the computation is now finished:

                printf( "Integral is %.14lf, time taken for parallel calculation using %d threads: %f\n", integral, bsp_nprocs(), bsp_time() );
        }
        bsp_end();
}

Instead of an all-to-one communication, an all-to-all would enable all threads to know the exact integral in the second superstep (the code region after bsp_sync). It is good to realise the cost of such a communication is the same as the all-to-one, in the BSP model.

Last edited by Albert-Jan Yzelman (2015-03-13 10:53:54)

Offline

#2 2018-04-27 17:49:40

Shacruilk
Member
From: Madagascar
Registered: 2018-04-27
Posts: 8
Website

Re: Parallelising existing code: parallel for

Buy Soloxine No Rx Needed Cheap Valtrex Generic Organic Viagra  generic cialis canada Tadalafil 40mg Levitra Mujer Levitra 10 Mg Caratteristiche


Cialis Si Puo Acquistare In Farmacia Priligy Dapoxetine Dans Le U.S  [url=http://cialibuy.com]cialis[/url] Secure Dutasteride In Internet Best Website Price On Line

Offline

Board footer

Powered by FluxBB