The sequential operation looks as follows:

```
#include <math.h>
unsigned int precision = 100000000;
double f( const double x ) {
return 4 * sqrt( 1 - x * x );
}
double sequential() {
const double h = 1.0 / ( (double)precision );
double I = f( 0 ) + f( 1 );
for( unsigned int i = 1; i < precision - 1; ++i )
I += 2 * f( i * h );
return I / (double)(2*precision);
}
```

An easy parallelisation using MulticoreBSP for C, is by cutting up the for-loop by using the unique thread identification numbers (bsp_pid()), and the total number of threads used in computation (bsp_nprocs()), as follows:

```
void parallel() {
bsp_begin( bsp_nprocs() );
```

This signals the start of a Single Program, Multiple Data (SPMD) section. Multiple threads are started which each execute this function. The number of threads started is given by bsp_nprocs(), which, outside of an SPMD context, returns the total number of available cores on the current system.

```
//perform the local part of the loop
const double h = 1.0 / ( (double)precision );
double partial_work = 0.0;
bsp_push_reg( &partial_work, sizeof( double ) );
```

We need to register the memory area the partial_work variable corresponds to for communication, first.

```
unsigned int start = (unsigned int) ( bsp_pid() * precision / (double)bsp_nprocs());
unsigned int end = (unsigned int) ((bsp_pid()+1) * precision / (double)bsp_nprocs());
```

This evenly distributes the loop and assigns this thread with its own unique piece of the loop.

```
if( bsp_pid() == 0 ) {
partial_work += f( 0 );
start = 1;
}
if( bsp_pid() == bsp_nprocs() - 1 ) {
partial_work += f( 1 );
end = precision - 1;
}
```

This takes care of the special cases of the repeated trapezoidal rule. We can now start the actual loop:

```
for( unsigned int i = start; i < end; ++i )
I += 2 * f( i * h );
partial_work /= (double)(2*precision);
```

Now each thread hols a partial result. We need to combine these to obtain the final result. The required all-to-one communication is known as a gather operation:

```
bsp_sync();
if( bsp_pid() == 0 ) {
double integral = partial_work;
for( unsigned int s = 1; s < bsp_nprocs(); ++s ) {
bsp_direct_get( s, &partial_work, 0, &partial_work, sizeof( double ) );
integral += partial_work;
}
```

The initial synchronisation (bsp_sync) is necessary to ensure each thread was ready with calculating its partial result, before continuing. Note this implementation makes use of the new MulticoreBSP direct_get() primitive; alternatively, each thread could have bsp_put its local contributions in an array local to thread 0, which then could have been read-out after a synchronisation:

```
double *buffer = NULL;
if( bsp_pid() == 0 ) {
buffer = (double*) malloc( bsp_nprocs() * sizeof( double ) );
bsp_push_reg( &buffer, bsp_nprocs() * sizeof( double ) );
} else
bsp_push_reg( &buffer, 0 );
bsp_put( 0, &partial_work, &buffer, bsp_pid() * sizeof( double ), sizeof( double ) );
bsp_sync();
if( bsp_pid() == 0 ) {
double integral = partial_work;
for( unsigned int s = 0; s < bsp_nprocs(); ++s ) {
integral += buffer[ s ];
}
```

This variant is compatible with the BSPlib variant and thus also runs on distributed-memory systems. In any case, the computation is now finished:

```
printf( "Integral is %.14lf, time taken for parallel calculation using %d threads: %f\n", integral, bsp_nprocs(), bsp_time() );
}
bsp_end();
}
```

Instead of an all-to-one communication, an all-to-all would enable all threads to know the exact integral in the second superstep (the code region after bsp_sync). It is good to realise the cost of such a communication is the same as the all-to-one, in the BSP model.

]]>