The MulticoreBSP Forums / General

MulticoreBSP on Windows using Code::Blocks

dummy@example.com (Albert-Jan Yzelman) — Fri, 13 Sep 2019 20:22:12 +0000

A few years later, there are many more MinGW compiler versions out there that can cause incompatibilities. If the above guide nets you missing symbol errors of the forms mcbsp_perf_*, __impl_*PThread*, or __MinGW_*, you're likely in such a situation. The cleanest solution is to build a MulticoreBSP library from source under your current Windows installation, but this is not supported and thus requires some manual tinkering.

I've hence made sure at least one combination using a ready-to-go Code::Blocks version works as intended:
* MinGW + CodeBlocks version 17.12 [1,2]
* PThreads-win32 v2.1.9 [3]
* MulticoreBSP for C v2.0.4 pre-built 32-bit static library and header file [4,5]

You should use the x86 GC2 versions of the PThreads pre-built binaries, and you need both the DLL and the .a. To use MulticoreBSP for C in your Windows Code::Blocks project, simply add the bsp.h header [5] to your project and start coding. Before building, make sure to add the libmcbsp2.0.4.a [4] and libpthreadGC2.a (the x86 version!) [3] to your linker configuration as shown in [6]:

The order of the static libraries matter-- the MulticoreBSP for C library should be linked first, because it has dependencies on the PThreads library.

Before running, make sure to put the pthreadGC2.dll (the x86 version!) [3] *in the same folder as your executable* (e.g., /path/to/project/bin/Release/ or /path/to/project/bin/Debug). Alternatively, the DLL could be added to your system's path. The final result should be as in [7]:

The pre-built Win32 library [4] should work with all current and future MinGW v5.x.x and PThreads-win32 v2.9.x versions. The win64 pre-built binary (at http://www.multicorebsp.com/downloads/c/2.0.4/win64) is only compatible with the older MinGW v4.x.x.

[1] http://www.codeblocks.org/downloads/binaries#windows
[2] http://sourceforge.net/projects/codeblo … -setup.exe
[3] ftp://sourceware.org/pub/pthreads-win32 … 1-release/
[4] http://www.multicorebsp.com/downloads/c … bsp2.0.4.a
[5] http://www.multicorebsp.com/downloads/c/2.0.4/bsp.h
[6] http://www.multicorebsp.com/images/mcbsp-win32-1.png
[7] http://www.multicorebsp.com/images/mcbsp-win32-2.png

Multi-BSP: nested SPMD regions for hierarchical MulticoreBSP use

dummy@example.com (Albert-Jan Yzelman) — Mon, 22 Jul 2013 22:58:07 +0000

BSPlib and MulticoreBSP are first and foremost designed for the original flat BSP model as proposed by Leslie Valiant in 1990. In 2008, Leslie proposed a model where a BSP computer does not consist out of p processors, but out of p other BSP computers or p processors. This model is called Multi-BSP. MulticoreBSP for C supports this style of programming by supporting nested BSP runs.

Basics

What this means is best illustrated by an example. Consider the HP DL980 architecture; this machine consists out of two separate chipsets connected by a custom HP interconnect. Each chipset connects four sockets (and four memory controllers). Let us assume each socket contains an 8-core Intel Xeon processor. We observe three important layers in the hierarchy: first, the custom interconnect that basically binds together two mainboards so to create a single shared-memory computer; second, the four sockets connected by Intel Quick-Path Interconnect; and third, a single processor consisting out of 8 cores. This Multi-BSP description can be exploited in MulticoreBSP by using nested SPMD runs as follows:

#include "mcbsp.h"

#include 
#include 

void spmd1() {
    bsp_begin( 2 ); //HP custom interconnect
    bsp_init( &spmd2, 0, NULL );
    spmd2();
    bsp_sync();
    if( bsp_pid() == 0 )
        printf( "level-1 master thread closes nested BSP run...\n" );
    bsp_end();
}

void spmd2() {
    bsp_begin( 4 ); //four sockets per Intel QPI
    bsp_init( &spmd3, 0, NULL );
    spmd3();
    bsp_sync();
    if( bsp_pid() == 0 )
        printf( "level-2 master thread closes nested BSP run...\n" );
    bsp_end();
}

void spmd3() {
    bsp_begin( 8 ); //eight cores per processor
    //do useful work here
    bsp_end();
}

int main() {
    printf( "Sequential part 1\n" );

    bsp_init( &spmd1, 0, NULL );
    spmd1();

    printf( "Sequential part 2\n" );
    return EXIT_SUCCESS;    
}

To illustrate MulticoreBSP for C indeed handles this correctly, we compile the library with the MCBSP_SHOW_PINNING flag enabled, and run it on the HP DL980 machine (located at the Flanders ExaScience Labs) described above:

Sequential part 1
Info: pinning used is 0 32
Info: pinning used is 0 8 16 24
Info: pinning used is 0 1 2 3 4 5 6 7
Info: pinning used is 32 40 48 56
Info: pinning used is 8 9 10 11 12 13 14 15
Info: pinning used is 24 25 26 27 28 29 30 31
Info: pinning used is 16 17 18 19 20 21 22 23
Info: pinning used is 32 33 34 35 36 37 38 39
Info: pinning used is 40 41 42 43 44 45 46 47
Info: pinning used is 56 57 58 59 60 61 62 63
Info: pinning used is 48 49 50 51 52 53 54 55
level-2 master thread closes nested BSP run...
level-2 master thread closes nested BSP run...
level-1 master thread closes nested BSP run...
Sequential part 2

Two runs need not be identical, of course, as per the usual behaviour of concurrent threads. A second test run, for example, gives us:

Sequential part 1
Info: pinning used is 0 32
Info: pinning used is 0 8 16 24
Info: pinning used is 0 1 2 3 4 5 6 7
Info: pinning used is 32 40 48 56
Info: pinning used is 16 17 18 19 20 21 22 23
Info: pinning used is 24 25 26 27 28 29 30 31
Info: pinning used is 8 9 10 11 12 13 14 15
Info: pinning used is 32 33 34 35 36 37 38 39
Info: pinning used is 40 41 42 43 44 45 46 47
Info: pinning used is 56 57 58 59 60 61 62 63
Info: pinning used is 48 49 50 51 52 53 54 55
level-2 master thread closes nested BSP run...
level-2 master thread closes nested BSP run...
level-1 master thread closes nested BSP run...
Sequential part 2

This post describes but a toy example; in the ./examples/ directory in the MulticoreBSP for C distribution, the hierarchical.c example provides a Multi-BSP program that actually does a distributed computation in a hierarchical fashion. Please refer there for further illustration on how to practically use this facility.

Thread pinning

We proceed to describe how the pinning is computed in the hierarchical setting. If we have a machine consisting of 64 threads, and the thread numbering is consecutive with no reserved cores, then requesting p threads actually partitions the entire machine (1,2,...,64) into p parts (submachines), each consisting of consecutive threads. For example, if p=8, then submachine 1 is (1,2,...,8), submachine two is (9,10,...,16), ..., and submachine p is (57,58,...,64). Each of the p threads spawned will pin to the first thread number available in its assigned submachine. Starting a nested BSP run will apply the same logic, but only on the assigned submachine of the current thread; e.g., writing bsp_begin(2) threads within submachine two in the previous example will result in two threads which are assigned the submachines (9,10,11,12) and (13,14,15,16), respectively, and which are pinned to thread 9 and 13, respectively.

For different thread numberings (i.e., a wrapped numbering, as for instance commonly used on hyperthreading machines), the MulticoreBSP runtime will compute the necessary transformations to automatically yield the pinnings as described above. The same is true when reserved cores do occur. If evenly distributed submachines are not possible or not intended, the user can manually adapt the submachine settings using the interface exposed in mcbsp_afffinity.h at runtime, while already in an SPMD section.

bsp_sync()'s between bsp_push_reg()'s?

dummy@example.com (Albert-Jan Yzelman) — Wed, 28 Nov 2012 14:46:59 +0000

I checked the logs a while back: the issue I referred to above occurred when a process registers a new variable which incidentally had the same address as a previously registered variable, but not on all processors. This became evident in an application where mallocs, push_regs, pop_regs and frees were woven into subroutines in an incorrect fashion. It did lead to a detection of a bug related to multiple push_regs on the same pointer, which was fixed before version 1.0.0 already; not 1.0.1.

In any case, synching once should be all right, and the only thing I can think of is that some of the push_regs try to register the same address on some of your threads, like the above. In that case the last register will prevail (as per the standard). Note that in code like

x=malloc(30*sizeof(double));
free(x);
y=malloc(2*sizeof(char));
assert( x != y);

the assertion may fail, but doesn't have to, and in multithreading some threads may fail while others would not; this was the root cause in the bug I had.

Hope this helps.

MulticoreBSP in C++: basic usage

dummy@example.com (Albert-Jan Yzelman) — Sat, 17 Nov 2012 19:48:31 +0000

Since MulticoreBSP is now available in C, you can use it from within C++ programs by simply including the mcbsp.h header file in the usual way.

This requires you to call C-style functions from object-oriented code, and this isn't always elegant In particular, having to start an SPMD section by pointing to a global function completely breaks the object-oriented style. To cope with this, MulticoreBSP for C (from version 1.0.1 on) includes a C++-wrapper that objectifies SPMD programs. This class is defined in the mcbsp.hpp header file; including this file also defines all regular BSP primitives (that is, there is no need to include the C-header mcbsp.h).

The C++ header mcbsp.hpp defines the mcbsp namespace which contains the abstract BSP_program class. An SPMD section making use of the MulticoreBSP library may simply extend this class, and then must implement BSP_program's purely virtual methods:

virtual void spmd() = 0;
virtual BSP_program* newInstance() = 0;

The first method is the entry point of the SPMD code. Each thread involved in execution of this SPMD code has its own instance of the final BSP_program. The second method ensures that the C++-wrapper can create new instances of the user-defined class when it wants to supply a new thread with its own instance of that class. Finally, BSP_program defines only one other public function:

void begin( unsigned int P = bsp_nprocs() );

Calling this function will start parallel execution of the SPMD code over a user-supplied number of P processors. If P is not supplied, the maximum available number of processors is used (see the description of bsp_nprocs()). The use of this class deprecates the use of bsp_end(), and bsp_init( ... ); these are implied and handled by the C++-wrapper. The use of bsp_begin( unsigned int P ) is replaced by the use of BSP_program::begin( unsigned int P ).

An object-oriented parallel `Hello world'-example now looks as follows:

#include "mcbsp.hpp"

#include 
#include 

using namespace mcbsp;

class Hello_World: public BSP_program {

        protected:

                virtual void spmd() {
                        std::cout << "Hello world from thread " << bsp_pid() << std::endl;
                }

                virtual BSP_program * newInstance() {
                        return new Hello_World();
                }

        public:

                Hello_World() {}
};

int main() {
        BSP_program *p = new Hello_World();
        p->begin( 2 );
        p->begin();
        delete p;
        return EXIT_SUCCESS;
}

The example demonstrates calling BSP_program::begin( ... ) with and without parameters. Running on a quad-core computer, example output is:

Hello world from thread Hello world from thread 01

Hello world from thread 3
Hello world from thread 2
Hello world from thread 0
Hello world from thread 1

The order of the Hello-world printouts might change between runs, and `fusion' of output streams (like with the first run on 2 threads) may or may not occur in repeated runs.

Any communication between threads still follows the usual C-style functions on contiguous in-memory bytes; the C++-wrapper does not supply object-oriented communication constructs (at present).

Parallelising existing code: parallel for

dummy@example.com (Albert-Jan Yzelman) — Wed, 12 Sep 2012 20:45:53 +0000

When interested in transforming only parts of an existing codebase into BSP, one of the common patterns is to parallelise a single for-loop. In this example we parallelise the numerical integration of 4*sqrt(1-x^2), from 0 to 1, using the repeated trapezoidal rule.

The sequential operation looks as follows:

#include 

unsigned int precision = 100000000;

double f( const double x ) {
        return 4 * sqrt( 1 - x * x );
}

double sequential() {
        const double h = 1.0 / ( (double)precision );
        double I = f( 0 ) + f( 1 );
        for( unsigned int i = 1; i < precision - 1; ++i )
                I += 2 * f( i * h );
        return I / (double)(2*precision);
}

An easy parallelisation using MulticoreBSP for C, is by cutting up the for-loop by using the unique thread identification numbers (bsp_pid()), and the total number of threads used in computation (bsp_nprocs()), as follows:

void parallel() {
        bsp_begin( bsp_nprocs() );

This signals the start of a Single Program, Multiple Data (SPMD) section. Multiple threads are started which each execute this function. The number of threads started is given by bsp_nprocs(), which, outside of an SPMD context, returns the total number of available cores on the current system.

        //perform the local part of the loop
        const double h      = 1.0 / ( (double)precision );
        double partial_work = 0.0;
        bsp_push_reg( &partial_work, sizeof( double ) );

We need to register the memory area the partial_work variable corresponds to for communication, first.

        unsigned int start = (unsigned int)  (  bsp_pid()  * precision / (double)bsp_nprocs());
        unsigned int end   = (unsigned int) ((bsp_pid()+1) * precision / (double)bsp_nprocs());

This evenly distributes the loop and assigns this thread with its own unique piece of the loop.

        if( bsp_pid() == 0 ) {
                partial_work += f( 0 );
                start = 1;
        }
        if( bsp_pid() == bsp_nprocs() - 1 ) {
                partial_work += f( 1 );
                end = precision - 1;
        }

This takes care of the special cases of the repeated trapezoidal rule. We can now start the actual loop:

        for( unsigned int i = start; i < end; ++i )
                I += 2 * f( i * h );
        partial_work /= (double)(2*precision);

Now each thread hols a partial result. We need to combine these to obtain the final result. The required all-to-one communication is known as a gather operation:

        bsp_sync();
        if( bsp_pid() == 0 ) {
                double integral = partial_work;
                for( unsigned int s = 1; s < bsp_nprocs(); ++s ) {
                        bsp_direct_get( s, &partial_work, 0, &partial_work, sizeof( double ) );
                        integral += partial_work;
                }

The initial synchronisation (bsp_sync) is necessary to ensure each thread was ready with calculating its partial result, before continuing. Note this implementation makes use of the new MulticoreBSP direct_get() primitive; alternatively, each thread could have bsp_put its local contributions in an array local to thread 0, which then could have been read-out after a synchronisation:

        double *buffer = NULL;
        if( bsp_pid() == 0 ) {
                buffer = (double*) malloc( bsp_nprocs() * sizeof( double ) );
                bsp_push_reg( &buffer, bsp_nprocs() * sizeof( double ) );
        } else
                bsp_push_reg( &buffer, 0 );
        bsp_put( 0, &partial_work, &buffer, bsp_pid() * sizeof( double ), sizeof( double ) );
        bsp_sync();
        if( bsp_pid() == 0 ) {
                double integral = partial_work;
                for( unsigned int s = 0; s < bsp_nprocs(); ++s ) {
                        integral += buffer[ s ];
                }

This variant is compatible with the BSPlib variant and thus also runs on distributed-memory systems. In any case, the computation is now finished:

                printf( "Integral is %.14lf, time taken for parallel calculation using %d threads: %f\n", integral, bsp_nprocs(), bsp_time() );
        }
        bsp_end();
}

Instead of an all-to-one communication, an all-to-all would enable all threads to know the exact integral in the second superstep (the code region after bsp_sync). It is good to realise the cost of such a communication is the same as the all-to-one, in the BSP model.

Parallelising existing code: multiple SPMD areas

dummy@example.com (Albert-Jan Yzelman) — Sat, 01 Sep 2012 19:07:03 +0000

Sometimes only specific parts of an application may be worthwhile to parallelise, and it is not cost-effective to re-write the entire application as a single BSP program (although it is in general the right thing to do). MulticoreBSP for C does support multiple SPMD regions in a single code, thus making it possible to write BSP versions of only the compute-intensive highly-parallelisable parts of your application. It works by repeated application of bsp_init and bsp_begin; for example, the following code

#include "mcbsp.h"

#include 
#include 

void spmd1() {
        bsp_begin( 2 );
        printf( "Hello world from thread %d!\n", bsp_pid() );
        bsp_end();
}

void spmd2() {
        bsp_begin( bsp_nprocs() );
        printf( "Hello world from thread %d!\n", bsp_pid() );
        bsp_end();
}

int main() {
        printf( "Sequential part 1\n" );

        bsp_init( &spmd1, 0, NULL );
        spmd1();

        printf( "Sequential part 2\n" );

        bsp_init( &spmd2, 0, NULL );
        spmd2();

        printf( "Sequential part 3\n" );

        return EXIT_SUCCESS;
}

produces a variant of

Sequential part 1
Hello world from thread 1!
Hello world from thread 0!
Sequential part 2
Hello world from thread 3!
Hello world from thread 2!
Hello world from thread 0!
Hello world from thread 1!
Sequential part 3

(on a quadcore machine). Note the order of the `Hello world' lines may differ. This may also serve as a way to gradually transform a large C code into BSP form. MulticoreBSP does incur an overhead of thread creation and initialisation every time bsp_begin is encountered, so be sure each SPMD areas indeed constitutes enough work to amortise this cost.