The MulticoreBSP Forums

Leaderboard update

dummy@example.com (Albert-Jan Yzelman) — Mon, 24 Jan 2022 15:36:36 +0000

Dear Archimedis,

you are welcome and encouraged to submit-- we will update the leaderboard roughly on a weekly basis. It's not frozen:)

All the bests and looking forward to your updated entry!!
Albert-Jan

leaderboard visibility

dummy@example.com (Archimedes) — Sun, 23 Jan 2022 10:35:19 +0000

Hi,

a nice observation I got from the mistake we did in maximazing instead of minimizing is that, given the quality score definition, because in maximizing the aim is mainly to diversify as much as possible the colors of the vertices of a hyperedge, the strategy we found is on average worst than your random partitioner that means we were able to diversify even more than the random partitioner achieving so our "wrong" objective ... now let's try to address the real challenge otpimization goal

MulticoreBSP on Windows using Code::Blocks

dummy@example.com (Albert-Jan Yzelman) — Fri, 13 Sep 2019 20:22:12 +0000

A few years later, there are many more MinGW compiler versions out there that can cause incompatibilities. If the above guide nets you missing symbol errors of the forms mcbsp_perf_*, __impl_*PThread*, or __MinGW_*, you're likely in such a situation. The cleanest solution is to build a MulticoreBSP library from source under your current Windows installation, but this is not supported and thus requires some manual tinkering.

I've hence made sure at least one combination using a ready-to-go Code::Blocks version works as intended:
* MinGW + CodeBlocks version 17.12 [1,2]
* PThreads-win32 v2.1.9 [3]
* MulticoreBSP for C v2.0.4 pre-built 32-bit static library and header file [4,5]

You should use the x86 GC2 versions of the PThreads pre-built binaries, and you need both the DLL and the .a. To use MulticoreBSP for C in your Windows Code::Blocks project, simply add the bsp.h header [5] to your project and start coding. Before building, make sure to add the libmcbsp2.0.4.a [4] and libpthreadGC2.a (the x86 version!) [3] to your linker configuration as shown in [6]:

The order of the static libraries matter-- the MulticoreBSP for C library should be linked first, because it has dependencies on the PThreads library.

Before running, make sure to put the pthreadGC2.dll (the x86 version!) [3] *in the same folder as your executable* (e.g., /path/to/project/bin/Release/ or /path/to/project/bin/Debug). Alternatively, the DLL could be added to your system's path. The final result should be as in [7]:

The pre-built Win32 library [4] should work with all current and future MinGW v5.x.x and PThreads-win32 v2.9.x versions. The win64 pre-built binary (at http://www.multicorebsp.com/downloads/c/2.0.4/win64) is only compatible with the older MinGW v4.x.x.

[1] http://www.codeblocks.org/downloads/binaries#windows
[2] http://sourceforge.net/projects/codeblo … -setup.exe
[3] ftp://sourceware.org/pub/pthreads-win32 … 1-release/
[4] http://www.multicorebsp.com/downloads/c … bsp2.0.4.a
[5] http://www.multicorebsp.com/downloads/c/2.0.4/bsp.h
[6] http://www.multicorebsp.com/images/mcbsp-win32-1.png
[7] http://www.multicorebsp.com/images/mcbsp-win32-2.png

Version 2.0.4 released

dummy@example.com (Albert-Jan Yzelman) — Sun, 31 Mar 2019 15:05:24 +0000

This new version contains many bugfixes over 2.0.3; many thanks to Frédéric Dabrowski, Arvid Jokabsson, and Rob Bisseling for reporting some of these. Initial support for Android was added, while the build tools bspcc and bspcxx were improved. The profile mode now also prints the BSP signature, the ratio of useful work versus total run-time.

Get your copy: http://multicorebsp.com/downloads/c/2.0 … r-C.tar.xz

The quick-start guide has had an overhaul to correspond to the new v2 series of MulticoreBSP. Check it out: http://multicorebsp.com/documentation/quickC/

MulticoreBSP 2.0.3 released!

dummy@example.com (Albert-Jan Yzelman) — Sun, 20 May 2018 15:23:46 +0000

Time to update this project has become quite sparse. Nevertheless, I'm happy to recently have been to bring version 2.0.3 in the open!

New features:

checkpointing support,
initial support for accelerators,
improved pinning support for hyperthreaded machines,
speed improvements,
machine benchmarking suite and more example codes,
flexible APIs (no more need to recompile for compatibility mode),
new compilation modes: debug and profiling.

Objects compiled in various modes can be mixed freely-- that is, if you suspect a bug in one part of a large project, only the suspect part can be compiled in debug mode. That way full error checking (and corresponding overheads) are only incurred on the suspect code. The same is true for profiling or compatibility codes, altough the former has minor caveats (see the changelog).

No distributed-memory support was added. Multi-BSP programming is available separately. See http://multicorebsp.com/download/c/ for full details.

Version 1.2, and a roadmap for future releases

dummy@example.com (Albert-Jan Yzelman) — Wed, 14 May 2014 13:19:43 +0000

MulticoreBSP for C version 1.2

It is my pleasure to announce the third update to the MulticoreBSP for C library. Version 1.2 brings improved pinning support for nested BSP runs, which benefits code explicitly following the Multi-BSP model (as opposed to flat BSP). C++ support has been extended by the addition of templated BSPlib primitives, which provides escape from having to deal explicitly with byte sizes and byte offsets.
Smaller improvements consider documentation, internal data structures, and compilation support; the latter now include release-testing using the clang LLVM compiler, next to GCC and the Intel C++ Compiler. Several bugs from version 1.1.0 have been resolved as well; I like to thank Jing Fan and Joshua Moerman for reporting some of these. As always, the changelog contains more details.

Version 1.2.0 can be downloaded from the following URL: http://multicorebsp.com/download/c/

Roadmap

Version 1.3:
Explicitly writing parallel programs using the Multi-BSP model can be done using nested SPMD sections in MulticoreBSP for C. This is not a straightforward effort, and can easily lead to codes written for specific architectures. This hurts portability, reduces productivity, and negatively affects the ease of use BSP libraries are otherwise commonly known for. It is also at odds with the intention of (Multi-)BSP as an abstract bridging model.
To enable the implementation of Multi-BSP codes such that the produced codes are (1) valid for all Multi-BSP computers, (2) clearly and transparently structured, and (3) compatible with existing BSPlib code fragments, version 1.3 will support C++ extensions allowing for explicit Multi-BSP programming.

Version 2.0:
Many have requested that MulticoreBSP for C be deployable over distributed-memory architectures. The original plan was to handle this problem simultaneously with the addition of automatic global barrier avoidance, pipelined communication, and fault-tolerance. It remains, however, unclear in what time-frame I will be able to address these important issues. From version 2.0 on, I will hence first make MulticoreBSP for C a fully hybrid systems, such that any BSP program automatically uses MPI for inter-node process coordination, while using PThreads for intra-node threading. These extensions will remain adhering to the updated BSPlib standard as published with the introduction of MulticoreBSP for C; in particular, nested BSP runs will remain possible, which in turn allows the Multi-BSP C++ extensions that will be introduced in Version 1.3 to be deployable over any MPI-supporting cluster or supercomputer.
A secondary target is to not interfere with existing threading and parallel programming interfaces; advanced users of version 2.0 of MulticoreBSP for C will be able to mix their (Multi-)BSP codes with any existing MPI, OpenMP, or Cilk Plus codes that they may already have.

Bug Trying To Implement MulticoreBSP with JNI on W7 64 bits

dummy@example.com (Albert-Jan Yzelman) — Sat, 18 Jan 2014 18:13:34 +0000

As a note: this question is about calling MulticoreBSP for C from Java. Note that there's also a native Java version available.

As for the question, with part one, you're using a 64-bit compiler but linking against a 32-bit library; hence the BSP functions are not recoginised. For part two, the BSP functions are recognised by the linker, but the MulticoreBSP for C library depends on POSIX Threads, which is not found by the linker. Windows does not have native PThreads support. To add PThreads, you can link against libraries from the following project:

http://www.sourceware.org/pthreads-win32/

For a walkthrough on how to get it working in Code::Blocks, please consult this thread:

http://multicorebsp.com/forum/viewtopic.php?id=27

Multi-BSP: nested SPMD regions for hierarchical MulticoreBSP use

dummy@example.com (Albert-Jan Yzelman) — Mon, 22 Jul 2013 22:58:07 +0000

BSPlib and MulticoreBSP are first and foremost designed for the original flat BSP model as proposed by Leslie Valiant in 1990. In 2008, Leslie proposed a model where a BSP computer does not consist out of p processors, but out of p other BSP computers or p processors. This model is called Multi-BSP. MulticoreBSP for C supports this style of programming by supporting nested BSP runs.

Basics

What this means is best illustrated by an example. Consider the HP DL980 architecture; this machine consists out of two separate chipsets connected by a custom HP interconnect. Each chipset connects four sockets (and four memory controllers). Let us assume each socket contains an 8-core Intel Xeon processor. We observe three important layers in the hierarchy: first, the custom interconnect that basically binds together two mainboards so to create a single shared-memory computer; second, the four sockets connected by Intel Quick-Path Interconnect; and third, a single processor consisting out of 8 cores. This Multi-BSP description can be exploited in MulticoreBSP by using nested SPMD runs as follows:

#include "mcbsp.h"

#include 
#include 

void spmd1() {
    bsp_begin( 2 ); //HP custom interconnect
    bsp_init( &spmd2, 0, NULL );
    spmd2();
    bsp_sync();
    if( bsp_pid() == 0 )
        printf( "level-1 master thread closes nested BSP run...\n" );
    bsp_end();
}

void spmd2() {
    bsp_begin( 4 ); //four sockets per Intel QPI
    bsp_init( &spmd3, 0, NULL );
    spmd3();
    bsp_sync();
    if( bsp_pid() == 0 )
        printf( "level-2 master thread closes nested BSP run...\n" );
    bsp_end();
}

void spmd3() {
    bsp_begin( 8 ); //eight cores per processor
    //do useful work here
    bsp_end();
}

int main() {
    printf( "Sequential part 1\n" );

    bsp_init( &spmd1, 0, NULL );
    spmd1();

    printf( "Sequential part 2\n" );
    return EXIT_SUCCESS;    
}

To illustrate MulticoreBSP for C indeed handles this correctly, we compile the library with the MCBSP_SHOW_PINNING flag enabled, and run it on the HP DL980 machine (located at the Flanders ExaScience Labs) described above:

Sequential part 1
Info: pinning used is 0 32
Info: pinning used is 0 8 16 24
Info: pinning used is 0 1 2 3 4 5 6 7
Info: pinning used is 32 40 48 56
Info: pinning used is 8 9 10 11 12 13 14 15
Info: pinning used is 24 25 26 27 28 29 30 31
Info: pinning used is 16 17 18 19 20 21 22 23
Info: pinning used is 32 33 34 35 36 37 38 39
Info: pinning used is 40 41 42 43 44 45 46 47
Info: pinning used is 56 57 58 59 60 61 62 63
Info: pinning used is 48 49 50 51 52 53 54 55
level-2 master thread closes nested BSP run...
level-2 master thread closes nested BSP run...
level-1 master thread closes nested BSP run...
Sequential part 2

Two runs need not be identical, of course, as per the usual behaviour of concurrent threads. A second test run, for example, gives us:

Sequential part 1
Info: pinning used is 0 32
Info: pinning used is 0 8 16 24
Info: pinning used is 0 1 2 3 4 5 6 7
Info: pinning used is 32 40 48 56
Info: pinning used is 16 17 18 19 20 21 22 23
Info: pinning used is 24 25 26 27 28 29 30 31
Info: pinning used is 8 9 10 11 12 13 14 15
Info: pinning used is 32 33 34 35 36 37 38 39
Info: pinning used is 40 41 42 43 44 45 46 47
Info: pinning used is 56 57 58 59 60 61 62 63
Info: pinning used is 48 49 50 51 52 53 54 55
level-2 master thread closes nested BSP run...
level-2 master thread closes nested BSP run...
level-1 master thread closes nested BSP run...
Sequential part 2

This post describes but a toy example; in the ./examples/ directory in the MulticoreBSP for C distribution, the hierarchical.c example provides a Multi-BSP program that actually does a distributed computation in a hierarchical fashion. Please refer there for further illustration on how to practically use this facility.

Thread pinning

We proceed to describe how the pinning is computed in the hierarchical setting. If we have a machine consisting of 64 threads, and the thread numbering is consecutive with no reserved cores, then requesting p threads actually partitions the entire machine (1,2,...,64) into p parts (submachines), each consisting of consecutive threads. For example, if p=8, then submachine 1 is (1,2,...,8), submachine two is (9,10,...,16), ..., and submachine p is (57,58,...,64). Each of the p threads spawned will pin to the first thread number available in its assigned submachine. Starting a nested BSP run will apply the same logic, but only on the assigned submachine of the current thread; e.g., writing bsp_begin(2) threads within submachine two in the previous example will result in two threads which are assigned the submachines (9,10,11,12) and (13,14,15,16), respectively, and which are pinned to thread 9 and 13, respectively.

For different thread numberings (i.e., a wrapped numbering, as for instance commonly used on hyperthreading machines), the MulticoreBSP runtime will compute the necessary transformations to automatically yield the pinnings as described above. The same is true when reserved cores do occur. If evenly distributed submachines are not possible or not intended, the user can manually adapt the submachine settings using the interface exposed in mcbsp_afffinity.h at runtime, while already in an SPMD section.

multicoreBSP Java version

dummy@example.com (Albert-Jan Yzelman) — Sat, 04 May 2013 17:01:37 +0000

Hi sikanderhayat,

I think the confusion here is there cannot be just two vectors in a (BSP) SPMD run; each single program has its own two local vectors. With p processors, there will thus be 2p separate vectors.

Suppose x and y are the *global* vectors, and x_s and y_s the *local* vectors at process s. Note again that x and y do not exist anywhere. Then x should be distributed so that the union of the p subvectors x_s should yield x; and similarly for y_s and y. In other words, we need a partitioning of the global vectors x and y. Again, this still happens on the design table, not on the implementation-level. We now first design the algorithm.

To compute an inner product we can first compute the local inner product, then communicate the partial results, and then combine those partial results. Following the BSP paradigm, we want no communication during the computation of the local inner product; hence, the distributions (the way of partitioning) of x and y must be equal. The communication step is an all-to-all communication of a single element, and the last computation step is indeed also completely local as it just accumulates all received elements into one.
The cost of the first computation step is given by the local lengths of x_s and y_s (which are equal, since the distribution is equal). For load-balance, setting this length to n/p for all local vectors, with n the size of the global x and y, is optimal. Note that the actual partitioning scheme thus actually does not matter; as long as x is distributed as y is, and as long as all local vectors are of equal size. This concludes the design. We now can go on with implementation.

Each SPMD section must initialise local vectors of size n/p. This means allocating room for x_s and y_s, and initialising the values of the elements therein according to the global distribution that was chosen. Again, there are no global vectors x and y; we only initialise local vectors, from scratch. We then implement the algorithm as described above, and we are done. (It is quite natural for BSP implementations to just be literal translations of a BSP design.)

If you had the global vectors x and y available, then some part of the code is not in SPMD style. You can still force MulticoreBSP to run with a block-distribution of those vectors, but then you have to override the SPMD paradigm by using (or abusing) the available shared-memory. In terms of performance you will then become suboptimal in terms of data-locality, which will lead to big issues on architectures where memory access is non-uniform (NUMA architectures).

MulticoreBSP for C, version 1.1 released

dummy@example.com (Albert-Jan Yzelman) — Tue, 05 Mar 2013 17:42:49 +0000

We are happy to announce an update to the MulticoreBSP for C software. Highlights are:

A new BSP primitive: the bsp_hpsend,
Improved communication speed,
Improved synchronisation speed (on machines with a large number of cores),
New possibilities for advanced control of thread affinities as required on NUMA architectures,
Compilation support for Windows;
... and see the changelog for other new additions and bugfixes!

This release accompanies a new introductory paper to MulticoreBSP for C, which describes the BSP model, defines the updated BSPlib interface, and presents two BSP applications with performance evaluations on machines with highly non-uniform memory access (NUMA):

A. N. Yzelman, R. H. Bisseling, D. Roose, and K. Meerbergen, MulticoreBSP for C: a high-performance library for shared-memory parallel programming, technical report TW624, KU Leuven, 2013 (submitted for publication).

Version 1.1 of MulticoreBSP for C is ready for download, and the corresponding documentation has been updated. As always, we welcome your feedback, and wish you many pleasant BSP programming sessions!

bsp_sync()'s between bsp_push_reg()'s?

dummy@example.com (Albert-Jan Yzelman) — Wed, 28 Nov 2012 14:46:59 +0000

I checked the logs a while back: the issue I referred to above occurred when a process registers a new variable which incidentally had the same address as a previously registered variable, but not on all processors. This became evident in an application where mallocs, push_regs, pop_regs and frees were woven into subroutines in an incorrect fashion. It did lead to a detection of a bug related to multiple push_regs on the same pointer, which was fixed before version 1.0.0 already; not 1.0.1.

In any case, synching once should be all right, and the only thing I can think of is that some of the push_regs try to register the same address on some of your threads, like the above. In that case the last register will prevail (as per the standard). Note that in code like

x=malloc(30*sizeof(double));
free(x);
y=malloc(2*sizeof(char));
assert( x != y);

the assertion may fail, but doesn't have to, and in multithreading some threads may fail while others would not; this was the root cause in the bug I had.

Hope this helps.

MulticoreBSP Configration

dummy@example.com (sikanderhayat) — Mon, 19 Nov 2012 12:29:29 +0000

Thanks for the reply, I configure MulticoreBSP library in eclipse and through terminal as per your post, thanks again

MulticoreBSP in C++: basic usage

dummy@example.com (Albert-Jan Yzelman) — Sat, 17 Nov 2012 19:48:31 +0000

Since MulticoreBSP is now available in C, you can use it from within C++ programs by simply including the mcbsp.h header file in the usual way.

This requires you to call C-style functions from object-oriented code, and this isn't always elegant In particular, having to start an SPMD section by pointing to a global function completely breaks the object-oriented style. To cope with this, MulticoreBSP for C (from version 1.0.1 on) includes a C++-wrapper that objectifies SPMD programs. This class is defined in the mcbsp.hpp header file; including this file also defines all regular BSP primitives (that is, there is no need to include the C-header mcbsp.h).

The C++ header mcbsp.hpp defines the mcbsp namespace which contains the abstract BSP_program class. An SPMD section making use of the MulticoreBSP library may simply extend this class, and then must implement BSP_program's purely virtual methods:

virtual void spmd() = 0;
virtual BSP_program* newInstance() = 0;

The first method is the entry point of the SPMD code. Each thread involved in execution of this SPMD code has its own instance of the final BSP_program. The second method ensures that the C++-wrapper can create new instances of the user-defined class when it wants to supply a new thread with its own instance of that class. Finally, BSP_program defines only one other public function:

void begin( unsigned int P = bsp_nprocs() );

Calling this function will start parallel execution of the SPMD code over a user-supplied number of P processors. If P is not supplied, the maximum available number of processors is used (see the description of bsp_nprocs()). The use of this class deprecates the use of bsp_end(), and bsp_init( ... ); these are implied and handled by the C++-wrapper. The use of bsp_begin( unsigned int P ) is replaced by the use of BSP_program::begin( unsigned int P ).

An object-oriented parallel `Hello world'-example now looks as follows:

#include "mcbsp.hpp"

#include 
#include 

using namespace mcbsp;

class Hello_World: public BSP_program {

        protected:

                virtual void spmd() {
                        std::cout << "Hello world from thread " << bsp_pid() << std::endl;
                }

                virtual BSP_program * newInstance() {
                        return new Hello_World();
                }

        public:

                Hello_World() {}
};

int main() {
        BSP_program *p = new Hello_World();
        p->begin( 2 );
        p->begin();
        delete p;
        return EXIT_SUCCESS;
}

The example demonstrates calling BSP_program::begin( ... ) with and without parameters. Running on a quad-core computer, example output is:

Hello world from thread Hello world from thread 01

Hello world from thread 3
Hello world from thread 2
Hello world from thread 0
Hello world from thread 1

The order of the Hello-world printouts might change between runs, and `fusion' of output streams (like with the first run on 2 threads) may or may not occur in repeated runs.

Any communication between threads still follows the usual C-style functions on contiguous in-memory bytes; the C++-wrapper does not supply object-oriented communication constructs (at present).

Bug? Global memory corrupted after bsp_sync() [not-a-bug]

dummy@example.com (Albert-Jan Yzelman) — Sun, 11 Nov 2012 11:36:30 +0000

If it's searching for a position only (no changes to the tree), then passing the global variables as const parameters should already stop the compiler from pushing them onto the stack after each recursive call

It generally also helps performance to store tree-like structures in flat arrays, and transform recursive algorithms into loop-based ones (to help data locality and prevent pushing return addresses at each recursion, respectively); if your application allows such changes of course..

In any case, I hope the changes are not hard to incorporate,
and don't hesitate to post again if BSP issues seem to pop up!

MulticoreBSP for C update released: version 1.0.1 now available.

dummy@example.com (Albert-Jan Yzelman) — Wed, 17 Oct 2012 15:12:39 +0000

This release brings MulticoreBSP for C to Mac OS X users, and provides a wrapper for all those programming in C++, thus enabling full BSP programming on Mac computers and for C++ programmers!

See the changelog for more details (and on other changes), and pick up the new version here.