MulticoreBSP for C: a quick-start guide

For the purpose of this guide, we assume the following:

the GNU Compiler Collection (GCC) is installed,
you work in a bash-like shell,
you have downloaded the MulticoreBSP for C library to your current directory,
you have downloaded the BSPedupack from Prof. Dr. Rob Bisseling's website to your current directory.

The above links to versions 2.0.4 and 2.0 of MulticoreBSP for C and BSPedupack, respectively. These might not be the latest versions at the time you read this; check the respective websites for the latest software updates (MulticoreBSP, BSPedupack).

Compiling the library

First extract all files with

tar xvfJ MulticoreBSP-for-C.tar.xz

and the library will appear in the new `./MulticoreBSP-for-C' directory. We proceed with compilation:

cd MulticoreBSP-for-C;
make

This will create three new subdirectories: `include', `lib', and `tools'.

Warning for OS X and Windows users

Some popular OSes do not support the POSIX standards we use here; these include Mac OS X (no POSIX realtime) and Microsoft Windows (no POSIX threads and no POSIX realtime). On OS X, after extraction but before issueing `make’, please execute the following steps:

make include.mk
Uncomment the SHARED_LINKER_FLAGS and C_STANDARD_FLAGS appropriate for OS X systems in include.mk

Also note that modern OS X systems masquerade the LLVM clang compiler as a gcc compiler; therefore it is advisable to also enable the LLVM/clang compiler by uncommenting the respective lines in include.mk. After these, this guide also applies to OS X.

On Microsoft Windows, make use of the PThreads-win32 project and see this forum post for details. Please note only basic support for Windows is maintained through cross-compilation; see include.mk.

Optional: testing the library

MulticoreBSP for C comes with a basic testing suite. To check if your version of the library compiled correctly, you may want to run these tests with

make tests

This should output one SUCCESS message for each test. Some tests may require manual checking of output. Each test verifies functions in a separate internal file from the libraries, or checks the behaviour of MulticoreBSP versus its specification. If a failure occurs, it usually specifies the origin on a function-level; please report it should this happen.

Finally, to reset the testing environment, issue

rm machine.info

Writing, compiling, and running your own BSP application

Not relying on existing source, we give a very short `Hello world!' example. We can compile a MulticoreBSP program by the use of the bspcc and bsprun scripts that are available in the ./tools directory. To simplify their use, we add them to our path:

export BSPPATH=`pwd`/tools
export PATH=${BSPPATH}:${PATH}

To make this setting persist through multiple sessions, add the latter two lines to your startup script; for the bourne again shell (bash), this is ~/.bashrc.

A MulticoreBSP for C application starts as any other C program. Let us write example.c:

#include <stdlib.h> int main( int argc, char ** argv ) { return EXIT_SUCCESS; }

To use our library to spawn three threads which each print `Hello world!', we include the MulticoreBSP for C header file, bsp.h, write parallel code in a Single Program Multiple Data (SPMD) function, and have the main function call it:

#include <bsp.h> #include <stdio.h> #include <stdlib.h> void spmd() { bsp_begin( 3 ); printf( "Hello world!\n" ); bsp_end(); } int main( int argc, char ** argv ) { bsp_init( &spmd, argc, argv ); spmd(); return EXIT_SUCCESS; }

We can now compile and run example.c:

bspcc -o example example.c
./example

The output should print the following line at least once:

Hello world!

Adding the use of bsp_nprocs and bsp_pid primitives enables us to start a variable amount of threads, as well as enabling us to differentiate between threads within an SPMD program. We modify example.c to illustrate:

#include <bsp.h> #include <stdlib.h> #include <stdio.h> static unsigned int P; void spmd() { bsp_begin( P ); printf( "Hello world from thread %u out of %u!\n", bsp_pid(), bsp_nprocs() ); bsp_end(); } int main( int argc, char ** argv ) { printf( "How many threads do you want started? There are %u cores available.\n", bsp_nprocs() ); fflush( stdout ); scanf( "%u", &P ); if( P == 0 || P > bsp_nprocs() ) { fprintf( stderr, "Cannot start %u threads.\n", P ); return EXIT_FAILURE; } bsp_init( &spmd, argc, argv ); spmd(); return EXIT_SUCCESS; }

Starting from the main function, we first inform the user how many BSP processes the system supports by means of the bsp_nprocs primitive. We then ask the user how many processes she would like to start by means of scanf. We store the result in a global static variable so that the spmd function, called later, can read it.

After check the user input for possible errors, we register the spmd function for parallel execution, and call it. This is similar to the earlier Hello World example. The spmd function calls bsp_begin using the user-defined number of processes, P. Each parallel process then prints again a Hello World line, but this time with the local process ID and the total number of BSP processes in the parallel SPMD run. This information is retrieved via bsp_pid and the bsp_nprocs primitives. Note that only the bsp_nprocs can be called both from outside and within a parallel SPMD code-- when called outside of an SPMD region it returns the number of available processes, while when called from within an SPMD region it returns the number of processes participating in the parallel run.

We compile and run this extended example:

bspcc -o example example.c
./example

Output may, for example, look like the following:

How many threads do you want started? There are 3 cores available. 3 Hello world from thread 2 out of 3! Hello world from thread 0 out of 3! Hello world from thread 1 out of 3!

This example writes output in arbitrary order. For meaningful parallel programming, communication between threads is necessary. Sixteen of the remaining seventeen MulticoreBSP primitives provide you with various means on how to do this; please refer to the documentation for a full description. The introductory text on the BSP model may help new BSP programmers as well.

Compiling in debug or profile mode

Sometimes your parallel program will not behave as expected. To help diagnose any issues due to bugs, one can compile in debug mode:

bspcc --debug -o example example.c

The compilation will emit a warning you are compiling in debug mode-- the resulting program still runs the same but extra checks will be performed on every call to a BSP primitive, which slows down overall execution. Try to run the example program again and note extra information is printed out. Always make sure to compile in the default performance mode for production or benchmarking codes!

The profiling mode allows characterising the performance characteristics of the compiled code:

bspcc --profile -o example example.c

Again, a warning will be emitted since profiling brings with it a performance overhead. Running a program in profiling mode will show, for each superstep, statistics such as the time spent in computation phases, time spent buffering communication, and time spent communicating:

./example

How many threads do you want started? There are 6 cores available. 6 Hello world from thread 1 out of 6! Hello world from thread 5 out of 6! Hello world from thread 0 out of 6! Hello world from thread 2 out of 6! Hello world from thread 4 out of 6! Hello world from thread 3 out of 6! ===================== SPMD profiling report ===================== Number of supersteps: 1. *** Last computation phase *** maximum time spent in computation phase = 0.000365 seconds maximum time spent in buffering = 0.000000 seconds maximum time spent in pure computations = 0.000365 seconds maximum time spent in communication phase = 0.000000 seconds maximum time spent in BSP calls = 0.000000 seconds ...

Programs compiled in performance mode also reports the number of bytes sent and the h-relation achieved, the number of calls to BSP primtives made, and an overall BSP signature-- the ratio of useful (computational) work versus the total run-time:

... h-relation (in bytes, with metadata) = 0 maximum imbalance (in bytes) = 0 true h-relation (in bytes) = 0 total number of bytes sent = 0 total number of bytes received = 0 total number of meta-data sent = 0 total number of meta-data received = 0 total number of bytes buffered = 0 total number of bytes in direct_get = 0 total number of DRMA calls: - bsp_put: 0 - bsp_get: 0 - bsp_hpput: 0 - bsp_hpget: 0 -bsp_direct_get: 0 total number of BSMP calls: - bsp_send: 0 - bsp_hpsend: 0 - bsp_move: 0 - bsp_hpmove: 0 total number of memory registration calls (bsp_push_reg, bsp_pop_reg, bsp_set_tagsize): 0 total number of calls to non-communicating BSP primitives (bsp_pid, bsp_nprocs, bsp_time, bsp_qsize, bsp_get_tag): 12 BSP signature (time spent in pure computations divided by the total runtime): 0.000365 / 0.000365 = 1.000000. End profile. This report was generated in 0.000184 seconds.

Since the example program used thus far does not perform communication, most fields in the above profile read zero. While profiling more complicated programs definitely is more interesting, the profiles printed at the end of execution may quickly become overwhelming if there are many supersteps. In such cases, please see the mcbsp_profiling_set_description primitive. (Be sure to include mcbsp-profiling.h header to enable the use of this extension.)

Compiling and running the BSPedupack programs

First, make sure that the PATH environment contains <path to the McBSP installation directory>/tools/

echo ${PATH}

If it does not, please see the start of this guide how to set it, and how to make the setting permanent.

Let us now go out of the MulticoreBSP installation directory and extract the BSPedupack code:

cd ..
tar xvfz BSPedupack2.0.tar.gz
cd BSPedupack2.0

The BSPedupack compiles using a simple `make’:

make

The six BSPedupack applications compiled in the previous section can be run as you would any application:

./bench, a BSP benchmarking utility
./fft, a parallel Fast Fourier Transform for complex-valued vectors (note that both the vector length as well as the number of threads requested must be powers of 2)
./inprod, a parallel inner-product calculation
./lu, a parallel dense LU factorisation (this program uses a 2D process grid of M times N threads)
./match, a parallel greedy weighted graph matching (BSPedupack includes a test graph for 4 threads).
./matvec, a parallel sparse matrix–vector multiplication benchmark (BSPedupack includes a test matrix for 2 threads. Use the Mondriaan software to try other matrices).

Benchmarking g and l

./bench

How many processors do you want to use? 6 ... Time of 2046-relation = 77.40 microsec = 175904 flops Time of 2047-relation = 79.37 microsec = 180383 flops Time of 2048-relation = 77.81 microsec = 176847 flops size double = 8 bytes Range h=0 to p : g= 1027.3, l= 12730.9 Range h=p to 2048: g= 83.9, l= 14809.6 The bottom line for this BSP computer is: p= 6, r= 2272.799 Mflop/s, g= 83.9, l= 14809.6

Immortal Fast Fourier Transformation

./fft

How many processors do you want to use? 4 Please enter length n: 512 FFT of vector of length 512 using 4 processors ... Computing rate of FFT = 2456.964326 Mflop/s Absolute error= 1.030139e-11 Relative error= 2.011990e-14

This FFT algorithm is called immortal because it achieves the lower bound on parallel communication costs required for the FFT in the BSP model: there are no alternative BSP algorithms that would incur a lower communication cost.

See also the HPBSP package for an optimised version that makes use of the new hp primitives in MulticoreBSP, and makes use of optiised sequential FFT kernels as generated by Spiral.

Parallel inner product computation

./inprod

How many processors do you want to use? 2 Please enter n: 250000 Proc 1: sum of squares up to 250000*250000 is 5208364583375000 Proc 0: sum of squares up to 250000*250000 is 5208364583375000 This took only 0.003139 seconds. n(n+1)(2n+1)/6 = 5208364583375000

Parallel LU decomposition

./lu

Please enter number of processor rows M: 3 Please enter number of processor columns N: 2 Please enter matrix size n: 18 LU decomposition of 18 by 18 matrix using the 3 by 2 cyclic distribution Start of LU decomposition ... End of LU decomposition This took only 0.000310 seconds. ...

Greedy parallel weighted graph matching

./match

How many processors do you want to use? 4 Please enter the filename of the matrix distribution chain.mtx-P4 Please enter the maximum number of operations per superstep (0 if no maximum) 0 ... Matching took only 0.000018 seconds. ...

This algorithm is called greedy because it iteratively matches those vertices with the highest weights: it makes local optimal choices in the hope of reaching a global optimal solution. While greedy algorithms commonly do not lead to optimal solutions, an approximation bound may be provable. In the case of this particular greedily matching algorithm, the resulting matching cannot be more than 50% off optimal.

Sparse matrix–vector multiplication

./matvec

How many processors do you want to use? 2 Please enter the filename of the matrix distribution test.mtx-P2 Please enter the filename of the v-vector distribution test.mtx-v2 Please enter the filename of the u-vector distribution test.mtx-u2 ... Initialization took only 0.000026 seconds. Each matvec took only 0.000004 seconds. ...

See also the HPBSP package for an optimised version that makes use of the new hp primitives in MulticoreBSP.

Manual compilation, compiling C++ code, and system-wide installation

Compiling C++ code can be done using the bspcxx tool instead of bspcc. To manually compile your codes, use the --show flag to inspect all arguments bspcc and bspcxx pass through to the regular compiler.

In summary, when in C mode, bspcc falls back to ANSI C99, passes ./include to the -I flag, statically links against the library in ./lib, and links to POSIX Threads and POSIX Realtime. When compiling using -c, the static and dynamic linkage flags are omitted. When compiling in C++ mode using bspcxx, nothing changes except for the use of a C++ compiler and the use of the ANSI C++98 standard (if no others were manually defined). On OS X, neither tool will link against POSIX Realtime.

Both bspcc and bspcxx store the full paths to the include and lib directories; i.e., the path where MulticoreBSP was built is also its install directory. If you want to separate this, please move the public headers and compiled libraries to your preferred installation path manually, and edit the paths in bspcc and bspcxx accordingly before installing them in your preferred path. For users without super-user priviliges, adding the local install directory to your path as described above suffices and is recommended.

Running BSP programs

While a bsprun tool is provided, it is not necessary to use if the recommended static linkage is used (as in all the above examples): the final linked program can simply execute from command line; e.g., via ./example in all of the above. Object (.o) files that are linked together into one executable can each be compiled in different modes: performance, profile, and debug mode can be mixed in one executable. There is only one special rule: if at least one unit is compiled in profile mode, then the unit which contains the corresponding bsp_end must be compiled in profile mode also.

To hyperthread or not to hyperthread: how to pin threads

Modern Intel processors include hyperthreading. Programs that do not incur high memory latencies typically do not benefit from hyperthreads, while the use of hyperthreads typically also leads to higher performance variabilities. A user may thus wish to control, on an application-by-application basis, whether hyperthreads are to be used.

MulticoreBSP allows for controlling this by creating a file called machine.info in the working directory of each application. To disable hyperthreads, the contents of such a file should read

threads_per_core 2 thread_numbering wrapped unused_threads_per_core 1

The machine.info file controls affinity and thread pinning. To interactively create one appropriate for your specific use case, users can issue `make machine.info' from the MulticoreBSP-for-C root directory (you may need to remove the existing machine.info first). This tool allows to optimise for bandwidth-bound versus compute-bound applications, and can also create pinning strategies suitable for Intel Xeon Phi.

One can specify common pinning strategies using machine.info, using affinity compact, affinity scatter, or affinity manual. However, if you have Intel hyperthreads active, MulticoreBSP needs to be made aware of the fact multiple hardware threads share a single core.

For example, a scatter strategy on a hyperthread-enabled processor is enabled via:

threads_per_core 2 thread_numbering wrapped affinity compact

A manual pinning can be hardcoded as follows, assuming here for example a processor with four hardware threads:

affinity manual pinning 0 1 0 1

Manual thread pinning is unaffected by the presence of hyperthreads, though the user should of course be aware which threads map to the same core. Manual pinning overrides internal checks and will not result in warnings if, for example, multiple BSP processes are pinned to the same hardware thread (as in the above).

More examples and benchmarking your current machine

For more examples of BSP programs, please see the MulticoreBSP-for-C/examples directory. A simple `make' in that directory builds several examples applications, ranging from a simple `hello world' in both C and C++ to a numerical integration in the parallel_loop example.

To gauge the speed of your computer, to, for example, verify MulticoreBSP performs as expected, please navigate to the MulticoreBSP-for-C/benchmarks directory. When your system has an MPI implementation available, simply issue `make' to build a set of benchmarks programs using both MulticoreBSP and your MPI installation. These are then automatically run, the result of which can be inspected using MATLAB (or Octave) and running the plot_results.m script.

If you do not have MPI installed, one can issue `make nompi' in the benchmarks directory, which will build the non-MPI benchmarks only. These can be run manually; no automated runs nor plotting tools are available in this mode. The tools correspond to several common collective communication patterns, implemented using one of the BSP communication primitives. These benchmark programs are named accordingly: put, get, hpput, and hpget. Each of these run on various sizes of input data so to measure behaviour on various levels of cache.

The barrier program times the bsp_sync primitive, while the stream-memcpy, stream-sw-triad, and stream-hw-triad each measure local memory speeds. The memcpy uses the system standard memory copy routine available on your system, while the other two benchmarks measure bandwidth using variants of the STREAM benchmark. The latter program omits a barrier after each collective which the other two tools do perform; the use of this benchmark relies on an appropriate and successful thread pinning (and should thus not be used on OS X).

Taakbalk

Navigatie