The Bulk Synchronous Parallel library
We introduce BSP libraries by describing the object-oriented MulticoreBSP for Java library. It must be noted that it relies heavily on previous work on (distributed-memory) BSP libraries, specifically the Oxford BSP library. There is also an implementation of MulticoreBSP for C available. For other libraries, please refer to the related papers section.
Parallel algorithms within BSP are constructed according to the Single Program, Multiple Data (SPMD) paradigm; each process executes the same program, but performs its operations on different data. These programs are sequential programs which may contain directives to queue communication to any of the other processes.
BSP libraries typically have a very small amount of primitives: MulticoreBSP for Java consists of no more than 15 primitive functions, 4 classes, and one interface; this is everything required for efficient parallel programming. To compare, the Oxford BSP library supplies 21 primitives, and the MulticoreBSP for C library has 22 functions. This very low number of primitives is one of the strongest aspects of BSP programming.
We sketch the idea of BSP libraries in the object-oriented setting of MulticoreBSP for Java. There, a BSP algorithm is an abstract class BSP_PROGRAM. If a programmer wants to write a parallel algorithm, she extends the BSP_PROGRAM class, and provides an implementation for the following two virtual functions:
- virtual protected void main_part()
- virtual protected void parallel_part()
The first function contains sequential code preparing the parallel run, and starts up this run by executing the bsp_begin( P ) function. This concurrently starts P threads, each executing the code in parallel_part(). All threads within the SPMD part can call the following BSP functions:
- protected int bsp_pid(), which returns the unique thread ID;
- protected int bsp_nprocs(), which returns the number of threads used;
- protected void bsp_sync(), which synchronises all threads;
- protected void bsp_abort(), which aborts the whole parallel job.
Note that these functions are defined in the BSP_PROGRAM class. An instance of a BSP_PROGRAM can be started by calling its public start() method, similar to how a normal Java thread instance is started.
In MulticoreBSP, communication only works with variables implementing the generic BSP_COMM<T> interface; that is, a shared variable in MulticoreBSP is any class extended from BSP_COMM<T>, and stores a value of type T. The type T must be a Java class, and furthermore must implement a cloning function, so that the library can create independent clones, as required during communication. We refer to the documentation for further details.
As this is a SPMD model, when a thread defines a shared variable, this shared variable is defined by all threads in the same job. A shared variable has no meaning when not linked to a BSP_PROGRAM thread, thus shared variable constructors must always take a BSP_PROGRAM as parameter in its constructor.
The MulticoreBSP communication primitives are defined as functions of the shared variables. When called, it is called by a specific processor on the specific local instance of this shared variable, and this shared variable is always considered to be the destination variable. BSP_COMM gives access to the following four modes of communication:
void bsp_put( T source, int destination_pid )
void bsp_put( BSP_COMM<T> source, int destination_pid )
This function buffers the data as found at the source variable, to send it to the destination thread upon synchronisation. The source variable is free to be changed after this function is called; the value of source found at the time of the function call will be the value transmitted. The transmitted value will be available at the shared variable at the destination processor at the beginning of the next superstep.
void bsp_get( BSP_COMM<T> source, int source_pid )
This is the reverse of a put operation; it gets a value from a remote processor and puts it in the shared variable at the current processor. Again, this happens at synchronisation. The value transmitted is the value of source at the source processor at its end of the current superstep, and this value is available at the local processor at the start of the next superstep.
void bsp_direct_get( BSP_COMM<T> source, int source_pid )
Functionality of this communication primitive is exactly that of the normal get operation, except communication is executed immediately and not at synchronisation. This is a major deviation of the BSP paradigm, but a necessary one if advantage is to be taken from shared-memory architectures. Using a direct get instead of a normal get can reduce the number of supersteps required by a parallel algorithm, thus obtaining a higher efficiency; we refer to the documentation on how to do this systematically.
- void bsp_send( T source, int destination_pid )
void bsp_send( BSP_COMM<T> source, int destination_pid )
All previous functions provided one-to-one communication; if multiple gets or puts are performed on the same destination, only one of those will be successful. The send, in contrast, is a Bulk Synchronous Message Passing (BSMP) directive: each send results in a value being added to the queue of the receiving variable, and all sent variables will become available at their destination, during the next superstep. The following functions are available to read out the message queue:
The first function moves the first message from the queue into the current variable, the second function returns the number of messages still in queue. If the move was called while the queue is empty, the value of the current variable is set to NULL.
A last function is the void bsp_unregister() function, which can be called to destroy a shared variable. Calling this function at any thread invalidates the variable at all threads; this should not be called within a superstep where the variable is still used. Note that these communication functions are all defined within the BSP_COMM interface.
Communication classes in MulticoreBSP
MulticoreBSP provides a stock implementation of the BSP_COMM<T> interface in the form of the BSP_REGISTER<T> class. This class implements read and write methods to access the local value it stores, and correctly implements the above communication primitives. Note that the correctness depends largely on the requirement that the type T is cloneable.
When use of the BSP_REGISTER class is inefficient, whether due to performance issues or verbosity in coding, the BSP_COMM interface has to be implemented alternatively. Examples are the BSP_INT_ARRAY and the BSP_DOUBLE_ARRAY. These implement BSP_COMM for primitive arrays of integers and doubles; also, the functions defined in BSP_COMM are extended to better handle communicating subsequent parts of shared arrays.
For more details, please see the documentation, or the MulticoreBSP paper.