Multi-BSP: nested SPMD regions for hierarchical MulticoreBSP use

Albert-Jan Yzelman · 2013-07-22 22:58:07

BSPlib and MulticoreBSP are first and foremost designed for the original flat BSP model as proposed by Leslie Valiant in 1990. In 2008, Leslie proposed a model where a BSP computer does not consist out of p processors, but out of p other BSP computers or p processors. This model is called Multi-BSP. MulticoreBSP for C supports this style of programming by supporting nested BSP runs.

Basics

What this means is best illustrated by an example. Consider the HP DL980 architecture; this machine consists out of two separate chipsets connected by a custom HP interconnect. Each chipset connects four sockets (and four memory controllers). Let us assume each socket contains an 8-core Intel Xeon processor. We observe three important layers in the hierarchy: first, the custom interconnect that basically binds together two mainboards so to create a single shared-memory computer; second, the four sockets connected by Intel Quick-Path Interconnect; and third, a single processor consisting out of 8 cores. This Multi-BSP description can be exploited in MulticoreBSP by using nested SPMD runs as follows:

#include "mcbsp.h"

#include <stdio.h>
#include <stdlib.h>

void spmd1() {
    bsp_begin( 2 ); //HP custom interconnect
    bsp_init( &spmd2, 0, NULL );
    spmd2();
    bsp_sync();
    if( bsp_pid() == 0 )
        printf( "level-1 master thread closes nested BSP run...\n" );
    bsp_end();
}

void spmd2() {
    bsp_begin( 4 ); //four sockets per Intel QPI
    bsp_init( &spmd3, 0, NULL );
    spmd3();
    bsp_sync();
    if( bsp_pid() == 0 )
        printf( "level-2 master thread closes nested BSP run...\n" );
    bsp_end();
}

void spmd3() {
    bsp_begin( 8 ); //eight cores per processor
    //do useful work here
    bsp_end();
}

int main() {
    printf( "Sequential part 1\n" );

    bsp_init( &spmd1, 0, NULL );
    spmd1();

    printf( "Sequential part 2\n" );
    return EXIT_SUCCESS;    
}

To illustrate MulticoreBSP for C indeed handles this correctly, we compile the library with the MCBSP_SHOW_PINNING flag enabled, and run it on the HP DL980 machine (located at the Flanders ExaScience Labs) described above:

Sequential part 1
Info: pinning used is 0 32
Info: pinning used is 0 8 16 24
Info: pinning used is 0 1 2 3 4 5 6 7
Info: pinning used is 32 40 48 56
Info: pinning used is 8 9 10 11 12 13 14 15
Info: pinning used is 24 25 26 27 28 29 30 31
Info: pinning used is 16 17 18 19 20 21 22 23
Info: pinning used is 32 33 34 35 36 37 38 39
Info: pinning used is 40 41 42 43 44 45 46 47
Info: pinning used is 56 57 58 59 60 61 62 63
Info: pinning used is 48 49 50 51 52 53 54 55
level-2 master thread closes nested BSP run...
level-2 master thread closes nested BSP run...
level-1 master thread closes nested BSP run...
Sequential part 2

Two runs need not be identical, of course, as per the usual behaviour of concurrent threads. A second test run, for example, gives us:

Sequential part 1
Info: pinning used is 0 32
Info: pinning used is 0 8 16 24
Info: pinning used is 0 1 2 3 4 5 6 7
Info: pinning used is 32 40 48 56
Info: pinning used is 16 17 18 19 20 21 22 23
Info: pinning used is 24 25 26 27 28 29 30 31
Info: pinning used is 8 9 10 11 12 13 14 15
Info: pinning used is 32 33 34 35 36 37 38 39
Info: pinning used is 40 41 42 43 44 45 46 47
Info: pinning used is 56 57 58 59 60 61 62 63
Info: pinning used is 48 49 50 51 52 53 54 55
level-2 master thread closes nested BSP run...
level-2 master thread closes nested BSP run...
level-1 master thread closes nested BSP run...
Sequential part 2

This post describes but a toy example; in the ./examples/ directory in the MulticoreBSP for C distribution, the hierarchical.c example provides a Multi-BSP program that actually does a distributed computation in a hierarchical fashion. Please refer there for further illustration on how to practically use this facility.

Thread pinning

We proceed to describe how the pinning is computed in the hierarchical setting. If we have a machine consisting of 64 threads, and the thread numbering is consecutive with no reserved cores, then requesting p threads actually partitions the entire machine (1,2,...,64) into p parts (submachines), each consisting of consecutive threads. For example, if p=8, then submachine 1 is (1,2,...,8), submachine two is (9,10,...,16), ..., and submachine p is (57,58,...,64). Each of the p threads spawned will pin to the first thread number available in its assigned submachine. Starting a nested BSP run will apply the same logic, but only on the assigned submachine of the current thread; e.g., writing bsp_begin(2) threads within submachine two in the previous example will result in two threads which are assigned the submachines (9,10,11,12) and (13,14,15,16), respectively, and which are pinned to thread 9 and 13, respectively.

For different thread numberings (i.e., a wrapped numbering, as for instance commonly used on hyperthreading machines), the MulticoreBSP runtime will compute the necessary transformations to automatically yield the pinnings as described above. The same is true when reserved cores do occur. If evenly distributed submachines are not possible or not intended, the user can manually adapt the submachine settings using the interface exposed in mcbsp_afffinity.h at runtime, while already in an SPMD section.

The MulticoreBSP Forums

#1 2013-07-22 22:58:07

Multi-BSP: nested SPMD regions for hierarchical MulticoreBSP use

Board footer