I think the confusion here is there cannot be just two vectors in a (BSP) SPMD run; each single program has its own two local vectors. With p processors, there will thus be 2p separate vectors.

Suppose x and y are the *global* vectors, and x_s and y_s the *local* vectors at process s. Note again that x and y do not exist anywhere. Then x should be distributed so that the union of the p subvectors x_s should yield x; and similarly for y_s and y. In other words, we need a **partitioning** of the global vectors x and y. Again, this still happens on the design table, not on the implementation-level. We now first design the algorithm.

To compute an inner product we can first compute the local inner product, then communicate the partial results, and then combine those partial results. Following the BSP paradigm, we want no communication during the computation of the local inner product; hence, the distributions (the way of partitioning) of x and y must be equal. The communication step is an all-to-all communication of a single element, and the last computation step is indeed also completely local as it just accumulates all received elements into one.

The cost of the first computation step is given by the local lengths of x_s and y_s (which are equal, since the distribution is equal). For load-balance, setting this length to n/p for all local vectors, with n the size of the global x and y, is optimal. Note that the actual partitioning scheme thus actually does not matter; as long as x is distributed as y is, and as long as all local vectors are of equal size. This concludes the design. We now can go on with implementation.

Each SPMD section must **initialise** local vectors of size n/p. This means allocating room for x_s and y_s, and initialising the values of the elements therein according to the global distribution that was chosen. Again, there are no global vectors x and y; we only initialise local vectors, from scratch. We then implement the algorithm as described above, and we are done. (It is quite natural for BSP implementations to just be literal translations of a BSP design.)

If you had the global vectors x and y available, then some part of the code is not in SPMD style. You can still force MulticoreBSP to run with a block-distribution of those vectors, but then you have to override the SPMD paradigm by using (or abusing) the available shared-memory. In terms of performance you will then become suboptimal in terms of data-locality, which will lead to big issues on architectures where memory access is non-uniform (NUMA architectures).

]]>