Most current and emerging high-performance systems consist of large numbers of processors set within an architecture with ‘fat’ shared memory nodes supporting tens of threads per node. There are good reasons to adopt a hybrid MPI-OpenMP programming model for large-scale applications on such architectures, but this adds complexity to the parallel program and demands scalability at two levels: MPI across nodes and OpenMP within a node.

Other techniques for exploiting the memory locality available on shared memory nodes include hybrid MPI with shared memory.

In practice, the details of the data layout, synchronisation and communications need to be considered carefully in order to extract the best performance from the hardware. Issues with load balancing and system noise can cause parallel program execution to become inefficient. Pure static scheduling cannot absorb system noise and load can be difficult to balance for all cases. Pure dynamic scheduling on the other hand suffers from scheduling overhead and locality issues. Using mixed static and dynamic scheduling has been shown to give performance benefits.

We have carried out specific work in support of the Code_Saturne finite volume CFD code by using a blocked sparse matrix vector product parallel algorithm to improve OpenMP scalability. Results are presented in

We have also published a technical report describing the performance of four applications (Fluidity-ICOM, NEMO, PRMAT and a 3D Red-Black Smoother) using the hybrid MPI-OpenMP programming model.