Other techniques for exploiting the memory locality available on shared memory nodes include hybrid MPI with shared memory.
In practice, the details of the data layout, synchronisation and communications need to be considered carefully in order to extract the best performance from the hardware. Issues with load balancing and system noise can cause parallel program execution to become inefficient. Pure static scheduling cannot absorb system noise and load can be difficult to balance for all cases. Pure dynamic scheduling on the other hand suffers from scheduling overhead and locality issues. Using mixed static and dynamic scheduling has been shown to give performance benefits.
We have carried out specific work in support of the Code_Saturne finite volume CFD code by using a blocked sparse matrix vector product parallel algorithm to improve OpenMP scalability. Results are presented in
- Parallel Sparse Matrix Vector Product with OpenMP for SMPs in Code Saturne, by V. Szeremi, L. Anton, C. Evangelinos, C. Moulinec, Y. Fournier in The Proceedings of The Fourth International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering, Dubrovik, Croatia , March 2015
We have also published a technical report describing the performance of four applications (Fluidity-ICOM, NEMO, PRMAT and a 3D Red-Black Smoother) using the hybrid MPI-OpenMP programming model.
- Exploiting Multi-core Processors for Scientific Applications Using Hybrid MPI/OpenMP, L. Anton, M. Ashworth, X. Guo, S.M. Pickles, A.R. Porter and A.G. Sunderland, DL Technical Report DL-TR-2015-002, 2015