

# Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda

Department of Computer Science and Engineering
The Ohio State University



# Adaptive and Dynamic Design for MPI Tag Matching

M. Bayatpour, H. Subramoni, S. Chakraborty and D. K. Panda

Department of Computer Science and Engineering
The Ohio State University

### **Current Trends in HPC**



#### Supercomputing systems scaling rapidly

- Multi- and Many-core architectures
- High-performance Interconnects



#### InfiniBand and Omni-Path are popular HPC Interconnects

- Low-latency and High-bandwidth
- 192 systems (39%) in Jun'17 Top500 use IB



### MPI used by vast majority of HPC applications

- Helping applications scale to thousands of cores
- Large systems exposing new scalability issues

# **Components of an MPI Library**



## **MPI Tag Matching 101**

- On the receiver side, one needs to match the incoming message with the message that was posted by receiver
- Three parameters should match
  - Context id, Source Rank, Tag
  - Wildcards (MPI\_ANY\_SRC, MPI\_ANY\_TAG) introduce additional complexity
- Two kinds of the queues are involved in the receiver side
  - Posted queue
  - Unexpected queue

## Search Time Analysis of the Default Double Linked List Design

- Most MPI libraries use double linked list for unexpected and posted queues
- Message to be removed could be in any position of the queue
  - Removal time in the best case is O(1) and in the average case is linear O(N)
- Tag matching is in the critical path for point-to-point based operations
- Number of the processes in a job is increasing
  - Future extreme-scale systems are expected to have millions of cores\*
  - Multithreaded programming models
- All can push the search functions to go deeper in the lists
  - Impose significant overhead on the performance

<sup>\*</sup> Thakur R, Balaji P, Buntinas D, Goodell D, Gropp W, Hoefler T, Kumar S, Lusk E, Träff JL. MPI at Exascale. Proceedings of SciDAC. 2010 Jul;2:14-35.

## **Proposed Adaptive Design**

- Based on the Bin-based and default simple double linked list scheme
- Three phases
  - Starts with the default design
  - Observes the communication pattern for each process during the runtime
  - If all the conditions are held, it begins to convert the default scheme to the Binbased scheme
- Each process can have its own scheme
  - Some may stay at the default scheme, some may need to convert to bin-based scheme

## Proposed Adaptive Design (Cont'd)

- For each of the posted and unexpected queues, we consider the following thresholds
  - Number of the calls to the tag matching functions in the library (CALLS\_NUM)
  - The average number of queue look-up attempts per CALLS\_NUM (MACTCH\_ATTMPS)
- Each process maintains both during the runtime
- If both thresholds are crossed
  - Adaptive design changes from the double linked list scheme to the bin-based scheme

## **Proposed Adaptive Design (Cont'd)**

- Currently, conversion is one way from default to bin-based scheme and may occur only one time through the entire runtime
- These thresholds are fixed through entire runtime and they are configurable
  - We have tuned them based on empirical analysis using OSU micro benchmarks
- We consider two possible sizes for NUM\_BINS
  - ¼ JOB\_SIZE and ½ JOB\_SIZE
  - Based on MATCH\_ATTMPS, we decide which one to choose

## **Summary of Tag Matching Performance**



(b) Total Tag Matching Time, Normalized to Default (Lower is Better)

- Comparison of different designs/benchmarks at 512 processes on RI
- Adaptive design shows the best performance

# **Summary of Memory Consumed for Tag Matching**



- Comparison of different designs/ benchmarks at 512 processes on RI with default design
- Adaptive design shows minimal memory overhead



# Scalable Reduction Collectives with Data Partitioningbased Multi-Leader Design

M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda

Department of Computer Science and Engineering
The Ohio State University
Presented at Supercomputing 2017

## **MPI Reduction Collectives 101**

- Convenient abstraction to implement group communication operations
- Widely used across various scientific domains
  - Owing to their ease of use and performance portability
- One of the most popular collective operations: MPI\_Allreduce
  - 37% of communication time
- MPI\_Allreduce reduces values from all processes and distribute the result back to all processes

## **Existing Designs for MPI\_Allreduce**

- Hierarchical strategy
- TreeAltrascedesteateppiesach
  - Reculrative-Dode rieguction by root + inter-node Allreduce
    - Battomp a tintion saint owner by the root process of each node
    - High parallelism for computation
      - All the process are involved in computation
    - Pairs distance doubles after each step
    - Log (P\*) steps

<sup>\*</sup> Bloch et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction

## **Relative Throughput of Different Architectures**

- Using OSU Micro benchmark suite\*
- "Multiple Bandwidth Test"
  - Back-to-back messages
    - Sent to a pair before waiting for receive
- Evaluates the aggregate unidirectional bandwidth between multiple pairs of processes
- 1) Xeon + IB, 2)Xeon + Omni-Path, and 3) KNL + Omni-Path

<sup>\*</sup> http://mvapich.cse.ohio-state.edu/benchmarks/

# Communication Characteristics of Modern Architectures: Intra-node Communication

**Shared Memory (KNL)** 



Multiple pair test vs. one pair test

- The relative throughput very close to the number of pairs
- Support many concurrent intra-node communication

# Communication Characteristics of Modern Architectures: InfiniBand Interconnect

Xeon (Haswell) + IB (EDR - 100Gbps) 20 2-pair ■ 8-pair ■ 16-pair 4-pair 18 16 Relative Throughput 14 Higher is better 10 64K 256K Message Size (Byte)

Multiple pair test vs. one pair test

- The relative throughput close to the number of communicating processes per node
- Support many concurrent intranode communication

# Communication Characteristics of Modern Architectures: Omni-Path Interconnect





Multiple pair test vs. one pair test

- The relative throughput of one for large messages
- Supports many concurrent communications for small and medium message range
- Similar behavior observed for Xeon + Omni-Path

# Performance limitations of Existing Designs for MPI\_Allreduce

- Does not take advantage of large number of cores and high concurrency in communication
- Does not take advantage of shared memory collectives
  - Needs kernel support for zero-copy communication for large messages in same node
- Too many inter-node communication for large PPNs
- Limited performance due to extra QPI transfers
- Limited computing power of switches limits its performance for medium and large message ranges

## **Design Outline**



## Performance of MPI\_Allreduce On Omni-Path



- DPML always outperform MVAPICH2 for all medium and large message range
- DPML outperform IMPI in medium message range
- High parallelism of DPML benefits KNL more than XEON

<sup>\*</sup>Processes Per Node

## Performance of MPI\_Allreduce On InfiniBand



- DPML outperform MVAPICH2 for most of the medium and large message range
  - With 512K bytes, 3X improvement of DPML
- Higher benefits of DPML as the message size increases

## **Performance Benefits for MiniAMR Application**



- For MiniAMR Application with 4096 processes, DPML can reduce the latency by 2.4X
   on KNL + Omni-Path cluster
- On XEON + Omni-Path, with 1792 processes, DPML can reduce the latency by 1.5X



# SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives

M. Bayatpour, J. Hashmi,

S. Chakraborty, H. Subramoni, P. Kousha, and D. K. Panda

{bayatpour.1, hashmi.29, chakraborty.52, subramoni.1, kousha.2, panda.2}

@osu.edu

Department of Computer Science and Engineering
The Ohio State University
Presented at IEEE Cluster 2018

## Deep Learning (DL) Frameworks and Trends

- Renewed interest in DL
  - Deep Neural Networks (DNNs)
- Tensorflow, CNTK and many more
- Excellent accuracy for deep/convolutional neural networks
- Diverse applications Image
   Recognition, Cancer Detection, Self Driving Cars, Speech Processing etc.



https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/

### **MPI Allreduce Collective**

MPI\_Allreduce – Walkthrough Example



# Performance limitations of Existing Designs for MPI\_Allreduce

- Load-balancing the computation and network resources
- Overlap of communication and computation

- 3. Avoiding data copies and data staging
- 4. Avoiding the unnecessary synchronization overheads
- 5. Heuristic based adaptive design

| State-of-the-art Allreduce Designs         | Feature being used |          |          |          |          |
|--------------------------------------------|--------------------|----------|----------|----------|----------|
| <b>3</b>                                   | 1                  | 2        | 3        | 4        | 5        |
| Baidu-Allreduce [a]                        | <b>~</b>           | ~        | X        | ×        | ×        |
| Linear Pipelining [b]                      | <b>~</b>           | ~        | ×        | ×        | ×        |
| Reduce-scatter followed by Allgather [c,d] | <b>*</b>           | ×        | ×        | ×        | ×        |
| Segmented Ring [e]                         | <b>~</b>           | ~        | X        | ×        | ×        |
| XPMEM-based Reduction [f]                  | ×                  | ×        | <b>*</b> | ×        | ×        |
| Proposed "SALaR"                           | <b>*</b>           | <b>~</b> | <b>~</b> | <b>~</b> | <b>*</b> |

#### **Research Contribution**

- Designing high-performance Allreduce
  - Pipelined design for efficient overlap of computation and communication
  - Exploiting process Shared Address Space based truly zero-copy intra-node reduction
  - One-sided inter-node communication to reduce synchronizations
  - Efficient load-balanced inter-node communication
  - Heuristic based adaptive design
- Modeling the proposed design
- Improved the AlexNet training time on CNTK by up to 46%
- Reduced the latency of osu\_allreduce by up to 5X at scale

## **Outline**

- Introduction
- Motivation
- Contributions
- Proposed Designs
  - Design Optimizations
  - Modeling
- Experimental Results
- Conclusions & Future Work

# **Summary of Proposed SALaR Designs**

#### SALaR-XPMEM

- Efficient Pipeline of Inter-node
   Allreduce with Intra-node Reduce
- Uses XPMEM as intra-node zero copy mechanism

#### SALaR-SHMEM

 In case of lack of XPMEM module, shared memory is being used as the intra-node mechanism



## **Impact of Chunk Size on Allreduce Performance**

8MB is optimal among



Latency of MPI\_Allreduce on 224 processes and 28 processes per node on Cluster A

- Selecting the proper chunk size can have a big impact on the performance
- Different chunk is optimal for each message range

# Impact of Heuristic based Design on Allreduce Performance

- Adaptive design is close and in some cases, even has better performance compared to the Static version
- Effectively removes the hassle of static tuning



SALaR-SHMEM design on 896 processes on Cluster A



SALaR-XPMEM designs 896 processes on Cluster A

#### **Outline**

- Introduction
- Motivation
- Contributions
- Proposed Designs
  - Design Optimizations
  - Modeling
- Experimental Results
- Conclusions & Future Work

# **Experimental Setup**

| Hardware                                                                                 |                                                                                                          | Software               |                                                              |  |
|------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------------|--------------------------------------------------------------|--|
| Cluster A<br>RI2                                                                         | Cluster B<br>Comet                                                                                       | MPI<br>Benchmark       | DL Frameworks                                                |  |
| 40 Dual socket Intel<br>Xeon series CPUs 14-<br>core Broadwell<br>processors of 2.40 GHz | 1944 Dell PowerEdge<br>C6320 two- socket<br>servers with 12-core<br>Intel Xeon processors of<br>2.50 GHz | OSU<br>Microbenchmarks | Microsoft Computational<br>Network Toolkit (CNTK)<br>v.2.3.1 |  |
| Mellanox MT4115 EDR<br>ConnectX-4 HCAs                                                   | Mellanox MT4099 FDR<br>ConnectX-3 HCAs                                                                   | v5.4.1                 | Horovod: Uber implementation of Tensorflow v0.12.1           |  |

# Performance Comparison of MPI\_Allreduce

- Using osu\_allreduce
   benchmark from OSU
   Microbenchmarks on Cluster
   A with 28 processes per node
- SALaR outperforms Open MPI and MVAPICH2 up to 2X and 4X
- In the latest release of MVAPICH2, we have incorporated some of similar SALaR ideas and enhanced the performance







756 Processes (Latest Numbers)

## Performance Comparison of MPI\_Allreduce (cont'd)

- Using osu\_allreduce benchmark from OSU Microbenchmarks on Cluster B with 24 processes per node
- SALaR outperforms
   Open MPI v3.1.2 and
   MVAPICH2 v2.3rc2 up to
   40% and 5X respectively







1536 Processes on Cluster B

## **Impact of SALaR Designs on CNTK**

- CPU-based training AlexNet neural network ILSVRC2012 dataset from the ImageNet
- SALaR designs perform up to 46% better than the MVAPICH2 library at 896 processes
- Increasing the scale, the benefits of the proposed designs also increases



CNTK Samples per Second on Cluster A (higher is better)

## Impact of SALaR Designs on TensorFlow

- CPU-based tf\_cnn\_benchmarks for distributed tests from TensorFlow Benchmarks (TF)
  - Training AlexNet neural network from the synthetic datasets
- 15% and 35% improvements in the number of images per second at 448 and 896 processes jobs
- Increasing the job size, the benefits of SALaR compared to MVAPICH2 keep increasing



TensorFlow Images per Second (higher is better)

#### **Conclusions & Future Work**

- Designed multi-leader based collective operations
  - Capable of taking advantage of high-end features offered by modern network interconnects
- Modeled and analyzed proposed design theoretically
- The benefits were evaluated on different architectures
- The DPML design is released as a part of MVAPICH2-X 2.3b! Check out:
  - http://mvapich.cse.ohio-state.edu/overview/#mv2X
- Studied the interplay between communication pattern of applications and different tag matching schemes
- Proposes, designed and implemented a dynamic and adaptive tag matching scheme capable to adapting dynamically to the communication characteristics of applications
- The adaptive approach opens up a new direction to design tag matching schemes for next-generation exascale systems

## **Conclusion and Future Work (cont'd)**

- Proposed scalable and adaptive Allreduce design
  - Capable of taking advantage of high-end features offered by modern network interconnects and increased parallelism of Multi-/Many-core architectures
- Modeled and analyzed proposed design theoretically
- The benefits were evaluated on different architectures and Deep Learning frameworks
- Improved the AlexNet training time on CNTK by up to 46%
- Reduced the latency of osu\_allreduce by up to 5X at scale
- In the future:
  - Exploring the SALaR for other collective operations
- The SALaR design will be as a part of MVAPICH2! Check out:
  - http://mvapich.cse.ohio-state.edu/

### References

- [a] Baidu Allreduce Design: https://github.com/baidu- research/baidu-allreduce
- [b] Efficient communications in training large scale neural networks, Zhao et al, Thematic Workshops ACMMM2017
- [c] MVAPICH2 2.3rc2
- [d] Bandwidth optimal all-reduce algorithms for clusters of workstations, Patarasuk et al, Journal of Parallel and Distributed Comp '09
- [e] OpenMPI 1.8.5 and later
- [f] Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, Hashmi et al, IPDPS '17

**Thank you! Questions?**