



# Designing HPC, Big Data and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Talk at HPC Connection Workshop (SC '16)

by

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: panda@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

#### **Drivers of Modern HPC Cluster Architectures**



**Multi-core Processors** 





Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip



SSD, NVMe-SSD, NVRAM

- Multi-core/many-core technologies
- Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
- Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
- Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)









### **Three Major Computing Categories**

- Scientific Computing
  - Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant
     Programming Model
  - Many discussions towards Partitioned Global Address Space (PGAS)
    - UPC, OpenSHMEM, CAF, UPC++ etc.
  - Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
- Deep Learning
  - Caffe, CNTK, TensorFlow, and many more
- Big Data/Enterprise/Commercial Computing
  - Focuses on large data and data analysis
  - Spark and Hadoop (HDFS, HBase, MapReduce)
  - Memcached is also used for Web 2.0

# Designing Communication Middleware for Multi-Petaflop and Exaflop Systems: Challenges

#### **Application Kernels/Applications**

#### **Middleware**

#### **Programming Models**

MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

#### **Communication Library or Runtime for Programming Models**

Point-to-point Communication

Collective Communication

Energy-Awareness Synchronization and Locks

I/O and File Systems

Fault Tolerance

**Networking Technologies** 

(InfiniBand, 40/100GigE, Aries, and OmniPath)

Multi/Many-core Architectures Accelerators (NVIDIA and MIC)

Co-Design
Opportunities
and
Challenges
across Various
Layers

Performance Scalability

Fault-Resilience

# Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale

- Scalability for million to billion processors
  - Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)
  - Scalable job start-up
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)
  - Multiple end-points per node
- Support for efficient multi-threading
- Integrated Support for GPGPUs and Accelerators
- Fault-tolerance/resiliency
- QoS support for communication and I/O
- Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, ...)
- Virtualization
- Energy-Awareness

#### **Additional Challenges for Designing Exascale Software Libraries**

#### Extreme Low Memory Footprint

Memory per core continues to decrease

#### D-L-A Framework

- Discover
  - Overall network topology (fat-tree, 3D, ...), Network topology for processes for a given job
  - Node architecture, Health of network and node
- Learn
  - Impact on performance and scalability
  - Potential for failure
- Adapt
  - Internal protocols and algorithms
  - Process mapping
  - Fault-tolerance solutions
- Low overhead techniques while delivering performance, scalability and fault-tolerance

#### **Overview of the MVAPICH2 Project**

- High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
  - MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002
  - MVAPICH2-X (MPI + PGAS), Available since 2011
  - Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
  - Support for Virtualization (MVAPICH2-Virt), Available since 2015
  - Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
  - Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
  - Used by more than 2,690 organizations in 83 countries
  - More than 402,000 (> 0.4 million) downloads from the OSU site directly
  - Empowering many TOP500 clusters (Nov '16 ranking)
    - 1st ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
    - 13<sup>th</sup> ranked 241,108-core cluster (Pleiades) at NASA
    - 17<sup>th</sup> ranked 519,640-core cluster (Stampede) at TACC
    - 40<sup>th</sup> ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
  - Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
  - http://mvapich.cse.ohio-state.edu
- Empowering Top500 systems for over a decade
  - System-X from Virginia Tech (3<sup>rd</sup> in Nov 2003, 2,200 processors, 12.25 TFlops) ->
     Sunway TaihuLight at NSC, Wuxi, China (1<sup>st</sup> in Nov'16, 10,649,640 cores, 93 PFlops)



# **MVAPICH/MVAPICH2** Release Timeline and Downloads



# **Architecture of MVAPICH2 Software Family**

High Performance Parallel Programming Models

Message Passing Interface
(MPI)

PGAS
(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)



<sup>\*</sup> Upcoming

# **MVAPICH2 Software Family**

| High-Performance Parallel Programming Libraries |                                                                                                                                            |
|-------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| MVAPICH2                                        | Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE                                                                                |
| MVAPICH2-X                                      | Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime |
| MVAPICH2-GDR                                    | Optimized MPI for clusters with NVIDIA GPUs                                                                                                |
| MVAPICH2-Virt                                   | High-performance and scalable MPI for hypervisor and container based HPC cloud                                                             |
| MVAPICH2-EA                                     | Energy aware and High-performance MPI                                                                                                      |
| MVAPICH2-MIC                                    | Optimized MPI for clusters with Intel KNC                                                                                                  |
| Microbenchmarks                                 |                                                                                                                                            |
| ОМВ                                             | Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs                                     |
| Tools                                           |                                                                                                                                            |
| OSU INAM                                        | Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration                                                |
| OEMT                                            | Utility to measure the energy consumption of MPI applications                                                                              |

# Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

- Scalability for million to billion processors
  - Support for highly-efficient inter-node and intra-node communication
  - Dynamic and Adaptive Tag Matching
- Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ...)
- Integrated Support for GPGPUs
- Accelerating Graph Processing (Mizan) with MPI-3 RMA
- Optimized MVAPICH2 for KNL and Omni-Path

### One-way Latency: MPI over IB with MVAPICH2





Message Size (bytes)

Message Size (bytes)

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

#### **Bandwidth: MPI over IB with MVAPICH2**





Message Size (hytes)

Message Size (hytes)

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

#### **MVAPICH2** Two-Sided Intra-Node Performance

(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))







#### **On-Demand Paging (ODP)**

- Applications no longer need to pin down underlying physical pages
- Memory Region (MR) are **NEVER** pinned by the OS
  - Paged in by the HCA when needed
  - Paged out by the OS when reclaimed
- ODP can be divided into two classes
  - **Explicit ODP** 
    - Applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages
  - Implicit ODP
    - Applications are provided with a special memory key that represents their complete address space, does not need to register any virtual address range
- Advantages
  - Simplifies programming
  - Unlimited MR sizes
  - Physical memory optimization

Pin-down

1000



**Applications (64 Processes)** 

- M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and D. K. Panda, "Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits", SC 2016.
- Wednesday 11/16/2016 @ 11:00 11:30 AM in Room 355-D

# **Dynamic and Adaptive Tag Matching**

Challenge

Tag matching is a significant overhead for receivers

**Existing Solutions are** 

- Static and do not adapt dynamically to communication pattern
- Do not consider memory overhead

Solution

A new tag matching design

- Dynamically adapt to communication patterns
- Use different strategies for different ranks
- Decisions are based on the number of request object that must be traversed before hitting on the required one

Results

Better performance than other state-of-the art tagmatching schemes

Minimum memory consumption

Will be available in future MVAPICH2 releases



Normalized Total Tag Matching Time at 512 Processes Normalized to Default (Lower is Better)



Normalized Memory Overhead per Process at 512 Processes
Compared to Default (Lower is Better)

# Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

- Scalability for million to billion processors
- Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ...)
- Integrated Support for GPGPUs
- Accelerating Graph Processing (Mizan) with MPI-3 RMA
- Optimized MVAPICH2 for KNL and Omni-Path

# **MVAPICH2-X** for Hybrid MPI + PGAS Applications



- Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI
  - Possible deadlock if both runtimes are not progressed
  - Consumes more network resource
- Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF
  - Available with since 2012 (starting with MVAPICH2-X 1.9)
  - http://mvapich.cse.ohio-state.edu

### **UPC++ Support in MVAPICH2-X**



- Full and native support for hybrid MPI + UPC++ applications
- Better performance compared to IBV and MPI conduits
- OSU Micro-benchmarks (OMB) support for UPC++
- Available since MVAPICH2-X (2.2rc1)

### **Application Level Performance with Graph500 and Sort**





- Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design
  - 8,192 processes
    - 2.4X improvement over MPI-CSR
    - 7.6X improvement over MPI-Simple
  - 16,384 processes
    - 1.5X improvement over MPI-CSR
    - 13X improvement over MPI-Simple

- Performance of Hybrid (MPI+OpenSHMEM) Sort Application
  - 4,096 processes, 4 TB Input Size
    - MPI 2408 sec; 0.16 TB/min
    - Hybrid 1172 sec; 0.36 TB/min
    - 51% improvement over MPI-design

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013.

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC'13), June 2013

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

# Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

- Scalability for million to billion processors
- Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ...)
- Integrated Support for GPGPUs
  - CUDA-aware MPI
  - GPUDirect RDMA (GDR) Support
  - Control Flow Decoupling through GPUDirect Async
- Accelerating Graph Processing (Mizan) with MPI-3 RMA
- Optimized MVAPICH2 for KNL and Omni-Path

# **GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU**

- Standard MPI interfaces used for unified data movement
- Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
- Overlaps data movement from GPU with RDMA transfers

#### At Sender:

MPI\_Send(s\_devbuf, size, ...);

#### **At Receiver:**

MPI\_Recv(r\_devbuf, size, ...);

**High Performance and High Productivity** 



#### **CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases**

- Support for MPI communication from NVIDIA GPU device memory
- High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
- Optimized and tuned collectives for GPU device buffers
- MPI datatype support for point-to-point and collective communication from GPU device buffers
- Unified memory

#### Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)





**GPU-GPU Internode Bandwidth** 



MVAPICH2-GDR-2.2
Intel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPU
Mellanox Connect-X4 EDR HCA
CUDA 8.0
Mellanox OFED 3.0 with GPU-Direct-RDMA

Latency (us)

### **Application-Level Evaluation (HOOMD-blue)**



- Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
- HoomdBlue Version 1.0.5
  - GDRCOPY enabled: MV2\_USE\_CUDA=1 MV2\_IBA\_HCA=mlx5\_0 MV2\_IBA\_EAGER\_THRESHOLD=32768 MV2\_VBUF\_TOTAL\_SIZE=32768 MV2\_USE\_GPUDIRECT\_LOOPBACK\_LIMIT=32768 MV2\_USE\_GPUDIRECT\_GDRCOPY=1 MV2\_USE\_GPUDIRECT\_GDRCOPY\_LIMIT=16384

#### Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland





- 2X improvement on 32 GPUs nodes
- 30% improvement on 96 GPU nodes (8 GPUs/node)

Cosmo model: http://www2.cosmo-model.org/content

/tasks/operational/meteoSwiss/

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS'16

#### **MVAPICH2-GDS: Preliminary Results**



- Latency Oriented: Able to hide the kernel launch overhead
  - 25% improvement at 256 Bytes compared to default behavior
- Throughput Oriented: Asynchronously to offload queue the Communication and computation tasks
  - 14% improvement at 1KB message size

Intel Sandy Bridge, NVIDIA K20 and Mellanox FDR HCA

Will be available in a public release soon

# Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

- Scalability for million to billion processors
- Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ...)
- Integrated Support for GPGPUs
- Accelerating Graph Processing (Mizan) with MPI-3 RMA
- Optimized MVAPICH2 for KNL and Omni-Path

#### **Mizan-RMA: Accelerating Graph Processing Framework**

#### PageRank with Arabic Dataset



- Mizan framework is available from:
  - https://thegraphsblog.wordpress. com/the-graph-blog/mizan/
- Accelerate communication with MPI one-sided programming model (RMA)?
- Overlap communication with computation
- 2.5X improvement on 40 nodes
- Will be released soon

M. Li, X. Lu, K. Hamidouche, J. Zhang, and D. K. Panda, Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA, Int'l Conference on High Performance Computing, HiPC'16, To be presented

#### **MVAPICH2** for Omni-Path and KNL

- MVAPICH2 has been supporting QLogic/PSM for many years with all different compute platforms
- Latest version (MVAPICH2 2.2 GA) supports
  - Omni-Path (derivative of QLogic IB)
    - Xeon family with Omni-Path
    - KNL with Omni-Path

# **Enhanced Designs for KNL: MVAPICH2 Approach**

- New designs
  - Take advantage of the idle cores
  - Dynamically configurable
  - Take advantage of highly multithreaded cores
  - Take advantage of MCDRAM of KNL processors
- Applicable to other programming models such as PGAS, Task-based, etc.
- Provides portability, performance, and applicability to runtime as well as applications in a transparent manner

# **Performance Benefits of the Enhanced Designs**



- New designs to exploit high concurrency and MCDRAM of KNL
- Significant improvements for large message sizes
- Benefits seen in varying message size as well as varying MPI processes

# **Performance Benefits of the Enhanced Designs**



 Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset using CNTK Deep Learning Framework

Enhanced Designs will be available in upcoming MVAPICH2 releases

#### **MVAPICH2 – Plans for Exascale**

- Performance and Memory scalability toward 1M cores
- Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + UPC++, MPI + CAF ...)
  - MPI + Task\*
- Enhanced Optimization for GPU Support and Accelerators
- Taking advantage of advanced features of Mellanox InfiniBand
  - Switch-IB2 SHArP\*
  - GID-based support\*
- Enhanced communication schemes for upcoming architectures
  - Knights Landing with MCDRAM\*
  - NVLINK\*
  - CAPI\*
- Extended topology-aware collectives
- Extended Energy-aware designs and Virtualization Support
- Extended Support for MPI Tools Interface (as in MPI 3.1)
- Extended Checkpoint-Restart and migration support with SCR
- Support for \* features will be available in future MVAPICH2 Releases

### **Three Major Computing Categories**

- Scientific Computing
  - Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model
  - Many discussions towards Partitioned Global Address Space (PGAS)
    - UPC, OpenSHMEM, CAF, UPC++ etc.
  - Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC, UPC++)
- Deep Learning
  - Caffe, CNTK, TensorFlow, and many more
- Big Data/Enterprise/Commercial Computing
  - Focuses on large data and data analysis
  - Spark and Hadoop (HDFS, HBase, MapReduce)
  - Memcached is also used for Web 2.0

# **Deep Learning: New Challenges for MPI Runtimes**

- Deep Learning frameworks are a different game altogether
  - Unusually large message sizes (order of megabytes)
  - Most communication based on GPU buffers
- How to address these newer requirements?
  - GPU-specific Communication Libraries (NCCL)
    - NVidia's NCCL library provides inter-GPU communication
  - CUDA-Aware MPI (MVAPICH2-GDR)
    - Provides support for GPU-based communication
- Can we exploit CUDA-Aware MPI and NCCL to support Deep Learning applications?



Hierarchical Communication (Knomial + NCCL ring)

### Efficient Broadcast: MVAPICH2-GDR and NCCL

- NCCL has some limitations
  - Only works for a single node, thus, no scale-out on multiple nodes
  - Degradation across IOH (socket) for scale-up (within a node)
- We propose optimized MPI\_Bcast
  - Communication of very large GPU buffers (order of megabytes)
  - Scale-out on large number of dense multi-GPU nodes
- Hierarchical Communication that efficiently exploits:
  - CUDA-Aware MPI\_Bcast in MV2-GDR
  - NCCL Broadcast primitive

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning,
A. Awan , K. Hamidouche , A. Venkatesh , and D. K. Panda,
The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up]



**Performance Benefits: OSU Micro-benchmarks** 



Performance Benefits: Microsoft CNTK DL framework
(25% avg. improvement)

## **Efficient Reduce: MVAPICH2-GDR**

- Can we optimize MVAPICH2-GDR to efficiently support DL frameworks?
  - We need to design large-scale reductions using CUDA-Awareness
  - GPU performs reduction using kernels
  - Overlap of computation and communication
  - Hierarchical Designs
- Proposed designs achieve 2.5x
   speedup over MVAPICH2-GDR and
   133x over OpenMPI

## Optimized Large-Size Reduction



# Large Message Optimized Collectives for Deep Learning

- MV2-GDR provides
   optimized collectives for

   large message sizes
- Optimized Reduce,
   Allreduce, and Bcast
- Good scaling with large number of GPUs
- Available in MVAPICH2-GDR 2.2GA



# **OSU-Caffe: Scalable Deep Learning**

- Caffe: A flexible and layered Deep Learning framework.
- Benefits and Weaknesses
  - Multi-GPU Training within a single node
  - Performance degradation for GPUs across different sockets
  - Limited Scale-out
- OSU-Caffe: MPI-based Parallel Training
  - Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes)
  - Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset
  - Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

## OSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

## GoogLeNet (ImageNet) on 128 GPUs



## **Three Major Computing Categories**

- Scientific Computing
  - Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model
  - Many discussions towards Partitioned Global Address Space (PGAS)
    - UPC, OpenSHMEM, CAF, etc.
  - Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC, UPC++)
- Deep Learning
  - Caffe, CNTK, TensorFlow, and many more
- Big Data/Enterprise/Commercial Computing
  - Focuses on large data and data analysis
  - Spark and Hadoop (HDFS, HBase, MapReduce)
  - Memcached is also used for Web 2.0

How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies?

What are the major
bottlenecks in current Big
Data processing
middleware (e.g. Hadoop,
Spark, and Memcached)?

Can RDMA-enabled high-performance interconnects benefit Big Data processing?

Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applications?

How much performance benefits can be achieved through enhanced designs?

How to design benchmarks for evaluating the performance of Big Data middleware on HPC clusters?



Bring HPC and Big Data processing into a "convergent trajectory"!

# Designing Communication and I/O Libraries for Big Data Systems: Challenges



# The High-Performance Big Data (HiBD) Project

- RDMA for Apache Spark
- RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
  - Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
- RDMA for Apache HBase
- RDMA for Memcached (RDMA-Memcached)
- RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
- OSU HiBD-Benchmarks (OHB)
  - HDFS, Memcached, and HBase Micro-benchmarks
- http://hibd.cse.ohio-state.edu
- Users Base: 195 organizations from 27 countries
- More than 18,600 downloads from the project site
- RDMA for Impala (upcoming)







## **Available for InfiniBand and RoCE**

## **Different Modes of RDMA for Apache Hadoop 2.x**



- HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by **default** in the package.
- HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible.
- HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
- HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD.
- MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
- Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).

## **Acceleration Case Studies and Performance Evaluation**

- RDMA-based Designs and Performance Evaluation
  - HDFS
  - MapReduce
  - Spark

# **Enhanced HDFS with In-Memory and Heterogeneous Storage**



- Design Features
  - Three modes
    - Default (HHH)
    - In-Memory (HHH-M)
    - Lustre-Integrated (HHH-L)
  - Policies to efficiently utilize the heterogeneous storage devices
    - RAM, SSD, HDD, Lustre
  - Eviction/Promotion based on data usage pattern
  - Hybrid Replication
  - Lustre-Integrated mode:
    - Lustre-based fault-tolerance

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid '15, May 2015

# **Design Overview of MapReduce with RDMA**



- Design Features
  - RDMA-based shuffle
  - Prefetching and caching map output
  - Efficient Shuffle Algorithms
  - In-memory merge
  - On-demand Shuffle Adjustment
  - Advanced overlapping
    - map, shuffle, and merge
    - shuffle, merge, and reduce
  - On-demand connection setup
  - InfiniBand/RoCE support
- Enables high performance RDMA communication, while supporting traditional socket interface
- JNI Layer bridges Java based MapReduce with communication library written in native code

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014

## Performance Benefits – RandomWriter & TeraGen in TACC-Stampede



- RandomWriter
  - 3-4x improvement over IPolB for 80-120 GB file size

- TeraGen
  - 4-5x improvement over IPolB for 80-120 GB file size

## **Performance Benefits – Sort & TeraSort in TACC-Stampede**



Cluster with 32 Nodes with a total of 128 maps and 57 reduces

- Sort with single HDD per node
  - 40-52% improvement over IPolB
     for 80-120 GB data



Cluster with 32 Nodes with a total of 128 maps and 64 reduces

- TeraSort with single HDD per node
  - 42-44% improvement over IPolB
     for 80-120 GB data

## **Acceleration Case Studies and Performance Evaluation**

- RDMA-based Designs and Performance Evaluation
  - HDFS
  - MapReduce
  - Spark

## **Design Overview of Spark with RDMA**



## Design Features

- RDMA based shuffle plugin
- SEDA-based architecture
- Dynamic connection management and sharing
- Non-blocking data transfer
- Off-JVM-heap buffer management
- InfiniBand/RoCE support

- Enables high performance RDMA communication, while supporting traditional socket interface
- JNI Layer bridges Scala based Spark with communication library written in native code

X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (Hotl'14), August 2014

X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData '16, Dec. 2016.

## Performance Evaluation on SDSC Comet – SortBy/GroupBy





64 Worker Nodes, 1536 cores, SortByTest Total Time

64 Worker Nodes, 1536 cores, GroupByTest Total Time

- InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
- RDMA-based design for Spark 1.5.1
- RDMA vs. IPolB with 1536 concurrent tasks, single SSD per node.
  - SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)
  - GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)

## **Performance Evaluation on SDSC Comet – HiBench PageRank**





32 Worker Nodes, 768 cores, PageRank Total Time

64 Worker Nodes, 1536 cores, PageRank Total Time

- InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
- RDMA-based design for Spark 1.5.1
- RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
  - 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
  - 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

## **Performance Evaluation on SDSC Comet: Astronomy Application**

- Kira Toolkit¹: Distributed astronomy image processing toolkit implemented using Apache Spark.
- Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,150 image files.
- Compare RDMA Spark performance with the standard apache implementation using IPoIB.

1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. *CoRR*, *vol: abs/1507.03325*, Aug 2015.



Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE'16, July 2016

## **Looking into the Future ....**

- Exascale systems will be constrained by
  - Power
  - Memory per core
  - Data movement cost
  - Faults
- Programming Models, Runtimes and Middleware need to be designed for
  - Scalability
  - Performance
  - Fault-resilience
  - Energy-awareness
  - Programmability
  - Productivity
- Highlighted some of the issues and challenges
- Need continuous innovation on all these fronts

# **Funding Acknowledgments**

## **Funding Support by**























## **Equipment Support by**



















# **Personnel Acknowledgments**

#### **Current Students**

A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

S. Chakraborthy (Ph.D.)

- C.-H. Chu (Ph.D.)

S. Guganani (Ph.D.)

N. Islam (Ph.D.)

J. Hashmi (Ph.D.)

M. Rahman (Ph.D.)
D. Shankar (Ph.D.)

A. Venkatesh (Ph.D.)

M. Li (Ph.D.) – J. Zhang (Ph.D.)

#### **Current Research Scientists**

K. Hamidouche

X. Lu

H. Subramoni

R. Rajachandrasekar (Ph.D.)

G. Santhanaraman (Ph.D.)

A. Singh (Ph.D.)

J. Sridhar (M.S.)

H. Subramoni (Ph.D.)

A. Vishnu (Ph.D.)

J. Wu (Ph.D.)

W. Yu (Ph.D.)

K. Vaidyanathan (Ph.D.)

S. Sur (Ph.D.)

#### **Current Research Specialist**

J. Smith

#### **Past Students**

- A. Augustine (M.S.)

P. Balaji (Ph.D.)

S. Bhagvat (M.S.)

A. Bhat (M.S.)

D. Buntinas (Ph.D.)

L. Chai (Ph.D.)

B. Chandrasekharan (M.S.)

N. Dandapanthula (M.S.)

V. Dhanraj (M.S.)

T. Gangadharappa (M.S.)

K. Gopalakrishnan (M.S.)

W. Huang (Ph.D.)

W. Jiang (M.S.)

J. Jose (Ph.D.)

S. Kini (M.S.)

M. Koop (Ph.D.)K. Kulkarni (M.S.)

- R. Kumar (M.S.)

S. Krishnamoorthy (M.S.)

K. Kandalla (Ph.D.)

P. Lai (M.S.)

– J. Liu (Ph.D.)

M. Luo (Ph.D.)

A. Mamidala (Ph.D.)

G. Marsh (M.S.)

V. Meshram (M.S.)

- A. Moody (M.S.)

S. Naravula (Ph.D.)
R. Noronha (Ph.D.)

X. Ouyang (Ph.D.)

S. Pai (M.S.)

S. Potluri (Ph.D.)

#### **Past Research Scientist**

S. Sur

#### **Past Programmers**

D. Bureddy

M. Arnold

J. Perkins

#### Past Post-Docs

- D. Banerjee

X. Besseron

– H.-W. Jin

– J. Lin

M. Luo

E. Mancini

S. Marcarelli

- J. Vienne

H. Wang

## **Thank You!**

panda@cse.ohio-state.edu



Network-Based Computing Laboratory <a href="http://nowlab.cse.ohio-state.edu/">http://nowlab.cse.ohio-state.edu/</a>



The MVAPICH2 Project <a href="http://mvapich.cse.ohio-state.edu/">http://mvapich.cse.ohio-state.edu/</a>



The High-Performance Big Data Project <a href="http://hibd.cse.ohio-state.edu/">http://hibd.cse.ohio-state.edu/</a>