# The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies

Talk at OSC/OH-TECH Booth (SC '15)

by

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: panda@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

#### High-End Computing (HEC): PetaFlop to ExaFlop



# Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)



Timeline

# **Drivers of Modern HPC Cluster Architectures**





**Multi-core Processors** 

**High Performance** Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth>



**Accelerators / Coprocessors** high compute density, high performance/watt >1 TFlop DP on a chip



SSD, NVMe-SSD, NVRAM

- Multi-core/many-core technologies
- Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
- Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
- Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)



Stampede

Tianhe – 1A

# Large-scale InfiniBand Installations

- 235 IB Clusters (47%) in the Nov' 2015 Top500 list (<u>http://www.top500.org</u>)
- Installations in the Top 50 (21 systems):

| 462,462 cores (Stampede) at TACC (10 <sup>th</sup> )                 | 76,032 cores (Tsubame 2.5) at Japan/GSIC (25 <sup>th</sup> )         |  |
|----------------------------------------------------------------------|----------------------------------------------------------------------|--|
| 185,344 cores (Pleiades) at NASA/Ames (13 <sup>th</sup> )            | 194,616 cores (Cascade) at PNNL (27 <sup>th</sup> )                  |  |
| 72,800 cores Cray CS-Storm in US (15 <sup>th</sup> )                 | 76,032 cores (Makman-2) at Saudi Aramco (32 <sup>nd</sup> )          |  |
| 72,800 cores Cray CS-Storm in US (16 <sup>th</sup> )                 | 110,400 cores (Pangea) in France (33 <sup>rd</sup> )                 |  |
| 265,440 cores SGI ICE at Tulip Trading Australia (17 <sup>th</sup> ) | 37,120 cores (Lomonosov-2) at Russia/MSU (35 <sup>th</sup> )         |  |
| 124,200 cores (Topaz) SGI ICE at ERDC DSRC in US $(18^{th})$         | 57,600 cores (SwiftLucy) in US (37 <sup>th</sup> )                   |  |
| 72,000 cores (HPC2) in Italy (19 <sup>th</sup> )                     | 55,728 cores (Prometheus) at Poland/Cyfronet (38 <sup>th</sup> )     |  |
| 152,692 cores (Thunder) at AFRL/USA (21 <sup>st</sup> )              | 50,544 cores (Occigen) at France/GENCI-CINES (43 <sup>rd</sup> )     |  |
| 147,456 cores (SuperMUC) in Germany (22 <sup>nd</sup> )              | 76,896 cores (Salomon) SGI ICE in Czech Republic (47 <sup>th</sup> ) |  |
| 86,016 cores (SuperMUC Phase 2) in Germany (24 <sup>th</sup> )       | and many more!                                                       |  |

#### **Designing High-Performance Middleware for HPC: Challenges**



### **Broad Challenges in Designing Communication Libraries at Exascale**

- Scalable Job Startup
- Scalability for million to billion processors
  - Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
  - Multiple end-points per node
- Support for efficient multi-threading
- Integrated Support for GPGPUs and Accelerators
- Support for GPGPUs Support for MICs
- QoS support for communication and I/O

# **MVAPICH2** Software

- High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
  - MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
  - MVAPICH2-X (MPI + PGAS), Available since 2011
  - Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
  - Support for Virtualization (MVAPICH2-Virt), Available since 2015
  - Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
  - Used by more than 2,475 organizations in 76 countries
  - More than 307,000 downloads from the OSU site directly
  - Empowering many TOP500 clusters (Nov '15 ranking)
    - 10<sup>th</sup> ranked 519,640-core cluster (Stampede) at TACC
    - 13<sup>th</sup> ranked 185,344-core cluster (Pleiades) at NASA
    - 25<sup>th</sup> ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
  - Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
  - <u>http://mvapich.cse.ohio-state.edu</u>
- Empowering Top500 systems for over a decade
  - System-X from Virginia Tech (3<sup>rd</sup> in Nov 2003, 2,200 processors, 12.25 TFlops) ->
  - Stampede at TACC (10<sup>th</sup> in Nov'15, 519,640 cores, 5.168 Plops)

### **Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale**

- Scalable Job Startup
- Scalability for million to billion processors
  - Support for highly-efficient Inter-node communication
  - Support for highly-efficient Intra-node communication
  - Support for highly-efficient One-sided / RMA communication
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Support for GPGPUs
- Support for MICs
- QoS support for communication and I/O

# **Job-Launchers supported by MVAPICH2**



#### **MPI\_Init Performance on TACC Stampede**



- Near-constant MPI\_Init performance
- 59 times improvement at 8,192 processes (512 nodes)
- New designs show good
   scaling with 16K
   processes and above

"Non-blocking PMI Extensions for Fast MPI Startup"

S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins and D. K. Panda Int'l Symposium on Cluster, Cloud, and Grid Computing (CCGrid '15)

#### **MPI Hello World Performance on TACC Stampede**



- PMI Exchange costs overlapped with application computation
- 5.7 times improvement at 8,192 processes (512 nodes)
- New designs to be available as part of upcoming releases

"Non-blocking PMI Extensions for Fast MPI Startup"

S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins and D. K. Panda Int'l Symposium on Cluster, Cloud, and Grid Computing (CCGrid '15)

#### **Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale**

- Scalable Job Startup
- Scalability for million to billion processors
  - Support for highly-efficient Inter-node communication
  - Support for highly-efficient Intra-node communication
  - Support for highly-efficient One-sided / RMA communication
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Support for GPGPUs
- Support for MICs
- QoS support for communication and I/O

#### Latency & Bandwidth: MPI over IB with MVAPICH2



TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back

# MVAPICH2 Two-Sided Intra-Node Performance

(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))





#### Memory Utilization using Shared Receive Queues, UD



SRQ reduces the memory used by 1/6<sup>th</sup> at 64,000 processes

S. Sur, L. Chai, H. –W. Jin and D. K. Panda, "Shared Receive Queue Based Scalable MPI Design for InfiniBand Clusters", IPDPS 2006

|                        | RC (MVAPICH2 1.8) |         |        | UD (MVAPICH2 1.8) |         |        |       |
|------------------------|-------------------|---------|--------|-------------------|---------|--------|-------|
| Number of<br>Processes | Conn.             | Buffers | Struct | Total             | Buffers | Struct | Total |
| 512                    | 22.9              | 24      | 0.3    | 47.2              | 24      | 0.2    | 24.2  |
| 1024                   | 29.5              | 24      | 0.6    | 54.1              | 24      | 0.4    | 24.4  |
| 2048                   | 42.4              | 24      | 1.2    | 67.6              | 24      | 0.9    | 24.9  |



• UD reduces HCA QP cache trashing M. Koop, S. Sur, Q. Gao and D. K. Panda, "High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters," ICS '07

#### eXtended Reliable Connection (XRC) and Hybrid Mode



• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)

• NAMD performance improves when there is frequent communication to many peers



- Both UD and RC/XRC have benefits
  - Hybrid for the best of both
- Available since MVAPICH2 1.7 as integrated interface
- Runtime Parameters: RC default;
  - UD MV2\_USE\_ONLY\_UD=1
  - Hybrid MV2\_HYBRID\_ENABLE\_THRESHOLD=1

M. Koop, J. Sridhar and D. K. Panda, "Scalable MPI Design over InfiniBand using eXtended Reliable Connection," Cluster '08

# **Dynamic Connected (DC) Transport in MVAPICH2**



- Constant connection cost (One QP for any peer)
- Full Feature Set (RDMA, Atomics etc)
- Separate objects for send (DC Initiator) and receive (DC Target)
  - DC Target identified by "DCT Number"
  - Messages routed with (DCT Number, LID)
  - Requires same "DC Key" to enable communication
- Initial study done in MVAPICH2
- DCT support available in Mellanox OFED 2.2.0.1

#### **Memory Footprint for Alltoall**





### **User-mode Memory Registration (UMR)**

- Introduced by Mellanox to support direct local and remote noncontiguous memory access
- Avoid packing at sender and unpacking at receiver
- Available in MVAPICH2-X 2.2b



Large Message Latency

Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, "High Performance MPI Datatype Support with Usermode Memory Registration: Challenges, Designs and Benefits", CLUSTER, 2015

#### **MPI-3 RMA Model: Performance**

• RDMA-based and truly one-sided implementation of MPI-3 RMA in progress



- MVAPICH2-2.1 and OSU micro-benchmarks (OMB v4.1)
- Better performance for MPI\_Compare\_and\_swap and MPI\_Fetch\_and\_op and MPI\_Get performance with RDMA-based design

#### **MPI-3 RMA Model: Overlap**



- Process 0 is busy in computation, Process 1 performance atomic operations at P0
- These benchmarks show the latency of atomic operations. For RDMA based design, the atomic latency at P1 remains consistent even as the busy time at P0 increases

### **Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale**

- Scalable Job Startup
- Scalability for million to billion processors
  - Support for highly-efficient Inter-node communication
  - Support for highly-efficient Intra-node communication
  - Support for highly-efficient One-sided / RMA communication
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Support for GPGPUs
- Support for MICs
- QoS support for communication and I/O

#### **Shared-memory Aware Collectives**

• MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB)





Message Size (Bytes)

MV2\_USE\_SHMEM\_ALLREDUCE=0/1

 MVAPICH2 Barrier with 1K Intel Westmere cores, QDR IB

MV2\_USE\_SHMEM\_BARRIER=0/1

#### Hardware Multicast-aware MPI\_Bcast on Stampede



# Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload



Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)



Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

10 20 30 40 50 60 70 HPL Problem Size (N) as % of Total Memory

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT. ISC 2011

K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, Hotl 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS '12

Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS' 12

### **Network-Topology-Aware Placement of Processes**

Can we design a highly scalable network topology detection service for IB?

How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service?

What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?



**Overall performance and Split up of physical communication for MILC on Ranger** 

- Reduce network topology discovery time from O(N<sup>2</sup><sub>hosts</sub>) to O(N<sub>hosts</sub>)
- 15% improvement in MILC execution time @ 2048 cores
- 15% improvement in Hypre execution time @ 1024 cores

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper **Finalist** 

### **Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale**

- Scalable Job Startup
- Scalability for million to billion processors
  - Support for highly-efficient Inter-node communication
  - Support for highly-efficient Intra-node communication
  - Support for highly-efficient One-sided / RMA communication
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Support for GPGPUs
- Support for MICs
- QoS support for communication and I/O

### **MPI + CUDA - Naive**

• Data movement in applications with standard MPI and CUDA interfaces

#### At Sender:

cudaMemcpy(s\_hostbuf, s\_devbuf, . . .); MPI\_Send(s\_hostbuf, size, . . .);

#### At Receiver:

MPI\_Recv(r\_hostbuf, size, . . .); cudaMemcpy(r\_devbuf, r\_hostbuf, . . .);

#### High Productivity and Low Performance



### **MPI + CUDA - Advanced**

• Pipelining at user level with non-blocking MPI and CUDA interfaces

```
At Sender:
for (j = 0; j < pipeline_len; j++)
    cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,
        ...);
for (j = 0; j < pipeline_len; j++) {
    while (result != cudaSucess) {
        result = cudaStreamQuery(...);
        if(j > 0) MPI_Test(...);
        }
        MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
    }
MPI_Waitall();
<<Similar at receiver>>
```



#### Low Productivity and High Performance

#### **GPU-Aware MPI Library: MVAPICH2-GPU**

- Standard MPI interfaces used for unified data movement
- Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
- Overlaps data movement from GPU with RDMA transfers



# **GPU-Direct RDMA (GDR) with CUDA**

- OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox
- OSU has a design of MVAPICH2 using GPUDirect RDMA
  - Hybrid design using GPU-Direct RDMA
    - GPUDirect RDMA and Host-based pipelining
    - Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge
  - Support for communication using multi-rail
  - Support for Mellanox Connect-IB and ConnectX
     VPI adapters
  - Support for RoCE with Mellanox ConnectX VPI adapters



SNB E5-2670 P2P write: 5.2 GB/s P2P read: < 1.0 GB/s IVB E5-2680V2 P2P write: 6.4 GB/s P2P read: 3.5 GB/s

#### Performance of MVAPICH2-GDR with GPU-Direct-RDMA



# **Application-Level Evaluation (HOOMD-blue)**



- Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
- HoomdBlue Version 1.0.5
  - GDRCOPY enabled: MV2\_USE\_CUDA=1 MV2\_IBA\_HCA=mlx5\_0 MV2\_IBA\_EAGER\_THRESHOLD=32768 MV2\_VBUF\_TOTAL\_SIZE=32768 MV2\_USE\_GPUDIRECT\_LOOPBACK\_LIMIT=32768 MV2\_USE\_GPUDIRECT\_GDRCOPY=1 MV2\_USE\_GPUDIRECT\_GDRCOPY\_LIMIT=16384

# Non-Blocking Collectives (NBC) from GPU Buffers using Offload Mechanism (CORE-Direct)

- New designs are proposed to support Non-Blocking Collectives from GPU buffers to provide good overlap/latency
- Available in MVAPICH2-GDR 2.2a



Connect-X2, 2.6 GHz 12-core (IvyBridge E5-2630), dual NVIDIA K20c GPUs, Intel PCI Gen3 with MLX IB FDR switch

A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, "Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters", HIPC, 2015

### **Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale**

- Scalable Job Startup
- Scalability for million to billion processors
  - Support for highly-efficient Inter-node communication
  - Support for highly-efficient Intra-node communication
  - Support for highly-efficient One-sided / RMA communication
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Support for GPGPUs
- Support for MICs
- QoS support for communication and I/O

# **MPI Applications on MIC Clusters**

• Flexibility in launching MPI jobs on clusters with Xeon Phi



# MVAPICH2-MIC 2.0: High-Performance MPI Design for Clusters with IB and MIC

- Offload Mode
- Intranode Communication
  - Coprocessor-only and Symmetric Mode
- Internode Communication
  - Coprocessors-only and Symmetric Mode
- Multi-MIC Node Configurations
- Running on three major systems
  - Stampede, Blueridge (Virginia Tech) and Beacon (UTK)



#### **MIC-Remote-MIC P2P Communication with Proxy-based** Communication



#### **Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)**



A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS'14, May 2014

### **Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale**

- Scalable Job Startup
- Scalability for million to billion processors
  - Support for highly-efficient Inter-node communication
  - Support for highly-efficient Intra-node communication
  - Support for highly-efficient One-sided / RMA communication
- Scalable Collective communication
  - Offload
  - Non-blocking
  - Topology-aware
- Support for GPGPUs
- Support for MICs
- QoS support for communication and I/O

# **Exploiting QoS Support in MVAPICH2**



- IB is capable of providing network level differentiated service QoS
- Uses Service Levels (SL) and Virtual Lanes (VL) to classify traffic
- Enabled at configure time using CFLAG ENABLE\_QOS\_SUPPORT
- Check with System administrator before enabling
  - Can affect performance of other jobs in system

# Minimizing Network Contention w/ QoS-Aware Data-Staging

- Asynchronous I/O introduces contention for network-resources
- How should data be orchestrated in a data-staging architecture to eliminate such contention?
- Can the QoS capabilities provided by cutting-edge interconnect technologies be leveraged by parallel filesystems to minimize network contention?



• Reduces runtime overhead from 17.9% to 8% and from 32.8% to 9.31%, in case of AWP and NAS-CG applications respectively

R. Rajachandrasekar, J. Jaswani, H. Subramoni and D. K. Panda, Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework, IEEE Cluster, Sept. 2012

# **MVAPICH2 – Plans for Exascale**

- Performance and Memory scalability toward 500K-1M cores
  - Dynamically Connected Transport (DCT) service with Connect-IB
- Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF ...)
  - Support for UPC++
- Enhanced Optimization for GPU Support and Accelerators
- Taking advantage of advanced features
  - User Mode Memory Registration (UMR)
  - On-demand Paging
- Enhanced Inter-node and Intra-node communication schemes for upcoming OmniPath enabled Knights Landing architectures
- Extended RMA support (as in MPI 3.0)
- Extended topology-aware collectives
- Energy-aware point-to-point (one-sided and two-sided) and collectives
- Extended Support for MPI Tools Interface (as in MPI 3.0)
- Extended Checkpoint-Restart and migration support with SCR
- Energy Awareness

# **Two Additional Talks**

- Wednesday (1:30-2:00pm)
  - The MVAPICH2 Project: Heading Towards New Horizons in Energy-Awareness, Virtualization and Network/Job-Level Introspection
- Thursday (10:00-10:30am)
  - How to Exploit MPI, PGAS and Hybrid MPI+PGAS Programming through MVAPICH2-X?

# **Funding Acknowledgments**

#### Funding Support by







VIDI.









Equipment Support by



## **Personnel Acknowledgments**

#### **Current Students**

- A. Augustine (M.S.) \_
- A. Awan (Ph.D.)
- S. Chakraborthy (Ph.D.) \_
- C.-H. Chu (Ph.D.) \_
- N. Islam (Ph.D.) \_
- M. Li (Ph.D.) \_

- K. Kulkarni (M.S.) \_
- M. Rahman (Ph.D.)
- D. Shankar (Ph.D.)
- A. Venkatesh (Ph.D.)
- J. Zhang (Ph.D.) \_

#### Current Research Scientists Current Senior Research Associate

- H. Subramoni \_
- X. Lu \_

#### **Current Post-Doc**

- \_ J. Lin
- D. Banerjee \_

M. Luo (Ph.D.)

G. Marsh (M.S.)

A. Mamidala (Ph.D.)

V. Meshram (M.S.)

S. Naravula (Ph.D.)

R. Noronha (Ph.D.)

X. Ouyang (Ph.D.)

S. Potluri (Ph.D.)

S. Pai (M.S.)

A. Moody (M.S.)

- K. Hamidouche

#### **Current Programmer**

J. Perkins

#### **Current Research Specialist**

- M. Arnold \_
- G. Santhanaraman (Ph.D.)
- A. Singh (Ph.D.) \_
- J. Sridhar (M.S.)
- S. Sur (Ph.D.) \_
- H. Subramoni (Ph.D.) \_
- K. Vaidyanathan (Ph.D.)
- A. Vishnu (Ph.D.) \_
- J. Wu (Ph.D.)
- W. Yu (Ph.D.) \_

- **Past Students** 
  - P. Balaji (Ph.D.) \_
  - S. Bhagvat (M.S.) \_
  - A. Bhat (M.S.) \_
  - D. Buntinas (Ph.D.) \_
  - L. Chai (Ph.D.) \_
  - B. Chandrasekharan (M.S.) \_
  - N. Dandapanthula (M.S.) \_
  - V. Dhanraj (M.S.) \_
  - T. Gangadharappa (M.S.) \_
  - K. Gopalakrishnan (M.S.)

- W. Huang (Ph.D.) \_
- W. Jiang (M.S.) \_
- J. Jose (Ph.D.) \_
- S. Kini (M.S.) \_
- M. Koop (Ph.D.) \_
- R. Kumar (M.S.) \_
- S. Krishnamoorthy (M.S.) \_
- \_
- J. Liu (Ph.D.) \_

- Past Research Scientist
  - S. Sur \_

#### Past Programmers

D. Bureddy \_

#### Past Post-Docs

- H. Wang
- X. Besseron
- H.-W. Jin
- M. Luo

R. Rajachandrasekar (Ph.D.)

\_

\_

\_

\_

\_

\_

\_

\_

E. Mancini

J. Vienne

S. Marcarelli

\_

\_

- - K. Kandalla (Ph.D.)
  - P. Lai (M.S.) \_

#### **Web Pointers**

http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~subramon http://nowlab.cse.ohio-state.edu

MVAPICH Web Page http://mvapich.cse.ohio-state.edu



panda@cse.ohio-state.edu subramon@cse.ohio-state.edu