This benchmark creates a cuPy array and distributes its chunks across Dask workers. The benchmark performs a slicing operation with an interval of 3, forcing the GPU data to move around over the network. We compare the performance of MVAPICH2-GDR based communication device in the Dask Distributed library.

cuPy Array Slicing Execution Time and Aggregate Worker Throughput

Execution Time

cupy ri2 time

Aggregate Worker Throughput

cupy ri2 thruput

Experimental Testbed: Each node in Cambridge Wilkes-3 System has two 32-core AMD EPYC 7763 processors with 1000 GB main memory. Each node has 4 NVIDIA A100 SXM4 GPUs with 80GB memory each. The nodes are equipped with Dual-rail Mellanox HDR200 InfiniBand.

The cuPy array dimensions are 48E3x48E3 and the chunk size is 4E3 for 2/4 workers; 96E3x96E3 and the chunk size is 4E3 for 8/16/32 workers.

cuPy Array Slicing Execution Time and Aggregate Worker Throughput

Execution Time

cupy ri2 time

Aggregate Worker Throughput

cupy ri2 thruput

Experimental Testbed: Each node in TACC Frontera System has two 28-core Intel Xeon Platinum 8280 with 192 GB main memory. The nodes are equipped with Dual-rail Mellanox HDR200 InfiniBand.

The cuPy array dimensions are 16E3x16E3 and the chunk size is 4E3. The benchmark presents strong scaling results.

cuPy Array Slicing Execution Time and Aggregate Worker Throughput

Execution Time

cupy ri2 time

Aggregate Worker Throughput

cupy ri2 thruput

Experimental Testbed: Each node in RI2 System has two 28-core Intel Xeon E5 2680 with 512 GB main memory. The nodes are equipped with Mellanox EDR InfiniBand.

The cuPy array dimensions are 16E3x16E3 and the chunk size is 4E3. The benchmark presents strong scaling results.