cuDF DataFrames are table-like data-structure that are stored in the GPU memory. As part of this application, merge operation is carried out for multiple cuDF data frames. We compare the performance of MVAPICH2-GDR based communication device in the Dask Distributed library, MPI4Dask here, with UCX and TCP (using IPoIB) communication devices on an in-house cluster called Cambridge Wilkes-3.

cuDF Merge Execution Time and Aggregate Worker Throughput

Execution Time

cupy ri2 time

Aggregate Worker Throughput

cupy ri2 thruput

Experimental Testbed: Each node in Cambridge Wilkes-3 System has two 32-core AMD EPYC 7763 processors with 1000 GB main memory. Each node has 4 NVIDIA A100 SXM4 GPUs with 80GB memory each. The nodes are equipped with Dual-rail Mellanox HDR200 InfiniBand.

The cuDF merge length is 1E8 for each worker. The benchmark presents weak scaling results.