cuDF DataFrames are table-like data-structure that are stored in the GPU memory. As part of this application, merge operation is carried out for multiple cuDF data frames. We compare the performance of MVAPICH2-GDR based communication device in the Dask Distributed library, MPI4Dask here, with UCX and TCP (using IPoIB) communication devices on an in-house cluster called Cambridge Wilkes-3.
cuDF Merge Execution Time and Aggregate Worker Throughput
Execution Time
Aggregate Worker Throughput
Experimental Testbed: Each node in Cambridge Wilkes-3 System has two 32-core AMD EPYC 7763 processors with 1000 GB main memory. Each node has 4 NVIDIA A100 SXM4 GPUs with 80GB memory each. The nodes are equipped with Dual-rail Mellanox HDR200 InfiniBand.
The cuDF merge length is 1E8 for each worker. The benchmark presents weak scaling results.