This page presents performance evaluation of the "Sum of numPy Array and its Transpose" benchmark. We compare the performance of MVAPICH2/MVAPICH2-X based communication device in the Dask Distributed library, MPI4Dask here, with UCX and TCP (using IPoIB) communication devices on an in-house cluster called RI2.
Sum of numPy Array and its Transpose Benchmark

Experimental Testbed: Each node in OSU-RI2 has two fourteen core Xeon E5-2680v4 processors at 2.4 GHz and 512 GB main memory. The nodes support 16x PCI Express Gen3 interfaces and are equipped with Mellanox ConnectX-4 EDR HCAs with PCI Express Gen3 interfaces. The operating system used is CentOS 7.
This benchmark creates a NumPy array and distributes its chunks across Dask workers. The benchmark adds these distributed chunks to their transpose. The numPy array dimensions are 3E4x3E4 and the chunk size is 5E3. There are total 36 partitions. The benchmark is presenting strong scaling results. 28 threads are started for each Dask worker.