Message size: 4 bytes

4B stampede TPS

Message size: 4096 bytes

4KB stampede TPS

Experimental Testbed (TACC - Stampede): Each node of our testbed is dual-socket containing Intel Sandy Bridge (E5-2680) dual octa-core processors running at 2.70GHz. Each node has 32GB of main memory, a SE10P (B0-KNC) co-processor, and a Mellanox IB FDR MT4099 HCA. The host processors run CentOS release 6.3 (Final).

These experiments are performed with 4 Memcached servers and up to 128 client nodes that each ran up to 8-10 Memcached client instances. Since the nodes on this cluster are not equipped with SSDs, we run RDMA-enhanced Memcached in In-memory mode only.

This multi-client test is an extension of the OHB Get Micro-benchmark. All the clients start accessing keys randomly from Memcached server simultaneously and the performance is measured by the total number of transactions executed per second. This scalability test simulates the case where multiple clients access the Memcached servers simultaneously. With a value size of 4 bytes, the RDMA-enhanced design can achieve an improvement of 7X over IPoIB FDR and an improvement of 10X with a value size of 4KB.


Message size: 4 bytes

4B gordon TPS

Message size: 4096 bytes

4KB gordon TPS

Experimental Testbed (SDSC - Comet): Each compute node in this cluster has two twelve-core Intel Xeon E5-2680 v3 (Haswell) processors, 128GB DDR4 DRAM, and 320GB of local SATA-SSD with CentOS operating system. The network topology in this cluster is 56Gbps FDR InfiniBand with rack-level full bisection bandwidth and 4:1 over-subscription cross-rack bandwidth.

These experiments are performed with 4 Memcached servers and up to 50 client nodes that each ran 10-20 Memcached client instances.

This multi-client test is an extension of the OHB Get Micro-benchmark. All the clients start accessing keys randomly from Memcached server simultaneously and the performance is measured by the total number of transactions executed per second. This scalability test simulates the case where multiple clients access the Memcached servers simultaneously. The RDMA-enhanced design can achieve an improvement of 4X over IPoIB QDR with a value size of 4 bytes, and an improvement of 7-10X with a value size of 4KB. This is also an illustration of the fact that there is no added over-head of using hybrid mode when all data fits into memory.


Total Write Time Per Client: 512 MB chunks

RI2 NB IO Writes

Total Read Time Per Client: 512 MB chunks

RI2 NB IO Reads

Experimental Testbed (OSU - RI2 Cluster): Each storage node is provisioned with Intel Broadwell (E5-2680-v4) dual fourteen-core processors, 512 GB of memory and a Mellanox IB EDR HCA. These experiments are performed with 4 Memcached servers and up to 16 clients instances.

This multi-client test is an extension of the I/O pattern example with non-blocking APIs that is provided with the libmemcached library withtin the RDMA for Memcached package version 0.9.5. All the clients start by setting keys that represnt data blocks to the Memcached servers and then access these data blocks in order. This scalability test simulates the case where multiple clients access the Memcached servers simultaneously with total data stored and accessed varying from 5GB to 20GB. We use key/value pair sizes of 512 MB. The non-blocking API extensions with the RDMA-enhanced design can achieve an improvement of 1.25-2.25X over blocking APIs for write phase and an improvement of 2.51-4.97X over blocking APIs with RDMA-based designs for the read phase.