Noah Watkins

github
twitter
linkedin

Remote RAM disk with RDMA

In this post I’ll show you how to use iSER, iSCSI, and LIO to setup a remote RAM disk. This is useful if you need high IOPS but don’t have access to a bunch of SSDs or NVRAM. Note that the performance achieved in this post is quite low compared to what you should be able to achieve with different hardware. Currently the arm64 machines we are using aren’t getting the performance expected, and tuning is on going. However, the description of the steps here are relevant for other installations. Once you create several remote RAM disks, tie them together with RAID-0 or dm-linear.

We’ll use the following hardware provided by CloudLab.

HP Moonshot m400
Eight 64-bit ARMv8 (Atlas/A57) cores at 2.4 GHz (APM X-GENE)
64GB ECC Memory (8x 8 GB DDR3-1600 SO-DIMMs)
120 GB of flash (SATA3 / M.2, Micron M500)
Dual-port Mellanox ConnectX-3 10 GB NIC (PCIe v3.0, 8 lanes)

Next I’ll show you the basic server and client setup, and then demonstrate usage with some basic benchmarks.

Target (Server) Setup

Make the RAMDisk backing store:

/> /backstores/rd_mcp create name=rd1 size=50G
Generating a wwn serial.
Created rd_mcp ramdisk rd1 with size 50G.

Make the iSCSI target:

/> /iscsi create
Created target iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040.
Selected TPG Tag 1.
Successfully created TPG 1.

Create a LUN backed by the RAMDisk:

/> iscsi/iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040/tpgt1/luns create storage_object=/backstores/rd_mcp/rd1 
Selected LUN 0.
Successfully created LUN 0.

Create a portal for the iSCSI target:

/> iscsi/iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040/tpgt1/portals create 10.10.1.3
Using default IP port 3260
Successfully created network portal 10.10.1.3:3260.

Enable iSER on the portal:

/> iscsi/iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040/tpgt1/portals/10.10.1.3:3260 iser_enable
iser operation has been enabled

Here is the final configuration

/> ls
o- / ..................................................................... [...]
  o- backstores .......................................................... [...]
  | o- fileio ............................................... [0 Storage Object]
  | o- iblock ............................................... [0 Storage Object]
  | o- pscsi ................................................ [0 Storage Object]
  | o- rd_dr ................................................ [0 Storage Object]
  | o- rd_mcp ............................................... [1 Storage Object]
  |   o- rd1 ............................................... [ramdisk activated]
  o- ib_srpt ....................................................... [0 Targets]
  o- iscsi .......................................................... [1 Target]
  | o- iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040 ...... [1 TPG]
  |   o- tpgt1 ....................................................... [enabled]
  |     o- acls ....................................................... [0 ACLs]
  |     o- luns ........................................................ [1 LUN]
  |     | o- lun0 ....................................... [rd_mcp/rd1 (ramdisk)]
  |     o- portals .................................................. [1 Portal]
  |       o- 10.10.1.3:3260 ................................. [OK, iser enabled]
  o- loopback ...................................................... [0 Targets]
  o- qla2xxx ....................................................... [0 Targets]
  o- tcm_fc ........................................................ [0 Targets]

Finally, disable all security

/> /iscsi/iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
Parameter authentication is now '0'.
Parameter demo_mode_write_protect is now '0'.
Parameter generate_node_acls is now '1'.
Parameter cache_dynamic_acls is now '1'.

Initiator (Client) Setup

From the client, also called the initiator, we can use the iscsiadm tool to look for the targets we have created. In this case we’ve setup one iSCSI target on the node with address 10.10.1.3:

[email protected]:~$ sudo iscsiadm -m discovery -t sendtargets -p 10.10.1.3
10.10.1.3:3260,1 iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde

To access the iSCSI targets as local devices we need to login to the targets. We can login to all of the targets that have been discovered with the following command:

[email protected]:~$ sudo iscsiadm -m node -L all
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] (multiple)
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] successful.

Now that we have logged into the target, we should be able to access the LUN we setup as a local device. We can see the device has been attached by examining dmesg:

[ 1653.685547] scsi2 : iSCSI Initiator over TCP/IP
[ 1653.941126] scsi 2:0:0:0: Direct-Access     LIO-ORG  RAMDISK-MCP      4.0  PQ: 0 ANSI: 5
[ 1653.941314] sd 2:0:0:0: Attached scsi generic sg1 type 0
[ 1653.942324] sd 2:0:0:0: [sdb] 104857600 512-byte logical blocks: (53.6 GB/50.0 GiB)
[ 1653.942717] sd 2:0:0:0: [sdb] Write Protect is off
[ 1653.942721] sd 2:0:0:0: [sdb] Mode Sense: 43 00 00 08
[ 1653.942880] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[ 1653.944324]  sdb: unknown partition table
[ 1653.945174] sd 2:0:0:0: [sdb] Attached SCSI disk

The iscsiadm tool also lets us look at a lot of information about the targets. Using the following command we can see some of the networking configuration for the targets:

[email protected]:~$ sudo iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-873
Target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde
        Current Portal: 10.10.1.3:3260,1
        Persistent Portal: 10.10.1.3:3260,1
                **********
                Interface:
                **********
                Iface Name: default
                Iface Transport: tcp
                Iface Initiatorname: iqn.1993-08.org.debian:01:a41f7afa2fc8
                Iface IPaddress: 10.10.1.1
                ...

Notice that the Iface Transport option is set to tcp. In order to get maximum performance using RDMA, we want to use the iSER transport instead. To set this we first need to logout of the targets:

[email protected]:~$ sudo iscsiadm -m node -U all
Logging out of session [sid: 1, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260]
Logout of [sid: 1, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] successful.

Next, set the iface.transport_name option to iser for our target:

[email protected]:~$ sudo iscsiadm -m node -T iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde -o update -n iface.transport_name -v iser

Now we can log back in to the target and check to be sure that the transport has been set to iSER. Login:

[email protected]:~$ sudo iscsiadm -m node -L all
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] (multiple)
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] successful.

Check transport:

[email protected]:~$ sudo iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-873
Target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde
        Current Portal: 10.10.1.3:3260,1
        Persistent Portal: 10.10.1.3:3260,1
                **********
                Interface:
                **********
                Iface Name: default
                Iface Transport: iser
                Iface Initiatorname: iqn.1993-08.org.debian:01:a41f7afa2fc8
                ...

Success!

Benchmarks

We are going to use the fio tool to do 512 byte direct I/O random reads and writes to the remote RAM disk devices. With a single device we are able to get around 70,000 random read and write IOPS. Here is the write workload:

[email protected]:~$ sudo fio --rw=randwrite --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --filename=/dev/sdb --name=sdb
asdf: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 1 process
^Cbs: 1 (f=1): [w] [0.0% done] [0KB/34454KB/0KB /s] [0/68.1K/0 iops] [eta 115d:17h:45m:24s]

And the read workload:

[email protected]:~$ sudo fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --filename=/dev/sdb --name=sdb
asdf: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 1 process
^Cobs: 1 (f=1): [r] [0.0% done] [35481KB/0KB/0KB /s] [70.1K/0/0 iops] [eta 115d:17h:45m:53s]s]

I would expect the performance to be better.

One iSCSI Target with 2 LUNs

Next we try again with one iSCSI target hosting two LUNs, and instruct fio to send IOs to both devices. For writes we get slightly less than 2x speed-up at 113,000 IOPS:

[email protected]:~$ sudo fio --rw=randwrite --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --
filename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc
sdb: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 2 processes
^Cbs: 2 (f=3): [ww] [0.0% done] [0KB/56665KB/0KB /s] [0/113K/0 iops] [eta 115d:17h:46m:05s]

And right about 2x speed-up for reads:

[email protected]:~$ sudo fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --f
ilename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc                                                                                                                                                      
sdb: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 2 processes
^Cbs: 2 (f=3): [rr] [0.0% done] [60537KB/0KB/0KB /s] [121K/0/0 iops] [eta 115d:17h:46m:25s]

This still seems really slow.

Two iSCSI portals With 2 LUNs

Next we try to create separate portals. We associate one LUN with each portal. Writes are now performing a bit better than 2x at 131K IOPS, but this is probably a peak we caught. For the most part its about 2x:

[email protected]:~$ sudo fio --rw=randwrite --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --f
ilename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc                                                                                                                                                       
sdb: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 2 processes
^Cbs: 2 (f=3): [ww] [0.0% done] [0KB/65281KB/0KB /s] [0/131K/0 iops] [eta 115d:17h:45m:36s] 

And reads are a bit better too at 126K IOPS:

[email protected]:~$ sudo fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --fi
lename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc                                                                                                                                                        
sdb: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 2 processes
^Cbs: 2 (f=3): [rr] [0.0% done] [63220KB/0KB/0KB /s] [126K/0/0 iops] [eta 115d:17h:44m:26s]

Still not that great.

Four Targets With Four LUNs

The next experiment is four targets each with a separate LUN. Now we get roughly 200K IOPS for read and write workloads.

Write workload:

[email protected]:~$ sudo fio --rw=randwrite --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --f
ilename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc --filename=/dev/sdd --name=sdd --filename=/dev/sde --name=sde
sdb: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdd: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sde: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 4 processes
^Cbs: 4 (f=7): [wwww] [0.0% done] [0KB/98105KB/0KB /s] [0/196K/0 iops] [eta 115d:17h:44m:53s]

And the read workload:

[email protected]:~$ sudo fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --fi
lename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc --filename=/dev/sdd --name=sdd --filename=/dev/sde --name=sde                                                                                          
sdb: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdd: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sde: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 4 processes
^Cbs: 4 (f=7): [rrrr] [0.0% done] [98.11MB/0KB/0KB /s] [201K/0/0 iops] [eta 115d:17h:46m:13s]

I really expect much higher IOPS.

Optimizations

According to this page https://vanity-mellanoxexternal.jiveon.com/docs/DOC-1483 there are a lot of different optimizations that can be applied to help squeeze out more performance. However, after applying most of the optimizations the performance doesn’t really improve for me. While the page above isn’t using RoCE and they are using x86 rather than arm64, they are able to get almost 2 million IOPS. I’m hoping I can figure out how to get more IOPS out of our setup.

30bd762cd913e5b33d66499bed483624ef44ed89