UFS Pinpointing storage bottlenecks when using iSCSI vs DAS

scotia · Jun 14, 2022

HI all,
for various reasons (that shall remain unrevealed) I am considering moving my storage from my HP Microserver N40L to iSCSI. This will involve moving the HBA in the N40L to another server. The new server will be the iSCSI target and the N40L the initiator.
Before I do this I am running some tests on a parallel setup. The two servers are connected back-to-back with 10Gb Ethernet with a 9Kbyte MTU.
I have run two the three most obvious tests. The third will happen soon (after hardware swapping).
I was obviously hoping for better results and although I'm yet to run test #3, I am hoping the community might join in my troubleshooting.
All tests are simple writes using dd with /dev/zero as the source and a block size of 8kbytes. The drive has a single partition formatted with UFS (with default settings). Both servers are running 13.1-RELEASE freshly installed.
The tests are:
#1: on the new server, local write to a 7K SAS drive
#2: on the N40L, iSCSI write to the same drive
#3: on the N40L, local write to a 7K SAS drive (not yet run as I need to swap out the HBA from the new server).
I have confirmed with iperf3 that both servers can achieve 10GB/s with a 9Kbyte MTU:

Code:

New server -> N40L (TCP)
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-10.00 sec 11.5 GBytes 9.89 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 11.5 GBytes 9.88 Gbits/sec receiver

N40L -> New server (TCP)
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-10.00 sec 11.4 GBytes 9.81 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 11.4 GBytes 9.82 Gbits/sec receiver

Here is test #1 (local write on new server):

Code:

# dd if=/dev/zero of=/mnt/a bs=8192 count=1310720
1310720+0 records in
1310720+0 records out
10737418240 bytes transferred in 59.083653 secs (181732472 bytes/sec)

Test #2: iSCSI write (N40L with new server's drive mounted via iSCSI):

Code:

# dd if=/dev/zero of=/mnt/a bs=8192 count=1310720
1310720+0 records in
1310720+0 records out
10737418240 bytes transferred in 108.958877 secs (98545603 bytes/sec)

This is roughly half the speed.

Now, I know the new server can write to the the local drive at 181MB/s, and the network can support ~1.1GB/s.
Can anyone please suggest the next place to look?

Thanks,
Scott

ralphbsz · Jun 14, 2022

Try the same thing with much larger IOs in the DD, for example 8 MiB instead of 8 KiB. And try reads instead of writes.

My theory is that the iSCSI latency overhead per IO causes the next sequential write to miss one rotation occasionally.

gpw928 · Jun 15, 2022

A couple of observations from my days benchmarking disks, iSCSI(4), and lagg(4):

jumbo frames might work well on some specific work loads, but I could never get them to deliver a good outcome, and I did try hard; and
a synthetic benchmark like benchmarks/bonnie++ will give you a better sense of what's possible (just skip the character-at-a-time I/O tests).

scotia · Jun 15, 2022

ralphbsz said:
Try the same thing with much larger IOs in the DD, for example 8 MiB instead of 8 KiB. And try reads instead of writes.

My theory is that the iSCSI latency overhead per IO causes the next sequential write to miss one rotation occasionally.

Here's some local reads with dd with 8MB blocks:
After a few hours without testing:

Code:

# dd if=/mnt/a of=/dev/null bs=8388608
1280+0 records in
1280+0 records out
10737418240 bytes transferred in 56.079198 secs (191468826 bytes/sec)

A few seconds later:

Code:

# dd if=/mnt/a of=/dev/null bs=8388608
1280+0 records in
1280+0 records out
10737418240 bytes transferred in 2.390279 secs (4492119888 bytes/sec)

That's some serious caching!

Just to satisfy my curiosity let's try again with a source file containing random data rather than \0s:
Make the file:

Code:

dd if=/dev/random of=/mnt/100G-random bs=8388608 count=1280
1280+0 records in
1280+0 records out
10737418240 bytes transferred in 60.652449 secs (177031901 bytes/sec)

Now local reads of the random data:

Code:

root@freebsd-test-01:~ # dd if=/mnt/a-random of=/dev/null bs=8388608 count=1280
1280+0 records in
1280+0 records out
10737418240 bytes transferred in 56.573594 secs (189795582 bytes/sec)

Some seconds later:

Code:

root@freebsd-test-01:~ # dd if=/mnt/a-random of=/dev/null bs=8388608 count=1280
1280+0 records in
1280+0 records out
10737418240 bytes transferred in 2.288979 secs (4690920060 bytes/sec)

Ok, ignoring that distraction, back to the action. We've established that local reads are slow (1.4Gb/s) after some time passes, very quick (35Gb/s) after a recent read, regardless of the content.

Back to iSCSI reads:

Code:

dd if=/mnt/a of=/dev/null bs=8388608
1280+0 records in
1280+0 records out
10737418240 bytes transferred in 67.147063 secs (159908979 bytes/sec)

That's 1.2 Gb/s

And some seconds later:

Code:

dd if=/mnt/a of=/dev/null bs=8388608
1280+0 records in
1280+0 records out
10737418240 bytes transferred in 25.983411 secs (413241291 bytes/sec)

3Gb/s

Same result for the random file.

So we know that iSCSI reads can be done across the 10Gb at at least 3Gb/s.

I'm not sure if any of this data is helpful.
What I found confusing was running tcpdump captures on the target for fresh and subsequent reads:
For the initial read (after a time of quiescence), the tcpdump capture file was 4GB. For subsequent reads it was 1.5GB. Which means less traffic was pulled across the ethernet. So that rules out caching on the target. Any ideas on this one?

More soon. Thanks.

Phishfry · Jun 15, 2022

Might help if you mention the network cards you are experimenting with.
Chelsio has some driver sysctl tweaks that enable iSCSI tcp offload. (ToE)

scotia · Jun 15, 2022

Phishfry said:
Might help if you mention the network cards you are experimenting with.
Chelsio has some driver sysctl tweaks that enable iSCSI tcp offload. (ToE)

Hi,

the initiator has an Intel 10Gb card:

Code:

ix0: <Intel(R) X520 82599ES (SFI/SFP+)> port 0xe800-0xe81f mem 0xfe880000-0xfe8fffff,0xfe87c000-0xfe87ffff irq 18 at device 0.0 on pci2
ix0: Using 2048 TX descriptors and 2048 RX descriptors
ix0: Using 2 RX queues 2 TX queues
ix0: Using MSI-X interrupts with 3 vectors
ix0: allocated for 2 queues
ix0: allocated for 2 rx queues
ix0: Ethernet address: 00:1b:21:bc:4c:3e
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix0: eTrack 0x00012b2c PHY FW V65535
ix0: netmap queues/slots: TX 2/2048, RX 2/2048

the target is a VM on ESXi 6.7 connected to a standard vSwitch with a Cisco VNIC uplink:

Code:

Cisco Systems Inc Cisco VIC Ethernet NIC

diizzy · Jun 15, 2022

One oddity(?) I've noticed on several occations is that UFS uses _a lot_ more I/O than ZFS especially when performing write tasks and it scales poorly. A very simple example is to create a md device, format it to UFS and try lets say cloning ports repo or just copy /usr/src . Look at gstat and go "uhm...." ;-)

UFS Pinpointing storage bottlenecks when using iSCSI vs DAS

scotia

ralphbsz

gpw928

scotia

Phishfry

scotia

diizzy