Other My NVMe experience

ralphbsz · Apr 1, 2017

A: Flash drives can have enormously high power consumption, therefore heating. I was just looking at some NVMe drive specs, and they can dissipate 17W in a 2.5" SFF form factor. Good cooling (air flow) is mandatory at these power consumptions, otherwise the drive will self-limit, and performance will fluctuate.

B: To measure performance of flash drives at maximum, you need to drive a workload with high queue depth. I have no idea what diskinfo does. Here would be my proposal: Find a good disk benchmarking tool, and set it for a queue depth of 32 or 64. Or run this many copies of dd in parallel. In python, it's actually pretty easy to write a script that creates a very large file (a few GB), and then does random reads and writes of that file using read() and write() system calls, and then run a few dozen copies of that program in parallel. The advantage of doing this in python is that you can do the same test on different OSes.

C: In the end, the only thing that matters is performance for your workload, and your personal cost/benefit analysis. Is this thing cheap and fast enough for what you want to accomplish?

Phishfry · Apr 1, 2017

Its more about feeling robbed. The manufacturers are only giving 2 pci-e lanes to the m.2 slot when it needs 4 lanes.

SuperMicro also seems to use 2x lanes when they have a chart that says 'nvme support'.
https://www.supermicro.com/products/nfo/M.2.cfm

I guess to them 2 lanes means nvme support.(Most models on their list only do x2 except the socket 1151-Skylake boards)
They also like to use the x2 -10Gb figure in the manuals. With x2 being 2 Lanes. 10Gb is not 10GB.
10Gb=1250 megabytes/sec.
The good nvme drives do 2500 megabytes/sec on a PCIe slot. Same hardware.

Hence my feeling of being robbed. The loss is linear. At half the lanes, half the throughput.

I am using mine for Poudriere and Crochet building.

It is real ironic to me that a Supermicro LGA2011v3 board that takes a CPU with 40 lanes only gives 2 lanes to the nvme slot. Yet the SuperMicro Socket 1151-Skylake boards (*with only a 16 Lane CPU*) give 4 lanes for nvme slot. Giving it full speed. How bizarre is that.

I do notice that most of the boards based on X99 chipset work at x4 nvme.

ralphbsz · Apr 1, 2017

I see ... the reality didn't match your expectations. Obviously, manufacturer's specifications are written so obtusely that one really has to check them with a fine-toothed comb. With dumb things like "confusing" Gb and GB, it's easy to suspect that the manufacturers are dishonest, bu, I'll assume incompetence rather than malice.

The thing to remember is this: to the manufacturer of the motherboard/computer (whether it's Dell or SuperMicro), you are small fry: You buy one computer, or 10. Large fry are people like Facebook, Citibank, or the No Such Agency, who buy tens of thousands. I'm sure that if this class of customers cared about the NVMe bandwidth into the M.2 socket, then (a) their purchasing specialists would have discussed the question ahead of time with the vendor, and (b) the vendor would take their wishes into account. But neither you nor me are in that category.

The basic ill here is that the Intel CPUs and chipsets have relatively few PCIe lanes, and fixing that with PCIe multiplexers costs real money. The motherboard manufacturers know where the PCIe lanes are wanted by most customers (GPUs, and slots for IO like IB, SAS, Ethernet and FC cards); that immediately implies that the other uses get shortchanged. I see the same problem with motherboards that have a slot with lots of PCIe (intended for GPUs), and then under provisioning the other slots; for building IO servers, this is nasty. We also have to remember that for most customers, the M.2 slot is not intended for performance; it's for a convenient small boot disk, while the "real" IO happens via adapter cards that sit in the "real" slots. That brings me to a productive suggestion: You can buy PCIe adapter cards that go into slots and have M.2 sockets. I've used a card used that allowed putting four M.2 modules into a single x16 slot (for performance testing, but that card was pretty special and probably expensive), and I know that cheap cards for a single M.2 slot exist. That would get you your x4 connection back, without having to swap motherboards. Just found one on the web: Asus HyperM2X4, costs $15.

Here is an interesting performance engineering question: Have you considered getting a handful of PCIe slot based NVMe modules, and using those for your file system instead? Given your workload (I presume you're running your builds in parallel), striping over multiple slot cards might give your more speed/$ than a single M.2 card. With parallel makes and a good file system, you can probably have enough queue depth on the back end to keep multiple cards busy. On the other hand, PCIe-based NVMe cards tend to be more expensive than M.2 cards, so this is probably not a good proposal from a financial viewpoint.

Terri_Kennedy · Apr 2, 2017

Phishfry said:
Its more about feeling robbed. The manufacturers are only giving 2 pci-e lanes to the m.2 slot when it needs 4 lanes.

SuperMicro also seems to use 2x lanes when they have a chart that says 'nvme support'.
https://www.supermicro.com/products/nfo/M.2.cfm

I guess to them 2 lanes means nvme support.(Most models on their list only do x2 except the socket 1151-Skylake boards)

Intel has traditionally been pretty stingy with the number of PCIe lanes on their CPUs (this is changing on at least some of the newer models). Manufacturers like Supermicro probably assume that more customers are going to want those lanes in PCIe slots for things like NICs (servers) or SLI / Crossfire graphics (enthusiasts). And some of the people who might use the NVMe slots to save on motherboard slots (as opposed to speed) will probably prefer SATA DOM anyway. As NVMe becomes more popular, we'll probably see more lanes made available and possibly more on-board sockets. Until then, if you need more M.2 NVMe performance than you get with the on-board socket(s), you can use a PCIe to M.2 adapter card in one of the slots. Just be aware of potential PCIe slot bifurcation issues when using cards with more than one module.

ralphbsz · Apr 2, 2017

Terry: Completely agree. Also, I think in the high-end IO server market, we'll start seeing more M.2 slots (with full performance, and perhaps soon multiple sockets per motherboard). That's because those slots make a convenient place for ultra-fast cache storage, while the "mainstream" PCIe slots are filled with network and "disk" IO cards (when I say disk, it may very well mean flash storage, and disks can be connected via SATA, SAS, NVMe, PCIe fabric, and who knows what else). This will become particularly urgent once M.2 NVMe cards with large and persistent DRAM caches become mainstream, because those will be very useful for building fast IO servers.

In theory, such an IO-server motherboard would make a great platform for IO intensive application servers (and Phishfry's compile server is a prime example of such a workload). But I fear that these specialized motherboards may remain quite expensive and specialized, because of a simple economic argument: Anyone who builds a system with 100 or 300 disks (which cost tens of thousands of $) or between a dozen and 100s of SSDs (even more expensive) won't care whether their motherboard is $90 or $900.

sko · Apr 3, 2017

The new Xeon D boards have 4 lanes for M.2 - OTOH there are less PCIe-slots at least on the uATX boards, so there are enough lanes available anyways. The mini PCI-E slot is only PCIe 2.0, but I've not seen them in use on much systems. If they're used, it's mostly for mini-SATA drives which are slower than the interface anyways, so it makes sense not to waste PCI 3.0-lanes here.

Most M.2 drives I have played around with, suffered from excessive heat and bad airflow at their location on the boards to such an extent, that they heavily throttled after only a few minutes of operation with high I/O. So for high-performance applications you'd have to go for full-sized (and properly cooled) PCIe-SSDs anyways or at least place the M.2 drives on an PCIe adapter card which properly exposes them to the chassis airflow.

As for huge, fast storage servers: It seems the agreed upon way to go during the next few years will be NVMe (over Fibre). Caching on such systems doesn't make much sense with conventionally PCIe-/NVMe-attached flash anymore, so that's where technologies like Intel Optane will have to fill the speed gap to RAM.
M.2 PCIe/NVMe therefore might be already seen as a dead-end except for small boot devices and to get rid of SATA-DOMs or USB-drives and here speed isn't that crucial. So this might be another reason why on most boards the M.2 slot doesn't have a high priority when it comes to allocate the available PCIe lanes.

rigoletto@ · Jun 28, 2017

It should not be a problem any more with those new AMD X399 and Epyc motherboards, however it seems Supermicro still give just two lanes to the M.2 slot.

rigoletto@ · Jul 8, 2017

Phishfry

I just remembered I was told by a "datacenter" guy sometime ago, who often deal with SM, that at good SM resellers they often arrange to get custom SM boards, even just 1. I mean, if you want the board XYZ but with/without the features ABC they supposedly can arrange it.

It will be more expensive but theoretically would be possible to have the board you need with 4 lanes.

sko · Jul 10, 2017

lebarondemerde said:
it seems Supermicro still give just two lanes to the M.2 slot.

The X10SDV boards give 4 lanes to the M.2 slot:
https://www.supermicro.com/products/motherboard/Xeon3000/#1667

I don't think they will use less lanes on EPYC boards, given there are a lot more lanes available from the CPU than with Xeon...

rigoletto@ · Jul 10, 2017

sko said:
The X10SDV boards give 4 lanes to the M.2 slot:
https://www.supermicro.com/products/motherboard/Xeon3000/#1667

I don't think they will use less lanes on EPYC boards, given there are a lot more lanes available from the CPU than with Xeon...

The AMD site says EPYC has 128 lanes per CPU, but unfortunately Supermicro is still giving just 2 lanes to the M.2 in the ones they launched already, and it has 2 sockets: H11DSiNT

EDIT: unless the info in the Supermicro site is wrong.

sko · Jul 10, 2017

There are also 2 dedicated NVMe ports available that are connected via 4 lanes each.
I think the M.2 port is intended mainly for use with SATA M.2 SSDs which are so slow anyways that 2 lanes are sufficient. All other M.2 expansion cards available are various WiFi/radio/NFC/"anything wireless" controllers which aren't really relevant for servers and also have such low data rates that more than 2 lanes aren't necessary.

Phishfry · Jul 11, 2017

Phishfry said:
(Most (Supermicro )models on their list only do x2 except the socket 1151-Skylake boards)

This was an overly broad statement as you are correct. X10SDV does in fact offer x4 PCIe lanes.
I will only buy a socket'ed board as I refuse to spend $600 on a motherboard that I can't change the CPU.
To me they are charging way too much for XeonD SOC. If I can't upgrade you need to knock some money off the price.

rigoletto@ · Jul 11, 2017

The SoC is not just about upgrades, but the board is usually the (one of) first thing(s) to fail. So, you can end up with a failed board with perfectly fine CPU in it you can't use anywhere...

Phishfry · Jul 11, 2017

Have you see the price of the D1541 boards. 8 Core 16 Thread CPU plus board is $830.
I agree that the board will probably go bad before the CPU.

Regardless, if the C27XX experience wasn't enough to make my point then nothing is.
https://www.theregister.co.uk/2017/02/06/cisco_intel_decline_to_link_product_warning_to_faulty_chip/

Most LGA2011v3 socket boards will take a LGA2011v4 CPU with a firmware flash.
That is what I call value added benefit to a socketed CPU platform. You may not be guaranteed an afterlife, but it is possible.

ralphbsz · Jul 11, 2017

Before you get upset about particular vendor's pricing, lack of sockets, and incompatibility, consider this: the vast majority of motherboards and systems made by SuperMicro (in particular!) and the other major vendors are sold to commercial users, in particular server farms. The number of people who still have real motherboards at home is small; the number of people who would know how to choose motherboard, CPU, memory, IO separately is infinitesimally tiny. And in corporate use, the idea up upgrading the CPU while keeping the motherboard (or the other way around) doesn't fly: way too much work, way too much risk.

Matter-of-fact, a trend that is getting very common is "complete system swap" and then "fail in place": If something (anything) breaks on a server, either just power it down, and put a new one in (without doing much diagnosing of which internals are at fault), or even just power it down and leave it in place. That's because manpower to diagnose, repair and replace stuff is more expensive than new hardware. This is particularly true with the cluster-based cloud usage models: Whether Amazon runs your virtual instance on machine A or machine B makes no difference to you; whether Google or Facebook or Bank of America uses 1562 or 1561 nodes to do the analytics that keeps their business going also makes only an irrelevant difference.

sko · Jul 11, 2017

I think I haven't upgraded a CPU in a server for the last ~10 years.... Adding RAM or changing HBAs or other controllers, yes - but the CPUs were mostly planned for the task of the system with some headroom and at the time they were no longer up to their task, upgrading the CPU wouldn't have been economical anyways... Re-purposing the system for another 1-2 years and just buying a new(er) one for the task that grew over the old systems capacity always was the better way.

I can somewhat understand these reservations regarding the Atom platform - these things were/are hopelessly underpowered, so the urge to upgrade is imminent from the beginning (I also still have to mess with 2 Atom-powered NAS systems

).
A few weeks ago I deployed a XeonD as a small gateway and server for basic network services and file server in a small branch and even with 4 cores / 8 threads these things are quite beefy and WAY more powerful than any Atom based system. In fact they play well in the league of the smaller E3 systems, but getting everything right from the board/SoC greatly reduces time for physical deployment: add some RAM and disks, put it in the rack and you're done.

Also compare the power consumption of 2011v3/4 systems and the Xeon D platforms; now add up this difference over 3-5 years 24/7 runtime. Add the cost of the bigger UPS needed and new batteries for them every ~2 years. Add the higher cooling requirements for all this... Yes, I also hate beancounting, but even for small businesses it mostly makes more sense to just buy a new server with lower power requirements at higher performance. (Plus: if you also just jump on that train, you regularly get some new, shiny servers to play with

)

BTW: You can get XeonD boards a lot cheaper from ASRock (if you want to deal with desktop-grade hardware and crappy/buggy firmware in your servers)

rigoletto@ · Aug 10, 2017

A Tyan NVMe minded board.

Phishfry · Aug 10, 2017

As much moaning as I did about speed of my NVMe I am really enjoying the speed on my NanoBSD build-server. It is 3 times quicker than a SATA3 SSD build time. When it comes to compiling disk speed is the key factor I have found along with more cores.

priyadarshan · Dec 5, 2017

We are having good results with FreeBSD 11.1 and Samsung 960 Pro on ASRockRack EPC612D8x boards.

SchwarzerVossatka · Mar 21, 2020

Hello from 2020. I've bought an SSD drive under AMD Radeon trademark. nvmecontrol identify shows these lines:

Code:

Controller Capabilities/Features ================================
Vendor ID:                   126f
Subsystem Vendor ID:         126f
Serial Number:               E201810230020034
Model Number:                R5MP120G8
Firmware Version:            R1015A0

The drive is almost useless. Reads from the drive are OK, but writes cause many of these errors in the kernel console:

Code:

nvme0: Resetting controller due to a timeout.
nvme0: aborting outstanding i/o

Writing a single 100Mb file can take a minute and there is a possibility that the file will be written corrupted. If you look at the bugtracker you will see many still unresolved errors like mine. So you can conclude that FreeBSD lacks NVMe support unless you are lucky enough to buy some special drive that will actually work.

Version of FreeBSD is 12.1-RELEASE-p2

Phishfry · Mar 21, 2020

243422 – NVME controller failure: resetting (AMD Radeon R5 NVMe Series)

bugs.freebsd.org

SchwarzerVossatka said:
So you can conclude that FreeBSD lacks NVMe support unless you are lucky enough to buy some special drive that will actually work.

Put another way:
So you can conclude that FreeBSD has NVMe support unless you are unlucky enough to buy some special drive that does not work correctly.

I know it can be frustrating when hardware does not work. But you have bought a module I have never heard of.
Stick to Samsung, Toshiba or Western Digital. These are all brands that are in the storage business. AMD is not.

Phishfry · Apr 4, 2020

I finally was able to find a few NVMe controller cards that worked.
1.) This card uses bifurcated PCIe slots so your board must support splitting PCIe lanes
https://www.ebay.com/itm/193343663275
http://www.ioi.com.tw/products/proddetail.aspx?CatID=106&DeviceID=3050&HostID=2073&ProdID=1060223
This card presents its drives as nvd/nda to FreeBSD. The PM983 drives perform slightly better on this controller:

Code:

nvme0: <Generic NVMe Device> mem 0xfb510000-0xfb513fff irq 26 at device 0.0 on pci2
nda0 at nvme0 bus 0 scbus10 target 0 lun 1
nda0: nvme version 1.2 x4 (max x4) lanes PCIe Gen3 (max Gen3) link

Code:

diskinfo -t /dev/nda0
/dev/nda0
    512             # sectorsize
    960197124096    # mediasize in bytes (894G)
    1875385008      # mediasize in sectors
    512             # stripesize
    0               # stripeoffset
    SAMSUNG MZQLB960HAJR-000AZ                 # Disk descr.
    S3VKNE0KA05944         # Disk ident.
    Yes             # TRIM/UNMAP support
    0               # Rotation rate in RPM

Transfer rates:
    outside:       102400 kbytes in   0.052742 sec =  1941527 kbytes/sec
    middle:        102400 kbytes in   0.052228 sec =  1960634 kbytes/sec
    inside:        102400 kbytes in   0.052517 sec =  1949845 kbytes/sec

2.) The LSI/Broadcom/Avago SAS94xx series supports NVMe through their Tri-Mode.
The kicker here is you need a special $120 U.2 cable for this to work.
Broadcom cable #05-50065-00 (0.5M U.2 Enabler Cable HD SFF-8643 To two SFF-8482 connectors)
This supports two U.2 NVMe. I have found a generic version on ebay for $75
https://www.ebay.com/itm/302909883622
The SAS94xx boards presents its drives as da drives.

Code:

mrsas0: FW now in Ready state
mrsas0: Using MSI-X with 20 number of vectors
mrsas0: FW supports <128> MSIX vector,Online CPU 20 Current MSIX <20>
mrsas0: max sge: 0x46, max chain frame size: 0x400, max fw cmd: 0x5ec
mrsas0: Issuing IOC INIT command to FW.
mrsas0: IOC INIT response received from FW.
mrsas0: NVME page size  : (4096)
mrsas0: FW supports SED
mrsas0: FW supports JBOD Map
mrsas0: FW supports JBOD Map Ext
mrsas0: Jbod map is supported
mrsas0: System PD created target ID: 0x1
mrsas0: max_fw_cmds: 1516  max_scsi_cmds: 1500
mrsas0: MSI-x interrupts setup success
mrsas0: mrsas_ocr_thread

da0 at mrsas0 bus 1 scbus1 target 1 lun 0
da0: <NVMe SAMSUNG MZQLB960 3W0Q> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number 5162_A04B_304B_5633_0100_0000_4538_2500.
da0: 150.000MB/s transfers
da0: 915715MB (1875385008 512 byte sectors)

Code:

diskinfo -t /dev/da0
/dev/da0
    512             # sectorsize
    960197124096    # mediasize in bytes (894G)
    1875385008      # mediasize in sectors
    0               # stripesize
    0               # stripeoffset
    116737          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    NVMe SAMSUNG MZQLB960    # Disk descr.
    5162_A04B_304B_5633_0100_0000_4538_2500.    # Disk ident.
    Yes             # TRIM/UNMAP support
    0               # Rotation rate in RPM
    Not_Zoned       # Zone Mode

Transfer rates:
    outside:       102400 kbytes in   0.055840 sec =  1833811 kbytes/sec
    middle:        102400 kbytes in   0.055777 sec =  1835882 kbytes/sec
    inside:        102400 kbytes in   0.055299 sec =  1851751 kbytes/sec

roper · Sep 28, 2020

I have the 9400-16i Tri-Mode Storage Adapter and will soon need one of those cables. I'm thinking about adding couple U.2 drives.

Jerome4 · Nov 5, 2020

For anybody having nvme speed issues, I'd also check your drive temperature. I was seeing horrible performance on my Intel drives and realized that they were being throttled due to overheating. After adding $5 ebay heatsinks they performed as expected.

Also, for anybody wanting to push performance... I ended up testing in 3 configurations for mirrored drives.

a) 2x nvme drives in ZFS mirror
- this gave the performance of a single drive at about 1GB/s
- my testing always suggests that ZFS mirrors don't increase performance (unless across spanned mirrors)

b) 2x nvme drives in gmirror with ZFS (non-mirrored) on top
- this was a bit more reasonable at around 2GB/s
- my CPU is sluggish, so I think it's just the extra overhead slowing it down

c) 2x nvme drives in gmirror using UFS instead
- this was by far the best performing with 2.6GB/s+ sequential scanning
- I see real world Postgresql table/index scans of above 800Mb/s (DB is the bottleneck here)

Phishfry · May 12, 2021

So I have invested in a bunch of Samsung PM983. Some in M.2 format and some U.2 format with 2.5" drives.
I just bought a 4 bay Icydock MB699VP for 1.92GB PM983 U.2 drives.

Now I see they have changed the connector for 2.5" NVMe PCIe 4.0 drives and now call the form factor U.3.
There seems to be some connectors that offer backwards compatibility to U.2 and some are U.3 -Only.
The Samsung PM1733 and PM1735 Lines are U.3-only.
So new cabling and backplanes required.

Has anyone tested a PCIe4 drive in a PCIe3 slot? I am aching to try that. I wanna see where PCIe3 x4 maxes out.
PCIe4.0 drives are compatible with PCIe3.0 with reduced bandwidth.
From looking at some reviews the PCIe4.0 drives are giving ~7,000MB/sec reads, while writes are ~2000-3000MB/s