ZFS Desperate with 870 QVO and ZFS

gpw928 · Apr 7, 2022

rootbert said:
In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.

Good thought. However some caution may be indicated. It's a mail server, so depending on what's in the datasets, atime might be important.

vermaden · Apr 7, 2022

The Samsung QVO disks are slowest of SSD and are quite good cheap alternative for storage but not I/O intensive. They have quite LARGE CACHE DROP EFFECT which means even with continuous sequential I/O they will quite fast drop from ~500MB/s to ~160MB/s as shown in the image below.

You can read more about that here:

The Samsung 870 QVO (1TB & 4TB) SSD Review: QLC Refreshed

www.anandtech.com

egoitz · Apr 7, 2022

gpw928 said:
Buffer size is not the only issue than can choke the write speed.

But, you can't manage what you don't measure.

Testing an SSD suited to sustained writing is one very quick way to eliminate all maladies relating to QLC.

Otherwise, start taking observations with the actions suggested by SirDice and Eric A. Borisch above, and discuss the results.

Hi Gpw928!

Totally agree man

Thank you!!

I'm trying to summarize all the ideas you gave me

for taking actions

Many many thanks for all your help!!

egoitz · Apr 7, 2022

rootbert said:
I could saturate io of my IBM datacenter ssds with a handful of client connections ... just saying.

I see although... they could perhaps manage the load better than QVOs....

rootbert said:
hm, really? I mean, if the primary gets a mail delivered it should copy it to the replica, right? so we have at least twice the io compared to just having one primary server.

Yes this is true. But later, when an issue happens in the master server, you live very relaxed because all it's equal in the master and the slave server....

rootbert said:
In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.

Hi Rootbert!!

Thank you for your comments

and your help!!

egoitz · Apr 7, 2022

Hi!! Thanks for your answer rootbert!!.

rootbert said:
I could saturate io of my IBM datacenter ssds with a handful of client connections ... just saying.

You are evil

rootbert said:
hm, really? I mean, if the primary gets a mail delivered it should copy it to the replica, right? so we have at least twice the io compared to just having one primary server.

Well yes you duplicate the content, but.... you live far more relaxed when the master machine crashes

because it takes less than 10 minutes to get promote the slave to master....

rootbert said:
rootbert said:

I could saturate io of my IBM datacenter ssds with a handful of client connections ... just saying.

hm, really? I mean, if the primary gets a mail delivered it should copy it to the replica, right? so we have at least twice the io compared to just having one primary server.

In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.

Click to expand...

MMM... we have it by default.... so I assume access time flag is enabled.... Do you think this flag is nowadays so noticeable?. Even with ssds, yes?.

rootbert said:
In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.

egoitz · Apr 7, 2022

Hi Vermaden!!

Thank you for posting and helping me

vermaden said:
The Samsung QVO disks are slowest of SSD and are quite good cheap alternative for storage but not I/O intensive. They have quite LARGE CACHE DROP EFFECT which means even with continuous sequential I/O they will quite fast drop from ~500MB/s to ~160MB/s as shown in the image below.

We knew they used a buffer for compensating the QLC slowness... but we though too we would never end up by filling the whole speeding buffer... that's the fact...

vermaden said:
View attachment 13549

You can read more about that here:

The Samsung 870 QVO (1TB & 4TB) SSD Review: QLC Refreshed

www.anandtech.com

I'll take a look at provided URL.

Thanks a lot!!!
Cheers!

ralphbsz · Apr 7, 2022

Here's what I would do, if I had time: Watch the IO rates, latencies and queue depth on all the physical disks, separately for reads and writes. Now, I don't know how to do that in detail. The iostat command and "zpool iostat" give you a part of it, but they don't show queue depth and latency. You can get an upper limit of latency from Little's law (inverse of IO rate), but for a partially idle disk that is not useful.

This will help you debug whether the bottom IO layer is really the bottleneck. And if it is, it will allow you to measure how the system's behavior changes as you adjust parameters.

PMc · Apr 7, 2022

ralphbsz said:
Here's what I would do, if I had time: Watch the IO rates, latencies and queue depth on all the physical disks, separately for reads and writes. Now, I don't know how to do that in detail.

I think I see all of this in gstat. Anyway, I see lots of SSD misbehaviour there - and I've seen now so much of that, I don't want to see any more. I happened to think, the only problem with these pieces is that they die, sooner or later (and usually at the most unpleasant time). It is not, there are lots of other problems. E.g. they are utterly unsuitable for raid5 or similar, because read and write speeds are erratically changing all the time.
If I had an application similar to this one here, I would try with solid spinning disks and pimp that up with a properly sized l2arc plus write intent log (if appropriate) plus a special volume; these on some small and solidly performing SSDs. (If one would get some; they seem to be mostly out of stock currently, only that Q stuff is readily avaiable.)

Code:

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0     10     10    632    0.6      0      0    0.0    0.5| ada0
    0    119     21     83   10.4     98   1055    0.8    4.0| ada1
    0      3      3    190    0.4      0      0    0.0    0.1| ada2
   20    116     55   1838   67.7     60   1498   64.4   98.9| ada3
    0      7      7    443    0.4      0      0    0.0    0.3| ada4
    0      0      0      0    0.0      0      0    0.0    0.0| ada5
    0    152     18     71    0.4    134   1051    0.1    1.0| ada6

Here we see the queue depth lefthand, we see some peak ms per request, we see when the device gets saturated (like ada3 here) and we see the throughput. (There are options to also show -d delete and -o other commands). And, as usual with performance matters, this does not show readymade what is wrong, but it is something to watch and get a feeling for how normal performance would look and what then is not normal performance.

gpw928 · Apr 7, 2022

You have not checked the most fundamental thing of all, which is the health of the disk.
Do whatSirDice suggested, and show us the output of (assuming that your ssd is ada0):

Code:

sudo smartctl -a /dev/ada0

Then do as Eric A. Borisch has already suggested show us the output of gstat.
There is no point in tuning your system until you understand what it is doing NOW (behaving and misbehaving).

ralphbsz · Apr 8, 2022

PMc said:
Anyway, I see lots of SSD misbehaviour there ...
E.g. they are utterly unsuitable for raid5 or similar, because read and write speeds are erratically changing all the time. ...

Sadly, agree. SSDs are a bizarre marketplace. The thing to remember that SSDs are internally incredibly complicated beasts, with millions of lines of internal source code for their firmware. They typically contain a miniature file system, redundant storage layers, extensive error checking and health management, and wear leveling and tracking. You can completely change their behavior (latency, throughput, good IO patterns, durability, reliability) by changing and adjusting the firmware. If you go to academic/research conferences on storage, you can always hear many talks about FTL (flash translation layers, the firmware in SSDs). Personally, I always fall asleep in these talks.

The versions sold to consumers (purchasers of individual units) are usually built in firmware to maximize customer satisfaction in the most common use case; for most people that means gaming PCs with Windows. Now you take the FTL that's (well and carefully adjusted) for that IO pattern, and put it behind a RAID system and a modern complex file system ... and things go sideways. Performance drops, because the SSD gets a write workload that is completely unlike the one it was optimized for.

What's the fix? At the consumer level, I don't know. I use SSDs as a boot disk in my server (where the workload is super light, I'd be surprised if it gets to 100KB/s more often than a few seconds per day), and in laptops and small desktops (all MacOS in my household), where performance is irrelevant. Another member of our household uses NVMe SSDs in their ... Windows gaming computer, and they work great, as long as you install all the required heatsinks. For a consumer storage server, I don't know what to do. In big computer industry, the answer include: (a) build your own SSDs: just buy flash chips from Micron, Toshiba or Samsung, and do all the rest yourself. (b) Buy raw SSDs, but then write all the firmware yourself. (c) Work with the SSD vendor to carefully tune the FTL to the workload. None of this is viable for small systems.

PMc · Apr 8, 2022

ralphbsz said:
The versions sold to consumers (purchasers of individual units) are usually built in firmware to maximize customer satisfaction in the most common use case; for most people that means gaming PCs with Windows.

This is the next thing I see coming: SSD that just will not work with UFS or ZFS, and nobody will tell you.
We have this already with the sticks. My older sticks can be formatted to anything, and they just work.The newer ones don't work with UFS, they produce timeouts (5 sec per each single write), I/O errors, or just die. Simple explanation: in UFS or ZFS, the frequently written areas are not where the stick expects them. Format it back to msdos and it works as it should.
Embedded intelligence has become too cheap, and it is toxic.

VladiBG · Apr 8, 2022

PMc which model SSD?

PMc · Apr 8, 2022

VladiBG said:
PMc which model SSD?

Not yet with SSD, just with USB sticks. I said I see it coming this may in near future happen with SSD also.

PMc · Apr 8, 2022

egoitz said:
They were interesting because are the only SSD disks with 8TB of disk space...

Really?

Amazon.com: Kingston DC500 DC500R 7.68 TB Solid State Drive - 2.5" Internal - SATA (SATA/600) - Read Intensive : Electronics

Buy Kingston DC500 DC500R 7.68 TB Solid State Drive - 2.5" Internal - SATA (SATA/600) - Read Intensive: Internal Solid State Drives - Amazon.com ✓ FREE DELIVERY possible on eligible purchases

www.amazon.com

I don't know how good these are, but they are marketed for professional workload.

VladiBG · Apr 8, 2022

PMc ohh sorry it was TLDR.

For SATAs SSD IOPS goings up to 60K write. When you need more you switch to SAS or PCIe. Here's the list for comparison. The usually pickup criteria is high as possible IOPs and DWPD. (PM897 are also 3DWPD/5y)

Part Number	Model	Interface	F/F	Capacity	Sequential Read	Sequential Write	Random Read	Random Write	DWPD
MZ7L31T9HBLT-00B7C	PM893	SATA 6.0 Gbps	2.5 inch	1920 GB	550 MB/s	520 MB/s	98K IOPS	30K IOPS	Mass Production
MZ7L3240HCHQ-00B7C	PM893	SATA 6.0 Gbps	2.5 inch	240 GB	550 MB/s	380 MB/s	98K IOPS	15K IOPS	Mass Production
MZ7L33T8HBLT-00B7C	PM893	SATA 6.0 Gbps	2.5 inch	3840 GB	550 MB/s	520 MB/s	98K IOPS	30K IOPS	Mass Production
MZ7L3480HCHQ-00B7C	PM893	SATA 6.0 Gbps	2.5 inch	480 GB	550 MB/s	520 MB/s	98K IOPS	29K IOPS	Mass Production
MZ7L37T6HBLA-00B7C	PM893	SATA 6.0 Gbps	2.5 inch	7680 GB	550 MB/s	520 MB/s	98K IOPS	30K IOPS	Mass Production
MZ7L3960HCJR-00B7C	PM893	SATA 6.0 Gbps	2.5 inch	960 GB	550 MB/s	520 MB/s	98K IOPS	30K IOPS	Mass Production
MZ7L31T9HBNA-00B7C	PM897	SATA 6.0 Gbps	2.5 inch	1,920 GB	560 MB/s	530 MB/s	97K IOPS	60K IOPS	Mass Production
MZ7L33T8HBNA-00B7C	PM897	SATA 6.0 Gbps	2.5 inch	3,840 GB	560 MB/s	530 MB/s	97K IOPS	60K IOPS	Mass Production
MZ7L3480HBLT-00B7C	PM897	SATA 6.0 Gbps	2.5 inch	480 GB	560 MB/s	530 MB/s	97K IOPS	60K IOPS	Mass Production
MZ7L3960HBLT-00B7C	PM897	SATA 6.0 Gbps	2.5 inch	960 GB	560 MB/s	530 MB/s	97K IOPS	60K IOPS	Mass Production
MZ7LH1T9HMLT	PM883	SATA 6.0 Gbps	2.5 inch	1.92 TB	550 MB/s	520 MB/s	98K IOPS	30K IOPS	1.3(3yrs)
MZ7LH240HAHQ	PM883	SATA 6.0 Gbps	2.5 inch	240 GB	550 MB/s	320 MB/s	98K IOPS	14K IOPS	1.3(3yrs)
MZ7LH3T8HMLT	PM883	SATA 6.0 Gbps	2.5 inch	3.84 TB	550 MB/s	520 MB/s	98K IOPS	30K IOPS	1.3(3yrs)
MZ7LH480HAHQ	PM883	SATA 6.0 Gbps	2.5 inch	480 GB	550 MB/s	520 MB/s	98K IOPS	25K IOPS	1.3(3yrs)
MZ7LH7T6HMLA	PM883	SATA 6.0 Gbps	2.5 inch	7.68 TB	550 MB/s	520 MB/s	98K IOPS	30K IOPS	1.3(3yrs)
MZ7LH960HAJR	PM883	SATA 6.0 Gbps	2.5 inch	960 GB	550 MB/s	520 MB/s	98K IOPS	28K IOPS	1.3(3yrs)
MZ7KH1T9HAJR	SM883	SATA 6.0 Gbps	2.5 inch	1.92 TB	540 MB/s	520 MB/s	97K IOPS	29K IOPS	3.0(5yrs)
MZ7KH240HAHQ	SM883	SATA 6.0 Gbps	2.5 inch	240 GB	540 MB/s	480 MB/s	97K IOPS	22K IOPS	3.0(5yrs)
MZ7KH3T8HALS	SM883	SATA 6.0 Gbps	2.5 inch	3.84 TB	540 MB/s	520 MB/s	97K IOPS	29K IOPS	3.0(5yrs)
MZ7KH480HAHQ	SM883	SATA 6.0 Gbps	2.5 inch	480 GB	540 MB/s	520 MB/s	97K IOPS	27K IOPS	3.0(5yrs)
MZ7KH960HAJR	SM883	SATA 6.0 Gbps	2.5 inch	960 GB	540 MB/s	520 MB/s	97K IOPS	29K IOPS	3.0(5yrs)

Note:
I personally prefer Intel's SSDs but I don't like the price...

egoitz · Apr 8, 2022

PMc said:
This is the next thing I see coming: SSD that just will not work with UFS or ZFS, and nobody will tell you.
We have this already with the sticks. My older sticks can be formatted to anything, and they just work.The newer ones don't work with UFS, they produce timeouts (5 sec per each single write), I/O errors, or just die. Simple explanation: in UFS or ZFS, the frequently written areas are not where the stick expects them. Format it back to msdos and it works as it should.
Embedded intelligence has become too cheap, and it is toxic.

If this is truth (I don't doubt of your word) it's an absolute disgrace!!!

egoitz · Apr 8, 2022

Thank you so much to all really....

ralphbsz said:
Here's what I would do, if I had time: Watch the IO rates, latencies and queue depth on all the physical disks, separately for reads and writes. Now, I don't know how to do that in detail. The iostat command and "zpool iostat" give you a part of it, but they don't show queue depth and latency. You can get an upper limit of latency from Little's law (inverse of IO rate), but for a partially idle disk that is not useful.

This will help you debug whether the bottom IO layer is really the bottleneck. And if it is, it will allow you to measure how the system's behavior changes as you adjust parameters.

Yes we are trying to do some sort of this... but it's complicated... as we don't really know when it fails...

PMc said:
Really?

Amazon.com: Kingston DC500 DC500R 7.68 TB Solid State Drive - 2.5" Internal - SATA (SATA/600) - Read Intensive : Electronics

Buy Kingston DC500 DC500R 7.68 TB Solid State Drive - 2.5" Internal - SATA (SATA/600) - Read Intensive: Internal Solid State Drives - Amazon.com ✓ FREE DELIVERY possible on eligible purchases

www.amazon.com

I don't know how good these are, but they are marketed for professional workload.

Good discovery!!!!

egoitz · Apr 8, 2022

ralphbsz said:
Sadly, agree. SSDs are a bizarre marketplace. The thing to remember that SSDs are internally incredibly complicated beasts, with millions of lines of internal source code for their firmware. They typically contain a miniature file system, redundant storage layers, extensive error checking and health management, and wear leveling and tracking. You can completely change their behavior (latency, throughput, good IO patterns, durability, reliability) by changing and adjusting the firmware. If you go to academic/research conferences on storage, you can always hear many talks about FTL (flash translation layers, the firmware in SSDs). Personally, I always fall asleep in these talks.

The versions sold to consumers (purchasers of individual units) are usually built in firmware to maximize customer satisfaction in the most common use case; for most people that means gaming PCs with Windows. Now you take the FTL that's (well and carefully adjusted) for that IO pattern, and put it behind a RAID system and a modern complex file system ... and things go sideways. Performance drops, because the SSD gets a write workload that is completely unlike the one it was optimized for.

What's the fix? At the consumer level, I don't know. I use SSDs as a boot disk in my server (where the workload is super light, I'd be surprised if it gets to 100KB/s more often than a few seconds per day), and in laptops and small desktops (all MacOS in my household), where performance is irrelevant. Another member of our household uses NVMe SSDs in their ... Windows gaming computer, and they work great, as long as you install all the required heatsinks. For a consumer storage server, I don't know what to do. In big computer industry, the answer include: (a) build your own SSDs: just buy flash chips from Micron, Toshiba or Samsung, and do all the rest yourself. (b) Buy raw SSDs, but then write all the firmware yourself. (c) Work with the SSD vendor to carefully tune the FTL to the workload. None of this is viable for small systems.

Complicated world... only paying in abundance... you get guarantees....

Cath O'Deray · Apr 10, 2022

Parallel discussion: <https://lists.freebsd.org/archives/freebsd-hackers/2022-April/000950.html> | <https://freebsd.markmail.org/thread/yh2pxlzwhbvmqe3z>

vermaden · Apr 10, 2022

You can also limit TRIM operations to '1' instead of the default '64'.

Put vfs.zfs.vdev.trim_max_active=1 into the /etc/sysctl.conf file and set it as usual with sysctl(8) command.