ZFS Desperate with 870 QVO and ZFS

In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.
Good thought. However some caution may be indicated. It's a mail server, so depending on what's in the datasets, atime might be important.
 
The Samsung QVO disks are slowest of SSD and are quite good cheap alternative for storage but not I/O intensive. They have quite LARGE CACHE DROP EFFECT which means even with continuous sequential I/O they will quite fast drop from ~500MB/s to ~160MB/s as shown in the image below.


1649337069277.png


You can read more about that here:
 
Buffer size is not the only issue than can choke the write speed.

But, you can't manage what you don't measure.

Testing an SSD suited to sustained writing is one very quick way to eliminate all maladies relating to QLC.

Otherwise, start taking observations with the actions suggested by SirDice and Eric A. Borisch above, and discuss the results.
Hi Gpw928!

Totally agree man :) :) Thank you!!

I'm trying to summarize all the ideas you gave me :) for taking actions :) :)

Many many thanks for all your help!!
 
I could saturate io of my IBM datacenter ssds with a handful of client connections ... just saying.
I see although... they could perhaps manage the load better than QVOs....
hm, really? I mean, if the primary gets a mail delivered it should copy it to the replica, right? so we have at least twice the io compared to just having one primary server.
Yes this is true. But later, when an issue happens in the master server, you live very relaxed because all it's equal in the master and the slave server....

In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.
Hi Rootbert!!

Thank you for your comments :) :) and your help!! :) :)
 
Hi!! Thanks for your answer rootbert!!.

I could saturate io of my IBM datacenter ssds with a handful of client connections ... just saying.

You are evil :p :p

hm, really? I mean, if the primary gets a mail delivered it should copy it to the replica, right? so we have at least twice the io compared to just having one primary server.

Well yes you duplicate the content, but.... you live far more relaxed when the master machine crashes :) :) because it takes less than 10 minutes to get promote the slave to master....

I could saturate io of my IBM datacenter ssds with a handful of client connections ... just saying.



hm, really? I mean, if the primary gets a mail delivered it should copy it to the replica, right? so we have at least twice the io compared to just having one primary server.

In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.

MMM... we have it by default.... so I assume access time flag is enabled.... Do you think this flag is nowadays so noticeable?. Even with ssds, yes?.
In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.
 
Hi Vermaden!!

Thank you for posting and helping me :) :)


The Samsung QVO disks are slowest of SSD and are quite good cheap alternative for storage but not I/O intensive. They have quite LARGE CACHE DROP EFFECT which means even with continuous sequential I/O they will quite fast drop from ~500MB/s to ~160MB/s as shown in the image below.
We knew they used a buffer for compensating the QLC slowness... but we though too we would never end up by filling the whole speeding buffer... that's the fact...
I'll take a look at provided URL.

Thanks a lot!!!
Cheers!
 
Here's what I would do, if I had time: Watch the IO rates, latencies and queue depth on all the physical disks, separately for reads and writes. Now, I don't know how to do that in detail. The iostat command and "zpool iostat" give you a part of it, but they don't show queue depth and latency. You can get an upper limit of latency from Little's law (inverse of IO rate), but for a partially idle disk that is not useful.

This will help you debug whether the bottom IO layer is really the bottleneck. And if it is, it will allow you to measure how the system's behavior changes as you adjust parameters.
 
Here's what I would do, if I had time: Watch the IO rates, latencies and queue depth on all the physical disks, separately for reads and writes. Now, I don't know how to do that in detail.
I think I see all of this in gstat. Anyway, I see lots of SSD misbehaviour there - and I've seen now so much of that, I don't want to see any more. I happened to think, the only problem with these pieces is that they die, sooner or later (and usually at the most unpleasant time). It is not, there are lots of other problems. E.g. they are utterly unsuitable for raid5 or similar, because read and write speeds are erratically changing all the time.
If I had an application similar to this one here, I would try with solid spinning disks and pimp that up with a properly sized l2arc plus write intent log (if appropriate) plus a special volume; these on some small and solidly performing SSDs. (If one would get some; they seem to be mostly out of stock currently, only that Q stuff is readily avaiable.)

Code:
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0     10     10    632    0.6      0      0    0.0    0.5| ada0
    0    119     21     83   10.4     98   1055    0.8    4.0| ada1
    0      3      3    190    0.4      0      0    0.0    0.1| ada2
   20    116     55   1838   67.7     60   1498   64.4   98.9| ada3
    0      7      7    443    0.4      0      0    0.0    0.3| ada4
    0      0      0      0    0.0      0      0    0.0    0.0| ada5
    0    152     18     71    0.4    134   1051    0.1    1.0| ada6

Here we see the queue depth lefthand, we see some peak ms per request, we see when the device gets saturated (like ada3 here) and we see the throughput. (There are options to also show -d delete and -o other commands). And, as usual with performance matters, this does not show readymade what is wrong, but it is something to watch and get a feeling for how normal performance would look and what then is not normal performance.
 
You have not checked the most fundamental thing of all, which is the health of the disk.
Do whatSirDice suggested, and show us the output of (assuming that your ssd is ada0):
Code:
sudo smartctl -a /dev/ada0
Then do as Eric A. Borisch has already suggested show us the output of gstat.
There is no point in tuning your system until you understand what it is doing NOW (behaving and misbehaving).
 
Anyway, I see lots of SSD misbehaviour there ...
E.g. they are utterly unsuitable for raid5 or similar, because read and write speeds are erratically changing all the time. ...
Sadly, agree. SSDs are a bizarre marketplace. The thing to remember that SSDs are internally incredibly complicated beasts, with millions of lines of internal source code for their firmware. They typically contain a miniature file system, redundant storage layers, extensive error checking and health management, and wear leveling and tracking. You can completely change their behavior (latency, throughput, good IO patterns, durability, reliability) by changing and adjusting the firmware. If you go to academic/research conferences on storage, you can always hear many talks about FTL (flash translation layers, the firmware in SSDs). Personally, I always fall asleep in these talks.

The versions sold to consumers (purchasers of individual units) are usually built in firmware to maximize customer satisfaction in the most common use case; for most people that means gaming PCs with Windows. Now you take the FTL that's (well and carefully adjusted) for that IO pattern, and put it behind a RAID system and a modern complex file system ... and things go sideways. Performance drops, because the SSD gets a write workload that is completely unlike the one it was optimized for.

What's the fix? At the consumer level, I don't know. I use SSDs as a boot disk in my server (where the workload is super light, I'd be surprised if it gets to 100KB/s more often than a few seconds per day), and in laptops and small desktops (all MacOS in my household), where performance is irrelevant. Another member of our household uses NVMe SSDs in their ... Windows gaming computer, and they work great, as long as you install all the required heatsinks. For a consumer storage server, I don't know what to do. In big computer industry, the answer include: (a) build your own SSDs: just buy flash chips from Micron, Toshiba or Samsung, and do all the rest yourself. (b) Buy raw SSDs, but then write all the firmware yourself. (c) Work with the SSD vendor to carefully tune the FTL to the workload. None of this is viable for small systems.
 
The versions sold to consumers (purchasers of individual units) are usually built in firmware to maximize customer satisfaction in the most common use case; for most people that means gaming PCs with Windows.
This is the next thing I see coming: SSD that just will not work with UFS or ZFS, and nobody will tell you.
We have this already with the sticks. My older sticks can be formatted to anything, and they just work.The newer ones don't work with UFS, they produce timeouts (5 sec per each single write), I/O errors, or just die. Simple explanation: in UFS or ZFS, the frequently written areas are not where the stick expects them. Format it back to msdos and it works as it should.
Embedded intelligence has become too cheap, and it is toxic.
 
They were interesting because are the only SSD disks with 8TB of disk space...
Really?
I don't know how good these are, but they are marketed for professional workload.
 
PMc ohh sorry it was TLDR.

For SATAs SSD IOPS goings up to 60K write. When you need more you switch to SAS or PCIe. Here's the list for comparison. The usually pickup criteria is high as possible IOPs and DWPD. (PM897 are also 3DWPD/5y)

Part NumberModelInterfaceF/FCapacitySequential ReadSequential WriteRandom ReadRandom WriteDWPD
MZ7L31T9HBLT-00B7CPM893SATA 6.0 Gbps2.5 inch1920 GB550 MB/s520 MB/s98K IOPS30K IOPSMass Production
MZ7L3240HCHQ-00B7CPM893SATA 6.0 Gbps2.5 inch240 GB550 MB/s380 MB/s98K IOPS15K IOPSMass Production
MZ7L33T8HBLT-00B7CPM893SATA 6.0 Gbps2.5 inch3840 GB550 MB/s520 MB/s98K IOPS30K IOPSMass Production
MZ7L3480HCHQ-00B7CPM893SATA 6.0 Gbps2.5 inch480 GB550 MB/s520 MB/s98K IOPS29K IOPSMass Production
MZ7L37T6HBLA-00B7CPM893SATA 6.0 Gbps2.5 inch7680 GB550 MB/s520 MB/s98K IOPS30K IOPSMass Production
MZ7L3960HCJR-00B7CPM893SATA 6.0 Gbps2.5 inch960 GB550 MB/s520 MB/s98K IOPS30K IOPSMass Production
MZ7L31T9HBNA-00B7CPM897SATA 6.0 Gbps2.5 inch1,920 GB560 MB/s530 MB/s97K IOPS60K IOPSMass Production
MZ7L33T8HBNA-00B7CPM897SATA 6.0 Gbps2.5 inch3,840 GB560 MB/s530 MB/s97K IOPS60K IOPSMass Production
MZ7L3480HBLT-00B7CPM897SATA 6.0 Gbps2.5 inch480 GB560 MB/s530 MB/s97K IOPS60K IOPSMass Production
MZ7L3960HBLT-00B7CPM897SATA 6.0 Gbps2.5 inch960 GB560 MB/s530 MB/s97K IOPS60K IOPSMass Production
MZ7LH1T9HMLTPM883SATA 6.0 Gbps2.5 inch1.92 TB550 MB/s520 MB/s98K IOPS30K IOPS1.3(3yrs)
MZ7LH240HAHQPM883SATA 6.0 Gbps2.5 inch240 GB550 MB/s320 MB/s98K IOPS14K IOPS1.3(3yrs)
MZ7LH3T8HMLTPM883SATA 6.0 Gbps2.5 inch3.84 TB550 MB/s520 MB/s98K IOPS30K IOPS1.3(3yrs)
MZ7LH480HAHQPM883SATA 6.0 Gbps2.5 inch480 GB550 MB/s520 MB/s98K IOPS25K IOPS1.3(3yrs)
MZ7LH7T6HMLAPM883SATA 6.0 Gbps2.5 inch7.68 TB550 MB/s520 MB/s98K IOPS30K IOPS1.3(3yrs)
MZ7LH960HAJRPM883SATA 6.0 Gbps2.5 inch960 GB550 MB/s520 MB/s98K IOPS28K IOPS1.3(3yrs)
MZ7KH1T9HAJRSM883SATA 6.0 Gbps2.5 inch1.92 TB540 MB/s520 MB/s97K IOPS29K IOPS3.0(5yrs)
MZ7KH240HAHQSM883SATA 6.0 Gbps2.5 inch240 GB540 MB/s480 MB/s97K IOPS22K IOPS3.0(5yrs)
MZ7KH3T8HALSSM883SATA 6.0 Gbps2.5 inch3.84 TB540 MB/s520 MB/s97K IOPS29K IOPS3.0(5yrs)
MZ7KH480HAHQSM883SATA 6.0 Gbps2.5 inch480 GB540 MB/s520 MB/s97K IOPS27K IOPS3.0(5yrs)
MZ7KH960HAJRSM883SATA 6.0 Gbps2.5 inch960 GB540 MB/s520 MB/s97K IOPS29K IOPS3.0(5yrs)

Note:
I personally prefer Intel's SSDs but I don't like the price...
 
This is the next thing I see coming: SSD that just will not work with UFS or ZFS, and nobody will tell you.
We have this already with the sticks. My older sticks can be formatted to anything, and they just work.The newer ones don't work with UFS, they produce timeouts (5 sec per each single write), I/O errors, or just die. Simple explanation: in UFS or ZFS, the frequently written areas are not where the stick expects them. Format it back to msdos and it works as it should.
Embedded intelligence has become too cheap, and it is toxic.
If this is truth (I don't doubt of your word) it's an absolute disgrace!!!
 
Thank you so much to all really....


Here's what I would do, if I had time: Watch the IO rates, latencies and queue depth on all the physical disks, separately for reads and writes. Now, I don't know how to do that in detail. The iostat command and "zpool iostat" give you a part of it, but they don't show queue depth and latency. You can get an upper limit of latency from Little's law (inverse of IO rate), but for a partially idle disk that is not useful.

This will help you debug whether the bottom IO layer is really the bottleneck. And if it is, it will allow you to measure how the system's behavior changes as you adjust parameters.

Yes we are trying to do some sort of this... but it's complicated... as we don't really know when it fails...

Really?
I don't know how good these are, but they are marketed for professional workload.

Good discovery!!!!
 
Sadly, agree. SSDs are a bizarre marketplace. The thing to remember that SSDs are internally incredibly complicated beasts, with millions of lines of internal source code for their firmware. They typically contain a miniature file system, redundant storage layers, extensive error checking and health management, and wear leveling and tracking. You can completely change their behavior (latency, throughput, good IO patterns, durability, reliability) by changing and adjusting the firmware. If you go to academic/research conferences on storage, you can always hear many talks about FTL (flash translation layers, the firmware in SSDs). Personally, I always fall asleep in these talks.

The versions sold to consumers (purchasers of individual units) are usually built in firmware to maximize customer satisfaction in the most common use case; for most people that means gaming PCs with Windows. Now you take the FTL that's (well and carefully adjusted) for that IO pattern, and put it behind a RAID system and a modern complex file system ... and things go sideways. Performance drops, because the SSD gets a write workload that is completely unlike the one it was optimized for.

What's the fix? At the consumer level, I don't know. I use SSDs as a boot disk in my server (where the workload is super light, I'd be surprised if it gets to 100KB/s more often than a few seconds per day), and in laptops and small desktops (all MacOS in my household), where performance is irrelevant. Another member of our household uses NVMe SSDs in their ... Windows gaming computer, and they work great, as long as you install all the required heatsinks. For a consumer storage server, I don't know what to do. In big computer industry, the answer include: (a) build your own SSDs: just buy flash chips from Micron, Toshiba or Samsung, and do all the rest yourself. (b) Buy raw SSDs, but then write all the firmware yourself. (c) Work with the SSD vendor to carefully tune the FTL to the workload. None of this is viable for small systems.
Complicated world... only paying in abundance... you get guarantees....
 
You can also limit TRIM operations to '1' instead of the default '64'.

Put vfs.zfs.vdev.trim_max_active=1 into the /etc/sysctl.conf file and set it as usual with sysctl(8) command.
 
Back
Top