ZFS Desperate with 870 QVO and ZFS

Good morning,

I write this post with the expectation that perhaps someone could help me :)

I am running some mail servers with FreeBSD and ZFS. They use 870 QVO (not EVO or other Samsung SSD disks) disks as storage. They can easily have from 1500 to 2000 concurrent connections. The machines have 128GB of ram and the CPU is almost absolutely idle. The disk IO is normally at 30 or 40% percent at most.

The problem I'm facing is that they could be running just fine and suddenly at some peak hour, the IO goes to 60 or 70% and the machine becomes extremely slow. ZFS is all by default, except the sync parameter which is set disabled. Apart from that the ARC is limited to 64GB. But even this is extremely odd. The used ARC is near 20GB. I have seen, that meta cache in arc is very near to the limit that FreeBSD automatically sets depending on the size of the ARC you set. It seems that almost all ARC is used by meta cache. I have seen this effect in all my mail servers with this hardware and software config.

I do attach a zfs-stats output, but from now that the servers are not so loaded as described. I do explain. I run a couple of Cyrus instances in these servers. One as master, one as slave on each server. The commented situation from above, happens when both Cyrus instances become master, so when we are using two Cyrus instances giving service in the same machine. For avoiding issues, know we have balanced and we have a master and a slave in each server. You know, a slave instance has almost no io and only a single connection for replication. So the zfs-stats output is from now we have let's say half of load in each server, because they have one master and one slave instance.

As said before, when I place two masters in same server, perhaps all day works, but just at 11:00 am (for example) the IO goes to 60% (it doesn't increase) but it seems like if the IO where not being able to be served, let's say more than a limit. More than a concrete io limit (I'd say 60%).

I don't really know if, perhaps the QVO technology could be the guilty here.... because... they say are desktop computers disks... but later... I have get a nice performance when copying for instance mailboxes from five to five.... I can flood a gigabit interface when copying mailboxes between servers from five to five.... they seem to perform....

Could anyone please shed us some light in this issue?. I don't really know what to think.

Best regards,
 

Attachments

Hi SirDice,

Thank you so much for your time really.

I'm pretty sure there's not a problem with that because it has happened me in several servers and the disks are absolutely new.

We have suffered from that too.... but it's not a slowness like the one you are commenting about. It's like if the server had a buffer near to be full and if like some small percentage of the disk had become emptied but as it's so full, like if you couldn't handle more than a very few io each time you need.....

It's not the sensation you get when a disk is wrong and until you set it offline, the entire pool performs bad..... it's not that senstion. In fact, it get's solved by it's own... and 30 minutes later... more or less... alll works fine..... without modifying pool parameters....

Cheers,
 
Oh, forgot to ask, what version of FreeBSD? FreeBSD 12 and 13 have different ZFS implementations, so it's important.
 
Yes it's FreeBSD 12.2.... you have more deep info in the zfs-stats.txt attached (in case it helps)...

Thank you!
 
i had a bad ssd with no hard/diagnosticable problems which slowed down writes to floppy disk like performance but would pass any test i tried
not every write would cause the symptom just some of them
was a pain the butt to fix (ssd was part or raid10)
avoid consumer hardware in production servers whenever is possible
that was a hp ssd
also had one (offbrand) without detectable errors but with read performance around 30MB/s for 70% of the disk surface
the rest or 30% was ok
this one was in a win10 box, also a pain in the ass to fix because it seems everything can be slow on win10 for some software reasons
slow rightclick, slow to show the privilege escalation box,etc,etc,etc
 
Hi Covacat,

Thank you so much too for your answer....

They were interesting because are the only SSD disks with 8TB of disk space...

Cheers,
 
The most extrange thing is... When machine boots ARC is in 40 value of GB used (for instance), but later decreases to 20GB (and this is not an example... is exact) in all my servers.... it's like if the ARC metadata which is more or less 17GB would limite the whole ARC.....

With the traffic of this machines, it should I suppose the ARC should be larger than it is... and ARC in loader.conf is limited to 64GB (the half the ram this machines have)
 
I'd check the ashift value for the VDEV. If it's not 12 or 13, I'd investigate further. However...

QLC NAND is really slow, and the 870 QVO relies on massive caches to achieve the performance numbers in the brochures. Once the caches are saturated, performance falls off a cliff. Firstpost has a review.
 
Hi Gpw928,

Thank you so much for your answer. The ashift is 12...which I assume it's fine for these disks....

We know are slow, but they supposedly compensate the slowness with a buffer this disks have... and then you should only get the slowness when you exceed that buffer usage, am I wrong?.

Cheers!
 
Hi!

Thank you so much Covacat :) :)

Yes I have read the zfs mastery book of Michael W. Lucas, and have learnt how that parameter works, the vfs.zfs.dirty_data_max, the vfs.zfs.txg.timeout, the vfs.zfs.vdev.async_write_active_min_dirty_percent and vfs.zfs.vdev.async_write_active_max_dirty_percent . I have learned a lot of concepts with that book but the doubt that comes to my mind is : "if normally this kind of values are not needed to be adjusted and even, the own FreeBSD has it's own mechanisms for auto-adjusting, perhaps they are already almost right and if I modify them, could cause a system collapse?".

Apart from that and speaking about that tunable, that would cause that if the dirty buffer does not arrive to vfs.zfs.dirty_data_max size in the time in seconds you set in that tunable, to commit the transaction group of ZFS after that amount seconds. Obviously that would cause the need of writting probably, some more info when committing the txg than if you set a lower timeout in the tunnable we are talking about. It would interrupt less the storage but it will do probably more time later.... and when it writes, ZFS attempts to write in the fastest manner the requests generated ... even if it causes some unnoticeable stop in reading.... that, increasing the txg.timeout value. If decreasing it will write more often, and could keep your disks busy (with writes) some more time.... than with 5 seconds.

It's really difficult to decide the correct way of achieving this (increase/decrease...)... Of course you can try... but it's not the most desirable thing I would like doing... to start checking this in production env...

Best regards,
 
I have learned a lot of concepts with that book but the doubt that comes to my mind is : "if normally this kind of values are not needed to be adjusted and even, the own FreeBSD has it's own mechanisms for auto-adjusting, perhaps they are already almost right and if I modify them, could cause a system collapse?".
FreeBSD "auto-tunes" a lot, but that's all based on "general" usage. For specific use-cases you have to help it a bit because it can make the wrong choices for your particular situation.
 
It's possible Covacat to more or less right, because in that moment in a top, almos all procesesses are in txg state or txg-> by which seem to temptative to try adjusting the values I commented in the previous comment, the same way as Covacat stated....
 
Hi Sirdice,

Thanks mate!!

That's right but... if you don't auto-adjust them properly... you could end up collapsing the server.... that's the reason because we usually try to let auto-adjust do it's work.
 
Although if someone does know or it's pretty sure about some tunable configs, please advise me and we will... check them or we would read about them, for understanding and later it's possible to be applied... but you know... this machines are in production... we need to take tons of care with them....

Cheers,
 
I have been thinking and.... I got the following tunables now :

vfs.zfs.arc_meta_strategy: 0
vfs.zfs.arc_meta_limit: 17179869184
kstat.zfs.misc.arcstats.arc_meta_min: 4294967296
kstat.zfs.misc.arcstats.arc_meta_max: 19386809344
kstat.zfs.misc.arcstats.arc_meta_limit: 17179869184
kstat.zfs.misc.arcstats.arc_meta_used: 16870668480
vfs.zfs.arc_max: 68719476736

and top sais :

ARC: 19G Total, 1505M MFU, 12G MRU, 6519K Anon, 175M Header, 5687M Other



When using even 128GB of vfs.zfs.arc_max (instead of 64GB I have now set) the ARC wasn't approximating to it's max usable size.... Can perhaps that could have something to do with that fact that arc meta values are almost at the limit set?. Perhaps increasing vfs.zfs.arc_meta_limit or kstat.zfs.misc.arcstats.arc_meta_limit (I suppose the first one is the one to increase) could cause a better performance and perhaps a better usage and better take advantage of having 64GB max of ARC set?. I say it because now it doesn't use more than 19GB in total ARC memory....



As always said, any opinion or idea would be very highly appreciated.



Cheers,
 
The ashift is 12...which I assume it's fine for these disks....
Samsung don't declare the underlying sector size, so I'm not sure that you can know without benchmarking. But my guess is that 12 should be OK, or at least not wildly bad (simply because Samsung know that there's a broad market expectation of 4K sectors with SSDs).
We know are slow, but they supposedly compensate the slowness with a buffer this disks have... and then you should only get the slowness when you exceed that buffer usage, am I wrong?.
Your original post indicated a continuous I/O load:
The disk IO is normally at 30 or 40% percent at most ... and suddenly at some peak hour, the IO goes to 60 or 70% and the machine becomes extremely slow.
This is exactly what I would expect with buffer exhaustion. The word is that the throughput drops by as much as 90% for QLC NAND when the on-board buffers are full.

The symptoms might also indicate a tuning issue, so it makes sense to continue looking there.

But if it were my system, I'b be looking to test a better SSD (not QLC), rated for continuous load bearing.
 
A few thoughts:

Perversely, you may try re-enabling sync on the ZFS filesystem. By having it disabled, you are getting data queued up into bursts to write to the drive (either after 5s, or if enough is dirty that it decides it is time to write out; I can’t recall the exact heuristic.) You’re also not giving any feedback to the programs that resources are being exhausted, so they are free to submit more and more work (IOs) to be done until they hit a point where the system is truly saturated for some time. This may compound with the performance issues as described by others with this technology, and lead to the temporary freezes rather than a more graceful slowing down with increasing load.

You can also look at the output of gstat -pdo (gstat(8)) to see where the drives are spending time, including delete (trim) and flush (sync) calls.

Note that a write-heavy workload does not benefit significantly from ARC, other than the metadata side of things.

My guess, however, is that if you’re pushing the performance of these drives, you’ll likely be much happier with drives with better write performance.

One final item: if you have periodic snapshot creation / destruction, that will also cause periodic spikes in activity, especially with a high numbers of ZFS file systems.
 
Good morning :)

Thank you so much for all your help. It is absolutely really appreciated. So gpw928, Eric very thankful really.

I answer in a reply to each post :)
 
Samsung don't declare the underlying sector size, so I'm not sure that you can know without benchmarking. But my guess is that 12 should be OK, or at least not wildly bad (simply because Samsung know that there's a broad market expectation of 4K sectors with SSDs).

My guess was because a workmate said it was ok with a zdb | grep -i ashift :) :) . By the way have read too some comments saying with 12 should be ok for these drives...

Your original post indicated a continuous I/O load:

This is exactly what I would expect with buffer exhaustion. The word is that the throughput drops by as much as 90% for QLC NAND when the on-board buffers are full.

Yes we have a continuous load. That's true. But they have 72GB of "intelligent turbo write cache" (as I have read for the 8TB version of this disks). We move a continuous load but of very little changes... I see it difficult for us to arrive to 72GB of uncommited data really.... do you think it could be?. We move little files of KBs... (the own emails). Do you think it could be then?.

The symptoms might also indicate a tuning issue, so it makes sense to continue looking there.

But if it were my system, I'b be looking to test a better SSD (not QLC), rated for continuous load bearing.

I see gpw928.... yes that's right... but I have some machines now with this disks.. so ... I needed to try to improve the setup as much as possible... and obviously without causing many mess in the systems... you know... it's difficult....
 
A few thoughts:

Perversely, you may try re-enabling sync on the ZFS filesystem. By having it disabled, you are getting data queued up into bursts to write to the drive (either after 5s, or if enough is dirty that it decides it is time to write out; I can’t recall the exact heuristic.)

I could try it, because the rollback of that change does not leave changed things... different than previous or posterior ones... (as for instance compression could do... or recordsize...).

You’re also not giving any feedback to the programs that resources are being exhausted, so they are free to submit more and more work (IOs) to be done until they hit a point where the system is truly saturated for some time. This may compound with the performance issues as described by others with this technology, and lead to the temporary freezes rather than a more graceful slowing down with increasing load.

Really said... in the mail world it's difficult of achieving. Because imagine you access to a mailbox with 5 imap clients connected doing changes. The io can't really... "wait".... you could... don't know... perhaps try to use some buffer to write to.. prior to committing to disk but....I assume this is an operating system or at least... file system job.... really......

Perhaps I consider this is far more reasonable if it's handled by the own filesystem....

You can also look at the output of gstat -pdo (gstat(8)) to see where the drives are spending time, including delete (trim) and flush (sync) calls.

Note that a write-heavy workload does not benefit significantly from ARC, other than the metadata side of things.

So perhaps... increasing meta arc should be useful?. I think it should... here are some example values I have right now :

vfs.zfs.arc_meta_strategy: 0
vfs.zfs.arc_meta_limit: 17179869184
kstat.zfs.misc.arcstats.arc_meta_min: 4294967296
kstat.zfs.misc.arcstats.arc_meta_max: 19386809344
kstat.zfs.misc.arcstats.arc_meta_limit: 17179869184
kstat.zfs.misc.arcstats.arc_meta_used: 17014650624



My guess, however, is that if you’re pushing the performance of these drives, you’ll likely be much happier with drives with better write performance.

One final item: if you have periodic snapshot creation / destruction, that will also cause periodic spikes in activity, especially with a high numbers of ZFS file systems.

Well... yes... I basically have in this disks a couple of datasets.... and when I have issues, no snapshots have exist in the datasets.... I very rarely do a snashot.... When I do, I should say I remove it normally after some few minutes have past.... the few minutes after which the snapshots have been deleted can be 3-5.... not more....
 
Buffer size is not the only issue than can choke the write speed.

But, you can't manage what you don't measure.

Testing an SSD suited to sustained writing is one very quick way to eliminate all maladies relating to QLC.

Otherwise, start taking observations with the actions suggested by SirDice and Eric A. Borisch above, and discuss the results.
 
I am running some mail servers with FreeBSD and ZFS. They use 870 QVO (not EVO or other Samsung SSD disks) disks as storage. They can easily have from 1500 to 2000 concurrent connections. The machines have 128GB of ram and the CPU is almost absolutely idle. The disk IO is normally at 30 or 40% percent at most.

I could saturate io of my IBM datacenter ssds with a handful of client connections ... just saying.

I run a couple of Cyrus instances in these servers. One as master, one as slave on each server. The commented situation from above, happens when both Cyrus instances become master, so when we are using two Cyrus instances giving service in the same machine. For avoiding issues, know we have balanced and we have a master and a slave in each server. You know, a slave instance has almost no io and only a single connection for replication.

hm, really? I mean, if the primary gets a mail delivered it should copy it to the replica, right? so we have at least twice the io compared to just having one primary server.

In addition: have you switched of atime on the zfs dataset? - should massively improve performance on io-bound servers.
 
Back
Top