ZFS What happened to vfs.zfs.vdev.max_pending?

ethoms · Mar 3, 2015

Ok, so I'm using SATA in an SAS array with expanders. I know (well I do now), that's not a good idea. See this for background info why: http://garrett.damore.org/2010/08/why-sas-sata-is-not-such-great-idea.html

But I was running in this drive configuration for 2 years without any problems at all, no drives went down either. I use SATA's in 3-way mirror vdevs to offset the price, performance, redundancy, failure risk vs SAS. If I was using SAS, I would certainly not be using a 3-way mirror, due to the cost.

Anyway, it turns out that the reason I was getting away with it was due to the following settings in /boot/loader.conf :

Code:

# Change I/O queue settings to play nice with SATA NCQ and
# other storage controller features.
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"

So after upgrading to FreeBSD 10.1 and having major problems on my pools; SCSI /CAM errors and drives being REMOVED by themselves. After almost a week of debugging, I discover that the oid (sysctl variable) vfs.zfs.vdev.max_pending is no longer used in FreeBSD 10.1 RELEASE. It's due to a change in OpenZFS I believe.

So, please, please somebody tell me what is the equivalent of vfs.zfs.vdev.min_pending in the latest ZFS version? I haven't slept properly in over a week. Nor have I been able to do a backup, since high load is causing drives to temporarily die. Amazingly it can survive a full working day in production... just.

It's a bit too late to go back to FreeBSD 8.4 now. I could, but who would want to do that. And I would have to move 4TB to new pools created with an older zpool version. That takes about 2 days to send/recv.

tobik@ · Mar 3, 2015

vfs.zfs.vdev.max_pending is gone. It was removed with this commit.

The change also introduced the following new sysctls:

Code:

vfs.zfs.vdev.max_active
vfs.zfs.vdev.sync_read_min_active
vfs.zfs.vdev.sync_read_max_active
vfs.zfs.vdev.sync_write_min_active
vfs.zfs.vdev.sync_write_max_active
vfs.zfs.vdev.async_read_min_active
vfs.zfs.vdev.async_read_max_active
vfs.zfs.vdev.async_write_min_active
vfs.zfs.vdev.async_write_max_active

Maybe one of these will help you? I linked to a big comment which might be worth reading first if you modify any sysctl.

ethoms · Mar 3, 2015

Thanks for the fast reply tobik. I already tried most of those oids. See below the sysctl's I tried, with no success:

Code:

# Change I/O queue settings to play nice with SATA NCQ and
# other storage controller features.
# In newer ZFS implementations, the following oid's have been replaced ...
#vfs.zfs.vdev.min_pending="1"
#vfs.zfs.vdev.max_pending="1"
# ... by these
vfs.zfs.txg.timeout=30
vfs.zfs.vdev.sync_read_min_active=1
vfs.zfs.vdev.sync_read_max_active=1
vfs.zfs.vdev.sync_write_min_active=1
vfs.zfs.vdev.sync_write_max_active=1
vfs.zfs.vdev.async_read_min_active=1
vfs.zfs.vdev.async_read_max_active=1
vfs.zfs.vdev.async_write_min_active=1
vfs.zfs.vdev.async_write_max_active=1
vfs.zfs.vdev.scrub_min_active=1
vfs.zfs.vdev.scrub_max_active=1

But I haven't tried vfs.zfs.vdev.max_active. I'll give it a try. Any chance there's an explanation on these tunables?

ethoms · Mar 3, 2015

Also, how can I find out what the default for vfs.zfs.txg.timeout on 8.3-R was? Without installing it in a VM of course, which I will do if I have to.

ethoms · Mar 3, 2015

I found this: https://forums.freenas.org/index.ph...ve-previously-vfs-zfs-vdex-max_pending.19212/
It sounds promising that vfs.zfs.vdev.max_active is the sysctl I'm looking for. I'll try it out tonight when folk have gone home.

If this fixes my problem, I'm going to give tobik a big wet kiss. OK he's probably a bloke, so maybe not.

tobik@ · Mar 3, 2015

vfs.zfs.txt.timeout was set to 5 (still is in 10.1) in 8.3. See https://github.com/freebsd/freebsd/...ntrib/opensolaris/uts/common/fs/zfs/txg.c#L41

As for explanations I do not know, but if you follow the link I posted above there are some comments written above the variables for the sysctls. This seems to be the only documentation... For vfs.zfs.vdev.max_active you want to look for zfs_vdev_max_active. I am pasting the relevant comments here:

Code:

/*
* The maximum number of I/Os active to each device.  Ideally, this will be >=
* the sum of each queue's max_active.  It must be at least the sum of each
* queue's min_active.
*/
uint32_t zfs_vdev_max_active = 1000;

/*
* Per-queue limits on the number of I/Os active to each device.  If the
* sum of the queue's max_active is < zfs_vdev_max_active, then the
* min_active comes into play.  We will send min_active from each queue,
* and then select from queues in the order defined by zio_priority_t.
*
* In general, smaller max_active's will lead to lower latency of synchronous
* operations.  Larger max_active's may lead to higher overall throughput,
* depending on underlying storage.
*
* The ratio of the queues' max_actives determines the balance of performance
* between reads, writes, and scrubs.  E.g., increasing
* zfs_vdev_scrub_max_active will cause the scrub or resilver to complete
* more quickly, but reads and writes to have higher latency and lower
* throughput.
*/
uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 1;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;
uint32_t zfs_vdev_trim_min_active = 1;

And no kiss please...

ethoms · Mar 3, 2015

I tried turning all the vdev_*_active to "1", trying to achieve the same as vfs.zfs.vdev.max_pending="1" in earlier ZFS implementations. I did not get far before the whole server locked up. Couldn't even log in locally. I had to power cycle it.

However, when I had them all turned to "1" except vfs.zfs.vdev.max_active it did not lock up, but I still had the problem with CAM / SCSI errors. And having read an explanation of what these values mean, I think that vfs.zfs.vdev.max_active is irrelevant if the other *_active are turned to "1".

So, these new sysctl(8)'s don't seem to help. It seems the new ZFS implementation, with regards to queueing, do not allow me to workaround the SATA in SAS expander toxicity issue. It was likely just by chance that the vfs.zfs.vdev.max_pending="1" worked for me before.

Really I should just buy SAS drives. It's not going to be cheap, but I can't afford to mess around any longer.

There is one more thing I'm trying: disable NCQ (queueing) per drive via camcontrol(8). I wrote a small one-liner script to do it on all my WD drives, as to exclude my SAS drives. Which reminds me, I'm mixing SAS/SATA drives in one of my arrays. This is maybe not a good idea, so I temporarily popped out the SATA drives from the array with SAS drives. So I now have 2 arrays, one with only SATA, one with only SAS. I'm doing a big send/recv operation, and scrubbing one of the pools at the same time. To see if it triggers the CAM/SCSI errors. The problem is , it can take a while, and only under heavy load does it show any symptoms.

The NCQ disable script:

Code:

#!/bin/sh

for i in `camcontrol devlist | grep "ATA WDC WD" | cut -d"," -f2 | cut -d")" -f1` ; do camcontrol tags $i -N 1 ; done

This simply turns the number of queue tags down to 1 via command: camcontrol tags daX -N 1 where X is the drive id.

Another way to disable NCQ is to turn tags off altogether: camcontrol negotiate daX -T disable. I'll try that next perhaps.

So now the frustrating thing is that it takes so long to know if any of these things are making a difference, since it only happens occasionally, under heavy load. I have to wait 1-2 days before I get feedback.

ethoms · Mar 3, 2015

I forgot to mention. Setting the queue size to 1 did not seem to effect send/recv performance. It has exactly the same GB/s transfer as before, measured over a 2 minute period.

Also, you can check the queue size via: camcontrol tags daX -v

ethoms · Mar 3, 2015

Quick update: I just did another zpool list and I already up to 571GB on the new pool (the one I'm zfs sending to).That does seem much faster than before. Previoulsy it was taking almost 2 days to send/recv 3.6 TB. Now it has done 0.5TB in about an hour or so, and whilst it's doing a scrub on the sending pool. Even though the day-time workload would have slowed down the previous send/recv, it doesn't even come close. I am tired, I wonder if I'm imagining this, but turing my cam devices queueing down to 1 seems to have really boosted the performance. I wonder if the current transfer speed is normal and the previous speed was slow because of CAM/SCSI issues.

It's as if the SATA NCQ is not compatible with the SAS controller/bus equipment, and a performance enhancing feature is actually causing severe performance degradation. Not to mention timeout's and evertually dropping the drives from the cam bus.

It's now at 716GB, I'm really starting to get excited. I think I've fixed it. No errors in dmesg so far, all drives are looking healthy. But I've been here before, it's maybe too early to speculate.

ethoms · Mar 3, 2015

There is another explanation. I removed the SATA drives from the enclosure that houses the SAS drives. So I will need to push those SATA drives back in to confirm if the performance increase, and apparent lack of CAM errors is related to drive type mixing or the queue settings.

ethoms · Mar 3, 2015

Everything is looking good now. The send/recv transfer is still running at good performance. I even pushed back in the remaining SATA drives, then ran my disable NCQ script again. No errors in dmesg, the transfer is almost 50% complete and I'm scrubbing another pool at the same time.

Something else to note. I now have a problem when plugging a SATA drive right next to an SAS drive. There seems to be physical partitions in the 24 drive enclosure, 8 drives per partition. The SATA drive locks up when it tries to go online, the blue LED stays lit. However the same drive spins up fine in any other slot. So I thinkthis is normal, you can't mix SAS nand SATA on the same enclosure partition. I bet there's a better name for this block of ports. So I think I never noticed this before becasue it was just a spare. No I'm using it for a zpool it seems to have a problem being in that slot / port. And it makes me wonder if this was part of my problem all along.

For now I will assume that it's turning the queue size down to 1 via camcontrol that is making everything work well.

ethoms · Mar 5, 2015

Nope, I still get CAM/ SCSI errors. I think the only thing I can do is change my hardware configuration. Either spent S$30,000 on SAS drives, or buy some SATA->SAS interops. Or fit my server with large 4TB 3.5" SATA's, I just have enough SATA ports and 3.5" bays to do it. Although I'll have to find power connectors and space to retro-fit my SSD's which are currently in the 3.5" bays.

This is not a good situation to be in. I still haven't got a backup in over a week, and my data pool has no redundancy. I'm hanging on a thread.

It's gonna take at least 2 weeks to get the SAS drives (here in Singapore). So I'm seriously considering putting 10 3.5" drives in my server itself, effectively ditching the 2.5" SAS enclosure and it's drives. At least I will be able to get a redundant data pool and a full backup.

What's happening at the moment is that when I try to move data, either to a new data pool or to a backup pool, or to attach, resilver. There is 3.6TB of data involved, and before the operation completes I get the CAM /SCSI errors which leads to permanent errors and never-ending resilvers. Actually the resilvers complete, but after export/import it starts all over again. I have to do these operations at night, becasue during the day it causes even more problems.

NEVER USE SATA DRIVES ON A SAS BUS/ENCLOSURE!!!

ethoms · Mar 5, 2015

OK, another status report. It looks like maybe my problems mostly stem from the fact I was mixing SAS and SATA drives in the same enclosure. I have since removed the SATA drives (backup pool) from the enclosure housing the SAS drives. I have managed to do a full send/recv, during a working day (so the load is reasonably high). Followed by a scrub, followed by an incremental send/recv, and now I'm doing another scrub. I never got anywhere near this far before without a SCSI/CAM error in my dmesg. I did get some strange bus messages though (below). But no read, write, checksum errors, and certainly no drives popping out due to excessive timeouts. So all is looking good so far. But I've said that before and the errors came back.

Strange dmesg output (not directly related to my main issue):

Code:

ses2: da7,pass16: Element descriptor: '000'
ses2: da7,pass16: SAS Device Slot Element: 1 Phys at Slot 0, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bc7
ses2: da6,pass15: Element descriptor: '001'
ses2: da6,pass15: SAS Device Slot Element: 1 Phys at Slot 1, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bc6
ses2: da5,pass14: Element descriptor: '002'
ses2: da5,pass14: SAS Device Slot Element: 1 Phys at Slot 2, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bc5
ses2: da4,pass13: Element descriptor: '003'
ses2: da4,pass13: SAS Device Slot Element: 1 Phys at Slot 3, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bc4
ses2: da3,pass12: Element descriptor: '004'
ses2: da3,pass12: SAS Device Slot Element: 1 Phys at Slot 4, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bc3
ses2: da2,pass11: Element descriptor: '005'
ses2: da2,pass11: SAS Device Slot Element: 1 Phys at Slot 5, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bc2
ses2: da1,pass10: Element descriptor: '006'
ses2: da1,pass10: SAS Device Slot Element: 1 Phys at Slot 6, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bc1
ses2: da0,pass9: Element descriptor: '007'
ses2: da0,pass9: SAS Device Slot Element: 1 Phys at Slot 7, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bc0
ses2: da13,pass22: Element descriptor: '008'
ses2: da13,pass22: SAS Device Slot Element: 1 Phys at Slot 8, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bdb
ses2: da12,pass21: Element descriptor: '009'
ses2: da12,pass21: SAS Device Slot Element: 1 Phys at Slot 9, Not All Phys
ses2:  phy 0: SATA device
ses2:  phy 0: parent 5003048000b22bff addr 5003048000b22bda

So, the next question is; does the cam queue settings I'm using have a part to play? Remember I have a script to turn the ncq (tag) queuing down to 1. I guess I'll have to try to find out. But not until I can stabilize my data pool and do a full backup.

If the SATA/SAS mixing is the main cause. I will need to get rid of 6 SAS drives from the enclosure. I have an idea about that: I'll replace them with 2 internal SSD's, and use the SAS drives elsewhere.

User23 · Mar 5, 2015

Very interesting.

ethoms · Apr 20, 2015

Just to follow up on this issue. I was exhausted after I resolved it and so didn't give closure.

It turned out that even when only SATA drives are in any of the enclosures, the highly toxic SCSI timeouts persisted. I only just managed to get the ~3.6TB of data onto a new healthy pool not on the SAS bus. In fact I lost about two files due to permanent errors. It was so incredibly stressful, only copious amounts of beer got me to sleep at night for a while. Seriously, do not put SATA drives on a SAS bus, no matter what the product specs tell you.

So I eventually decided to build pools from large 3.5" 4/6TB drives. 6 x 4TB WD reds for data pool (2 vdevs, each 3-way mirrors). 4 x 6TB reds for backup pool (2 vdevs, each 2-way mirrors). This was mainly because every supplier I went to here in Singapore said it would take at least 2 weeks lead time for 2.5" SAS drives. I was dealing with a ticking time bomb, so 2 weeks was out of the question. Consequently it was about 8-10 times cheaper to go with consumer grade SATA drives than 2.5" SAS drives. The performance may be reduced, but with 3-way mirrors and 2 vdevs and 2 large SSD cache drives, the performance is somewhat offset. Fortunately I had space in the large 4U self build rack server chasis for 10 3.5" drives. I used 2 of these 4-port PCIe SATA port controller cards: http://www.iocrest.com/en/product_details304.html. I can confirm that they work very well in FreeBSD 10.0 or above.

User23 · Apr 21, 2015

Thanks to you and all the others in the forums testing and reporting this issue.
That forced me to switch to direct access disk setup for a new server.
I selected this 2U open bay chassis with open 4x 5,25" and 2x 3,5" bay, RM21600 from chenbro.
Around 20x 2,5" should be possible with in that case, with installed internal HDD racks for 2,5" drives.

It is more complex to get, build, manage and more expansive to buy, than just getting a case with SAS backplane.

Sebulon · Apr 22, 2015

ethoms

Never saw you posted what disk controller (HBA/RAID) you use? That may be your source of error? Have you tried changing controller?

TheDreamer · Feb 2, 2017

What an interesting old thread to have stumbled upon.... I use SATA disks in 5 bay port multiplier, using both a Sil3124 and a Sil3132 controller. (I have 10 drives connected to my Sil3132, and 5 new drives on the Sil3124.)

I was working on migrating from a 6 disk zpool that had been done on pre-AF drives, so ashift=9. to a 5 disk zpool. (I have a smaller 4 disk zpool that has one drive nearing imminent failure and I'm out of 1.5TB spares.... so plan is to use 5 disk from the 6 disk pool, redone as ashift=12 to cope with future replacements, similar to what I did when my mirrored 1.5TB zpool for /home failed a disk....some how had the foresight to do that pool with ashift=12...so upgrading it to 2-3TB drives wasn't that painful.... Though the new server I had been wanting to build has been abandoned, as I keep stealing drives set aside for it for other needs. (plus found the dual cpu mobo + heatsinks wouldn't work in the case I had...)

Anyways, wanting to keep the data on the older 6 disk pool ( 7TiB of data ... cramming the zpool to 97%.... ) into a new 5 disk pool which brings, where the same data would use ~72% of the zpool. (also opting for 5 disk in on a single channel, since the hold up time between arrays differs...and ZFS doesn't like it when half the drives in the pool disappear....it should handle losing 2 drives, but doesn't like losing 3....during a UPS transfer...) I had debated getting an online UPS, but not likely to happen (was hard enough replacing a true-sine UPS... just as I finally got around to hooking it for snmp...it got spiked through the network interface... manual did say to fully power down the UPS for connecting it.... luck ran out while cleaning up my cabling... though now I not sure I still need true-sine...)

At first, it ran for almost 2 days, with less than 1TB transferred (when system crashed, and it took forever to make what it had done disappear....) Found a blog post about using misc/mbuffer ... it was doing about 1TB per day....when another crash and recovery undid all progress. (I was sure that I had stopped by daily ports update builds, but something happened so they slipped back in a couple of days later....I suspect now it was the periodic cleanup of /tmp that removed the file that inhibits the builds...something I had quickly added to the scripts for this migration work.

But, then I read, in this thread, about camcontrol(8) and lowering tags.... my reading of the man page seems to say the min/max range defines the min that I can change it to. So, I had changed tags for the 5 new disks to 2, and immediately saw performance improvement. I debated trying disabling it, but things were well underway so I opted to not change anything else.

It took just over 2.5 days to send/receive 7TiB.... now I wonder how camcontrol affects general performance (many mixed read/writes.) I had done a check of zfs properties while the send/receive was taking place, and it took some time to get them. Though I also had 'sync=always' set, had read somewhere that it could help keep ZFS performance up in zpool's over 80% full. Seems zfs receive works differently, since I saw no activity for its SLOG. Now to work on migrating many filesystems on 4 disk pool (raidz - 1.5TBx3) to its new 5 disk pool (raidz2 - 2TBx3)

The 7TiB is my BackupPC pool of the many things I've backed up over the years, including several computers that are no longer around.... and 'things' had included doing ftp (now rsync since upgrading to plan with shell access) backups of my (shared) webhosting accounts....

The Dreamer

Forgot to add, I've been working on upgrading this server's hardware.... though now I wonder if using a SAS controller to drive these SATA drives was the right way to go....