ZFS problems still on 8-STABLE

Ok I have a server I manage which has had issues with ZFS.

They include.

processes locking up unkillable in zfs state, since this was usually apache or ftp I think this was down to the known sendfile bug.
Slow performance compared to UFS eg. 2meg/sec causing 100% disk utilisation on zfs whilst same is just 4% on ufs.
excessive ram usage when ufs is used at same time.

3 days ago I updated from 8.2-RC to 8-STABLE for the security updates and to get the zfs patches that came right after 8.2 release.

What has happened since I have complaints of laggy performance, which I havent pinned down to anything yet, but the only thing done in the last 3 days is the OS update, but more importantly dovecot processes all locked up unkillable, I couldnt get the state other than it was in D state which is I believe io wait, so this would point to zfs again. In addition more issues on the backups last night after which I had it under control for past few months.

zfs on servers which are only for testing or light load is fine, but this server is very heavy on i/o and zfs has been less than stable during the time.

So what I am asking is.

Confirmation if D state is i/o related
How to look for a certian pid in top.
Best zfs settings people feel for mysql when using myisam (guides for innodb but not myisam) and when lots of small files. I have both on diff filesets.
Tips on how to debug zfs lockups.

The server is probably very close to been wiped, and reinstalled with ufs as the owners as you may guess are not too happy. Its currently setup with a 2 drive zfs mirror setup. prefetch disabled, 12 gig of ram, dual core intel cpu.
 
Here are the questions I have in mind after reading Your post:

1. Is this amd64 or i386?
2. Have You upgraded the ZPOOL version and ZFS version after updating the OS to 8.2-STABLE?
3. What is the recordsize for the database storage?
4. Are You using the deduplication?
5. Are You using compression?
6. zfs-stats (with -a option) can show You some issues with ZFS (pkg_add -r zfs-stats), posting results here can also be helpful.
7. Show us output of zpool history command.
 
1 amd64, I thought was obvious after posting the ram, sorry.
2 its still v15.
3 the mysql fileset has a recordsize of 8k which I believe matches myisam.
4 no
5 I was previously but it was causing instability so is currently disabled however that was a while ago.
6
Code:
------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Oct  5 15:56:54 2011
------------------------------------------------------------------------

System Information:

        Kernel Version:                         802512 (osreldate)
        Hardware Platform:                      amd64
        Processor Architecture:                 amd64

        ZFS Storage pool Version:               28
        ZFS Filesystem Version:                 5

FreeBSD 8.2-STABLE #1: Fri Sep 30 10:40:59 EEST 2011 root
 3:56PM  up  3:03, 3 users, load averages: 2.12, 1.64, 1.51

------------------------------------------------------------------------

System Memory:

        21.75%  2.52    GiB Active,     14.39%  1.67    GiB Inact
        41.54%  4.82    GiB Wired,      0.05%   5.78    MiB Cache
        22.26%  2.58    GiB Free,       0.01%   1.14    MiB Gap

        Real Installed:                         12.00   GiB
        Real Available:                 99.83%  11.98   GiB
        Real Managed:                   96.81%  11.60   GiB

        Logical Total:                          12.00   GiB
        Logical Used:                   64.53%  7.74    GiB
        Logical Free:                   35.47%  4.26    GiB

Kernel Memory:                                  4.17    GiB
        Data:                           99.72%  4.16    GiB
        Text:                           0.28%   11.86   MiB

Kernel Memory Map:                              17.53   GiB
        Size:                           22.25%  3.90    GiB
        Free:                           77.75%  13.63   GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                18.18k
        Recycle Misses:                         10.36k
        Mutex Misses:                           5
        Evict Skips:                            5

ARC Size:                               99.39%  3.98    GiB
        Target Size: (Adaptive)         100.00% 4.00    GiB
        Min Size (Hard Limit):          75.00%  3.00    GiB
        Max Size (High Water):          1:1     4.00    GiB

ARC Size Breakdown:
        Recently Used Cache Size:       63.11%  2.52    GiB
        Frequently Used Cache Size:     36.89%  1.48    GiB

ARC Hash Breakdown:
        Elements Max:                           255.25k
        Elements Current:               99.45%  253.84k
        Collisions:                             611.94k
        Chain Max:                              7
        Chains:                                 66.29k

------------------------------------------------------------------------

ARC Efficiency:                                 136.75m
        Cache Hit Ratio:                99.84%  136.53m
        Cache Miss Ratio:               0.16%   221.47k
        Actual Hit Ratio:               99.84%  136.53m

        Data Demand Efficiency:         99.88%  134.87m
        Data Prefetch Efficiency:       0.00%   289

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           1.11%   1.52m
          Most Frequently Used:         98.89%  135.01m
          Most Recently Used Ghost:     0.00%   1.08k
          Most Frequently Used Ghost:   0.01%   13.74k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  98.67%  134.71m
          Prefetch Data:                0.00%   0
          Demand Metadata:              1.33%   1.82m
          Prefetch Metadata:            0.00%   66

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  71.15%  157.57k
          Prefetch Data:                0.13%   289
          Demand Metadata:              28.39%  62.87k
          Prefetch Metadata:            0.33%   739

------------------------------------------------------------------------

L2ARC is disabled

------------------------------------------------------------------------


------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------

ZFS Tunables (sysctl):
        kern.maxusers                           768
        vm.kmem_size                            19327352832
        vm.kmem_size_scale                      1
        vm.kmem_size_min                        0
        vm.kmem_size_max                        329853485875
        vfs.zfs.l2c_only_size                   0
        vfs.zfs.mfu_ghost_data_lsize            2637667328
        vfs.zfs.mfu_ghost_metadata_lsize        40979968
        vfs.zfs.mfu_ghost_size                  2678647296
        vfs.zfs.mfu_data_lsize                  1198331904
        vfs.zfs.mfu_metadata_lsize              103634432
        vfs.zfs.mfu_size                        1329270272
        vfs.zfs.mru_ghost_data_lsize            1520379904
        vfs.zfs.mru_ghost_metadata_lsize        73915392
        vfs.zfs.mru_ghost_size                  1594295296
        vfs.zfs.mru_data_lsize                  1993404928
        vfs.zfs.mru_metadata_lsize              378664960
        vfs.zfs.mru_size                        2670475776
        vfs.zfs.anon_data_lsize                 0
        vfs.zfs.anon_metadata_lsize             0
        vfs.zfs.anon_size                       3869184
        vfs.zfs.l2arc_norw                      1
        vfs.zfs.l2arc_feed_again                1
        vfs.zfs.l2arc_noprefetch                1
        vfs.zfs.l2arc_feed_min_ms               200
        vfs.zfs.l2arc_feed_secs                 1
        vfs.zfs.l2arc_headroom                  2
        vfs.zfs.l2arc_write_boost               8388608
        vfs.zfs.l2arc_write_max                 8388608
        vfs.zfs.arc_meta_limit                  1073741824
        vfs.zfs.arc_meta_used                   1073811120
        vfs.zfs.arc_min                         3221225472
        vfs.zfs.arc_max                         4294967296
        vfs.zfs.dedup.prefetch                  1
        vfs.zfs.mdcomp_disable                  0
        vfs.zfs.write_limit_override            0
        vfs.zfs.write_limit_inflated            38588792832
        vfs.zfs.write_limit_max                 1607866368
        vfs.zfs.write_limit_min                 33554432
        vfs.zfs.write_limit_shift               3
        vfs.zfs.no_write_throttle               0
        vfs.zfs.zfetch.array_rd_sz              1048576
        vfs.zfs.zfetch.block_cap                64
        vfs.zfs.zfetch.min_sec_reap             2
        vfs.zfs.zfetch.max_streams              8
        vfs.zfs.prefetch_disable                1
        vfs.zfs.mg_alloc_failures               8
        vfs.zfs.check_hostid                    1
        vfs.zfs.recover                         0
        vfs.zfs.txg.synctime_ms                 1000
        vfs.zfs.txg.timeout                     5
        vfs.zfs.scrub_limit                     10
        vfs.zfs.vdev.cache.bshift               16
        vfs.zfs.vdev.cache.size                 67108864
        vfs.zfs.vdev.cache.max                  16384
        vfs.zfs.vdev.write_gap_limit            4096
        vfs.zfs.vdev.read_gap_limit             32768
        vfs.zfs.vdev.aggregation_limit          131072
        vfs.zfs.vdev.ramp_rate                  2
        vfs.zfs.vdev.time_shift                 6
        vfs.zfs.vdev.min_pending                1
        vfs.zfs.vdev.max_pending                1
        vfs.zfs.vdev.bio_flush_disable          0
        vfs.zfs.cache_flush_disable             0
        vfs.zfs.zil_replay_disable              0
        vfs.zfs.zio.use_uma                     0
        vfs.zfs.version.zpl                     5
        vfs.zfs.version.spa                     28
        vfs.zfs.version.acl                     1
        vfs.zfs.debug                           0
        vfs.zfs.super_owner                     0

------------------------------------------------------------------------
7 zpool history is mainly me changing things like compression on then off again. Plus many set quota commands from a script that periodically runs so far too big to post here. Does zpool history ever get pruned? as it seems to grow indefenitly.

The zfs vdev cache size was previously at default until today, I increased that to see if i/o performance goes up. The rest that arent at defaults have been a while and will answer questions i fneeded as to why they set like that. Also achi is currently enabled. Overall performance is better with ahci on and more stable.
 
I would not yet consider myself a ZFS or FreeBSD expert. But I can tell you what I know...

Your L2ARC says it is disabled. You could add an SSD or some other device as a cache vdev.

Since you are using a database (which I assume uses syncrhonous writes) you should add a ZIL if you don't have one already. But be warned! you cannot remove a ZIL until version 19. So if your ZIL dies, the pool is lost. So you should either have a good mirror, or upgrade to at least version 19. And if you upgrade, you can't read it later on older software.

Read here about the relevance of a ZIL when you use a database.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes

Others usually say to avoid raidz when you want write performance, and use mirrors instead. I haven't tried that. I might eventually test it out, adding a performance mirrored zpool alongside my larger raidz2 one.

My system has 48 GB of RAM. I got so much because I wanted top performance from ZFS. And then I added a mirror of SSDs as a ZIL for the synchronous writes that NFS does. Consider getting more RAM. ZFS wants more RAM... expect it to want lots.

My system is not yet under heavy load, and crashes because of the mps driver, but seems to be fast and responsive when copying files, scrub, rsync all at the same time.

And despite my ZIL, the synchronous writes still don't go very fast. To avoid this well known problem, you could also try a UFS zvol or a separate slice for your mysql data. Unfortunately though, in my testing in FreeBSD 8.2-RELEASE the performance of my UFS zvol shared over NFS was very poor, only going about 40 MB/s (over a 10 gigabit connection) and causing 100% load on the ZVOL disk in gstat (I don't know what effect that has on the other real disks). I didn't yet test it in 8-STABLE.

I hope that helps at least a tiny bit.
 
@chrcol

You may want to increase ARC if You have any RAM available for it:

vfs.zfs.arc_min (/boot/loader.conf)
vfs.zfs.arc_max (/boot/loader.conf)

... and also modyfying these may increase performance:

vfs.zfs.txg.timeout (/boot/loader.conf)
vfs.zfs.vdev.min_pending (/boot/loader.conf)
vfs.zfs.vdev.max_pending (/boot/loader.conf)

2 its still v15.
Upgrade to v28 and ZFS version to v4, check if that helps.


Code:
vfs.zfs.arc_meta_limit                  1073741824
vfs.zfs.arc_meta_used                   1073811120
I am no ZFS expert, but it seams that You hit cache size already for metadata.

Try to increase vfs.zfs.arc_meta_limit oid.

I would also add 2 * SSD for L2ARC (You may add 2 * cheap 8GB usb drives for test if it helps).

2 * SLC SSD mirror for ZIL would also probably help.

Also what are those hdds You got there?
 
answers to points you made.

You may want to increase ARC if You have any RAM available for it:

This has to be capped as the backups to ufs disks makes ram usage go through the roof, the server when ram exhausted from ufs and zfs prefers to swap data, and it has a significant effect on performance. With it capped at 4gig, there is about 2-3gig of free ram ram during the day and light swap usage at night.

.. and also modyfying these may increase performance:

vfs.zfs.txg.timeout (/boot/loader.conf)
vfs.zfs.vdev.min_pending (/boot/loader.conf)
vfs.zfs.vdev.max_pending (/boot/loader.conf)

These have had hours of tuning and research, setting the pending to 1 I found to be of the greatest performance, when they set disk latency goes through the roof and even simple thing like 'nano /etc/make.conf' can take several seconds when they higher than 1. The same with the txg timeout, older versions of zfs had it defaulted to a higher value, then they reduced it I guess for the same reason as I discovered in that it makes writes starve reads when its flushing too large amounts of data to disk at once.

I am very hesitant to upgrade zfs as its a one way path and v28 by a few is not considered stable yet. I also have done no personal testing on v28. Whilst I had done a lot on v15 before I upgraded to it.

the owners of server refuse to add a SSD drive.

all the drives including ufs ones are Seagate Barracuda 7200.12

thanks
 
chrcol said:
This has to be capped as the backups to ufs disks makes ram usage go through the roof, the server when ram exhausted from ufs and zfs prefers to swap data, and it has a significant effect on performance. With it capped at 4gig, there is about 2-3gig of free ram ram during the day and light swap usage at night.

That's quite known problem about using heavily UFS and ZFS, the UFS wants to buffer/cache as much as possible in RAM, same for ZFS, the solution may be to use one for 'heavy use' like UFS everywhere or /boot or / on 512MB UFS read-only and all the rest on ZFS, or even boot from ZFS pool.

I am very hesitant to upgrade zfs as its a one way path and v28 by a few is not considered stable yet. I also have done no personal testing on v28. Whilst I had done a lot on v15 before I upgraded to it.
I haven't heard any 'bad stories' about v28, any links/bad experiences?

... also think about CC'ing this problem to freebsd-questions@freebsd.org or also freebsd-stable@, a lot more developers there then here.
 
Well seems this is a major performance regression from 8.2-RC

backups are now taking about 4x longer so as such are still running during the day.

I have these errors reported in various logs also.

'Can't read physical znode'

I am also aware of the UFS/ZFS memory issues and have talked on the mailing lists about it months back, the problem is UFS uses tons of ram for its caching (its more agressive than ZFS, dispite people thinking otherwise) the logical solution I proposed was to add a cap to UFS file caching the same way ZFS has but it was on deaf ears. eg. before I set zfs min ARC size to 3gig, UFS would force the zfs ARC down to almost 0, and then start making stuff swap out and had over 9 gig of data cached with almost all processes running in swap memory. So why am I using UFS for backups? because when I was doign heavy read and writes to zfs at the same time it was even worse, the server got so laggy services would go offline as not responding and this would carry on until the backups were complete.

I respect zfs v28 will yield some improvement (Assuming no complications) however what shouldnt be happening is a performance regression that I am seeing now, its not as if I have downgraded zfs. all I did was update FreeBSD to newer code which I know includes various zfs patches and it would seem at least one of these patches has a performance regression not picked up by light users.

There is a possibility it is caused by a setting as previously I had throttled writes on zfs, and the sysctl has changed syntax meaning it is not currently set, I cannot remember why I throttled writes but I think was to make reads faster because on zfs when its writing at high speeds reads get starved.

The backups in question are 14gig in size and took 12 hours and 40 mins to run, during those 12 hours the server was noticebly laggy as well.

On another server with similiar sized backups thats just UFS the backups take just 1 hour 50mins to run. Granted that server also lags during backups it completes them tho about 6x as fast with similiar cpu and hdd power.

Prior to updating FreeBSD the backups took approx 3-4 hours to complete. So the regression is a 3 fold slowdown.

I wonder if the error I pasted sheds any light as the only hit I found on google was inside the src code.
 
ok some good news and further update.

The plan is now to upgrade to v28, I have found the same bad stories but they were all from people on CURRENT, so I am hoping/assuming those issues wont exist on STABLE. One was a guy in same situation as me where he had a large performance regression before he even upgraded the filsystem (just updated OS code).

Basically prior to the update gsched was enabled and I had the zfs writes throttled on the overide setting. Last night I turned both of these back to how they were and I have had reports its back to how it was again, the backups seemed to finish much faster also, this is of course only one day but I hope this will stick now.

I plan to utilise the 2 spare hdd's in the machine to add a mirrored zil device as I have determined there is constant writes going on which should make it a prime candidate for using a seperate zil device, as I know newer zfs has improvements to zil (can tolerate failures and can remove zil devices) this is added reasoning for upgrading zfs. I wont immediately enable deduplication tho but that will of course be an option.

Incidently disk utilisation is reporting very high again tho.

example here, low throughput but high utilisation.

Code:
dT: 1.009s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   3     61     46   1278   29.4      9    960   14.5   76.8| ada0
    2     57     42   1663   21.3      9    960    9.7   66.5| ada1
 
FYI: chrcol, when you upgrade to v28, do not enable dedup until you have analyzed the performance impact. My system has 48GB of memory, half free, and dedup horribly slows down "zfs recv" and scrub, and slows all other writes but not by as much. Turning off dedup fixes writes, but does not fix the scrub speed because the data on disk is already deduped.
 
peetaur said:
FYI: chrcol, when you upgrade to v28, do not enable dedup until you have analyzed the performance impact. My system has 48GB of memory, half free, and dedup horribly slows down "zfs recv" and scrub, and slows all other writes but not by as much. Turning off dedup fixes writes, but does not fix the scrub speed because the data on disk is already deduped.

I also had a warning of someone else, thanks.

The other guy warned me he runs a heavy server, prior he had 16 GB of RAM with a 12 GB ARC, when he enabled dedup the demand on his RAM went through the roof and he had to upgrade it to 24 GB RAM and increase the ARC to 20 GB. I have been very busy lately tho so haven't done much testing. What I have done to relieve the immediate problems is the following.

Made a RAM disk for /tmp and /var/tmp (seems random large files are been dumped to these folders so this is helping).
Disabled flushing on zfs, as the myisam does a fsync on every write, my friend told me this as when disabled flushes he had a huge performance boost. Note zfs does still does its batch flushes so it doesn't disable those.
Moved (temporarily) the databases to a ufs drive.

The system is now stable no one is complaining although with the 24/7 ufs usage and the RAM disk RAM is under a lot of pressure but the main thing is at the moment it's working much better.

There is a glut of zfs tuning guides for innodb but myisam not. In addition I think there is hardly anyone yet using zfs in production on busy servers, it's still mainly hobbyists and developers.

Finally I also increase vnodes on the system, I think they were been starved a bit.

However in gstat the zfs drives still show poor io/sec performance, high latency and slow throughput relative to the disk utilisation.

To balance this out though, I have another zfs server which is under about 1/3 the load on zfs but albeit on weaker hardware, in gstat that looks much superior and zfs seems to perform significantly better on that server. I do know that server only uses innodb though and the ARC is under less pressure.

The zil device is definitely still planned and mirrored as well.
 
@chrcol,

Just a few things I noticed. You say that you are using ZFS V15 but I see in the output of zfs-stats:

Code:
...
        ZFS Storage pool Version:               28
        ZFS Filesystem Version:                 5
...

This means that the system is running ZFS V28

Code:
System Memory:

        21.75%  2.52    GiB Active,     14.39%  1.67    GiB Inact
        41.54%  4.82    GiB Wired,      0.05%   5.78    MiB Cache
        [B]22.26%  2.58    GiB Free[/B],       0.01%   1.14    MiB Gap

Your free memory looks very good for a heavy loaded ZFS system.

Code:
ARC Size:                               99.39%  3.98    GiB
        Target Size: (Adaptive)         100.00% 4.00    GiB
        Min Size (Hard Limit):          75.00%  3.00    GiB
        Max Size (High Water):          1:1     4.00    GiB

This doesn't looks good to me though. Are you tuning both arc_min & arc_max ?
As a good rule of thumb try using at least half of your RAM for arc_max
Limiting this value too much could cause deadlocks.

I am also curious to find out what this server does.

You did mention dovecot, apache and mysql. May I assume that this is a mailserver ?

George
 
gkontos, I believe he said he limited it so UFS will have some memory to use. But you are right... I believe that it is his most significant limitation.

Why don't you upgrade your RAM? RAM is cheap... I added 8 more GB to my desktop for only 40€. That's twice what you gave ZFS on your server. (But ECC memory is around double I guess) And time is money, so since that is only worth a few hours of your time, it should save you money overall.

Here is 16 GB non-ECC for 100 CAD:

Here is 8 GB ECC for 100 USD

And here is my story about ZFS performance and RAM:

Doing any sort of simple test, such as scp some files, dd, scrub, I found that my ZFS pool (2 raidz2 vdevs of 8 SATA disks each) could read or write around 500-1000 MB/s. Comparing this to another fast system (22 disks XFS SAS hardware RAID6), the ZFS system was roughly 40% slower (on the same overly simple tests). So I thought things are all working just fine.

But then I found that my server was going rather slow when reading and writing at the same time (40-250 MB/s combined r+w), even when just simply copying a file with "cp", and I got a laggy shell with some tests (I forget which) but not "cp". So I used things like zfs-stats, and found to my surprise, that I was barely using any memory at all. So therefore, something I read about arc is completely wrong (on FreeBSD): They generally say you shouldn't need to tune arc unless you have very low memory, or some special requirements. So I left the defaults (except upped kernel memory to 1GB to get my 10gigabit network to work).

After figuring out why it was using so little memory, I set up the kernel boot parameters (mostly things you are already familiar with), for example I set kernel memory to 46 GB (but it only uses around 34GB for arc). Now it performs the same on overly simple tests if I make sure it isn't caching the data itself, and when copying files goes more like 700 MB/s (r+w), which is almost double the speed of the 22 disk XFS SAS system. I don't know why it needs to cache so much (metadata?) just to copy a single file; it seems terrible, but it is what ZFS apparently needs to perform properly, so I accept it. It is not just to improve things beyond already good performance. (and I think dedup still goes slow)
 
an update.

it is now v28

I enabled prefetch and reduced the prefetch size.

This seems to have been a positive change however, if doing something like tarballing then all i/o almost stops. Even when the tarball is writing to another disk, simply reading data for the tarball cripples the server, but not on ufs. An ARC can only do so much since if I am tarballing 200gig of data then ultimately thats not all going to fit in cache and will need reads. Even when not tarballing the server has a lot of reads.

What seems to be the issue here aside from the ufs ram usage I identified earlier is the lack of a proper disk scheduler that aims to keep latencies low. Response time is king on a server even if it means throughput is reduced as a penalty.
 
Back
Top