ZFS performance degradation over time

garrettmoore · Jan 1, 2010

Hi,

I'm having problems with ZFS performance. When my system comes up, read/write speeds are excellent (testing with dd if=/dev/zero of=/tank/bigfile and dd if=/tank/bigfile of=/dev/null); I get at least 100MB/s on both reads and writes, and I'm happy with that.

The longer the system is up, the worse my performance gets. Currently my system has been up for 4 days, and read/write performance is down to about 10MB/s at best.

The system is only accessed by 3 clients: myself, my roommate, and our HTPC. Usually, only one client will be doing anything at a time, so it is not under heavy load or anything.

Software:

Code:

FreeBSD leviathan 8.0-RELEASE FreeBSD 8.0-RELEASE #0: Sat Nov 21 15:02:08 UTC 2009
root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

The following apps are running and touching data on the zpool:

rTorrent - read and write, usually active, not doing much for reads (not doing any major seeding)
SABnzbd+ - write only, not always active
Lighttpd - running ruTorrent (web interface for rTorrent); nothing else
samba - all of our clients are running Windows, so we use samba to network-mount the zpool

Hardware:

AMD Athlon II X2 250 Dual Core Processor Socket AM3 3.0GHZ
Gigabyte MA790GP-UD4H AMD790GX ATX AM2+/AM3 Sideport 2PCI-E Sound GBLAN HDMI CrossFireX Motherboard
Corsair XMS2 TWIN2X4096-6400C5 4GB DDR2 2X2GB
Supermicro AOC-USASLP-L8I LSI 1068E 8-PORT RAID 0/1/10 Uio SATA/SAS Controller W/ 16MB Low Profile
8x Western Digital WD15EADS Caviar Green 1.5TB SATA 32MB Cache 3.5IN

ZFS setup:
I have the 1.5TB drives in one RAIDZ pool. All 8 drives are connected to the Supermicro L8I controller. The controller is set to 'disabled', so it isn't doing anything with the drives except presenting them to the system untouched. (So I'm really only using it as an expansion card, for the extra ports).

Code:

[root@leviathan ~]# zpool status
  pool: tank
 state: ONLINE
config:
        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0
            da6     ONLINE       0     0     0
            da7     ONLINE       0     0     0

errors: No known data errors

Any suggestions as to what might be causing the performance to degrade with system uptime? If I missed anything or more information is needed, please let me know. Thanks in advance.

tobiastheviking · Jan 2, 2010

No suggestions, but i have exactly the same problem as you do.

All my hardware is different, but the setup is mostly the same(small file server, only used by me, etc).

One thing i have found is that "top" will say "Mem: 962M Active" even when i have closed all programs, and nothing should be using memory at all.

The memory stays active, and is never marked inactive. At the same time i see zfskern using a lot of processing power.

I've been trying to debug this for some time. I have even done a complete reinstall of the system. No luck thus far.

I've just tried disabling zil in loader.conf if it does anything i will write back.

tobiastheviking · Jan 2, 2010

Ok, that did nothing

Code:

last pid: 95448;  load averages:  0.00,  0.00,  0.00                                                                                 up 0+09:06:40  12:02:25
71 processes:  3 running, 68 sleeping
CPU:  0.4% user,  0.0% nice,  2.6% system,  0.4% interrupt, 96.7% idle
Mem: 1099M Active, 109M Inact, 666M Wired, 68M Cache, 213M Buf, 29M Free
Swap: 8192M Total, 7896K Used, 8184M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
 1900 <username>      1  64   20 85344K 36460K select   3:07  0.00% /usr/local/bin/python2.6 -OO /usr/local/bin/SABnzbd.py

I have sorted it after size(ie, memory usage). in the 9 hours since i rebooted, nothing has really been done.

I moved less than 10gigs from a ufs to a zfs drive. irssi has been running in a screen. And a few other programs have been doing minor stuff as well. But nothing major.

The program that is currently using the most memory is sabnzbd, which is using 85mb(it has done NOTHING since i started it). So i have no idea why so much memory is marked as active.

From experience, killing all the programs i have running will NOT free any more memory.

wonslung · Jan 2, 2010

This could be related to the prefetch bug.

have you tried disabling prefetch or applying the new prefetch patches?

oliverh · Jan 2, 2010

wonslung said:
This could be related to the prefetch bug.

have you tried disabling prefetch or applying the new prefetch patches?

ZFS prefetch isn't used with <=4GB of memory.

tobiastheviking · Jan 2, 2010

I only have 2gb.

gkontos · Jan 2, 2010

You will need some ZFS tuning. First try limiting the arc size.
For example

Code:

vfs.zfs.arc_min: 122880000
vfs.zfs.arc_max: 983040000

Is a nice start for a 2GB system.

Regards,

George

wonslung · Jan 2, 2010

oliverh said:
ZFS prefetch isn't used with <=4GB of memory.

i swear i read wrong...i thouhgt he had more than 4gb memory.

my bad. yah, it's probably an arc issue then...

but honestly the best option would be to get more memory....memory is cheap.....tuning for short term but unless the motherboard wont' support it, get 4+ gb's

i use 8 myself and it works well.

EDIT:
went back and read his post again....he's got 4 gb memory...i thought prefetch was only disabled if you had LESS than 4 gb memory but was ON if you had exactly 4 gb memory.

tty23 · Jan 2, 2010

Hi,

I have similar problems with ZFS. I want to move my home server from Debian/Linux to FreeBSD. Now FreeBSD 8 is up and running, everything seems to be fine, except the long term ZFS performance.
I created a 4 drive RaidZ pool and am currently copying the stuff from my old drives (the linux ones with ext3 fs) to the new raid pool. The fist 10GB are copied really fast, but then the performance decreases drastically. When starting to copy the stuff, the linux drive usually reads at 20-30MB/s (according to iostat 1), the ZFS raid writes even faster, zpool iostat 1 reported values > 200MB/s.
After some time the performance gets down to 5MB/s (the last drive I copied), today I started to copy another drive, and currently get about 1MB/s (according to zpool iostat).
The drives I copy are between 160 and 320GB, so copying at 1MB/s takes some time.
I followed the ZFS tuning guide in the wiki (http://wiki.freebsd.org/ZFSTuningGuide), which basically says that you should increase the kern.maxvnodes setting.

Note that in my experience the last days, there seems no connection between uptime and ZFS performance. It looks more like the amount of ZFS usage leads to ZFS performance degradation.

At the beginning, I also tried different values of the other settings mentioned in the ZFS tuning guide, but found no notable differences (I played with vm.kmem_size_max, vm.kmem_size, vfs.zfs.arc_max).

It is good to know that others have similar problems, so I guess my hardware is not the cause.

And NFS with ZFS is even worse. As already mentioned I am currently copying a drive to the ZFS raid. When I try to access this ZFS raid via NFS during copying, the share is so slow, it is barely usable.
Note that I am directly connected via a 1GB network link, and NFS access to the ext3 hdds of my old linux systems is really fast.

oliverh · Jan 2, 2010

wonslung said:
i swear i read wrong...i thouhgt he had more than 4gb memory.

my bad. yah, it's probably an arc issue then...

but honestly the best option would be to get more memory....memory is cheap.....tuning for short term but unless the motherboard wont' support it, get 4+ gb's

i use 8 myself and it works well.

EDIT:
went back and read his post again....he's got 4 gb memory...i thought prefetch was only disabled if you had LESS than 4 gb memory but was ON if you had exactly 4 gb memory.

I'm using it with 2GB and 4GB (both of them 64BIT) without any problems. Well, I don't have a datacenter, but sometimes I have to transfer big data (sat pictures, highres photography (photogrammetry) etc.). I wouldn't try it with something <2GB or 32BIT, too much fuss about nothing ;-)

tobiastheviking · Jan 3, 2010

gkontos said:
You will need some ZFS tuning. First try limiting the arc size.
For example

Code:

vfs.zfs.arc_min: 122880000 vfs.zfs.arc_max: 983040000

Is a nice start for a 2GB system.

Regards,

George

Since you are saying 2GB that only applies to me i think.

Those values are way higher than what i currently have for arc. but i'll try them anyways.

Currently it's at

Code:

vfs.zfs.arc_min: 53856640
vfs.zfs.arc_max: 430853120

garrettmoore · Jan 3, 2010

It seems like we have a decent number of people with the same issue, based on how many replies there have been already.

It would be really helpful if someone familiar with the codebase could step in, since they would probably have some insight as to what tuning parameters are causing it, and since we have several people to test with, we should be able to narrow it down.

I'm going to take an initial stab at it and guess that it's some sort of performance tuning ZFS is attempting to do "on the fly" which fails utterly for our usage patterns.

tobiastheviking · Jan 3, 2010

Agreed.

Something that might help ease the pain a bit, would be something like ionice(for when moving stuff around). But, i haven't been able to find anything like it for freebsd yet.

tobiastheviking · Jan 3, 2010

I noticed a funny thing while moving data from a UFS(ad22) drive to a zraid drive(ad12,14,16,18).

gstat output

Code:

dT: 1.003s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
-snip-
    4    278      2    128    6.2    270   1894    9.0   72.9| ad12
    6    182      0      0    0.0    176   1975    7.3   65.2| ad14
    5    189      2    128    6.8    181   2015    7.2   59.7| ad16
    6    184      2    128    4.8    176   1975    7.2   68.7| ad18
    0     87     40   5105    2.7     47   2903    2.3   15.6| ad22
-snip-

Can it really be true that it only uses 15% on ad22(which is also getting data moved TO it from another computer) while it uses ~70% on the zraid drives moving data from ad22.

The only disk activity to the zraid is the move. This is how it looks constantly.

tobiastheviking · Jan 3, 2010

Oh, and setting arc min and arc max to the new values(which were higher than my original) did naught.

hedwards · Jan 4, 2010

tobiastheviking said:
Can it really be true that it only uses 15% on ad22(which is also getting data moved TO it from another computer) while it uses ~70% on the zraid drives moving data from ad22.

The only disk activity to the zraid is the move. This is how it looks constantly.

If I'm understanding you correctly, the answer yes it could very well be. If it's taking 15% to do just one side of the operations on UFS, then it's not necessarily unreasonable for it to be taking up ~70% when doing both sides of it.

Assuming you're moving data from ad22 to the ZRAID or vice versa, then it's not that unreasonable at all. You're not just having to do the operations of copying the files, you're also having to have the computer check the other disks in the array and do any relevant checksums.

tty23 · Jan 4, 2010

Some infos about my case (if anyone is interested)

Hardware:
- Memory: 3065 MB
- Controller: Intel ICH7
- HDDs: 4x WDC WD10EARS-00Y5B1 (Western Digital Green 1TB)

Software
- FreeBSD 8
- ZFS RaidZ with the 4 mentioned HDDs
- 3 file systems on the pool

To find out the speed of degradation, I rebooted the server and
copied a 8GB file from my old Linux drive to the ZFS pool several
times.
Write speed was measured with "zpool iostat 1"

1. copy:
- throughput start: 25MB/s
- throughput end: 20MB/s
2. copy
- throughput start: 20MB/s
- throughput end: 5-10MB/s (alternating between 5 and 10MB/s)
3. copy
- throughput start: 5-10MB/s
- throughput end: 5MB/s
4. copy
- throughput start: 5MB/s
- throuthput end; 5MB/s

# sysctl -a | grep zfs
The command was run after the 4. copy.

Code:

vfs.zfs.arc_meta_limit: 168369920
vfs.zfs.arc_meta_used: 15162272
vfs.zfs.mdcomp_disable: 0
vfs.zfs.arc_min: 84184960
vfs.zfs.arc_max: 673479680
vfs.zfs.zfetch.array_rd_sz: 1048576
vfs.zfs.zfetch.block_cap: 256
vfs.zfs.zfetch.min_sec_reap: 2
vfs.zfs.zfetch.max_streams: 8
vfs.zfs.prefetch_disable: 1
vfs.zfs.recover: 0
vfs.zfs.txg.synctime: 5
vfs.zfs.txg.timeout: 30
vfs.zfs.scrub_limit: 10
vfs.zfs.vdev.cache.bshift: 16
vfs.zfs.vdev.cache.size: 10485760
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.vdev.aggregation_limit: 131072
vfs.zfs.vdev.ramp_rate: 2
vfs.zfs.vdev.time_shift: 6
vfs.zfs.vdev.min_pending: 4
vfs.zfs.vdev.max_pending: 35
vfs.zfs.cache_flush_disable: 0
vfs.zfs.zil_disable: 0
vfs.zfs.version.zpl: 3
vfs.zfs.version.vdev_boot: 1
vfs.zfs.version.spa: 13
vfs.zfs.version.dmu_backup_stream: 1
vfs.zfs.version.dmu_backup_header: 2
vfs.zfs.version.acl: 1
vfs.zfs.debug: 0
vfs.zfs.super_owner: 0
kstat.zfs.misc.arcstats.hits: 183630
kstat.zfs.misc.arcstats.misses: 38211
kstat.zfs.misc.arcstats.demand_data_hits: 114956
kstat.zfs.misc.arcstats.demand_data_misses: 13889
kstat.zfs.misc.arcstats.demand_metadata_hits: 68674
kstat.zfs.misc.arcstats.demand_metadata_misses: 24322
kstat.zfs.misc.arcstats.prefetch_data_hits: 0
kstat.zfs.misc.arcstats.prefetch_data_misses: 0
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 0
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 0
kstat.zfs.misc.arcstats.mru_hits: 67371
kstat.zfs.misc.arcstats.mru_ghost_hits: 19058
kstat.zfs.misc.arcstats.mfu_hits: 116259
kstat.zfs.misc.arcstats.mfu_ghost_hits: 14086
kstat.zfs.misc.arcstats.deleted: 268507
kstat.zfs.misc.arcstats.recycle_miss: 56554
kstat.zfs.misc.arcstats.mutex_miss: 38
kstat.zfs.misc.arcstats.evict_skip: 35618
kstat.zfs.misc.arcstats.hash_elements: 2645
kstat.zfs.misc.arcstats.hash_elements_max: 10052
kstat.zfs.misc.arcstats.hash_collisions: 15960
kstat.zfs.misc.arcstats.hash_chains: 46
kstat.zfs.misc.arcstats.hash_chain_max: 3
kstat.zfs.misc.arcstats.p: 33528320
kstat.zfs.misc.arcstats.c: 84184960
kstat.zfs.misc.arcstats.c_min: 84184960
kstat.zfs.misc.arcstats.c_max: 673479680
kstat.zfs.misc.arcstats.size: 15296928
kstat.zfs.misc.arcstats.hdr_size: 551824
kstat.zfs.misc.arcstats.l2_hits: 0
kstat.zfs.misc.arcstats.l2_misses: 0
kstat.zfs.misc.arcstats.l2_feeds: 0
kstat.zfs.misc.arcstats.l2_rw_clash: 0
kstat.zfs.misc.arcstats.l2_writes_sent: 0
kstat.zfs.misc.arcstats.l2_writes_done: 0
kstat.zfs.misc.arcstats.l2_writes_error: 0
kstat.zfs.misc.arcstats.l2_writes_hdr_miss: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_free_on_write: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_size: 0
kstat.zfs.misc.arcstats.l2_hdr_size: 0
kstat.zfs.misc.arcstats.memory_throttle_count: 415384
kstat.zfs.misc.vdev_cache_stats.delegations: 7553
kstat.zfs.misc.vdev_cache_stats.hits: 35799
kstat.zfs.misc.vdev_cache_stats.misses: 13472

deepdish · Jan 4, 2010

Is there any compression enabled or changes to checksum algorithms on your zpool's ?

tty23 · Jan 4, 2010

Nope, default config, the only thing I changed is sharenfs=on.

garrettmoore · Jan 4, 2010

Default configuration for me as well. I created my zpool with zpool create tank da{0,1,2,3,4,5,6,7}.

Maybe what we need to do is a test like tty23 did, except with the output from sysctl -a | grep zfs immediately after the system boots, and also after each copy test, to see how the parameters are changing. If we can narrow it down a bit more it may mean something to someone.

Would this be useful data?

deepdish · Jan 4, 2010

garrettmoore said:
Default configuration for me as well. I created my zpool with zpool create tank da{0,1,2,3,4,5,6,7}.

Maybe what we need to do is a test like tty23 did, except with the output from sysctl -a | grep zfs immediately after the system boots, and also after each copy test, to see how the parameters are changing. If we can narrow it down a bit more it may mean something to someone.

Would this be useful data?

I have 2 zpool's on my system and I have no issues participating in these tests. As long as the commands used is all laid out, then I'll post my results and we can compare.

My results will be skewed as I have enabled compression globally on all of my pools (using lzjb compression), but the idea of performance degradation due to usage should still apply.

Alt · Jan 4, 2010

If it can help - i cant see same results for 100m file on raidz (4 virtual disks) under vmware.

Matty · Jan 5, 2010

Here are my results:
amd X2 2ghz, 4GB ram single WD hddisk with zfs on ad0s2 on an empty pool running 8.0-release

Code:

[matty@fb ~]$ cat /boot/loader.conf 
zfs_load="YES"
vm.kmem_size="4G"
kern.maxusers=2048 
vfs.zfs.txg.timeout="5"

Code:

tank  type                  filesystem             -
tank  creation              Tue Jan  5 14:07 2010  -
tank  used                  2.93G                  -
tank  available             100G                   -
tank  referenced            2.93G                  -
tank  compressratio         1.00x                  -
tank  mounted               yes                    -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            128K                   default
tank  mountpoint            /tank                  default
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           off                    default
tank  atime                 off                    local
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  jailed                off                    default
tank  snapdir               hidden                 default
tank  aclmode               groupmask              default
tank  aclinherit            restricted             default
tank  canmount              on                     default
tank  shareiscsi            off                    default
tank  xattr                 off                    temporary
tank  copies                1                      default
tank  version               3                      -
tank  utf8only              off                    -
tank  normalization         none                   -
tank  casesensitivity       sensitive              -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       0                      -
tank  usedbydataset         2.93G                  -
tank  usedbychildren        55.5K                  -
tank  usedbyrefreservation  0                      -

I did

Code:

dd if=/dev/urandom of=./file1 bs=1m count=1000

6 times with adding 1 to of=./fileX. Between runs I waited till the disk had written all from its' cache/ram.

Here are the results:

Code:

1048576000 bytes transferred in 16.884601 secs (62102504 bytes/sec)
1048576000 bytes transferred in 16.782244 secs (62481274 bytes/sec)
1048576000 bytes transferred in 17.148042 secs (61148439 bytes/sec)
1048576000 bytes transferred in 16.877827 secs (62127429 bytes/sec)
1048576000 bytes transferred in 16.804360 secs (62399044 bytes/sec)
1048576000 bytes transferred in 17.143705 secs (61163908 bytes/sec)

And one with a 10GB dump

Code:

10485760000 bytes transferred in 200.109637 secs (52400075 bytes/sec)

deepdish · Jan 5, 2010

I wonder if we should justify using sync(8) for these tests?

Matty · Jan 11, 2010

My result posted above were on an empty pool.

after filling the pool to 90% I also see some kind of degradation from 61MB/s to 35/40MB/s.