Question about zfs raidz and geli

CrazyEmperor · May 1, 2010

Hi

i have been using geli+ufs on my 7.1 amd64 backup server and have had no problems however i then bought 3x1tb drives and added raidz ontop of the geli devices like:

Code:

# zpool status
  pool: array1
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        array1        ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            ad12.eli  ONLINE       0     0     0
            ad10.eli  ONLINE       0     0     0
            ad14.eli  ONLINE       0     0     0

errors: No known data errors

after some tuning it was stable but REALLY SLOW the machine has the following specs:
cpu:intel core2 dup e6600
ram:6gb 667mhz ddr2
mobo:asus p5n32-sli se deluxe

and the hdds are 1tb seagate sata2 drives (got them for cheap because they had the firmware bug wich i have flashed away)

hdds are connected to mobo sata controller and a pci intel nic is used

transferring a large file (2gb) over sftp i get an average speed of ~5mb/s with cpu usage going from 50 to 100%, the first 200mb or so i max out the switch at 11.6mb/s (cpu at 50%) and then cpu gets pegged at 100% and transfer stops for a while and then copies some smaller part of the file then stops this repeats until file is done copying or transfer speed falls to about 4-5mb/s where it will be fairly stable

doing the same thing but writing to ufs+geli instead of zfs+geli average speed is stable at 11.6mb/s(maxing out the 100mbit switch) with around 50% cpu usage

i thought this was a problem with zfs v6 so i installed 8.0 release (still amd64) and did
zpool upgrade -a
but the same problem persists i have tried tuning kernel mem etc but this only affects when the slowdown will occur even with

Code:

vm.kmem_size_max=5000M 
vfs.zfs.arc_max=4096M
vfs.zfs.vdev.cache.size=10M

vs

Code:

vm.kmem_size_max=1536M
vfs.zfs.arc_max=128M
vfs.zfs.vdev.cache.size=5M

any effect is minor

i have also tried setting vfs.zfs.mdcomp_disable=1
and turned kern.maxvnodes up to 800 000

but no dice

setting vfs.zfs.zil_disable=1 appeared to make things worse

am i wrong in assuming i should get atleast 11-12mb/s from this system running geli+zfs?
does anyone have any ideas where i should look ? or have maybe even had a similar issue and solved it?

hermes · May 1, 2010

Iâ€™m pretty sure ZFS+geli is _not_ the bottleneck here. To test â€˜realâ€™ I/O performance (and not network performance), use

Code:

dd if=/dev/zero of=/some/place/on/your/raidz/test bs=512 count=10000

or something similar.
However, your CPU usage seems troubling. Neither 50 nor 100% is anywhere near acceptable for a simple sftp/scp copy operation. Which processes are using so much CPU? How is memory usage?

CrazyEmperor · May 1, 2010

output of dd as you can see it starts off ok with smallish files and then really slows down

Code:

# dd if=/dev/zero of=/usr/ftp/zfs/Os/testfile.test bs=512 count=10000
10000+0 records in
10000+0 records out
5120000 bytes transferred in 0.136940 secs (37388637 bytes/sec)
# dd if=/dev/zero of=/usr/ftp/zfs/Os/testfile.test bs=512 count=100000
100000+0 records in
100000+0 records out
51200000 bytes transferred in 1.256747 secs (40740101 bytes/sec)
# dd if=/dev/zero of=/usr/ftp/zfs/Os/testfile.test bs=512 count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes transferred in 69.689158 secs (7346910 bytes/sec)

output of top while the ~500mb file was running

Code:

21 processes:  2 running, 19 sleeping
CPU:  0.4% user,  0.0% nice, 99.1% system,  0.4% interrupt,  0.2% idle
Mem: 10M Active, 13M Inact, 258M Wired, 36K Cache, 623M Buf, 5650M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 1821 root        1 111    0  4776K  1024K RUN     0   1:19 66.55% dd
 1804 user        1  44    0  8304K  2336K CPU0    0   0:04  0.59% top
 1012 user        1  44    0 37040K  5488K select  0   2:28  0.00% sshd
 1031 root        1  44    0 10276K  2944K pause   0   0:01  0.00% csh
  697 root        1  44    0  5992K  1536K select  0   0:00  0.00% syslogd
 1010 root        1  44    0 37040K  5128K sbwait  0   0:00  0.00% sshd
  937 root        1  44    0  6920K  1612K nanslp  0   0:00  0.00% cron
 1802 user        1  46    0 10276K  2808K pause   1   0:00  0.00% csh
 1016 user        1  45    0 10276K  2784K pause   1   0:00  0.00% csh
 1030 user        1  45    0 20636K  2016K wait    0   0:00  0.00% su
  118 root        1  76    0  2736K  1072K pause   1   0:00  0.00% adjkerntz
  990 root        1  76    0  5860K  1288K ttyin   0   0:00  0.00% getty
  993 root        1  76    0  5860K  1288K ttyin   1   0:00  0.00% getty
  989 root        1  76    0  5860K  1288K ttyin   1   0:00  0.00% getty
 1009 root        1  44    0  5860K  1288K ttyin   1   0:00  0.00% getty
  992 root        1  76    0  5860K  1288K ttyin   0   0:00  0.00% getty
  994 root        1  76    0  5860K  1288K ttyin   0   0:00  0.00% getty
  995 root        1  76    0  5860K  1288K ttyin   0   0:00  0.00% getty

at the start of each copy cpu load stays at about 50% and then goes up to 100% the same way copy over ssh does

Neco · May 12, 2010

You need this patch to get acceptable performance with ZFS+GELI.

fgordon · May 12, 2010

Hmm I'm running without any "optimization" on a slower cpu also with .eli devices and raidz2 and don't have this issue.... it's not fast but ok (~ 20-25 Mbyte / sec)

Did you try without any optimization settings too?

phatfish · May 12, 2010

Neco said:
You need this patch to get acceptable performance with ZFS+GELI.

Im running a mirrored pool and a raidz pool with geli encrypted devices and get about 25-30mb/sec write in "real world" tests i did. My system is setup with default settings pretty much.

I believe the patch above patch is to reduce the priority that geli threads run at, so "user land" applications aren't effected as much by the heavy load ZFS and geli can put on the system, when it commits a batch of writes. However it didn't seem to have any noticeable effect for me. My main issue with my setup is still that my user land processes get a heavy performance hit during writes.

See the discussion here: http://lists.freebsd.org/pipermail/freebsd-geom/2009-December/003812.html

You can use "top" then "shift+s" to see the load geli threads are putting on your system.

Neco · May 13, 2010

phatfish said:
Im running a mirrored pool and a raidz pool with geli encrypted devices and get about 25-30mb/sec write in "real world" tests i did. My system is setup with default settings pretty much.

I believe the patch above patch is to reduce the priority that geli threads run at, so "user land" applications aren't effected as much by the heavy load ZFS and geli can put on the system, when it commits a batch of writes. However it didn't seem to have any noticeable effect for me. My main issue with my setup is still that my user land processes get a heavy performance hit during writes.

See the discussion here: http://lists.freebsd.org/pipermail/freebsd-geom/2009-December/003812.html

You can use "top" then "shift+s" to see the load geli threads are putting on your system.

Without that patch my system (both i386 and amd64) grinds to a halt, with loadaverage way over +200 and then becoming unresponsive forcing a reboot basically.
And yeah with top -S it was geli:w that was doing it.

CrazyEmperor · May 14, 2010

Neco said:
You need this patch to get acceptable performance with ZFS+GELI.

i found your post about this patch on some other forum and have tried patching and compiling a new kernel however i did not notice any improvement in system performance i will try making GENERIC using the patched geli module and see what happens (my current kernel has some smallish tweaks that should not have anything to do with any file system or crypto stuff but i might as well try)

Did you try without any optimization settings too?

i dont remember now but i think i did since i read somewhere that unlike 7 most tuning in 8.0 should be automagic or better defaults will try that again tho just to make sure

phatfish · May 14, 2010

Neco said:
Without that patch my system (both i386 and amd64) grinds to a halt, with loadaverage way over +200 and then becoming unresponsive forcing a reboot basically.
And yeah with top -S it was geli:w that was doing it.

The top -S comment was meant for CrazyEmperor really so he could see where his load was, sorry.

I certainly don't get that behaviour, my system runs fine under load and over long usage periods. Its more an annoyance for me that my user land processes nailed because of high the system load. I'm not sure if its the ZFS+geil combination, or whether the same thing would happen with ZFS alone. Which ever, I would like a way to stop it.

I will add a faster (3ghz) cpu soon, and see if there is any effect. I have the feeling it will just speed up my transfer rates and still cause issues with my user land processes as data gets written as fast as the cpu will allow.

CrazyEmperor · May 15, 2010

i tried compiling GENERIC kernel with patched geli and it looks like there was some improvement however performance still drops alot when doing larger writes

output of dd

Code:

# dd if=/dev/zero of=/usr/ftp/zfs/Os/testfile.test bs=512 count=100000
100000+0 records in
100000+0 records out
51200000 bytes transferred in 1.162083 secs (44058818 bytes/sec)
# dd if=/dev/zero of=/usr/ftp/zfs/Os/testfile.test bs=512 count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes transferred in 20.464974 secs (25018356 bytes/sec)
# dd if=/dev/zero of=/usr/ftp/zfs/Os/testfile.test bs=512 count=10000000
10000000+0 records in
10000000+0 records out
5120000000 bytes transferred in 356.679843 secs (14354610 bytes/sec)

output of top -s while large file is running

Code:

last pid:  1296;  load averages:  6.67,  2.66,  1.71    up 0+01:18:30  17:25:33
134 processes: 14 running, 102 sleeping, 18 waiting
CPU:  0.0% user,  0.0% nice, 99.8% system,  0.2% interrupt,  0.0% idle
Mem: 13M Active, 7220K Inact, 1892M Wired, 4K Cache, 12M Buf, 4014M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 1087 root        1  48    -     0K    16K RUN     0   3:36 44.58% g_eli[0] ad1
 1088 root        1 105    -     0K    16K RUN     1   4:37 40.87% g_eli[1] ad1
 1294 root        1  76    0  4776K  1024K RUN     1   0:33 21.00% dd
   11 root        2 171 ki31     0K    32K RUN     0 108:31 19.14% idle
 1090 root        1  63    -     0K    16K RUN     0   4:01 11.87% g_eli[0] ad1
 1085 root        1  98    -     0K    16K RUN     1   4:54 10.69% g_eli[1] ad1
    3 root        1  -8    -     0K    16K CPU1    0   4:40  9.28% g_up
   19 root        1  56    -     0K    16K RUN     0   1:04  7.47% syncer
    0 root       42  -8    0     0K   656K -       1   4:22  4.39% kernel
 1084 root        1  76    -     0K    16K RUN     0   4:10  3.17% g_eli[0] ad1
 1091 root        1  46    -     0K    16K RUN     1   3:30  2.69% g_eli[1] ad1
    4 root        1  -8    -     0K    16K RUN     0   1:14  0.98% g_down
   42 root        7  -8    -     0K   108K zio->i  1   0:04  0.10% zfskern
    2 root        1  -8    -     0K    16K -       1   0:49  0.00% g_event
   12 root       18 -60    -     0K   288K WAIT    0   0:28  0.00% intr
   13 root        1  44    -     0K    16K RUN     0   0:17  0.00% yarrow
   23 root        1  44    -     0K    16K geli:w  0   0:02  0.00% g_eli[0] ad0

my zfs drives are ad10 ad12 and ad14 im assuming there are 2 geli threads per drive
these cycle between geli:w and RUN. load seems to be quite uneven between threads for some reason

i also noticed something odd. when i tried setting primarycache and secondarycache to none
speed increased by ~15% on a large write

Code:

# dd if=/dev/zero of=/usr/ftp/zfs/Os/testfile.test bs=512 count=10000000
10000000+0 records in
10000000+0 records out
5120000000 bytes transferred in 297.481420 secs (17211159 bytes/sec)

with pure GENERIC kernel and no optimizations speed is really slow so it looks like patching geli did improve performance a bit

Code:

# dd if=/dev/zero of=/usr/ftp/zfs/Os/testfile.test bs=512 count=10000000
10000000+0 records in
10000000+0 records out
5120000000 bytes transferred in 603.252060 secs (8487331 bytes/sec)

fgordon · May 17, 2010

Hmm strange I ran yor tests on my system and > 1.000.000 I always get constant transfer rates no matter how high I go....

on my system all "g_eli[0]" are higher than all "g_eli[1]".... while dd is running... never saw a [0] below a [1] 99,9 it's perfectly sorted....

Code:

last pid:  1728;  load averages:  4.29,  3.06,  1.50    up 0+00:08:10  11:01:26
197 processes: 3 running, 176 sleeping, 18 waiting
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 60M Active, 60M Inact, 924M Wired, 2508K Cache, 312M Buf, 1890M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
   11 root          2 171 ki31     0K    32K RUN     0   7:40 137.55% idle
    0 root         36  -8    0     0K   560K -       0   0:47  3.86% kernel
   16 root          1  45    -     0K    16K syncer  0   0:26  1.07% syncer
 1340 root          1  -8    -     0K    16K geli:w  0   0:06  0.78% g_eli[0] a
 1323 root          1  -8    -     0K    16K geli:w  0   0:06  0.78% g_eli[0] a
 1351 root          1  -8    -     0K    16K geli:w  0   0:06  0.68% g_eli[0] a
 1357 root          1  -8    -     0K    16K geli:w  0   0:07  0.59% g_eli[0] a
 1334 root          1  -8    -     0K    16K geli:w  0   0:06  0.59% g_eli[0] a
 1329 root          1  -8    -     0K    16K geli:w  0   0:06  0.59% g_eli[0] a
 1318 root          1  -8    -     0K    16K geli:w  0   0:06  0.59% g_eli[0] a
 1312 root          1  -8    -     0K    16K geli:w  0   0:06  0.59% g_eli[0] a
 1346 root          1  -8    -     0K    16K geli:w  0   0:06  0.59% g_eli[0] a
 1362 root          1  -8    -     0K    16K geli:w  0   0:07  0.49% g_eli[0] a
 1368 root          1  -8    -     0K    16K geli:w  0   0:07  0.49% g_eli[0] a
 1373 root          1  -8    -     0K    16K geli:w  0   0:06  0.49% g_eli[0] a
 1374 root          1  -8    -     0K    16K geli:w  1   0:06  0.10% g_eli[1] a
    3 root          1  -8    -     0K    16K -       0   0:19  0.00% g_up
   12 root         18 -60    -     0K   288K WAIT    0   0:10  0.00% intr
    4 root          1  -8    -     0K    16K -       1   0:07  0.00% g_down
 1347 root          1  -8    -     0K    16K geli:w  1   0:07  0.00% g_eli[1] a
 1313 root          1  -8    -     0K    16K geli:w  1   0:07  0.00% g_eli[1] a
 1319 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
 1324 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
 1335 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
 1341 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
 1330 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
 1352 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
 1358 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
 1363 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
 1369 root          1  -8    -     0K    16K geli:w  1   0:06  0.00% g_eli[1] a
  593 www          19  44    0   940M 66268K ucond   1   0:06  0.00% java
 1378 root         16  -8    -     0K   252K tx->tx  1   0:02  0.00% zfskern
 1112 haldaemon     1  44    0 43100K  6828K select  1   0:00  0.00% hald
   13 root          1  44    -     0K    16K -       1   0:00  0.00% yarrow
 1479 root          1  44    0  9328K  2752K CPU0    1   0:00  0.00% top

fgordon · May 17, 2010

Ah dd had just finished in the top -S post... sorry could not edit my post....

this is top -S while running..... strange how [0] and [1] are ordered in my top -S and not in yours? could it be this is why my transfer rates are perfectly stable?

My transfer rates are > 1.000.000 count very stable at ~ 20 Mbytes / sec (12 x 2 TByte on cheap SII PCI 3114 controllers) no matter how high I set count= the results only differ +/- 0,5 MByte / sec

Code:

last pid:  2622;  load averages:  4.81,  2.64,  1.82    up 0+00:24:27  11:17:43
201 processes: 5 running, 178 sleeping, 18 waiting
CPU:  0.6% user,  0.0% nice, 79.2% system,  2.2% interrupt, 18.0% idle
Mem: 61M Active, 60M Inact, 902M Wired, 2508K Cache, 312M Buf, 1911M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 2484 root          1 106    0  4776K  1008K RUN     1   1:28 47.07% dd
   11 root          2 171 ki31     0K    32K RUN     0  32:37 40.48% idle
    0 root         36  -8    0     0K   560K -       0   1:35  8.50% kernel
   16 root          1  47    -     0K    16K syncer  0   0:50  7.18% syncer
    3 root          1  -8    -     0K    16K RUN     0   0:37  2.98% g_up
 1334 root          1  -8    -     0K    16K geli:w  0   0:12  2.29% g_eli[0] a
 1357 root          1  -8    -     0K    16K geli:w  0   0:13  2.20% g_eli[0] a
 1318 root          1  -8    -     0K    16K geli:w  0   0:12  2.20% g_eli[0] a
 1373 root          1  -8    -     0K    16K geli:w  0   0:13  2.10% g_eli[0] a
 1368 root          1  -8    -     0K    16K geli:w  0   0:13  1.95% g_eli[0] a
 1362 root          1  -8    -     0K    16K geli:w  0   0:13  1.86% g_eli[0] a
 1351 root          1  -8    -     0K    16K geli:w  0   0:12  1.86% g_eli[0] a
 1312 root          1  -8    -     0K    16K geli:w  0   0:12  1.86% g_eli[0] a
 1340 root          1  -8    -     0K    16K geli:w  0   0:12  1.86% g_eli[0] a
 1323 root          1  -8    -     0K    16K geli:w  0   0:12  1.86% g_eli[0] a
 1329 root          1  -8    -     0K    16K geli:w  0   0:12  1.66% g_eli[0] a
 1346 root          1  -8    -     0K    16K geli:w  0   0:11  1.46% g_eli[0] a
 1369 root          1  -8    -     0K    16K geli:w  1   0:12  0.59% g_eli[1] a
 1374 root          1  -8    -     0K    16K geli:w  1   0:12  0.49% g_eli[1] a
 1358 root          1  -8    -     0K    16K geli:w  1   0:12  0.49% g_eli[1] a
 1363 root          1  -8    -     0K    16K geli:w  1   0:12  0.49% g_eli[1] a
 1341 root          1  -8    -     0K    16K geli:w  1   0:13  0.29% g_eli[1] a
   12 root         18 -60    -     0K   288K WAIT    0   0:20  0.20% intr
 1347 root          1  -8    -     0K    16K geli:w  1   0:14  0.20% g_eli[1] a
 1324 root          1  -8    -     0K    16K geli:w  1   0:13  0.20% g_eli[1] a
 1313 root          1  -8    -     0K    16K geli:w  1   0:13  0.20% g_eli[1] a
 1319 root          1  -8    -     0K    16K geli:w  1   0:13  0.20% g_eli[1] a
 1330 root          1  -8    -     0K    16K geli:w  1   0:13  0.20% g_eli[1] a
 1335 root          1  -8    -     0K    16K geli:w  1   0:13  0.10% g_eli[1] a
    4 root          1  -8    -     0K    16K -       1   0:13  0.10% g_down
 1352 root          1  -8    -     0K    16K geli:w  1   0:13  0.10% g_eli[1] a
  593 www          19  44    0   939M 66508K ucond   0   0:07  0.00% java
 1378 root         16  -8    -     0K   252K zio->i  1   0:04  0.00% zfskern
   13 root          1  44    -     0K    16K -       1   0:00  0.00% yarro

phatfish · May 18, 2010

CrazyEmperor said:

output of top -s while large file is running

Code:

last pid:  1296;  load averages:  6.67,  2.66,  1.71    up 0+01:18:30  17:25:33
134 processes: 14 running, 102 sleeping, 18 waiting
CPU:  0.0% user,  0.0% nice, 99.8% system,  0.2% interrupt,  0.0% idle
Mem: 13M Active, 7220K Inact, 1892M Wired, 4K Cache, 12M Buf, 4014M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 1087 root        1  48    -     0K    16K RUN     0   3:36 44.58% g_eli[0] ad1
 1088 root        1 105    -     0K    16K RUN     1   4:37 40.87% g_eli[1] ad1
 1294 root        1  76    0  4776K  1024K RUN     1   0:33 21.00% dd
   11 root        2 171 ki31     0K    32K RUN     0 108:31 19.14% idle
 1090 root        1  63    -     0K    16K RUN     0   4:01 11.87% g_eli[0] ad1
 1085 root        1  98    -     0K    16K RUN     1   4:54 10.69% g_eli[1] ad1
    3 root        1  -8    -     0K    16K CPU1    0   4:40  9.28% g_up
   19 root        1  56    -     0K    16K RUN     0   1:04  7.47% syncer
    0 root       42  -8    0     0K   656K -       1   4:22  4.39% kernel
 1084 root        1  76    -     0K    16K RUN     0   4:10  3.17% g_eli[0] ad1
 1091 root        1  46    -     0K    16K RUN     1   3:30  2.69% g_eli[1] ad1
    4 root        1  -8    -     0K    16K RUN     0   1:14  0.98% g_down
   42 root        7  -8    -     0K   108K zio->i  1   0:04  0.10% zfskern
    2 root        1  -8    -     0K    16K -       1   0:49  0.00% g_event
   12 root       18 -60    -     0K   288K WAIT    0   0:28  0.00% intr
   13 root        1  44    -     0K    16K RUN     0   0:17  0.00% yarrow
   23 root        1  44    -     0K    16K geli:w  0   0:02  0.00% g_eli[0] ad0

my geli threads end up giving a similar load as fgordon, never above ~10%. So 40% seems odd, and the fact only one device has the higher load. I wonder if a hardware issue with that disk or that SATA channel could cause it.

CrazyEmperor · May 18, 2010

@fgordon
very strange indeed after some playing with top i found out that mine are ordered by drive
which does explain some slowness since all drives have to do an equal amount of write operations tings will be slow if done "one drive at a time "

what cpu do you use ?
it looks like geli barely affects system load while for me it is the main load

im starting to suspect some hw stuff myself im thinking maybe some NCQ/TCQ problem or that the crappy onboard sata controller breaks something

i will look in to this tomorrow

fgordon · May 19, 2010

I'm using a AMD Socket 939 CPU Dual Core (3800+) Dual Core - so not really a racing machine

Drives are 2 TByte Drives from Seagate (5400 U/min) and "cheap" SiI PCI Controllers (3114)

I ran the tests with really really high count= (resulting with a testfile of about 90 Gbyte) to be sure, but transfer rate is perfectly stable and the "ordered" tasks were always the same, whenever I took a look.

Really strange, I'm using an 8.0 64 "Release", with default kernel, no optimization, additional tasks running are just smb, tomcat, headless X-Org, maybe (though really a big maybe) this is because my System HD is a SSD drive (EIDE) and no SATA ... ? just a real wild guess.....?