questions about users experience with ZFS on FreeBSD

wonslung · Jul 26, 2009

I've been using ZFS on freebsd for awhile as my major home file server. So far i've had NO problems whatso ever and i absolutely love it. It's been fast on cheap hardware that doesn't work on solaris, and i've really enjoyed using it for jails and other freebsd specific stuff.

I've been reading the Sun ZFS mailing list, subscribed to it awhile back when i was first looking into using ZFS. Theres a few threads there where users have made various mistakes and lost all their data, and more where catastrophic failures have happened due to no real user error. I Know that zfs is technically still considered experimental in FreeBSD and therefore comes with no guarantee but i was wondering if these types of failures have been happening to folks with FreeBSD. I plan to build a backup server for my main fileserver when i can afford it, but that isn't an option right now. I do have a redundant setup but so did some of the others.

My main question is, has anyone had a pool get lost due to a power outage?

Is there a problem with using consumer grade hardware (desktop ram/motherboard/processor) with ZFS on FreeBSD? I see a lot of threads saying that ZFS should have ECC memory, and i plan to use server grade hardware on the next fileserver but is using non-ecc memory going to put me at major risk due to memory errors?

My system is built on FreeBSD 7.2, intel q9550 processor 8gb of ddr2 800 12 1tb hard drives and 2 compact flash drives mirrored with gmirror for the boot device
I configured zfs with 3 raidz1 vdevs each with 4 drives and an 80 gb log device

I've had a few power outages with no problems, theres a lot of thunderstorms where i live...i plan to invest in a UPS but right now i just have a normal surge protecter.

I know i should have asked all of these questions to begin with, i should be able to build a complete backup server for this in 2-4 months time.

Thanks

hedwards · Jul 26, 2009

I've been using ZFS on a sub $500 consumer desktop without really any problems. I have yet to have any issues with ZFS itself even though the computer will crash from time to time.

I've had more problems though with my own actions and with the weirdness related to mixing MBR and GPT on the same disk and with a short foray into version 13 on one half of my mirror.

wonslung · Jul 26, 2009

i've had no problems either but if you read the sun-zfs mailing list you'd understand my concern...i get the impression that on solaris zfs is one of those things that "works great until it doesn't" if you know what i mean.

I've found it to be amazingly fast and wonderful...i love snapshots and clones....but that list makes me paranoid....i just hope it'll be ok until i can build a backup server...i have about 5 TB of videos and music on it so far =)

Voltar · Jul 27, 2009

I've been using ZFS (RAID-Z) for a few months on a development box without issues. It does take a lot of tuning and tweaking to get it running smoothly at first, but runs like a dream afterward. I've gone as far as pulling the plug on the machine and pulling a drive out, dd'ing random parts of a drive to corrupt it and each time the file system has come back without issues. The one thing I would have a complaint about though, is the cheap Highpoint RocketRaid card that I have in said development machine, it didn't want to cooperate, but that's because of their shitty card, not ZFS.

I'm currently building a new file server for media and offsite/at home backups of a few servers, that will be running 16 drives in 4 drive vdevs, all in one pool.

So far I have about 8 TB of data on ZFS file systems and overall I've very satisfied with ZFS on FreeBSD thus far.

wonslung · Jul 27, 2009

Voltar said:
I've been using ZFS (RAID-Z) for a few months on a development box without issues. It does take a lot of tuning and tweaking to get it running smoothly at first, but runs like a dream afterward. I've gone as far as pulling the plug on the machine and pulling a drive out, dd'ing random parts of a drive to corrupt it and each time the file system has come back without issues. The one thing I would have a complaint about though, is the cheap Highpoint RocketRaid card that I have in said development machine, it didn't want to cooperate, but that's because of their shitty card, not ZFS.

I'm currently building a new file server for media and offsite/at home backups of a few servers, that will be running 16 drives in 4 drive vdevs, all in one pool.

So far I have about 8 TB of data on ZFS file systems and overall I've very satisfied with ZFS on FreeBSD thus far.

cool cool, yah, when i originally set mine up i did it wrong...sort of. It's very hard to find a good but cheap jbod sata raid card. Most of them for multi drive are designed for hardware raid or only hold 4 drives. I ended up getting a second hand highpoint rocketraid 2340. It's not the best card in the world but even with it not having any real dedicated hardware, with my system i get speeds up to 350 MB/s

The problem is, when i set it up i hooked up 12 new drives to the system which hadn't been configured and it didn't pass the empty drives on to freebsd...so i couldn't format them for what i'd consider to be a "normal" jbod setup. The raid controler has a jbod setting but it's not what i should have used. I made one "jbod" per drive then they showed up in freebsd, and i was able to make the zfs filesystem.

LATER i found out that if you format/partion the drive with normal system and hook it up to the raid card it shows up as "LEGACY" which is what i SHOULD have done....so what i need to do is fail one drive at a time, format the drive with another system with 1 large freebsd slice and add it back. I've been kind of paranoid about trying this because right now everything is working perfectly.

Thanks to pheonix i was smart enough to at least use glabels when i set this system up....i guess i've been putting off the resilvering until i got the backup done....but anyways, glad to hear that it's working well for you...i have 8 more slots for hard drives to fill.

Currently the 3 vdevs i have are made up of 4 1tb drives.

If i added a new vdev of 4 1.5 tb drives would it make a difference or should i only use 1 tb drives unless i replace the others?

graudeejs · Jul 27, 2009

ON my Desktop PC (25GB ram, 1x160GB + 1x250GB HDD) I almost have no problems....

I do have some small lags, when writing fast to HDD, but They are much smaller and less since Beta1...

Also I have this problem:
http://www.freebsd.org/cgi/query-pr.cgi?pr=137037
But even with that... I still use ZFS and I have never lost even a single bit of my data......

I don't plan to switch to old GPT/UFS anymore

Voltar · Jul 27, 2009

wonslung said:
cool cool, yah, when i originally set mine up i did it wrong...sort of. It's very hard to find a good but cheap jbod sata raid card. Most of them for multi drive are designed for hardware raid or only hold 4 drives. I ended up getting a second hand highpoint rocketraid 2340. It's not the best card in the world but even with it not having any real dedicated hardware, with my system i get speeds up to 350 MB/s

The problem is, when i set it up i hooked up 12 new drives to the system which hadn't been configured and it didn't pass the empty drives on to freebsd...so i couldn't format them for what i'd consider to be a "normal" jbod setup. The raid controler has a jbod setting but it's not what i should have used. I made one "jbod" per drive then they showed up in freebsd, and i was able to make the zfs filesystem.

LATER i found out that if you format/partion the drive with normal system and hook it up to the raid card it shows up as "LEGACY" which is what i SHOULD have done....so what i need to do is fail one drive at a time, format the drive with another system with 1 large freebsd slice and add it back. I've been kind of paranoid about trying this because right now everything is working perfectly.

Son of a ... I just went through the same issue with my RocketRaid 2320. I bought the card awhile ago for use as just a controller card, no RAID features used because I didn't trust it, and didn't want to be stuck with a dead card and no access to my data.

After painstaking tinkering, I came up with a solution that works so far. The RocketRaid card will force you to initialize a disk if it doesn't have a partition table, but then you're tied into using the controller if you do that. I also wanted to use the full disks, not slices because I've read that ZFS enables certain features (write caching?) when using a disk instead of a slice. However I don't know for certain on that, but I figured I would just go with the entire disk for simplicity. Since that wasn't an option, and HighPoint's customer support being completely lackluster and clueless, I took a few old drives and did some experimenting. You can pop a single drive into a system and create a partition on it using the entire disk. Now pop that disk into your highpoint controller and let it boot once. Check to make sure you have the drive showing (daX most likely). Now, I don't know about all RocketRaid controllers but the 2320 writes info to sector 9 of the hard drive when a disk is used in legacy mode.

Now that you have a drive that the controller sees in legacy mode, with a partition table, use dd to copy the partition table and MBR to a file (# dd if=/dev/daX of=~/rr_mbr bs=512 count=1). Now when you want to add a drive on the RR controller, just hook the drive up to your system SATA bus, eSATA/USB, etc, dd the saved MBR/Partition table to it, and when you hook it up to your RocketRaid controller it'll already be in legacy mode.

Now after that you can create your vdevs/zpool(s) using the full disks. In my tests using the above method I made a four drive raidz vdev, and tortured it, corrupted it and resilvered it and it held up time after time.

This leads me to believe that even though you give ZFS the entire disk, it doesn't touch the partition table. The plus to this is you can use disks instead of slices in your vdevs and your drives can be swapped from controller to controller.

That's basically a rough write up of what I did, I have an article going up soon that goes a bit more into detail about all of this.

Thanks to pheonix i was smart enough to at least use glabels when i set this system up

I neglected to do that on my original setup for my current storage server, so it's on my list of things to do. Not terribly important but it'll make nice if I go off moving a bunch of drives around.

If i added a new vdev of 4 1.5 tb drives would it make a difference or should i only use 1 tb drives unless i replace the others?

The size of the drives don't matter as long as the replication level is the same.

wonslung · Jul 27, 2009

Voltar said:
Son of a ... I just went through the same issue with my RocketRaid 2320. I bought the card awhile ago for use as just a controller card, no RAID features used because I didn't trust it, and didn't want to be stuck with a dead card and no access to my data.

a lot of people knock highpoint cards because they aren't "hardware raid" but for me, it's been amazing. I get great speeds...the card itself is pci-ex8 and is quite fast...so what if it passes everything off to the cpu...i have 4 of them =)

I also wanted to use the full disks, not slices because I've read that ZFS enables certain features (write caching?) when using a disk instead of a slice.

Well that's not true for freebsd. In solaris it's SOMEWHAT true, by default if you give zfs a slice in solaris it disabled write caching because solaris has no method of splitting the cache...or something and due to the way zfs writes data in short bursts every 5-30 seconds that can be REALLY BAD but if you have a single slice on solaris you can re-enable write caching without problems. I've read in these forums and elsewhere that this is NOT a problem in FreeBSD so using slices is fine. I intend to make one large freebsd slice and not format it...just so the card will "give" it to freebsd as a da device. From everything i've read the writecache still works in FreeBSD, maybe someone else would care to comment.

This leads me to believe that even though you give ZFS the entire disk, it doesn't touch the partition table. The plus to this is you can use disks instead of slices in your vdevs and your drives can be swapped from controller to controller.

see, this is EXACTLY why i want it to be in legacy mode.....but you're saying i can use the disk and highpoints controller will still see it? what about when you hook it to another machine....how does THAT work? doesn't it still think it's a highpoint device and therefore not read it?

when i initialized mine i made each drive a single jbod...maybe i didn't really understand what you were saying and i should read it again =)

I neglected to do that on my original setup for my current storage server, so it's on my list of things to do. Not terribly important but it'll make nice if I go off moving a bunch of drives around.

i did it because i was scared that i might somehow end up with the device name switch...after asking around this can be catastrophic to ZFS.

The size of the drives don't matter as long as the replication level is the same.

ok, another question then....my case holds 20 drives, i have 12 right now. I'd like to have 1 or 2 hot swap drives. would it be a problem to use smaller raidz vdevs?
right now i have 3 groups of 4 would it be ok to add another group of 4 and one group of 3?

wonslung · Jul 27, 2009

oh, after re-reading what you did it sounds like you're doing exactly what i plan on doing, just in a different way.

when you have it in legacy mode it HAS to have some kind of slice on it....even if it's the entire DISK as one slice....
that's exactly what i plan to do, i just need to do it one drive at a time but so far i've been way too much of a puss to try.

when i bought the card i didn't know about "legacy mode" but that's EXACTLY what i want....did you get it into this mode WITHOUT putting a slice on it? if so i'm totally confused...

phoenix · Aug 3, 2009

The only 2 major issues I've run into so far were due to user error:

using 24-drives in a single raidz2 vdev
using device nodes directly instead of labelled devices via glabel

The first issue gave us all kinds of performance issues, and killed the system when the first drive died.

The second issue was an issue when a drive died, I pulled the drive, and then the server was rebooted. The 3Ware card re-numbered all the drives, the devices nodes were all shifted down by one for all the drives after the missing one, and the pool became corrupted. Thankfully, putting even a normal drive back into the slot and booting allowed the 3Ware card to correctly number the drives, and the pool can back up in a degraded state.

Beyond that, we've gone through lots of system lockups during the initial tuning phase on FreeBSD 7.0-STABLE/7.1-RELEASE, lots of power failures, a couple of drive replacements, and things are still running along nicely.

We now do weekly "zpool scrub" via crontab to pick up any data corruption. In just over a month of doing that, we haven't found any issues.

So far, things are working quite nicely.

wonslung · Aug 4, 2009

I'm pretty happy with my setup...it's just amazing to read the ZFS mailing list for solaris...it's likely the fact that more people using opensolaris are using it especially for ZFS or that opensolaris uses it by default but there seems to be LOTS of issues there...I really love how well it works for jails. If any other users come along who have large pools, i'd be interested in hearing from you.

ernie · Aug 17, 2009

I am looking at building a FreeBSD mail server for 20-30 users. I am really keen on using zfs, but I was wondering what the best FreeBSD version to use in this application. I have 8GB RAM so an amd64 FreeBSD version I assume to support that much RAM.

The hardware is as follows:

Intel Core2 Quad core Q8300 CPU
8GB RAM
1 * 64GB SSD boot drive
4 * 1.5TB SATA drives for zfs

I was looking at raidz1 but am open to suggestions.

- Ernie.

wonslung · Aug 17, 2009

ernie said:
I am looking at building a FreeBSD mail server for 20-30 users. I am really keen on using zfs, but I was wondering what the best FreeBSD version to use in this application. I have 8GB RAM so an amd64 FreeBSD version I assume to support that much RAM.

The hardware is as follows:

Intel Core2 Quad core Q8300 CPU
8GB RAM
1 * 64GB SSD boot drive
4 * 1.5TB SATA drives for zfs

I was looking at raidz1 but am open to suggestions.

- Ernie.

7.2 is fine, but one thing i'd suggest is this:
get 7.2-stable or 8.0 and instead of using the ssd as a boot drive, get a compact flash card as your boot drive and use the ssd as a your ZIL (slog)

For a mail server, having the ssd as either ZIL or L2ARC is going to be much better than having it as your boot drive.

phoenix · Aug 17, 2009

ernie said:
I am looking at building a FreeBSD mail server for 20-30 users. I am really keen on using zfs, but I was wondering what the best FreeBSD version to use in this application. I have 8GB RAM so an amd64 FreeBSD version I assume to support that much RAM.

I'd suggest either waiting for FreeBSD 8.0 to be released, or installing an 8.0-BETA (BETA3 should be released soon) in order to get the best / most current ZFS support (ZFSv13).

If you can't wait, installing FreeBSD 7.2-RELEASE, and then updating to 7-STABLE (tag=RELENG_7 in the cvsup supfile) will get you ZFSv13 support as well, but without the major kernel changes that 8.0 will have.

ZFSv13 has quite a few little extras compared to ZFSv6 (available in FreeBSD 7.0-7.2).

The hardware is as follows:

Intel Core2 Quad core Q8300 CPU
8GB RAM
1 * 64GB SSD boot drive
4 * 1.5TB SATA drives for zfs

I was looking at raidz1 but am open to suggestions.

For the best performance, you should consider creating two pairs of mirrored vdevs instead of a raidz vdev:

Code:

# zpool create storage mirror da0 da1 mirror da2 da3

That will create a pool named storage and create two mirrored vdevs. The pool will automatically stripe the two mirrors together, in effect, creating a RAID10 array.

Writes should be much faster in this setup than in a 4-drive raidz1 or raidz2 setup. You'll have 3.0 TB of disk space, and still be able to lose up to 2 drives (similar specs to a raidz2 setup).

If you absolutely need the disk space, then a 4-drive raidz1 would be doable. That would give you 4.5 TB of space, but then you could only lose 1 drive, and your disk I/O will be lower.

If you have space in the case, then I'd also recommend getting a CF-to-SATA adapter, and using a 4 GB CompactFlash disk for the OS install. Then you can use the SSD as either a separate intent log (ZIL/slog device), or as a cache device (L2ARC). Leave / and /usr on the CF disk, and create ZFS filesystems for /var, /usr/local, /usr/src, /usr/obj, /usr/ports, /home, and so on. And then maybe use tmpfs(4) for /tmp.

jb_fvwm2 · Aug 17, 2009

I am "only just curious" as to how (GPT, gjournal, softupdates,)
might be (irrelevant to/useful for/able to be used with/not to
be used with) the setup mentioned above. (Just curious because
I could probably plan for it but maybe not find the time to
put it to use).

ernie · Aug 18, 2009

phoenix said:
I'd suggest either waiting for FreeBSD 8.0 to be released, or installing an 8.0-BETA (BETA3 should be released soon) in order to get the best / most current ZFS support (ZFSv13).

If you can't wait, installing FreeBSD 7.2-RELEASE, and then updating to 7-STABLE (tag=RELENG_7 in the cvsup supfile) will get you ZFSv13 support as well, but without the major kernel changes that 8.0 will have.

ZFSv13 has quite a few little extras compared to ZFSv6 (available in FreeBSD 7.0-7.2).

Can do, I have a few weeks to get this ready if 8.0-BETA3 is near I can run with that. I will run the system along side the existing mail server in case 8.0-BETA3 has some issues.

Code:
For the best performance, you should consider creating two pairs of mirrored vdevs instead of a raidz vdev:

Code:

# zpool create storage mirror da0 da1 mirror da2 da3

That will create a pool named storage and create two mirrored vdevs. The pool will automatically stripe the two mirrors together, in effect, creating a RAID10 array.

Writes should be much faster in this setup than in a 4-drive raidz1 or raidz2 setup. You'll have 3.0 TB of disk space, and still be able to lose up to 2 drives (similar specs to a raidz2 setup).

If you absolutely need the disk space, then a 4-drive raidz1 would be doable. That would give you 4.5 TB of space, but then you could only lose 1 drive, and your disk I/O will be lower.

Sounds good, I will try both raid1 and what you suggest and run iozone on it to see how it feels.

If you have space in the case, then I'd also recommend getting a CF-to-SATA adapter, and using a 4 GB CompactFlash disk for the OS install. Then you can use the SSD as either a separate intent log (ZIL/slog device), or as a cache device (L2ARC). Leave / and /usr on the CF disk, and create ZFS filesystems for /var, /usr/local, /usr/src, /usr/obj, /usr/ports, /home, and so on. And then maybe use tmpfs(4) for /tmp.

What are the characteristics of the ZIL/slog device that make a SSD suitable? Can the SSD be partitioned to do the / and ZIL/slog or does that require the whole drive? I know little about the ZIL.

- Ernie.

phoenix · Aug 18, 2009

jb_fvwm2 said:
I am "only just curious" as to how (GPT, gjournal, softupdates,)
might be (irrelevant to/useful for/able to be used with/not to
be used with) the setup mentioned above. (Just curious because
I could probably plan for it but maybe not find the time to
put it to use).

Why would you need GPT for a 4 GB flash disk?

There's nothing wrong with using it, as it will (hopefully/probably) eventually replace the BIOS/DOS/PC (whatever the technical term is) partition table. It's only needed for the disks that are not being used by ZFS.

GJournal can be used, but since / and /usr are only updated during OS upgrades, there's not really much need for it. (Obviously, the ZFS filesystems don't need/use it.)

GMirror is a handy tool to use for the CF disks, to create a RAID1 array between two of them (that's what we do).

Softupdates isn't really needed either, again, since the / and /usr filesystems should only be updated during OS upgrades.

wonslung · Aug 18, 2009

Phoenix is correct. I used his method for my boot device and it works brilliantly. I also got to benefit hugely from his experience when it came to glabels and raidz vdevs (i used glabel for my zfs drives and didn't make one giant raidz group, thanks to his helpful posts)

trev · Aug 20, 2009

wonslung said:
I've been reading the Sun ZFS mailing list, ... Theres a few threads there ... where catastrophic failures have happened due to no real user error.

I haven't used ZFS with FreeBSD yet (waiting for 8.0

, but at work we have a couple of 12 and 13 TB ZFS pools under Solaris 10 (Sparc). No problems so far...

ernie · Aug 21, 2009

trev said:
I haven't used ZFS with FreeBSD yet (waiting for 8.0 , but at work we have a couple of 12 and 13 TB ZFS pools under Solaris 10 (Sparc). No problems so far...

I have been trying on 8.0-BETA2 but it's too unstable, threads just keep stopping. The ZFS worked fine though.

I just tried 7.2-STABLE much better. ZFS is version 13 in 7.2-STABLE as well, so it was happy importing the pool I made in 8.0-BETA2 that was after I did a buildworld to update all the command line tools, just updating the kernel to 7.2-STABLE was not enough, the old tools wouldn't talk cleanly to it.

avilla@ · Aug 25, 2009

do any of you use a zvol as a swap device? i do, but swapping makes my system really slow, even locked for minutes sometimes... any suggestion?

Code:

$ zfs get all system/swap 
NAME         PROPERTY              VALUE                  SOURCE
system/swap  type                  volume                 -     
system/swap  creation              Mon Jun  1 13:26 2009  -     
system/swap  used                  3G                     -     
system/swap  available             14.4G                  -     
system/swap  referenced            654M                   -
system/swap  compressratio         1.00x                  -
system/swap  reservation           none                   default
system/swap  volsize               3G                     -
system/swap  volblocksize          8K                     -
system/swap  checksum              off                    local
system/swap  compression           off                    local
system/swap  readonly              off                    default
system/swap  shareiscsi            off                    default
system/swap  copies                1                      default
system/swap  refreservation        3G                     local
system/swap  primarycache          all                    default
system/swap  secondarycache        all                    default
system/swap  usedbysnapshots       0                      -
system/swap  usedbydataset         654M                   -
system/swap  usedbychildren        0                      -
system/swap  usedbyrefreservation  2.36G                  -
system/swap  org.freebsd:swap      on                     local

$ zpool status
  pool: system
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        system      ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors

$ gpart show
=>       34  156301421  ada0  GPT  (75G)
         34        128     1  freebsd-boot  (64K)
        162  156301293     2  freebsd-zfs  (75G)

perhaps acting on primarycache and secondarycache properties could help?

wonslung · Aug 26, 2009

i use zvol as swap.

i added it like this:

Code:

zfs set org.freebsd:swap=on pool/swap

as mentioned in another thread...and i've had no problems

but my pool is 3 raidz vdevs...don't know if it makes a difference....it might because it's likely a lot faster than a single drive.

avilla@ · Aug 26, 2009

wonslung said:
i added it like this:

Code:

zfs set org.freebsd:swap=on pool/swap

as mentioned in another thread...and i've had no problems

i have this set as well, as you can see

but my pool is 3 raidz vdevs...don't know if it makes a difference....it might because it's likely a lot faster than a single drive.

yes, that's probably a reason...

bigearsbilly · Aug 26, 2009

anectdotal evidence:
I've used linuxes for about 1o years, solaris 10 for a while and
for the last 2 years or so primarily BSD.
the only catastrophic loss of data I have ever had was with ZFS on solaris 10.

not sure why. I still have the same disks but a new mobo.
1 disk has uncorrectable sectors (SMARTD)

also be careful not to rm zfs.cache.

avilla@ · Aug 27, 2009

bigearsbilly said:
anectdotal evidence:
I've used linuxes for about 1o years, solaris 10 for a while and
for the last 2 years or so primarily BSD.
the only catastrophic loss of data I have ever had was with ZFS on solaris 10.

that was years ago if i understand correctly, things have changed, zfs was so much young
and that wouldn't be a problem, all my files are stored on a server: this is just a laptop, and zfs has too many nice features (like snapshots) to go back to ufs

also be careful not to rm zfs.cache.

i know that the system won't boot... are there any other risks?