zfs, raidz and total number of vdevs?

wonslung · Jun 9, 2009

Hello, i'm looking to upgrade my home NAS, my current is about a year or two old...it's running linux 2.6 and using mdadm software raid. When looking to upgrade i've been drawn to ZFS and now that i've been using ZFS in freebsd that is the route i want to go.

My question is this: Is there really a problem with using more than 9 vdevs in a raidz setup? and/or is there a way to separate them into smaller groups but still have them all in the same pool?

I was looking at going with 12 1 tb hard drives in raidz. 11+1 spare. The case i have is upgradeable to 20 drives leaving me room for 8 more which i eventually planned to use. If i did smaller raidz settups like 9+1 spare x2 they would be in separate pools. Would it be possible to somehow POOL that into 1 large pool or is that just a terrible idea?

or can i run 19+1 or 18+2 without much trouble?

graudeejs · Jun 9, 2009

You may add to pool as many raidz/disks as you want

check out this one:
http://forums.freebsd.org/showthread.php?t=3689

From what i can recall even sun recommends to use about 8 disks per raidz max, that gives maximum io bandwidth

phoenix · Jun 9, 2009

wonslung said:
My question is this: Is there really a problem with using more than 9 vdevs in a raidz setup? and/or is there a way to separate them into smaller groups but still have them all in the same pool?

Yes!! Most definitely there is!! See the post killasmurf86 linked to for the details on our backup servers, that use 24 drives in a single pool.

The way raidz works, you get the IOps (I/O operations per second) of a single drive for each raidz vdev. Also, when resilvering (rebuilding the array after replacing a drive) ZFS has to touch every drive in the raidz vdev. If there are more than 8 or 9, this process will thrash the drives and take several days to complete (if it ever does).

We made the mistake of building a storage server with a single 24-drive raidz2 vdev. It gave us over 10 TB of storage (400 GB drives), but was extremely slow for writes. Then a drive died. Spent over a week trying to get it to resilver. It was horrible.

Then I started reading some of the Sun blogs about ZFS, and came across all their info on IOps and recommendations for their 48-drive Thumper servers (multiple mirror and raidz vdevs in a single pool). The consensus is "Don't use more than 8-9 drives in any single raidz vdev".

I was looking at going with 12 1 tb hard drives in raidz. 11+1 spare. The case i have is upgradeable to 20 drives leaving me room for 8 more which i eventually planned to use. If i did smaller raidz settups like 9+1 spare x2 they would be in separate pools. Would it be possible to somehow POOL that into 1 large pool or is that just a terrible idea?

You should only have (need) 1 pool per server. That's the point of pooled storage. You just keep adding vdevs into the pool as time goes by.

For 12 drives, I'd go with 2x 6-drive raidz2 vdevs. That will give you 8 TB of disk space (using 1 TB drives), and will allow you to lose up to 4 drives (2 from each vdev) before losing data.

Later, when you add the extra 8 drives, you can set them up as a 6-drive raidz2 vdev, and 2 spares (or, a spare and a cache/log device, if using ZFSv13 in FreeBSD 7-STABLE or 8-CURRENT).

The zpool commands would look something like:

Code:

# zpool create mypool raidz2 da0 da1 da2 da3 da4 da5
# zpool add mypool raidz2 da6 da7 da8 da9 da10 da11
# zpool add mypool raidz2 da12 da13 da14 da15 da16 da17
# zpool add mypool spare da18
# zpool add mypool log da19

(That last line might be cache instead of log, can't remember off-hand, as I haven't used ZFSv13 yet, and am going by memory of a blog post.)

or can i run 19+1 or 18+2 without much trouble?

Definitely do NOT do that.

wonslung · Jun 9, 2009

oh wow, i didn't realize it worked like this.
i guess i'm still thinking about things in the traditional raid way.

I was thinking that each group of drives had to be in it's own pool.

you're saying i can make 3 raidz pools and then pool THOSE together into one big pool.

this is awesome.

so i'm going to have 12 drives at first....can i add new drives to the vdevs later if i decide?

let's say i do decide to made 2 groups of 5 drives, 1 spare and one log device (i'm still kind of fuzzy on how/why i would do that, i'll have to look it up but i'm just going by your example)

later, if i wanted to add more drives could i add one to each group or would i have to add a completely new third group?

Code:

# zpool create mypool raidz da0 da1 da2 da3 da4
# zpool add mypool raidz da5 da6 da7 da8 da9 
# zpool add mypool spare da10
# zpool add mypool log da11

later if i wanted to add more drives i'd pretty much only be able to add them as groups of 5?

phoenix · Jun 9, 2009

wonslung said:
oh wow, i didn't realize it worked like this.
i guess i'm still thinking about things in the traditional raid way.

I was thinking that each group of drives had to be in it's own pool.

you're saying i can make 3 raidz pools and then pool THOSE together into one big pool.

3 raidz vdevs, added into a single pool.

ZFS is organised like this:

Code:

(drive) (drive) (drive)   (drive) (drive) (drive)    (drive)         (drive)
   \       |       /         \       |       /          \               /
    -(raidz vdev)--           -(raidz vdev)--            -(mirror vdev)-
           \                         |                           /
            \                        |                          /
             ----------------------(pool)-----------------------
              /     /      |       |        |      |     \     \
             /      |      |       |        |      |      |     \
          (fs)     (fs)   (fs)   (fs)     (fs)    (fs)  (fs)    (fs)

The pool is comprised of vdevs (virtual devices). A vdev can be a single file, a single slice, a single drive, a mirrored set of drives, or a raidz set of drives. You can add as many vdevs to a pool as needed. (Obviously, using files or slices for vdevs is not recommended for production use, can be useful for testing and playing.)

Note: adding a non-redundant vdev to the pool can compromise the integrity of the pool, as losing the non-redundant vdev will cause data-loss to the pool (possibly the loss of the entire pool).

Just to clarify some terminology.

so i'm going to have 12 drives at first....can i add new drives to the vdevs later if i decide?

No. You cannot extend a raidz vdev (ie change a 6-drive raidz vdev to an 8-drive).

However, you can replace the drives in a raidz vdev with larger drives, to expand the total size of a raidz vdev. You have to replace each drive individually. Once all the drives in the raidz vdev have been replaced, you export the pool, and import the pool, and all the extra space becomes available.

And you can always add more raidz (or mirror) vdevs to a pool.

let's say i do decide to made 2 groups of 5 drives, 1 spare and one log device (i'm still kind of fuzzy on how/why i would do that, i'll have to look it up but i'm just going by your example)

With later versions of ZFS, you can move the ZIL (ZFS Intent Log, the journal) to a separate drive, which spreads the I/O around a bit better, and improves performance. It's not available in ZFSv6, which is what's available in FreeBSD 7.0-7.2. And add drives as cache drives to speed up certain operations (mainly reads), which is really useful if you have an SSD that can sit between the slow harddrives and the fast RAM. (Advanced topics, perhaps.)

later, if i wanted to add more drives could i add one to each group or would i have to add a completely new third group?

See above. You have to add the drives as another vdev.

later if i wanted to add more drives i'd pretty much only be able to add them as groups of 5?

No, the vdevs don't have to be symmetrical. You can create a pool with a mirrored vdev, a 5-drive raidz1 vdev, a 6-drive raidz2 vdev, a single drive vdev, and so on. ZFS will then create, in essence, a RAID0 strips across all the vdevs.

wonslung · Jun 9, 2009

yah, one question though, is it SAFE to have a single drive as the log device or should it be mirrored?

also, how big does the log device need to be?

thanks again

and yah, i understand the idea of a vdev now...for some reason i was thinking it meant drive or partition....now i see it CAN be a drive or partition or a raidz group or a mirror....it's basically just anything that you use to made the pool, and the pool is pretty much a single drive or a raid0 group of vdevs....that's right?

also originally i was planning on having the os on it's own mirrored pool of 2 smaller drives but if i understand this correctly it's almost a waste of space to do that. I would get much more out of spending that money on extra big drives for the pool and the added i/o of a new vdev than having it seperate.

is this correct?

also, i can have a single spare and it belong to the whole pool right? or do i need a spare for each vdev?

phoenix · Jun 9, 2009

wonslung said:
yah, one question though, is it SAFE to have a single drive as the log device or should it be mirrored?

It's fine on a single drive. If ZFS notices any issues with it, it switches back automatically to using the pool for the ZIL. There are a couple of blogs on the sun site that cover it in more detail.

also, how big does the log device need to be?

Not very big at all. I don't recall the specifics, but the ZIL is written out to disk fairly often, so the pending data is never that big.

and yah, i understand the idea of a vdev now...for some reason i was thinking it meant drive or partition....now i see it CAN be a drive or partition or a raidz group or a mirror....it's basically just anything that you use to made the pool

Correct.

, and the pool is pretty much a single drive or a raid0 group of vdevs....that's right?

A pool is a collection of vdevs.

also originally i was planning on having the os on it's own mirrored pool of 2 smaller drives but if i understand this correctly it's almost a waste of space to do that.

What we've been doing, and is working very well, is to grab a couple of 2 or 4 GB CompactFlash drives, pop them into either a CF-to-IDE or CF-to-SATA adapter, and use those for / and /usr. Create a single large slice and partition on one, install to it, create a gmirror(8) array and add it and the other CF disk to create a software RAID 1.

Then create a storage pool out of all the real harddrives in the system. And finally, create ZFS filesystems for /var, /home, /usr/local, /usr/ports, /usr/src, /usr/obj, and /tmp.

That way, only the base OS is on the CF, and everything else is on ZFS.

Alternatively, you can use USB flash drives, although I had a bad USB stick in one server that corrupted a bunch of files in the gmirror array, requiring several days of careful surgery to get things working with new drives.

also, i can have a single spare and it belong to the whole pool right? or do i need a spare for each vdev?

Spares are assigned to the pool, and are available for use by any vdev that needs it. So, yes, 1 spare for the whole pool would be doable.

wonslung · Jun 9, 2009

thanks, so if i understand it correctly, a small fast device would be better for the log device...something like a ssd?

so for the compactflash, just to be clear
do i need 2 or 4?

2 for / and 2 for /usr?

also, can you add the log device later or does it have to be added when you make the original pool

what about using a usb thumb drive as a log device? would that be a bad idea? from what i gather it just needs to be as big as half your ram.

phoenix · Jun 9, 2009

wonslung said:
thanks, so if i understand it correctly, a small fast device would be better for the log device...something like a ssd?

Correct.

so for the compactflash, just to be clear
do i need 2 or 4?

FreeBSD will install into less than 1 GB. 2 GB is plenty. I believe the 4 GB and larger CF disks support DMA, though. We're currently using 2 GB disks in one server, and 4 GB disks in the other. Use whatever is least expensive.

also, can you add the log device later or does it have to be added when you make the original pool

Can be added at any time.

wonslung · Jun 9, 2009

thanks for all this great infomation.

I think i'm going to take your advice more or less and install to cf.
thanks again.

i remember now where i saw the thing about the log device needing a mirror
it was in this zfs best practices wiki
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

it also says what you said about it reverting to the pool on failure so it's kind of confusing.

phoenix · Jun 10, 2009

Ah, it seems that in certain versions of ZFS, the failure of a separate log vdev is treated like the failure of a "root" vdev (whatever that is), and can render a pool unusable. Hence the note to create mirrored log vdevs.

There's a fix available that is incorporated into later versions of ZFS that allows a system to continue on automatically if a single, separate log vdev becomes unusable.

There's no (easy) indication of which versions of ZFS this applies to, though.

wonslung · Jun 10, 2009

that's kinda scary =)

do you find booting from cf to be as fast or faster than traditional hard drives? i'm mainly using this for a media server so the idea of everything ZFS brings to the table is very appealing. I'm wondering also about the compression settings, and which one i should use.

phoenix · Jun 10, 2009

Booting off CF using a SATA adapter is very speedy. The longest part of the boot, on these servers, is the POST, RAID controller initialisation, and 10-second count-down at the loader prompt.

The rest scrolls by too fast for me to read anything.

Booting of CF using an IDE adapter is still speedy, but I can read bits and pieces of the kernel/bootup messages.

Booting off a USB stick is slow, althought these are 2 GB consumer (ie < $10) sticks.

Except for the USB boot, it's as fast (if not faster) than booting off a normal SATA drive.

We use gzip-9 compression for our backups filesystem. The CPU usage (monitored via SNMP every 2 minutes) never goes above 20% (or 5% per CPU as there are 4) during the backup runs.

We also use lzjb compression on /usr/ports and /usr/src. Additional CPU usage is barely noticeable during cvsup, portsnap, or buildworld.

So long as you have over 1 GHz of CPU and 2 GB of RAM, using ZFS doesn't really add any load to the system.

wonslung · Jun 10, 2009

i'm upgrading my e7300 dual core to a q9550 quad core (intel)
going from 4gb ddr2 800 to 8 gb ddr2 800
decided to order 2 cf-->Sata2 adapaters and 2 233x 8gb compact flash cards for the boot device (thanks for that)
ordered the smallest sata drive i could find for the log device, might upgrade it to a ssd when i can afford it.

already have 6 1tb sata drives in mdadm linux raid which i'll be using and ordered 6 more sata drives, and found a great 4u 20 hot swap drive case on newegg.

which brings me to my next question. to migrate the data to the new server what i'm going to have to do is copy it to single 1tb drives, then put the server together with starting with just 1 raidz vdev, then copy it from the single drives to the pool, then make add the second vdev to the pool. In traditional raid it would take a long time, i understand zfs will let me do it right away but it will have all the data only on the first vdev.

is there a way to force it to split it across both vdevs after i add the second group of drives?

wonslung · Jun 10, 2009

No, the vdevs don't have to be symmetrical. You can create a pool with a mirrored vdev, a 5-drive raidz1 vdev, a 6-drive raidz2 vdev, a single drive vdev, and so on. ZFS will then create, in essence, a RAID0 strips across all the vdevs.

I've read on a couple wiki's that you have to use like vdevs. Did this change in 7.2?

from the zfs wikipedia entry:
# You cannot mix vdev types in a zpool. For example, if you had a striped ZFS pool consisting of disks on a SAN, you cannot add the local-disks as a mirrored vdev.
http://en.wikipedia.org/wiki/ZFS

phoenix · Jun 10, 2009

wonslung said:
I've read on a couple wiki's that you have to use like vdevs. Did this change in 7.2?

from the zfs wikipedia entry:
# You cannot mix vdev types in a zpool. For example, if you had a striped ZFS pool consisting of disks on a SAN, you cannot add the local-disks as a mirrored vdev.
http://en.wikipedia.org/wiki/ZFS

I'm pretty sure that's talking about mixing storage technologies on vdevs (iSCSI vs local), and not about mixing vdevs "types" (mirror, raidz1, raidz2). I'll have to look up all the blogs about configuring the Thumper storage servers (48 drive behemoths), as I recall they all recommneded using a mirrored vdev (OS), a mirrored log/cache vdev, and raidz2 vdevs for bulk storage.

Maybe I'll test this out tomorrow, as we have a spare 16-drive server sitting on the work bench.

wonslung · Jun 10, 2009

i was just currious =)
thanks for all of your wonderful help though

another thing i meant to ask you, but forgot.

for swap space do you use a seperate partition/drive or do you use a zvol on top of zfs.

I was thinking the idea of the zvol swap would be better if it worked properly

phoenix · Jun 10, 2009

wonslung said:
which brings me to my next question. to migrate the data to the new server what i'm going to have to do is copy it to single 1tb drives, then put the server together with starting with just 1 raidz vdev, then copy it from the single drives to the pool, then make add the second vdev to the pool. In traditional raid it would take a long time, i understand zfs will let me do it right away but it will have all the data only on the first vdev.

is there a way to force it to split it across both vdevs after i add the second group of drives?

Not AFAIK. But any new writes will be done primarily to the new vdev, as the pool tries to balance the storage usage across the vdevs in the pool.

I don't think there's any way to check how much storage space is used by any one vdev. If there was, you could probably do some manual copying of data between directories, then deleting directories, and checking the storage usage.

wonslung · Jun 10, 2009

well when you do zpool list it shows how much is on each group
i DO know that much.

but what i meant was this:

I know that when i copy all the data over it's going to be ONLY on the first raidz vdev because the second one WONT exist yet
then i'm going to add the second vdev
is there a command to FORCE it to spread the data between the two or does it just do it slowly on it's own later?

edit:

going back over this thread i can't help but feel how exciting this is. I've fallen in love with having a home media server, I've been using XFS on linux software raid5 but i've run into some limitations that ZFS really just blows out of the water. My system will be able to hold 20 drives, i'm going to have 12 drives to start with. it would almost be better for me to do smaller vdevs (4 drives in raidz)x3 instead of (6 drives in raidz)x2 wouldn't it?

phoenix · Jun 11, 2009

No, when you do a zpool list, it shows you the stats for the entire pool, not for the individual vdevs in the pool.

If you could get the stats for the individual vdevs, then you could do copy/delete tricks to spread the load around and watch to make sure it's actually doing it. Without knowing how much is on each vdev, though, you can't really do this.

ZFS will favour writing to newer/emptier vdevs, and (over time) will balance the writes across all the vdevs. There's no way, that I know of yet, to force it to rebalance the data across all the vdevs to spread the load onto new vdevs. This probably wouldn't work, anyway, due to the way snapshots work.

wonslung · Jun 11, 2009

i got the command wrong.

here

Code:

We can see where the data is currently written in our pool using zpool iostat -v:

zpool iostat -v trout
                                 capacity     operations    bandwidth
pool                           used  avail   read  write   read  write
----------------------------  -----  -----  -----  -----  -----  -----
trout                         64.5M   181M      0      0  13.7K    278
  mirror                      64.5M  58.5M      0      0  19.4K    394
    /home/ocean/disk2             -      -      0      0  20.6K  15.4K
    /home/ocean/disk1             -      -      0      0      0  20.4K
  mirror                          0   123M      0      0      0      0
    /home/ocean/disk3             -      -      0      0      0    768
    /home/ocean/disk4             -      -      0      0      0    768
----------------------------  -----  -----  -----  -----  -----  -----

taken from this page
http://flux.org.uk/howto/solaris/zfs_tutorial_01

wonslung · Jun 11, 2009

phoenix said:
. This probably wouldn't work, anyway, due to the way snapshots work.

yeah, i didn't think of that....ok, well i guess zfs is smart enough to work it out on it's own...my issue is that i don't have enough drives to just build both all the vdevs then copy the data over...i'll have to build 1 or 2 of the vdevs (depending on which way i go) copy the data, then build the last one.

i still haven't decided if i want to go with 2 raidz vdevs of 6 drives each or 3 raidz vdevs of 4 drives each...i lose 1 tb the second way but it would be much faster right?

i'm also still curious about using a zvol for swap space. It would seem to me that it would be faster than a single drive or partition for swap because of all the added i/o but of course i don't really know as well as you probably do.

phoenix · Jun 11, 2009

wonslung said:
i got the command wrong. zpool

Ah, cool. Didn't realise that was available. Thanks.

I've always used gstat(8) to monitor disk throughput in real-time, although I have used iostat a couple of times. Guess I never paid attention to the output from -v.

phoenix · Jun 11, 2009

wonslung said:
i still haven't decided if i want to go with 2 raidz vdevs of 6 drives each or 3 raidz vdevs of 4 drives each...i lose 1 tb the second way but it would be much faster right?

Yes, going with 3 vdevs would be faster than 2. In theory, it should be 50% faster.

i'm also still curious about using a zvol for swap space. It would seem to me that it would be faster than a single drive or partition for swap because of all the added i/o but of course i don't really know as well as you probably do.

It's kind of a catch-22 to use swap-on-zvol. ZFS needs to allocate a bit of memory to track various things when accessing stuff on a zvol. If the OS is short on RAM, it will write stuff out to swap ... generating more memory requests, and the cycle continues. In theory, things should work. In practise, some Solaris and FreeBSD users has run into kernel panics and out-of-memory situations due to using swap-on-zvol.

In using ZFS since August of last year, I have not run into this issue on any of the 3 servers I run ZFS on. [knock wood] All three use swap-on-zfs.

wonslung · Jun 11, 2009

well i will have 8gb of ram so i'm guessing swap on vdev will be ok, as i doubt i'll ever really need swap

sofar this is what i'll have on monday.

I already have a system i'm going to use to upgrade. It's a regular desktop motherboard, socket 775 with a e7400 intel core2duo and 6 1tb hard drives. I've ordered a new 4u case which holds 20 hotswap drives (my old case is 4u as well but it only holds 6 hotswap drives so to expand i really need the new one)
i also ordered an intel q9550 quadcore and 4 more gb of ddr2 800 for a total of 8gb
also ordered 6 more 1tb hard drives so i'll have a total of 12.
ordered 2 8gb compact flash cards and the sata>cf adapters, and an extra 80gb hard drive.

my original plan didn't include the cf cards or the 80gb drive but after listening to your ideas/suggestions i decided to go ahead and go that route. i was thinking i'd use the small hard drive as a log device, and later i plan on buying one of those super fast ssd drives to use as a cache device. I guess now i've decided to go with raidz vdevs made up of 4 disks each. My next order in a month or two will probably be for the ssd and at least one more 1tb drive for a hotswap. I'm just hoping i'll be ok with no hotswap drive until then.

i'm just currious, but why did you go with ufs instead of zfs for root.

i was thinking how cool it would be to have zfs snapshots for when you decide to upgrade stuff.

I guess you could maintain a backup on zfs as well, is that what you do?