Making sense of ZFS

Bobbla · Oct 22, 2010

So the thing is, I've been using zfs for a while but I'm not really sure how it works. In my mind it will look something like a cake, and I've made a picture.. just in case

I hope the image explains at least something (correct)

And I also was wondering how the adding and removing of vdev's worked. When I add a vdev to a pool does the new vdev stay empty or does it equalize with the rest of the pool? And if I remove a vdev, what happends to the data?

And what is the difference between scrub and resilvering? (also whats the time difference?)

And also, can someone give/get me some numbers on raidz with 6 vs. 8 disks? like transfer speed and resilvering time. I've heard that somewhere around 6-8 disks in a raidz the resilver time goes WAY up.

The thing is, I have a file server right now with 6x 1TB in RAIDz and has used about 80% of the space. So I was wondering how would things go if I start to add more space. I was thinking about adding 6 or 8 new 2TB disks in a RAIDz to my current pool. And so I was hoping someone here could come recommending and stuff, showering me in their wisdom.. (maybe that last part was a little to much..)

a little off topic, but getting these messages.. and I'm thinking its a sign of re0 dying?

Code:

re0: watchdog timeout
re0: reset never completed!
re0: PHY write failed

note: added the picture as an attachment.. just in case, online imagehosting seems to be somewhat unreliable sometimes..
note 2: why would you set resolution limits on attachments?? Now I have to zip it, and that should be unnecessary.

wonslung · Oct 23, 2010

When you add new vdev's zfs will start using them. It dynamically stripes across all available vdevs. Old data doesn't get re-written, new data is allocated based on a formula, vdevs with more space get higher priority.

phoenix · Oct 23, 2010

Bobbla said:
When I add a vdev to a pool does the new vdev stay empty or does it equalize with the rest of the pool?

It starts empty. ZFS will write to all vdevs in the pools; however, it will give slightly more precedence (write more) to emptier vdevs. Eventually, after writing enough new data, and deleting old snapshots/data, the vdevs will reach a state of equilibrium.

And if I remove a vdev, what happends to the data?

You cannot remove data vdevs from a pool. You can only remove "spare", "cache", and "log" vdevs.

And what is the difference between scrub and resilvering? (also whats the time difference?)

Scrub reads every block of data from all the vdevs in the pool, computes the checksum for that block, then compares the computed checksum with the checksum stored on disk. If the checksums are the same, then the data on disk is ok. If the checksums don't match, then the data on disk is considered "bad", and new data is written using the redundancy data. If there is no redundancy in the pool (1 disk, or non-redundant vdevs), then ZFS can't fix the data on disk and just tells you that it's corrupted. Think of it like "fsck", only it corrects data as well as metadata.

Resilver is the process of rebuilding a data disk in a vdev using the redundant data stored on the other disks in the vdev. Think of it like the rebuild process of a hardware RAID controller when you replace a failed disk.

A resilver will be much faster than a scrub. A resilver only touches data in one vdev, and is only limited by the write speed of the disk it is rebuilding. A scrub touches every single byte of data in the pool.

And also, can someone give/get me some numbers on raidz with 6 vs. 8 disks?

Search the zfs-discuss mailing list archives. This question comes up many, many, many times every month.

The thing is, I have a file server right now with 6x 1TB in RAIDz and has used about 80% of the space. So I was wondering how would things go if I start to add more space. I was thinking about adding 6 or 8 new 2TB disks in a RAIDz to my current pool.

Ideally, you would add new vdevs to a pool when the exiting vdev(s) reach 50% full, thus always keeping the pool under ~70% usage, and allowing the pool to stripe data across all vdevs. If you wait until the all vdevs are 90% full, then add new vdevs, new writes will go predominantly to the new vdev, thus limiting the speed of the pool to the speed of that one vdev. You really want to keep writes going to all vdevs in the pool. To do that, you need free space on all vdevs.

Bobbla · Oct 24, 2010

phoenix said:
Ideally, you would add new vdevs to a pool when the exiting vdev(s) reach 50% full, thus always keeping the pool under ~70% usage, and allowing the pool to stripe data across all vdevs. If you wait until the all vdevs are 90% full, then add new vdevs, new writes will go predominantly to the new vdev, thus limiting the speed of the pool to the speed of that one vdev. You really want to keep writes going to all vdevs in the pool. To do that, you need free space on all vdevs.

Is there a way to equalize it in a straight forward way? or do I have to cut stuff out and paste it back again?

phoenix · Oct 24, 2010

The only way to do it, right now, is to move data off the pool, then move it back into the pool, so that it gets written out to all the vdevs.

Eventually, once the "block pointer (bp) rewrite" feature is complete, we'll be able to balance the data in a pool across all vdevs after either adding a new vdev, or increasing the size of a vdev.

Until then, though, all you can do is add new vdevs before the existing ones are full; or move data off and move data back to force it to be re-written to all vdevs.

UNIXgod · Oct 24, 2010

The cake is a lie!!!!

danbi · Oct 27, 2010

Another way to balance vdev load is to create new filesystem within the same pool and move data there -- writes are then balanced to all vdevs.

Indeed there is very little utility to add new vdedv for the purpose of spreading data when the pool is already full -- you will have to move around data many times, so that it first frees space on the 'full' vdev. Only on the new writes will is start using up that 'freed' space again.

By the way, you got the cake a bit wrong. Think of ZFS as an large restaurant, possible empty at the beginning. You may have tables (drives, vdevs) that are grouped together. But unless you group together sufficiently large number of tables, there will be no place to accommodate single large group of people (filesystem). Instead, those will have to be spread all over on separate small groups (filesystems).

If you insist on the cake -- then an cake is a single zpool, with layers being the vdevs.

Bobbla · Oct 27, 2010

danbi said:
Another way to balance vdev load is to create new filesystem within the same pool and move data there -- writes are then balanced to all vdevs.

I don't suppose there are any programs that will write from the pool/"old vdev" to ram, and then write it back to get balance to the force. :e

danbi said:
If you insist on the cake -- then an cake is a single zpool, with layers being the vdevs.

I think that was kind of what I draw, but it is a little messy and I think I'm confusing some terms a little etc etc...... so here is a new version ;D

Note: hope it's more correct

Note2: ssssssh, there is always cake.... cake... cake :O
Note3: there was something more, but I forgot it during the cake process.. :/

wonslung · Oct 28, 2010

This isn't exactly right. On a real ZFS filesystem transactions don't happen like you have it pictured (in layers for each filesystem).

You can't use the cake imagery to really explain ZFS because at the lower levels, it doesn't look anything like that.

fronclynne · Oct 29, 2010

If the horse won't go, keep beating it till it does.

wonslung said:
You can't use the cake imagery to really explain ZFS because at the lower levels, it doesn't look anything like that.

So you're saying the analogy is a horse, but at some point you're going to try to make that horse get up and walk and since it's an analogy and not a horse it won't be able to pull your sulky?

nekoexmachina · Oct 29, 2010

Hm, never actually used zfs, so I have a question: storagepool just like it is on your picture. You can actually tell zfs which data must be on mirror, which on 3disk mirror, which on raidz, etc, within one big pool

Also I've read there are several memory and performance problems on the systems with both ufs2 and zfs present. Is that true? How hard would it be to migrate ~3Tb data from ufs2 to zfs?

Also by your opinion, guys: which is better way to use zfs, GEOM-based software raid with zfs on top of it or everything via zfs tools?

wonslung · Oct 30, 2010

fronclynne said:
So you're saying the analogy is a horse, but at some point you're going to try to make that horse get up and walk and since it's an analogy and not a horse it won't be able to pull your sulky?

No, that isn't what i'm saying at all.

I'm saying that trying to mentally picture zfs in raidz using his cake imagery will leave you confused or give you the wrong ideas.

Bobbla · Oct 30, 2010

wonslung said:
This isn't exactly right. On a real ZFS filesystem transactions don't happen like you have it pictured (in layers for each filesystem).

Well, how does it work? I can't say that after reading your post it made me any wiser.. and I still think that my cake metaphor is acceptable right. Bottom line, I can't see where I'm wrong so please explain...?

nekoexmachina said:
Hm, never actually used zfs, so I have a question: storagepool just like it is on your picture. You can actually tell zfs which data must be on mirror, which on 3disk mirror, which on raidz, etc, within one big pool

The way I understand it, no. You can't decide where the data goes, as wonslung and phoenix pointed out earlier in this thread.. (zfs decides where stuff goes..)

wonslung said:
When you add new vdev's zfs will start using them. It dynamically stripes across all available vdevs. Old data doesn't get re-written, new data is allocated based on a formula, vdevs with more space get higher priority.

phoenix said:
It starts empty. ZFS will write to all vdevs in the pools; however, it will give slightly more precedence (write more) to emptier vdevs. Eventually, after writing enough new data, and deleting old snapshots/data, the vdevs will reach a state of equilibrium.

and I don't see where in my picture it says or implies that you can tell where what goes..?

nekoexmachina said:
Also I've read there are several memory and performance problems on the systems with both ufs2 and zfs present. Is that true? How hard would it be to migrate ~3Tb data from ufs2 to zfs?

Also by your opinion, guys: which is better way to use zfs, GEOM-based software raid with zfs on top of it or everything via zfs tools?

Performance and memory, I suppose one could say that you get what you wish for, BUT it will cost you. I've used zfs in raidz for a while, but I have not had any performance problems.. However when I transfer 100++GB of data there is a possibility of crash, and scrub will affect performance but I only scrub while a sleep or away. XP/SMB is a bottleneck, Win7/SMB2 is not, and usually your local disk will be slower anyways.

However if you use compression and prefetching you will most likely get problems if you don't have the hardware for it.. And in my opinion you should use zfs with its integrated features.. The way I understand it raid5 gives you disk redundancy, while raidz also gives you file redundancy.. if you understand what I mean..

note: good one fronclynne, made me laugh. :e

Making sense of ZFS

Attachments