Future of filesystems

Mage · Jun 11, 2012

Chris Mason left Oracle. Today he starts working at a funny company. I checked their website. Funny is the nicest world I could tell.

[FLAME]

I would say that this is the end of BTRFS, however everything must have a beginning before having an end. After reading random emails from BTRFS development lists anyone who ever seen an IT project should know that BTRFS will never work.

[END OF FLAME]

ZFS is very nice, however we don't have the "block pointer rewrite" thing. If you ever filled up a ZFS pool more than 80% or added new devices to a large pool you should know that the lack of that feature hurts. Also Oracle is very good at destroying software.

Hammer is Dragonfly-only.

I wonder what the future will bring to us. I wish we had BPR.

UNIXgod · Jun 11, 2012

Mage said:
Hammer is Dragonfly-only.

I wonder what the future will bring to us. I wish we had BPR.

What's BPR?

cra1g321 · Jun 11, 2012

UNIXgod said:
What's BPR?

I'm guessing it's the block pointer rewrite, he mentioned.

UNIXgod · Jun 11, 2012

cra1g321 said:
I'm guessing it's the block pointer rewrite, he mentioned.

Funny. I need to read more carefully. I actually did a google search on the acronym and the first result was "Business process reengineering".

vermaden · Jun 11, 2012

Mage said:
ZFS is very nice, however we don't have the "block pointer rewrite" thing. If you ever filled up a ZFS pool more than 80% or added new devices to a large pool you should know that the lack of that feature hurts.

The ZFS Feature Flags has been imported [1] to HEAD making ZFS v28 into ZFS v5000 + features.

This means, that ANY feature (including "block pointer rewrite" one) can now be added to ZFS as a separate feature.

[1] http://freshbsd.org/commit/freebsd/r236884

xibo · Jun 11, 2012

Can't we put zfs on gvinum raids and extend the volumes on geom instead of zfs level?

But yes, I would like to get BPR, too.

tingo · Jun 12, 2012

Any good engineer / sysadmin / whatever makes use of the tools he or she has, with whatever features those tools have or lack. Today's kids seem to want to have everything fixed for them, not fixing anything themselves. I fail to see how that can be a good life.

phoenix · Jun 12, 2012

Mage said:
Chris Mason left Oracle. Today he starts working at a funny company. I checked their website. Funny is the nicest world I could tell.

I would say that this is the end of BTRFS, however everything must have a beginning before having an end. After reading random emails from BTRFS development lists anyone who ever seen an IT project should know that BTRFS will never work.

About half of the Btrfs development comes from RedHat, so not sure why you think "losing" one Oracle employee will begin the death spiral of a fs that's part of the mainline Linux kernel.

ZFS is very nice, however we don't have the "block pointer rewrite" thing. If you ever filled up a ZFS pool more than 80% or added new devices to a large pool you should know that the lack of that feature hurts. Also Oracle is very good at destroying software.

So long as you keep adding vdevs to a pool, or replacing drives in vdev with larger ones, then there's no problem.

Not really sure what your so afraid of. UFS isn't going anywhere, is getting new/improved features over time, and is great for small-ish filesystems. ZFS isn't going anywhere, is getting new/improved features over time, and is great for medium-to-huge storage setups. What more do we need?

graudeejs · Jun 12, 2012

phoenix said:
Not really sure what your so afraid of. UFS isn't going anywhere, is getting new/improved features over time, and is great for small-ish filesystems. ZFS isn't going anywhere, is getting new/improved features over time, and is great for medium-to-huge storage setups. What more do we need?

Portable filesystem, that could be used between BSDs, GNU/Linux, and other Unixes. FAT doesn't count, because it doesn't preserve Unix file attributes.

Mage · Jun 12, 2012

phoenix said:
About half of the Btrfs development comes from RedHat, so not sure why you think "losing" one Oracle employee will begin the death spiral of a fs that's part of the mainline Linux kernel.

It is not the beginning of a death spiral. It is a sign of something we know for two years at least. A main developer told ages ago that BTRFS is broken by design. Design is something what doesn't change. BTRFS was never alive so it can't die. Do you read btrfs-devel lists? Every thread are like "I cannot mount please help", "Please recover my data", "One week ago I deleted some lines of the source but I put it back yesterday and wrote 15 lines long comment for myself to avoid deleting that lines again."

phoenix said:
So long as you keep adding vdevs to a pool, or replacing drives in vdev with larger ones, then there's no problem.

If you fill your pool above 80% then it gets 3-10 times slower by fragmentation. Please don't tell it doesn't need defrag because there are a lot of examples to prove it does. Also please don't tell you shouldn't fill it above 80% because the top is at 100% and sooner or later every drives on the world reaches that percentage.

If you change anything like checksum, compression, copies, dedup then you should use BPR. If you had it. The only fix is send | receive two times (to another pool and back). That means downtime.

As far as remember I read on of the developers who was working on BPR that it never will be finished.

I am not bashing ZFS. It is my favourite filesystem. I just would like to see a bright future.

Mage · Jun 12, 2012

tingo said:
Today's kids seem to want to have everything fixed for them, not fixing anything themselves. I fail to see how that can be a good life.

I wonder how many people are on this Earth who could properly implement block pointer rewrite.

Mage · Jun 12, 2012

graudeejs said:
Portable filesystem, that could be used between BSDs, GNU/Linux, and other Unixes. FAT doesn't count, because it doesn't preserve Unix file attributes.

I use a shared pool with FreeBSD and Ubuntu. Ubuntu runs in VM under Windows. The only annoying thing is that I have to force the import time to time after I change OS.

However, I lost about 40GB of data when I imported a Gentoo MBR partitioned SSD drive. The pool was created with ZFS for Linux. It had several vdevs, cache and log on the same MBR drive. When I imported it under FreeBSD first time my data said goodbye.

This never happened again, however my actual shared pool uses raw disks and GPT.

Crivens · Jun 13, 2012

Just an idea, so feel free to say if it is stupid.
To defrag only some files on windows I used a trick with dd, searching a build directory for all libraries and then making a copy of the file with a new name, delete old, rename new. This should also re-distribute a file over a new set of vdevs, would it not?

kpedersen · Jun 13, 2012

graudeejs said:
Portable filesystem, that could be used between BSDs, GNU/Linux, and other Unixes. FAT doesn't count, because it doesn't preserve Unix file attributes.

We do.. UFS. Now we just need the rest to support it

How come Linux'ii do not support UFS? Surely it cannot be too hard to implement compared to NTFS and the license can't be the issue.

The linux folk spend time porting dreamcast filesystems to the kernel, why not a useful one?

wblock@ · Jun 13, 2012

http://ghantoos.org/2009/04/04/mounting-ufs-in-readwrite-under-linux/

UFS write support might be the default now, I don't know.

UFS read/write support for Windows is more interesting to me:

http://www.crossmeta.org/crossmeta.html

I have not tried either of these.

graudeejs · Jun 13, 2012

kpedersen said:
We do.. UFS. Now we just need the rest to support it

How come Linux'ii do not support UFS? Surely it cannot be too hard to implement compared to NTFS and the licence can't be the issue.

The linux folk spend time porting dreamcast filesystems to the kernel, why not a useful one?

No we don't. Unless it's UFS1 and even then I doubt (feel free to correct me if I'm wrong) it will work under OpenBSD.

Mage · Jun 14, 2012

Crivens said:
Just an idea, so feel free to say if it is stupid.
To defrag only some files on windows I used a trick with dd, searching a build directory for all libraries and then making a copy of the file with a new name, delete old, rename new. This should also re-distribute a file over a new set of vdevs, would it not?

The copy, move, rename, etc. methods don't really work as defrag on ZFS if you have for example dedup set on or if the pool has snapshots.

As far as I know zfs send | zfs receive does most of the things BPR should do. I am not sure it is 100% perfect solution. I mean beyond the fact that it is offline and needs double space.

Crivens · Jun 14, 2012

Mage said:
The copy, move, rename, etc. methods don't really work as defrag on ZFS if you have for example dedup set on or if the pool has snapshots.

Snapshots are a problem, sure, and dedup would screw up such things also.
Basically, this balance step should be part of a scrub operation. Maybe I should spend at least some minutes browsing the source for this.

Mage · Jun 21, 2012

Basically, this balance step should be part of a scrub operation. Maybe I should spend at least some minutes browsing the source for this.

It should be there but it isn't. I read some email written by one of the original ZFS developers and he said he spent some time on BPR. However, it is too hard to implement and even harder maintain when you add new features.

Also, Josef Bacik, another main developer of BTRFS left RedHat some days ago. It seems that ZFS should be the only available FS with checksumming, dedup and pool management. I hope that will bring improvements soon.

olav · Jul 23, 2012

Block pointer rewrite is absolutely needed for ZFS. I really like ZFS, but it really doesn't scale big very well. For example if you have a 40x4TB disk raidz3 setup which is almost 100% full and lose one disk you have to scan almost all the 160TB with data to rebuild a new disk. I would take months...

In my opinion the metadata should be stored in a seperate database and if one disk fails the fs should easily figure out what data is missing and quickly build a new disk that can replace the faulted disk.

Since I really love ZFS with its extraordinary features, I've started to research developing my own distributed filesystem. I've been thinking about Cassandra database for storing all the metadata. I just wish I had more time so I could create something for real and just not prototyping a proof of concept.

vermaden · Jul 23, 2012

@olav

After ZFS 'Feature Flags' are already merged, its probably just a matter of time when such 'feature' will be added.

ZFS is in the works all the time, for example here are benchmarks of LZ4 algorithm compared to others, mostly to LZJB:
http://thread.gmane.org/gmane.os.illumos.devel/8701/focus=8731

Crivens · Jul 23, 2012

@olav: Somewhere, there is a paper/article concerning the best disk size for raid systems. I failed in finding it again, after I had read it prior to setting up my home server. It seems that one sweet spot for SATA is about 500GB per disk so a resilver does not expose you with 'your pants down' for longer than would be necessary. Resilver of 4GB disks, as you have, would likely trash the remaining disks for about 15 to 20 hours, in which the remaining disks (prolly from the same batch) are running flat out and are thus likely to fail in that time, also.

Also, it would be SOP to divide the disks into several vdevs, and only the rest of the vdev needs to be read while the resilver is running. This leaves the other vdevs idle, and hopefully your really important data which is on "copies=3" is secure on the other vdevs untill the resilver is done.

olav · Jul 23, 2012

vermaden said:
@olav

After ZFS 'Feature Flags' are already merged, its probably just a matter of time when such 'feature' will be added.

ZFS is in the works all the time, for example here are benchmarks of LZ4 algorithm compared to others, mostly to LZJB:
http://thread.gmane.org/gmane.os.illumos.devel/8701/focus=8731

That's cool! Though adding a compression algorithm is a walk in the park compared to adding block point rewrite

I hope someone will add support LZMA soon! Yeah I know the compression speed is Ã¼berslow, but it compresses data amazingly well. Decompression speed is usable though.

phoenix · Jul 27, 2012

olav said:
Block pointer rewrite is absolutely needed for ZFS. I really like ZFS, but it really doesn't scale big very well. For example if you have a 40x4TB disk raidz3 setup which is almost 100% full and lose one disk you have to scan almost all the 160TB with data to rebuild a new disk. I would take months...

Hence why every single ZFS howto, best practises, and tuning guide says to never use more than 10 disks in a single vdev. Especially when using raidz.

The random write IOps for a raidz vdev is limited to the IOps of the slowest drive. In order to increase the IOps of the pool, you add more vdevs.

olav · Jul 27, 2012

And that's my point, its actually completely unecessary if the metadata was stored somewhere else and had more information. In that way it would be possible to resilver a new drive without scanning through all the data you have.

Because the metadata database knows exactly what data that was stored on the defective drive.
As it is now, ZFS simply does not "scale out".

Future of filesystems

Mage

UNIXgod

cra1g321

UNIXgod

vermaden

xibo

tingo

phoenix

graudeejs

Mage

Mage

Mage

Crivens

Administrator

kpedersen

wblock@

graudeejs

Mage

Crivens

Administrator

Mage

olav

vermaden

Crivens

Administrator

olav

phoenix

olav