What is the best way to create a 1-2PB storage solution with ZFS for 5000+ users

olav · Jul 27, 2011

Optimal performance is not a requirement, it should just perform "ok".
Each user should have their own zfs filesystem. This is because some prefer hourly snapshots and wants to keep snapshots for a long time, while some prefer daily snaphots for 30 days.
Users should access the "same" server, we don't want to manage tenfolds of servers with ldap and other services.
It should work, as little downtime as possible.
Should be possible to scale it even further, there might be as many as 10000+ users in a few years.

Our first idea was creating many NFS filesystems and mount them on one server. Though after reading this here, we figured out that this will maybe be a bad idea, even with amd.

Our second idea is using ISCSI. Creating a ZFS filesystem above ZVOL targets. In theory this should be great, but I know that ISCSI for FreeBSD is still very new and there are some issues about stability and memory leaks. Are there anyone with positive experience with FreeBSD and ISCSI and using it just fine in a production environment?

I also have some questions about how ZFS above ZFS works when it comes to data consistency.

Is it enough to scrub the ZVOL's? I dont have to run scrub on the initiator? I guess scrubbing 1PB would never finish. I could possibly also disable checksums on the initiator? What about deduplication?

What happen if a ZVOL target dies? When I google this I find a lot of complaints that the system will just hang or panic. Most comments are with OpenSolaris, how is this with FreeBSD? Should I use failmode=continue?

Our third idea is creating file volumes at top of ZFS for each storage node, exporting with NFS, and then creating a ZFS filesystem on top of those large files. This solution will probably perform worse than ISCSI, but will it be more realiable than iscsi?

Suggestions and criticism are very welcome

graudeejs · Jul 27, 2011

What hardware do you have? (especially storage hardware)

olav · Jul 27, 2011

Our current hardware setup looks like this

Main server

Phenom2 6 cores
8GB ECC ram
Intel dual gigabit pci-e NIC

Storage.

Asus M5A99X
Athlon 2
2GB ECC
Intel pci-e Gigabit Nic
PCI Ati graphics card
3 IBM M1015 SAS controllers.
24 3TB harddrives

Planning to setup 3x raidz2 with 8 disks

phoenix · Jul 27, 2011

The "normal" way to do massive storage using ZFS is like so:

Head node that contains the motherboard, CPUs, RAM, and several SAS controllers (SATA controllers can be used in a pinch) that support multi-lane cables
individual JBOD chassis stuff full of drives connected to either SATA backplanes or SAS expander backplanes
run cables from controllers in head node to backplanes in JBOD chassis, ideally in such a way that you have 1 drive from each JBOD in your raidz vdev so that you can lose an entire JBOD chassis without losing the pool

There's a couple threads on the zfs-discuss mailing list that detail different ways to do this. SUN (now Oracle) and SuperMicro setups seem to be the "norm".

For an example of what this looks like, check out this graphic from DataOnStorage.

You create your ZFS pool on the head node, create all your ZFS filesystems, and then export them via NFS or iSCSI to your client stations.

When you want to expand the pool, you just add more JBOD chassis into the mix, or your replace drives with larger ones.

AndyUKG · Jul 29, 2011

olav said:
Planning to setup 3x raidz2 with 8 disks

Isn't that more like 50TB than 1PB? Or is the idea to add more storage servers as time goes by?
Otherwise, if the plan is to have just 2 servers then isn't the best solution actually to have just 1 server? Directly attaching the disks to the main server would, I'm sure, save you a lot of problems and result in much better performance...

cheers Andy.

olav · Aug 4, 2011

Yes, the idea is to add more storage servers as time goes by.

I've experimented a bit with a file volume over NFS and iSCSI. File volume over NFS is useless, its just way way to slow. The problem is IOPS.
iSCSI works great when it comes to speed. But there is a problem when you remove an iSCSI device to test the reliability. ZFS will just hang and a hard restart is needed. I've been testing with FreeBSD-Current from May, I will update it to the lastest and test again. Though I don't think there is a problem with ZFS, just the FreeBSD iSCSI initiator(Having to kill it to disconnect devices isn't exactly a very elegant solution). As the head node don't really need any specific FreeBSD features, maybe a Solaris based OS would be more reliable?

phoenix · Aug 8, 2011

Uhm, you're not trying to create a head node that uses iSCSI volumes from stand-alone ZFS servers, are you? Such that the stack looks like:

disk --> ZFS storage node --> iSCSI --> ZFS head node --> whatever

Your storage nodes should be "dumb" devices, just a power supply, a bunch of disks plugged into a backplane, with cables going to the SAS/SATA controller in the head node.

olav · Aug 9, 2011

Why do it have to be 'dumb' devices? I don't really need extremly high performance. I've looked into expander solutions but hell it is expensive and have compatibility issues. Its cheaper to just buy 3 sas controllers. The only thing I wonder about is how resilver would perform when one storage node has had some downtime.