500TB fileserver?

Anyone have hints and experience building (or simply buying) file servers in the 300-500TB range? Is FreeBSD still a good solution for this scale? How should one deal with backups at this scale?
 
At that sort of scale, if you don't know what you're doing I would strongly suggest hiring someone who does.

To ensure such a setup is reliable, you will want redundant controllers, networking (multiple 10 Gb-E ports), etc.

Yes, an EMC or NetApp (for example) may look expensive on paper, but you get real support. Don't under-estimate the complexity of building something this big and ensuring it is actually fault tolerant.

I'm not saying that it can't be done with FreeBSD, but this isn't really the scale at which you want to be learning on the fly. If you have 500 TB of data go off-line, people are going to want blood.
 
throAU said:
If you have 500 TB of data go off-line, people are going to want blood.
Preferably yours if you're the one that screwed up x(
 
Better look at some data-store with a redundancy level one bit higher (redundancy over network nodes, not disks). like ceph or some other redundant system that spreads over independent nodes (RAIN).
 
Nukama said:
Better look at some data-store with a redundancy level one bit higher (redundancy over network nodes, not disks). like ceph or some other redundant system that spreads over independent nodes (RAIN).

Ceph is certainly interesting and compelling, however be aware of a few things:

1. the fs support must be abled to do anything that closely resembles traditional cli back up methods (ceph stores files in blocks that are placed all over)

2. ceph is not meant if your SAN is anything but 100% local - that is it is not good for WAN applications

3. "cloud" access requires fastcgi and a dedicated HTTP/S gateway infrastructure (and their s3 API is a little wonky)
 
monkeyboy said:
Anyone have hints and experience building (or simply buying) file servers in the 300-500TB range? Is FreeBSD still a good solution for this scale? How should one deal with backups at this scale?

Our largest FreeBSD + ZFS install is currently 120 TB, but the architecture used will support much, much, much more. It's only a backups storage server, though, so it's geared more toward maximum storage size and not maximum storage throughput. It does saturate the gigabit link, though, so it's not exactly pokey, either. :)

We use a head unit / storage unit split setup.

The head unit uses:
  • SuperMicro SC216 2U chassis with 24x 2.5" HD bays
  • SuperMicro H8DGi-6F motherboard
  • 2x AMD Opteron 6100-series CPU (8-cores each at 2 GHz; mobo supports up to 16 core CPUs)
  • 128 MB ECC DDR3 RAM (mobo supports 512 GB of RAM)
  • 2x Intel 330 SSD for the OS and ZFS log device
  • 2x Intel 520 SSD for the L2ARC and swap
  • 4x LSI 9200-8e SATA controller (each plugged into PCIe 2.0 x8 slots)

The OS is installed into a separate ZFS pool using a single mirror vdev (on the 330 SSDs). The onboard SATA controller is used for the SSDs. The LSI controllers are connected via external SAS cables to the storage units.

The storage unit uses:
  • SuperMicro SC847 4U chassis with 45x 3.5" HD bays
  • 45x 2 TB Western Digital and Seagate SATA HDs (standard consumer 7200 RPM drives)

There are two separate backplanes in the storage unit; one for the front 24 bays, and the other for the 21 bays on the back. We connect each backplane to a separate channel on the LSI controller in the head unit. Thus, there are 4 SATA channels (~24 Gbps) per backplane.

Our largest box currently has 2 storage units connected, although only 24 drives in the second one at this time.

This setup supports up to 4 storage units directly attached, or 8 storage units in a daisy-chain setup (second storage unit connects to first storage unit which then connects to the SATA controller).

Using 2 TB drives, there's 90 TB of raw storage in each storage unit, so 4 units gives 360 TB of raw storage and 8 units gives 720 TB of raw storage. Actual available storage will depend on the ZFS pool config.
 
The other questions I would have are:

What is this going to be used for? If you're looking to use it for VM storage, you will want to consider whether or not VMware/Hyper-V/etc. certification is important (this is a business decision I suspect). Is Windows based VSS snapshot support (for application consistent snapshots) important? How do you plan to back it up? Does your backup vendor have any concerns with the solution you're planning to install?
 
use: medical research data, including MR imaging data, genomic sequencing data, microscopy imaging data, etc...

backups: this is part of my question... I really only have dealt with and think of tapes... I don't quite grok using disks as backups unless they are going to be hot-removable... but open to more modern ideas...

NetApp et al... I don't think the budget is there for systems like those... I've been looking at Synology, which is probably doable budget-wise...

A salesguy has been pushing Lustre for closeout prices. The Backblaze Storage Pod or RaidZillas are interesting...
 
monkeyboy said:
The Backblaze Storage Pod or RaidZillas are interesting...
Well, I'm the RAIDzilla guy (RAIDzilla II for a couple of years now). Feel free to ask me any questions and I'll try to answer anything not already covered in that link. I've stated my opinion on Pods there as well: they're great for what Backblaze does, but may not be the optimal solution for others. A couple of people have asked me for a comparison of a 'zilla with something that I'd consider as a direct competitor, but from a "real" vendor. I'll be posting a RAIDzilla II vs. Dell NX3100 article in the next couple of weeks to answer that question.
 
monkeyboy said:
use: medical research data, including MR imaging data, genomic sequencing data, microscopy imaging data, etc...

So your data would cost quite a lot of man hours/$ to re-generate then? That data sounds kind of important, if it was me, I wouldn't be trying to cheap out on it too much - I'd focus on attempting to provide the most reliable solution possible and if the budget is too big, then work out ways it can be trimmed. But yes, unless you can answer the question of approximately how much your data costs to generate, determining that protecting it is "too expensive" is premature. I'd get the bean counters involved to work out how much it would cost if the data were to go away - then work out some figures on hardware failure rates, other catastrophic events (building fire, etc.) and ask management what they want to protect against - and outline the exclusions in your project scope. You don't want to be the one holding the ball when it all goes pear-shaped and they say "Why didn't you prevent this?". Much better if you can reply that they signed off that protecting against that particular risk was deemed too expensive. Don't forget downtime due to time-to-recover from disaster - it's all well and good to back up to tape or what have you, but if you will go out of business before you can recover from failure, there's no point.

If you go in low-ball initially, they'll try and cut it anyway and you'll end up compromising too much. Every time IT wants to spend money it is always too much unless management are convinced properly :) In my experience, if you present the business case appropriately most sane management want to protect themselves. - but may not have sat down and actually thought about the risks (we pay IT to just make magic happen line of thinking).
NetApp et al... I don't think the budget is there for systems like those... I've been looking at Synology, which is probably doable budget-wise...

Who's setting the budget? This a business decision of risk vs. cost to mitigate. If you're comparing NetApp or EMC vs. roll-your-own, are you including the wages of a full-time on-call storage administrator to build and maintain the system (including testing of new gear before rolling to production, etc.)? If being storage administrator is not your job, this will be a significant impact on your work - which is a real cost.

Do you actually need the 500TB to be live, or is a lot of it archive-able? Are there any performance requirements?

Not necessarily saying don't go ZFS on FreeBSD, but if you're doing comparisons, you need to compare like for like... if you get hit by a bus (or go on holiday or whatever) will the business be able to support the solution in your absence? Will it have redundant controllers? Etc.

Whether it is NetApp, EMC, ixSystems, or whoever, there will likely be costs beyond the cost of purchasing the hardware that should be taken into account.



edit:
I realise I haven't answered much here, but this is because a lot of it is a case of "well, that depends...".

Hopefully some of the questions above have given you some things to consider and ask management about - at the end of the day, it's their data. If they do not value it, and don't want to spend, then so be it. So long as you make them aware of the trade-offs.

Backups: restoring 500 TB from tape will take weeks. It also won't be cheap (I'm guessing what... $3000-5000 in tapes alone for a single full backup). If that time-frame is unacceptable, you'll be forced to do disk to disk, preferably to another dataset in a different building, cloud service offering, etc.
 
A question others have asked several times: How much is the data worth? Which means: If all the 500 TB were to suddenly vanish, how much would it cost to recreate the data? I bet the answer will be a big number.

From that viewpoint, it is silly to economize on the storage device. I think a sensible budget for this amount of storage will be O(10^5...6) dollars.

The active disk storage is the easy part. Get an x86 server, with two SAS cards, two Xyratex 84-disk drawers, stuff it fully with 4TB drives. That's enough capacity to store 500 TB, even with a sensible RAID code (even with 80% storage efficiency that system is still more than 500 TB). But: Does a single server provide enough availability and speed for this kind of storage solution? I doubt it. If you go to a real vendor (your typical 3-letter company, including HP and Dell as honorary members of the 3-letter club, and NetApp because they have two words with three letters each), the sales people will be happy to prepare a quote, which will probably include installation, management, and service. Get a few such quotes, to figure out what constitutes a sensible system. Then see whether you want to tackle this yourself.

Rock-bottom cost estimate: 160 drives, 4TB each, at about $350 (street price in sensible quantities), $56K. Two disk enclosures with power supplies, I'm guessing about $12K each, makes another $24K. Server, 2-socket, 24-core, with 128-256 GB of memory, and a half dozen SAS and network cards (for the enclosures and tape drives), about $20K. The raw hardware for the first copy of the data is already nearly $100K. If you go with FreeBSD and ZFS (or any similar solution), the software is free. A sensible and reliable system will be MUCH MUCH higher; not 10% higher, but probably 2x or 3x.

Backup: Obviously, at that scale, backup to a second set of disks is the easiest answer. But tape is also perfectly capable of doing this. If you go with IBM Jaguar or STK 10000 tapes, each cartridge has a capacity of 4-5 TB. So you need about 100 cartridges for one full set. At a cost of $150 per cartridge (that's very generous), that's not a lot of money, compared to the fact that each drive itself costs tens of thousands of $. Both IBM and STK will be happy to quote you a tape robot that can hold a few hundred tapes and a small number of drives. I think the cost or speed of backup really isn't the limiting factor. At 250 MB/s on the drives, 2x or 3x that with compression, and the fact that you will probably have multiple drives, a full backup only takes a week. Again, not a limiting factor, as it will take a long time (many days) to fill this much capacity.

I think you really need to begin by figuring out speeds & feeds (how much data in and out), what's the value of the data, how available and reliable does the storage have to be, what type of disasters do you need to guard against. These things will change the architecture of a proposed solution and the cost by factors of several over the raw disk cost.
 
ralphbsz said:
The active disk storage is the easy part. Get an x86 server, with two SAS cards, two Xyratex 84-disk drawers, stuff it fully with 4TB drives. That's enough capacity to store 500 TB, even with a sensible RAID code (even with 80% storage efficiency that system is still more than 500 TB). But: Does a single server provide enough availability and speed for this kind of storage solution? I doubt it.
I'm doing 500 Mbyte/sec (24 hour average) on a 32 TB 16-drive 'zilla (3 * 5-drive raidz1 for the pool). Assuming enough PCIe bus bandwidth, a 500 TB unit should be able to get substantially higher speed. I don't think the original poster said what his I/O bandwidth requirements were, or how the pool would be accessed (local, NFS, SMB, ...).

I'm with you 100%+ on the need to have a second server. That can be a live mirror or some sort of scheduled synchronization, but needs to be available on a moment's notice.

If you go to a real vendor (your typical 3-letter company, including HP and Dell as honorary members of the 3-letter club, and NetApp because they have two words with three letters each), the sales people will be happy to prepare a quote, which will probably include installation, management, and service. Get a few such quotes, to figure out what constitutes a sensible system. Then see whether you want to tackle this yourself.
There are a number of us here doing various medium- to large-scale deployments. I think phoenix has the largest disclosed storage set, though I'm sure there are lurkers with much bigger ones (if not here, definitely on the mailing lists).

Don't underestimate the fact that you will be ultimately responsible for a home-built system. You may wind up with an unexplained fault, and telling your boss "No, there isn't anyone to call to get this fixed faster" isn't going to go over well. Nor is burning through a good chunk of your budget and deciding "I can't deal with this any more". Having said all that, there are advantages - cost savings, as well as having the source code for everything and being able to (if you have the skill set) find any problem yourself can be quite satisfying. If you're doing a one-off (deployment, hopefully involving multiple, redundant servers) the time / money / learning effort to build it may tip the scales to a commercial solution.

Rock-bottom cost estimate: 160 drives, 4TB each, at about $350 (street price in sensible quantities), $56K. Two disk enclosures with power supplies, I'm guessing about $12K each, makes another $24K. Server, 2-socket, 24-core, with 128-256 GB of memory, and a half dozen SAS and network cards (for the enclosures and tape drives), about $20K. The raw hardware for the first copy of the data is already nearly $100K. If you go with FreeBSD and ZFS (or any similar solution), the software is free. A sensible and reliable system will be MUCH MUCH higher; not 10% higher, but probably 2x or 3x.
With deployments of this size, it pays to carefully look at each of the tradeoffs. After factoring in floor space / cooling / power, more lower-capacity drives may present a savings over fewer high-capacity drives. Or, in the other direction, perhaps the new 6TB drives might be more cost-effective (unlikely at this point, but I don't know how near-term this purchase will be).

Backup: Obviously, at that scale, backup to a second set of disks is the easiest answer. But tape is also perfectly capable of doing this. If you go with IBM Jaguar or STK 10000 tapes, each cartridge has a capacity of 4-5 TB. So you need about 100 cartridges for one full set. At a cost of $150 per cartridge (that's very generous), that's not a lot of money, compared to the fact that each drive itself costs tens of thousands of $. Both IBM and STK will be happy to quote you a tape robot that can hold a few hundred tapes and a small number of drives. I think the cost or speed of backup really isn't the limiting factor. At 250 MB/s on the drives, 2x or 3x that with compression, and the fact that you will probably have multiple drives, a full backup only takes a week. Again, not a limiting factor, as it will take a long time (many days) to fill this much capacity.
I'd consider a second set of disks replication, since they're almost certainly going to be on-site, while one of the advantages of tapes is easy off-site storage. And, tapes can be loaded one at a time, while with disks you need to attach enough to have a valid pool.

I would also suggest looking at LTO drives. LTO has steamrollered most of the competition, except in niche markets. There are multiple actual manufacturers for most of the LTO products (drives, media, etc.) and competition helps keep costs low. Backup could be done with multiple drives at once. An IBM TS3200/Dell TL4000 (same thing) is an entry library that can be configured with 4 LTO-6 drives and 48 tape slots. Each drive has its own SAS or FC attachment to one or two hosts. Though, with the size of this deployment, looking at a library with hundreds of slots might be prudent. A lot will depend on how much of the data needs to be backed up at a given time. Obviously, needing to back up all 500 TB once a week during a single overnight shift is a very different animal from needing to back up 250 TB over the span of a month.

I think you really need to begin by figuring out speeds & feeds (how much data in and out), what's the value of the data, how available and reliable does the storage have to be, what type of disasters do you need to guard against. These things will change the architecture of a proposed solution and the cost by factors of several over the raw disk cost.
Agreed 100%+, again.
 
thanks for all the thoughtful replies...

re: how much the data costs, etc.
I understand this argument, and yes the data costs $100,000's to $mils to generate, not to mention years or more in time.
Nevertheless the budget is the budget. It is not possible to go to the granting agencies (who themselves are having their
budgets slashed) and ask for more. NIH's budget has been totally slashed, for example. Now only 1/20 of grants are
funded and everyone is running on shoestrings, so as much as *I* fully understand those arguments, the money simply
isn't there for the best, most robust system/architecture. We are looking for "good enough" whatever that means...

What is our budget? Its enough to do Synology or similar, and more than enough to do StoragePods or RAIDzilla, but
not enough to do DELL, IBM, EMC, NetApp... seems like there are surprisingly big gaps in pricing between these
classes of solutions...

re: backup
the above is partly why I still think (but please convince me otherwise), that traditional tape backup still makes sense,
if for no other reason than it provide a lower cost "backstop" against loss or if we overflow the storage system, even
if it means many hours/days to restore the data, or to move data around from "online" to archival.

yes I don't think we need to backup 500TB every week... more like maybe 20-50TB every week or less.

I do like the idea of a 2nd server though. I gather the point would be to put it in another building though, which means
working out high speed links. This 2nd server could be slower, perhaps more "near line" in performance...
 
monkeyboy said:
I do like the idea of a 2nd server though. I gather the point would be to put it in another building though, which means
working out high speed links. This 2nd server could be slower, perhaps more "near line" in performance...
It depends what you're guarding against - if you're only concerned about hardware failure, then in the same room is fine. If you're concerned about power failures / HVAC failures, another room in the same building with its own UPS and HVAC is fine. If you're concerned about fire, a block or two away is fine and you may be able to rent space on the utility poles and put up your own fiber (I've done that). If you're worried about hurricanes, etc. then your second system needs to be in another city and you need a fast link between them and someone to take care of maintenance tasks on that remote system. You need to price out all of those options. Note that fast links are nowhere near as expensive as in the past - I can get a 10 GbE link from Manhattan to Newark for well under < $3000/month (but I buy in bulk).

Note that none of that is "backup". It is "replication". If somebody deletes data from your main server, that change will propagate to the second server after some time (depending on the synchronization method chosen). You can read more about this sort of thing on my 'zilla page and the links on it. Something I said to DEC (which didn't endear me to them) when they announced the VAXft system bears repeating: "A couple holding hands while walking off the edge of a cliff".
 
Terry_Kennedy said:
Note that none of that is "backup". It is "replication". If somebody deletes data from your main server, that change will propagate to the second server after some time

That's when you have to make use of snapshots, in ZFS you can maintain snapshots on both live and backup systems to provide you with point in time copies of data. If you don't have any tape or any other offline backups then this is a must and should be factored into your total available storage when designing the system...
 
AndyUKG said:
Terry_Kennedy said:
Note that none of that is "backup". It is "replication". If somebody deletes data from your main server, that change will propagate to the second server after some time
That's when you have to make use of snapshots, in ZFS you can maintain snapshots on both live and backup systems to provide you with point in time copies of data. If you don't have any tape or any other offline backups then this is a must and should be factored into your total available storage when designing the system...
I maintain that it's still replication. At some point the large number of snapshots become unweildy and you end up pruning some of them. Maybe you snapshot every hour and delete the hourly ones after some months, keeping a daily snapshort. Likewise for daily/weekly, weekly/monthly, etc.

I realize that shapshot overhead is minimal in the case where very little has changed. However, the OP said that this will be used for things like MRI datasets, etc. which are both big and non-compressable, so I think this won't be the case for his snapshots (at least in the case of a deleted file).

At the time I built my RAIDzilla II's, ZFS was still quite new (FreeBSD 8.0 / ZFS v13) and concerns about stability led me to a different solution (based on a modified sysutils/rdiff-backup) instead of ZFS snapshots and send / receive. If I were starting from scratch today, I might use those instead, and I admit this is coloring my opinion somewhat.
 
monkeyboy said:
What is our budget? Its enough to do Synology or similar, and more than enough to do StoragePods or RAIDzilla, but
not enough to do DELL, IBM, EMC, NetApp... seems like there are surprisingly big gaps in pricing between these
classes of solutions...
The Register posted an interesting article the other day, about storage becoming a commodity due to pricing pressure from the cloud.

I'd also point out that there's no reason you couldn't buy Dell / IBM / whatever hardware and run FreeBSD w/ ZFS on it. That at least gives you a single vendor for the hardware with warranty coverage. A sneak peek at my RAIDzilla II / PowerVault NX3100 comparison shows that in 2010, a RAIDzilla II w/ 16 * 2 TB drives cost $12,529 while a NX3100 with 12 * 2 TB drives cost $16,676. The NX3100 includes the "Windows Tax" of a mandatory copy of Windows Storage Server 2008 R2, so the NX3100 price is somewhat inflated. The price of the 'zilla fell a lot faster than the NX3100, though. However, this does show that a commercial, single-vendor box isn't that much more expensive than a roll-your-own using quality components.
 
Terry_Kennedy said:
AndyUKG said:
I maintain that it's still replication.
It is, its replication of a file system and snapshots, but it addresses the issue you raised about recovering accidentally deleted files.

Terry_Kennedy said:
At some point the large number of snapshots become unweildy and you end up pruning some of them.
I manage systems with snapshots which are maintained for years, its not a problem in my case so may work for the original poster. For example I have systems where there are hourly snapshots maintained for a month, weekly snapshots maintained for a year and monthly snapshots maintained indefinately.

Terry_Kennedy said:
I realize that shapshot overhead is minimal in the case where very little has changed. However, the OP said that this will be used for things like MRI datasets, etc. which are both big and non-compressable, so I think this won't be the case for his snapshots (at least in the case of a deleted file).
We're starting to speculate about the OPs needs here. It may be that as its image data the data never changes and is never deleted, why would you change an MRI image? I'd guess you'd upload a newer one and maintain the old ones too, in this scenario keeping snapshots puts very little demand for space on the system. But better the OP tells us really :p

cheers Andy.
 
AndyUKG said:
It may be that as its image data the data never changes and is never deleted, why would you change an MRI image? I'd guess you'd upload a newer one and maintain the old ones too, in this scenario keeping snapshots puts very little demand for space on the system. But better the OP tells us really :p
I was addressing situations like "Oops - I deleted the wrong directory. Arrrggghhh!"
 
Terry_Kennedy said:
I was addressing situations like "Oops - I deleted the wrong directory. Arrrggghhh!"

And that is indeed a very important situation, which happens all the time, and needs to be addressed. Humans (including the system administrator) are the single largest cause of data loss, and therefore the single largest problem for the system administrator. Old joke: What's the right way to run a computer center? You have to hire a man and a dog. The man is there to feed the dog. The dog is there to bite the man if he tries to touch the computer.

OK, all joking aside. There is a cheaper (but less reliable) solution for the problem of a human deleting the wrong file or directory: Arrange file system protection so they can't do it. The obvious one is: don't log in as root unless absolutely necessary, and make sure that files and in particular directories are neither group nor world writable. At that point, at least the attack surface has been reduced to just the owner of the files. The next step is to use protection bits, ACLs, and file system flags to make it so things can't be deleted by "normal" means. On my previous server (using OpenBSD and it's native version of the Berkeley FS, before I switched to ZFS) I had a script that would go through my directories with "archival" files (that being scanned documents, and mp3 files ripped from CD), and would change their permissions to 444 (r--r--r--) and set some combination of schg and uchg flags, which prevented the files from being modified or deleted. It would only do that after their mtime was older than 24 hours, the theory being that after a day, the human is done working on a file, and from then on the file will typically not change or be deleted. Porting that script to ZFS hasn't happened yet (it's on my to-do list), partly because I haven't had time to figure out how "unchangeable" flags or ACLs work on ZFS.

Is such a solution perfect? No, because a determined fool who logs in as root can still delete the files. But at least it prevents casual mistakes, which is the common type.
 
Automatic version control for files like the one in VMS would be a nice feature. Of course as something you could turn on for a particular file when needed and not forced by the system. Not sure how it could be done most efficiently. Knowing that GIT started as a filesystem and not just as a version control system makes me think that's one possibility to look at.
 
Terry_Kennedy said:
AndyUKG said:
It may be that as its image data the data never changes and is never deleted, why would you change an MRI image? I'd guess you'd upload a newer one and maintain the old ones too, in this scenario keeping snapshots puts very little demand for space on the system. But better the OP tells us really :p
I was addressing situations like "Oops - I deleted the wrong directory. Arrrggghhh!"

Which means you go into the snapshots directory and copy the file out of there to the live filesystem.

Or, you go to the second server, into the snapshots directory, and copy the file out of there to the live filesystem.

We do this on an almost daily basis.

Linux servers in the school have the live filesystem. Every night, that is rsync'd to the backups server running ZFS and a snapshot is created. Every day, that snapshot is sent (via zfs send) to the off-site storage server. The first storage server keeps 22 months of daily snapshots. The off-site storage server keeps just shy of 5 years of daily snapshots.

User deletes a file on the Linux server, we go look for it in the backups server. Even a simple # ls /path/.zfs/snapshot/<date>*/server/path/*filename* can be used to find copies of the file. And if it's not there, the other server has more snapshots to search through.

So, for the OP, doing hourly snapshots on the live ZFS system is the primary level of "backups" (actually data protection). If a user deletes a file, you just grab it from the snapshots. The second level of data protection is to send that data to a second, off-site system. A potential third level is to write the data out to tape.

Snapshots on the primary system are created hourly, culled weekly.

Snapshots on the second system are created via hourly zfs sends, culled on a different schedule (maybe monthly).

And, before you delete the oldest snapshots, you write the data to tape.
 
Back
Top