Stack several of something like hat as hardware:monkeyboy said:Anyone have hints and experience building (or simply buying) file servers in the 300-500TB range? Is FreeBSD still a good solution for this scale? How should one deal with backups at this scale?
Preferably yours if you're the one that screwed up x(throAU said:If you have 500 TB of data go off-line, people are going to want blood.
Nukama said:Better look at some data-store with a redundancy level one bit higher (redundancy over network nodes, not disks). like ceph or some other redundant system that spreads over independent nodes (RAIN).
monkeyboy said:Anyone have hints and experience building (or simply buying) file servers in the 300-500TB range? Is FreeBSD still a good solution for this scale? How should one deal with backups at this scale?
Well, I'm the RAIDzilla guy (RAIDzilla II for a couple of years now). Feel free to ask me any questions and I'll try to answer anything not already covered in that link. I've stated my opinion on Pods there as well: they're great for what Backblaze does, but may not be the optimal solution for others. A couple of people have asked me for a comparison of a 'zilla with something that I'd consider as a direct competitor, but from a "real" vendor. I'll be posting a RAIDzilla II vs. Dell NX3100 article in the next couple of weeks to answer that question.monkeyboy said:The Backblaze Storage Pod or RaidZillas are interesting...
monkeyboy said:use: medical research data, including MR imaging data, genomic sequencing data, microscopy imaging data, etc...
NetApp et al... I don't think the budget is there for systems like those... I've been looking at Synology, which is probably doable budget-wise...
I'm doing 500 Mbyte/sec (24 hour average) on a 32 TB 16-drive 'zilla (3 * 5-drive raidz1 for the pool). Assuming enough PCIe bus bandwidth, a 500 TB unit should be able to get substantially higher speed. I don't think the original poster said what his I/O bandwidth requirements were, or how the pool would be accessed (local, NFS, SMB, ...).ralphbsz said:The active disk storage is the easy part. Get an x86 server, with two SAS cards, two Xyratex 84-disk drawers, stuff it fully with 4TB drives. That's enough capacity to store 500 TB, even with a sensible RAID code (even with 80% storage efficiency that system is still more than 500 TB). But: Does a single server provide enough availability and speed for this kind of storage solution? I doubt it.
There are a number of us here doing various medium- to large-scale deployments. I think phoenix has the largest disclosed storage set, though I'm sure there are lurkers with much bigger ones (if not here, definitely on the mailing lists).If you go to a real vendor (your typical 3-letter company, including HP and Dell as honorary members of the 3-letter club, and NetApp because they have two words with three letters each), the sales people will be happy to prepare a quote, which will probably include installation, management, and service. Get a few such quotes, to figure out what constitutes a sensible system. Then see whether you want to tackle this yourself.
With deployments of this size, it pays to carefully look at each of the tradeoffs. After factoring in floor space / cooling / power, more lower-capacity drives may present a savings over fewer high-capacity drives. Or, in the other direction, perhaps the new 6TB drives might be more cost-effective (unlikely at this point, but I don't know how near-term this purchase will be).Rock-bottom cost estimate: 160 drives, 4TB each, at about $350 (street price in sensible quantities), $56K. Two disk enclosures with power supplies, I'm guessing about $12K each, makes another $24K. Server, 2-socket, 24-core, with 128-256 GB of memory, and a half dozen SAS and network cards (for the enclosures and tape drives), about $20K. The raw hardware for the first copy of the data is already nearly $100K. If you go with FreeBSD and ZFS (or any similar solution), the software is free. A sensible and reliable system will be MUCH MUCH higher; not 10% higher, but probably 2x or 3x.
I'd consider a second set of disks replication, since they're almost certainly going to be on-site, while one of the advantages of tapes is easy off-site storage. And, tapes can be loaded one at a time, while with disks you need to attach enough to have a valid pool.Backup: Obviously, at that scale, backup to a second set of disks is the easiest answer. But tape is also perfectly capable of doing this. If you go with IBM Jaguar or STK 10000 tapes, each cartridge has a capacity of 4-5 TB. So you need about 100 cartridges for one full set. At a cost of $150 per cartridge (that's very generous), that's not a lot of money, compared to the fact that each drive itself costs tens of thousands of $. Both IBM and STK will be happy to quote you a tape robot that can hold a few hundred tapes and a small number of drives. I think the cost or speed of backup really isn't the limiting factor. At 250 MB/s on the drives, 2x or 3x that with compression, and the fact that you will probably have multiple drives, a full backup only takes a week. Again, not a limiting factor, as it will take a long time (many days) to fill this much capacity.
Agreed 100%+, again.I think you really need to begin by figuring out speeds & feeds (how much data in and out), what's the value of the data, how available and reliable does the storage have to be, what type of disasters do you need to guard against. These things will change the architecture of a proposed solution and the cost by factors of several over the raw disk cost.
It depends what you're guarding against - if you're only concerned about hardware failure, then in the same room is fine. If you're concerned about power failures / HVAC failures, another room in the same building with its own UPS and HVAC is fine. If you're concerned about fire, a block or two away is fine and you may be able to rent space on the utility poles and put up your own fiber (I've done that). If you're worried about hurricanes, etc. then your second system needs to be in another city and you need a fast link between them and someone to take care of maintenance tasks on that remote system. You need to price out all of those options. Note that fast links are nowhere near as expensive as in the past - I can get a 10 GbE link from Manhattan to Newark for well under < $3000/month (but I buy in bulk).monkeyboy said:I do like the idea of a 2nd server though. I gather the point would be to put it in another building though, which means
working out high speed links. This 2nd server could be slower, perhaps more "near line" in performance...
Terry_Kennedy said:Note that none of that is "backup". It is "replication". If somebody deletes data from your main server, that change will propagate to the second server after some time
I maintain that it's still replication. At some point the large number of snapshots become unweildy and you end up pruning some of them. Maybe you snapshot every hour and delete the hourly ones after some months, keeping a daily snapshort. Likewise for daily/weekly, weekly/monthly, etc.AndyUKG said:That's when you have to make use of snapshots, in ZFS you can maintain snapshots on both live and backup systems to provide you with point in time copies of data. If you don't have any tape or any other offline backups then this is a must and should be factored into your total available storage when designing the system...Terry_Kennedy said:Note that none of that is "backup". It is "replication". If somebody deletes data from your main server, that change will propagate to the second server after some time
The Register posted an interesting article the other day, about storage becoming a commodity due to pricing pressure from the cloud.monkeyboy said:What is our budget? Its enough to do Synology or similar, and more than enough to do StoragePods or RAIDzilla, but
not enough to do DELL, IBM, EMC, NetApp... seems like there are surprisingly big gaps in pricing between these
classes of solutions...
It is, its replication of a file system and snapshots, but it addresses the issue you raised about recovering accidentally deleted files.Terry_Kennedy said:AndyUKG said:I maintain that it's still replication.
I manage systems with snapshots which are maintained for years, its not a problem in my case so may work for the original poster. For example I have systems where there are hourly snapshots maintained for a month, weekly snapshots maintained for a year and monthly snapshots maintained indefinately.Terry_Kennedy said:At some point the large number of snapshots become unweildy and you end up pruning some of them.
We're starting to speculate about the OPs needs here. It may be that as its image data the data never changes and is never deleted, why would you change an MRI image? I'd guess you'd upload a newer one and maintain the old ones too, in this scenario keeping snapshots puts very little demand for space on the system. But better the OP tells us reallyTerry_Kennedy said:I realize that shapshot overhead is minimal in the case where very little has changed. However, the OP said that this will be used for things like MRI datasets, etc. which are both big and non-compressable, so I think this won't be the case for his snapshots (at least in the case of a deleted file).
I was addressing situations like "Oops - I deleted the wrong directory. Arrrggghhh!"AndyUKG said:It may be that as its image data the data never changes and is never deleted, why would you change an MRI image? I'd guess you'd upload a newer one and maintain the old ones too, in this scenario keeping snapshots puts very little demand for space on the system. But better the OP tells us really
Terry_Kennedy said:I was addressing situations like "Oops - I deleted the wrong directory. Arrrggghhh!"
Terry_Kennedy said:I was addressing situations like "Oops - I deleted the wrong directory. Arrrggghhh!"AndyUKG said:It may be that as its image data the data never changes and is never deleted, why would you change an MRI image? I'd guess you'd upload a newer one and maintain the old ones too, in this scenario keeping snapshots puts very little demand for space on the system. But better the OP tells us really
# ls /path/.zfs/snapshot/<date>*/server/path/*filename*
can be used to find copies of the file. And if it's not there, the other server has more snapshots to search through.