ZFS pogolinux.org has 48 disk sas array with 6t disks raid 10 anyone?

azathoth · Dec 4, 2017

How fast would a raid 10 48 disk sas go?

wouldu even need ssd?

economically would this be much better buy than ssd?

phoenix · Dec 4, 2017

Pretty darn fast.

We have 90-drive storage boxes using 6-disk raidz2 vdevs and can get over 3 Gbps of scrub throughput (we're limited to a single gigabit link at the moment so our actual data throughput is much lower than the system is capable of), so a setup using mirror vdevs should be even faster. This is with plain-jane 7200 RPM SATA drives (2 and 4 TB mixed).

For mirror vdevs, you should get the write I/O of half the drives (as it copies data to both disks simultaneously), and theoretical read I/O of every drive (for the situation where two read requests can be satisfied from separate halves of the mirror).

As fast as the SAS drives may be, adding SSDs for caching could be faster still. SAS drives are still limited to around 100-200 IOps, while SSDs are measured in the 1000s of IOps. Not to mention the latency on SSDs is much lower than on SAS drives (which are limited by the spindle speed; it's virtually impossible to get below ~7ms for uncached reads on 7200 RPM drives).

ralphbsz · Dec 4, 2017

The simply question in these kinds of games is always: Where do you want your bottleneck to be, and how much money do you have?

Maximum sequential throughput of an individual spinning disk (SAS or SATA): Up to 200 MByte/s. Take that times 48 disks, and you have 9.6 GByte/s. That's roughly 100 Gbit/second. It is possible to provision enough SAS and PCIe bandwidth to actually fully utilize this. It may seem expensive (probably need multiple high-end HBAs and a server with lots of PCIe slots), but the cost of that infrastructure is small compared to the cost of the disks (48 x $300 = $15K). Then running RAID 1(0) over it cuts the bandwidth available to users in half for writes.

I know that modern x86 hardware is capable of doing this, but it isn't easy: It will require significant tuning and careful configuration.

In terms of cost per throughput ($ / GB/sec) for sequential IO, spinning disks are still unbeatable. For random IO, the situation is different: As Phoenix said, every disk will do no more than 100-200 seeks = IOps, so the aggregate is only ~5000 IOps (half that for writes). Even a single cheap SSD can do way more than that; $15K worth of SSDs or NVMe hardware would be faster by orders of magnitude. For best performance, you probably want a combination of disks and SSD. But that only makes the problem much harder: Which data goes on which storage device?

So: Do you know what your workload is? What is the ratio of sequential to random, of read to write? What are IO sizes, striding, all that stuff?

The harder question is: What are you going to do with the data? If you want to actually process it right on that machine, that's doing to be awfully difficult. It is very hard to actually do something productive with 5 or 10 GB/sec of data; there are just not enough CPU cycles to perform interesting analysis on that data using a single CPU. About two years ago, I went to a Hadoop+HDFS conference, and in talking to people in the hallways discovered that the throughput of data analysis tasks on typical 1U rackmount Intel servers is often only 3-4 MByte/s, which is a factor thousand slower than this set of disks could deliver data.

So maybe you want to build a cluster, with this machine as a file server, and 1000 machines as data analysis (CPU) boxes. Now: What cluster data distribution mechanism do you want to use? Have you ever tried using NFS or CIFS=Samba with 1000 clients? It's doable, but not fun, and a full-time job. In particular at performance. Furthermore, if this machine is now a file server, it needs network interfaces to move the data from CPU/memory back out to the network, which doubles the required PCIe bandwidth. The machine already has multiple high-end SAS HBAs, and now you'll add multiple Infiniband or 100g Ethernet cards to it? I've literally seen smoke come out of Intel servers (when the power supply and cooling infrastructure was so overloaded by building this kind of system that things overheated). And even if you get that (seemingly simple and superficial) problem under control, configuring and tuning a storage server and cluster file system at that scale is a serious task.

Required anecdote: We had a SAS HBA that we were monitoring the temperature on, using its management interface. At one point, it reported a chip temperature of slightly above 100 degrees C, and then we never heard from the card again. When we opened the server, we found that the PC board around the chip was no longer green, it was brown. Oops ... tried to pump too much data to it, and it burned up. And in the normal world (excluding pre-production prototypes), these smoke and fire problems don't occur, because most good server-class hardware has protection against overcurrent and overheat (which we promptly disable in our lab).

azathoth · Dec 5, 2017

ralphbsz said:
The simply question in these kinds of games is always: Where do you want your bottleneck to be, and how much money do you have?

Maximum sequential throughput of an individual spinning disk (SAS or SATA): Up to 200 MByte/s. Take that times 48 disks, and you have 9.6 GByte/s. That's roughly 100 Gbit/second. It is possible to provision enough SAS and PCIe bandwidth to actually fully utilize this. It may seem expensive (probably need multiple high-end HBAs and a server with lots of PCIe slots), but the cost of that infrastructure is small compared to the cost of the disks (48 x $300 = $15K). Then running RAID 1(0) over it cuts the bandwidth available to users in half for writes.

I know that modern x86 hardware is capable of doing this, but it isn't easy: It will require significant tuning and careful configuration.

In terms of cost per throughput ($ / GB/sec) for sequential IO, spinning disks are still unbeatable. For random IO, the situation is different: As Phoenix said, every disk will do no more than 100-200 seeks = IOps, so the aggregate is only ~5000 IOps (half that for writes). Even a single cheap SSD can do way more than that; $15K worth of SSDs or NVMe hardware would be faster by orders of magnitude. For best performance, you probably want a combination of disks and SSD. But that only makes the problem much harder: Which data goes on which storage device?

So: Do you know what your workload is? What is the ratio of sequential to random, of read to write? What are IO sizes, striding, all that stuff?

The harder question is: What are you going to do with the data? If you want to actually process it right on that machine, that's doing to be awfully difficult. It is very hard to actually do something productive with 5 or 10 GB/sec of data; there are just not enough CPU cycles to perform interesting analysis on that data using a single CPU. About two years ago, I went to a Hadoop+HDFS conference, and in talking to people in the hallways discovered that the throughput of data analysis tasks on typical 1U rackmount Intel servers is often only 3-4 MByte/s, which is a factor thousand slower than this set of disks could deliver data.

So maybe you want to build a cluster, with this machine as a file server, and 1000 machines as data analysis (CPU) boxes. Now: What cluster data distribution mechanism do you want to use? Have you ever tried using NFS or CIFS=Samba with 1000 clients? It's doable, but not fun, and a full-time job. In particular at performance. Furthermore, if this machine is now a file server, it needs network interfaces to move the data from CPU/memory back out to the network, which doubles the required PCIe bandwidth. The machine already has multiple high-end SAS HBAs, and now you'll add multiple Infiniband or 100g Ethernet cards to it? I've literally seen smoke come out of Intel servers (when the power supply and cooling infrastructure was so overloaded by building this kind of system that things overheated). And even if you get that (seemingly simple and superficial) problem under control, configuring and tuning a storage server and cluster file system at that scale is a serious task.

Required anecdote: We had a SAS HBA that we were monitoring the temperature on, using its management interface. At one point, it reported a chip temperature of slightly above 100 degrees C, and then we never heard from the card again. When we opened the server, we found that the PC board around the chip was no longer green, it was brown. Oops ... tried to pump too much data to it, and it burned up. And in the normal world (excluding pre-production prototypes), these smoke and fire problems don't occur, because most good server-class hardware has protection against overcurrent and overheat (which we promptly disable in our lab).

https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

azathoth · Dec 5, 2017

phoenix said:
Pretty darn fast.

We have 90-drive storage boxes using 6-disk raidz2 vdevs and can get over 3 Gbps of scrub throughput (we're limited to a single gigabit link at the moment so our actual data throughput is much lower than the system is capable of), so a setup using mirror vdevs should be even faster. This is with plain-jane 7200 RPM SATA drives (2 and 4 TB mixed).

For mirror vdevs, you should get the write I/O of half the drives (as it copies data to both disks simultaneously), and theoretical read I/O of every drive (for the situation where two read requests can be satisfied from separate halves of the mirror).

As fast as the SAS drives may be, adding SSDs for caching could be faster still. SAS drives are still limited to around 100-200 IOps, while SSDs are measured in the 1000s of IOps. Not to mention the latency on SSDs is much lower than on SAS drives (which are limited by the spindle speed; it's virtually impossible to get below ~7ms for uncached reads on 7200 RPM drives).

It seems at work we have financial clients and the speed is unoptimized n crappy because the programmers are so removed from th ereal problem. So everything is hacked to gether combination of oracle db, C++ engines and IIS. It would seem if the programmers got nearer the problem they could solve it far faster n better but with offshoring, agile, interface teams, project managers and friends its all a morass.

azathoth · Dec 5, 2017

https://code.google.com/archive/p/mogilefs/

Then there is this! each box a raid 0 brick, with perl app wiich acts as file gopher to grid of bricks.

I fantasized about combining with werc.cat-v.org

ZFS pogolinux.org has 48 disk sas array with 6t disks raid 10 anyone?

azathoth

phoenix

ralphbsz

azathoth

azathoth

azathoth