HOWTO: FreeBSD Enterprise 1 PB Storage

vermaden

Son of Beastie

Reaction score: 1,840
Messages: 3,141

Today FreeBSD operating system turns 26 years old.
19 June is an International FreeBSD Day.
This is why I got something special today :)

FreeBSD Enterprise 1 PB Storage

#verblog #backup #beadm #enterprise #freebsd #freenas #linux #nas #storage #tyan #zfs #FreeBSDDay
 

ralphbsz

Son of Beastie

Reaction score: 2,518
Messages: 3,382

Very nice system, and very well thought out. I used to build similar ones for a living, and doing things at this scale is quite an adventure.

Two questions. First, disk identity and management. You did not use symbolic names for the disk partitions. And you don't have geographic identity for the disks: How do you know that da91p1 is the 7th disk from the left in the 4th row from the front? To do disk maintenance in the future, something like that will be required. This is related to: are there indicator lights associated with the disks? Ideally, you should have a green power/activity light for each disk (right next to the disk, so you know which disk is idle, dead, or always busy), and a controllable light next to each disk (which you can use to orchestrate maintenance operations, like "replace the disk next to the blinking red light"). Plus, can you ask the enclosure services which physical slots contain disks? And if you have both disk naming and geographic identity, and indicator lights that can be controlled, and the ability to detect which disk is missing, then you can build quite a comfortable maintenance management system around it.

Second, performance. It seems the setup tops out at roughly 3 GByte/s for large sequential IOs. But each physical disk should be capable of roughly 100 MByte/s or more (as much as 200 for outer edge large sequential), so with 90 disks you should be getting at least 9 GByte/s. So there is a bottleneck somewhere. Do you know what the bottleneck is? As a simple test, you could try the following: For each of the physical disks, run "dd if=/dev/daXXp1 of=/dev/null bs=16M count=...", 90 copies of this, and then add the results. Would this get closer to the 9 GByte/s theoretical limit?

In reality, the performance question might be irrelevant to you, since you only have 2 GByte/s of network bandwidth, so probably there is no point making the IO and file system work any faster than that.

Again, wonderful description of a good system.
 
OP
vermaden

vermaden

Son of Beastie

Reaction score: 1,840
Messages: 3,141

Very nice system, and very well thought out. I used to build similar ones for a living, and doing things at this scale is quite an adventure.

Two questions. First, disk identity and management. You did not use symbolic names for the disk partitions. And you don't have geographic identity for the disks: How do you know that da91p1 is the 7th disk from the left in the 4th row from the front? To do disk maintenance in the future, something like that will be required. This is related to: are there indicator lights associated with the disks? Ideally, you should have a green power/activity light for each disk (right next to the disk, so you know which disk is idle, dead, or always busy), and a controllable light next to each disk (which you can use to orchestrate maintenance operations, like "replace the disk next to the blinking red light"). Plus, can you ask the enclosure services which physical slots contain disks? And if you have both disk naming and geographic identity, and indicator lights that can be controlled, and the ability to detect which disk is missing, then you can build quite a comfortable maintenance management system around it.

These two are your friends :)
# sesutil locate all off
# sesutil locate da3 on

Second, performance. It seems the setup tops out at roughly 3 GByte/s for large sequential IOs. But each physical disk should be capable of roughly 100 MByte/s or more (as much as 200 for outer edge large sequential), so with 90 disks you should be getting at least 9 GByte/s. So there is a bottleneck somewhere. Do you know what the bottleneck is? As a simple test, you could try the following: For each of the physical disks, run "dd if=/dev/daXXp1 of=/dev/null bs=16M count=...", 90 copies of this, and then add the results. Would this get closer to the 9 GByte/s theoretical limit?

From the description of the SAS3008 I/O controller:
BROCADE said:
Deliver more than million IOPS and 6,000 MB/s throughput performance
50 12TB drives with 200 MB/s should top at 100000 MB/s so that is the 'slower' I think.

Again, wonderful description of a good system.
Thanks, wanted to describe the details as much as possible :)
 
OP
vermaden

vermaden

Son of Beastie

Reaction score: 1,840
Messages: 3,141

Sure, maybe in little more 'limited form' but GEOM framework is pretty solid, RAID5 is one of the supported algorithms.
 

usdmatt

Daemon

Reaction score: 607
Messages: 1,545

These two are your friends :)
# sesutil locate all off
# sesutil locate da3 on

Ah, that's cool (if the system supports it).
It's the one thing I noticed, as you just have a sea of daX devices. My preferred method these days is to label a GPT partition something like data-XXXX, where XXXX is the end of the serial number, then I can physically label the disk carriers with the serial number and have a bit more confidence when it comes to knowing which disk to pull..
 
Top