Specs
OS: FreeBSD 10.3
FS: ZFS
SAS HBA: LSI 9201-16e
SAS HBA FW/DRIVER: P19
DISK ENCLOSURE: Supermicro SC847E16-RJBOD1
CAPACITY-DISKS: Seagate ST6000NM024 6TB
FAST-DISKS: Samsung 850 Pro MZ-7KE1TOBW 1TB
Hey All,
I'm having an issue with my FreeBSD build for our homebrew NAS solution here at the office. I recently built it out with just one Supermicro enclosure full of disks and everything went swimmingly. Zpool created, replication and snapshotting all functional boot time very quick.
Today more gear arrived and I installed the second enclosure full of disks. During this install I unfortunately broke the SAS chain while some writes were happening and of course the disks were none-too-happy about that (none of this data is production so it's actually not a huge deal yet). After getting everything cabled back up correctly I investigated the damaged zpool and decided to go for a reboot to see if it would come back up happier. Much to my dismay on boot the server generates a lot of timeout errors on the console related to the disks.
Example errors are pasted below:
These error numbers increment up and I can only assume it generates one for every one of the 90 disks in the system (perhaps even more than one error per) there's an associated wait time between each error being generated so this causes the boot up to take forever (37 minutes). After it finally boots up, everything appears to be functional and healthy. Mid way through the boot I get another type of console message:
It seems to go through adding a fair few disks identified by target IDs as a result of their failure to identify by command. At a certain point though it runs into this type of error:
It hits these errors in a big string amidst all the timeouts being generated. Not sure what to make of them. After that it goes back to the timeouts with an occasional target reset until the target id's have incremented up from ID 130 (the first one) to like ID 154 (the last one).
After this point it grinds through a ton of errors that look like:
After generating a butt ton of those, it finally runs through the disks, turns on the ethernet ports, decrypts all the partitions and boots up to happiness.
In summary: With only one enclosure in place, none of these disk errors occurred.
We currently have a production system that includes all parts identical to these ones including two enclosures so I know the configuration is theoretically possible. After 36 minutes once the system finally finishes booting, the zpools are healthy and all the disks come up fine in camcontrol. It only appears to cause me grief when the box is starting. Of course this is no good if we have to patch and restart some time in the future. The thing shouldn't take this long to spin up.
Any ideas?
OS: FreeBSD 10.3
FS: ZFS
SAS HBA: LSI 9201-16e
SAS HBA FW/DRIVER: P19
DISK ENCLOSURE: Supermicro SC847E16-RJBOD1
CAPACITY-DISKS: Seagate ST6000NM024 6TB
FAST-DISKS: Samsung 850 Pro MZ-7KE1TOBW 1TB
Hey All,
I'm having an issue with my FreeBSD build for our homebrew NAS solution here at the office. I recently built it out with just one Supermicro enclosure full of disks and everything went swimmingly. Zpool created, replication and snapshotting all functional boot time very quick.
Today more gear arrived and I installed the second enclosure full of disks. During this install I unfortunately broke the SAS chain while some writes were happening and of course the disks were none-too-happy about that (none of this data is production so it's actually not a huge deal yet). After getting everything cabled back up correctly I investigated the damaged zpool and decided to go for a reboot to see if it would come back up happier. Much to my dismay on boot the server generates a lot of timeout errors on the console related to the disks.
Example errors are pasted below:
Code:
mps0: mpssas_ata_id_timeout checking ATA ID command 0xfffffe0000af4440 sc 0xfffffe0000aa8000
mps0: ATA ID command timeout cm 0xfffffe0000af4440
mpssas_get_sata_identify: request for page completed wth error 0mps0: Sleeping 3 seconds after SATA ID error to wait for spinup
These error numbers increment up and I can only assume it generates one for every one of the 90 disks in the system (perhaps even more than one error per) there's an associated wait time between each error being generated so this causes the boot up to take forever (37 minutes). After it finally boots up, everything appears to be functional and healthy. Mid way through the boot I get another type of console message:
Code:
mps0: mpssas_add_device: failed to get disk type (SSD or HDD) for SATA device with handle 0x002a
mps0: mpssas_add_device: sending Target Reset for stuck SATA identify command (cm = 0xfffffe0000adf830)
(noperiph:mps0:0:130:0): SMID 1 sending target reset
(xpt0:mps0:0:130:ffffffff): SMID 1 recovery finished after target reset
mps0: Unfreezing devq for target ID 130
It seems to go through adding a fair few disks identified by target IDs as a result of their failure to identify by command. At a certain point though it runs into this type of error:
Code:
mps0: _mapping_add_new_device: failed to add the device with handle 0x0038 to persistent table because there is no free space available.
It hits these errors in a big string amidst all the timeouts being generated. Not sure what to make of them. After that it goes back to the timeouts with an occasional target reset until the target id's have incremented up from ID 130 (the first one) to like ID 154 (the last one).
After this point it grinds through a ton of errors that look like:
Code:
mps0: mpssas_add_device: failed to get disk type (SSD or HDD)for SATA device with handle 0x007a
mps0: mpssas_get_sata_identify: error reading SATA PASSTHRU; iocstatus =0x47
After generating a butt ton of those, it finally runs through the disks, turns on the ethernet ports, decrypts all the partitions and boots up to happiness.
In summary: With only one enclosure in place, none of these disk errors occurred.
We currently have a production system that includes all parts identical to these ones including two enclosures so I know the configuration is theoretically possible. After 36 minutes once the system finally finishes booting, the zpools are healthy and all the disks come up fine in camcontrol. It only appears to cause me grief when the box is starting. Of course this is no good if we have to patch and restart some time in the future. The thing shouldn't take this long to spin up.
Any ideas?