Hello,
Maybe this should go in General, not sure... I was attempting to use a zpool of SSDs when I ran into some odd behavior and a kernel panic.
For the holidays I got a 4-bay hot-swap enclosure and four hand-me-down 1 TB SSDs (which were working fine in the previous Linux NAS). I was able to create a 4-way mirror with them, and copy a bunch of data over. Everything seemed to work fine at this point. This machine has 4 zpools on it: one 7-disk HDD, one 1-disk USB boot drive, one 1-disk USB data drive, and the new 4-disk SSD mirror. I have 8 ports on the motherboard (currently unused) and two LSI SAS2008/9211-8i cards (mpt), which are currently hosting the HDDs. This is an old machine, ASRock A75 Extreme mobo with AMD A6-3500 CPU and 8 GB RAM on 15.0-RELEASE. It's primarily a file server.
A couple days ago I tried to run a scrub on everything and later discovered that the system was completely frozen. The monitor was blank, SSH wouldn't connect, my USB keyboard didn't even register key presses. I power cycled and eventually discovered that it was due to the scrubbing. At first I thought it was because I scrubbed everything at once and it ran out of memory, but then discovered that I could scrub all pools individually except the SSD pool. After many hours of testing data cables/ports, power cables/ports, direct plug/enclosure, different controllers, etc., I think I narrowed it down to the SSDs, regardless of where/how they are connected. Scrubbing the SSD pool causes it to freeze. I'm not sure why the monitor was blank initially, but my last attempt was with the SSDs on direct power and data cables from an LSI card, and in that config I saw a panic on the monitor when it froze. I was able to grab a photo and then found another one in /var/log:
This trace seems to be the same every time, though I have only been able to catch a few. In the last, most recent panic, while the SSDs were connected to the LSI card, this is after the above trace info (this isn't in /var/log):
The SSDs are Silicon Power 1TB A55, which reportedly has a Silicon Motion controller. Since the machine froze every time I scrubbed these drives regardless of where they were, it seems like the SSDs themselves are the issue (or are at least part of it), however they seemed to work fine elsewhere (I'm not sure if they were scrubbed there, so maybe it is load related). Unfortunately I don't have other SSDs right now to test anything with.
I guess the next thing to mess with is break the pool and just try using one or more SSDs independently and see what happens...but the panic was unexpected.
Maybe this should go in General, not sure... I was attempting to use a zpool of SSDs when I ran into some odd behavior and a kernel panic.
For the holidays I got a 4-bay hot-swap enclosure and four hand-me-down 1 TB SSDs (which were working fine in the previous Linux NAS). I was able to create a 4-way mirror with them, and copy a bunch of data over. Everything seemed to work fine at this point. This machine has 4 zpools on it: one 7-disk HDD, one 1-disk USB boot drive, one 1-disk USB data drive, and the new 4-disk SSD mirror. I have 8 ports on the motherboard (currently unused) and two LSI SAS2008/9211-8i cards (mpt), which are currently hosting the HDDs. This is an old machine, ASRock A75 Extreme mobo with AMD A6-3500 CPU and 8 GB RAM on 15.0-RELEASE. It's primarily a file server.
A couple days ago I tried to run a scrub on everything and later discovered that the system was completely frozen. The monitor was blank, SSH wouldn't connect, my USB keyboard didn't even register key presses. I power cycled and eventually discovered that it was due to the scrubbing. At first I thought it was because I scrubbed everything at once and it ran out of memory, but then discovered that I could scrub all pools individually except the SSD pool. After many hours of testing data cables/ports, power cables/ports, direct plug/enclosure, different controllers, etc., I think I narrowed it down to the SSDs, regardless of where/how they are connected. Scrubbing the SSD pool causes it to freeze. I'm not sure why the monitor was blank initially, but my last attempt was with the SSDs on direct power and data cables from an LSI card, and in that config I saw a panic on the monitor when it froze. I was able to grab a photo and then found another one in /var/log:
Code:
spin lock 0xfffffe000d3dc400 (sched lock 1) held by 0xfffff800b6c8a780 (tid 102500) too long
timeout stopping cpus
panic: spin lock held too long
cpuid = 1
time = 1769469202
KDB: stack backtrace:
#0 0xffffffff80bbe1ed at kdb_backtrace+0x5d
#1 0xffffffff80b71576 at vpanic+0x136
#2 0xffffffff80b71433 at panic+0x43
#3 0xffffffff80b4d1b4 at _mtx_lock_indefinite_check+0x64
#4 0xffffffff80b4d2fb at thread_lock_flags_+0xdb
#5 0xffffffff80ba4176 at sched_preempt+0x16
#6 0xffffffff810471b6 at ipi_bitmap_handler+0x86
#7 0xffffffff81052243 at Xipi_intr_bitmap_handler+0xb3
#8 0xffffffff804bfafd at acpi_cpu_idle+0x2cd
#9 0xffffffff8103cbe6 at cpu_idle_acpi+0x46
#10 0xffffffff8103cc9d at cpu_idle+0x9d
#11 0xffffffff80ba5a36 at sched_idletd+0x546
#12 0xffffffff80b2786b at fork_exit+0x7b
This trace seems to be the same every time, though I have only been able to catch a few. In the last, most recent panic, while the SSDs were connected to the LSI card, this is after the above trace info (this isn't in /var/log):
Code:
#13 0xffffffff81050f3e at fork_trampoline+0xe
Uptime 9m2s
mps1: Sending StopUnit: (xpt0:mps1:0:1:ffffffff): handle 14
mps1: Incrementing SSU count
mps1: Sending StopUnit: (xpt0:mps1:0:2:ffffffff): handle 13
mps1: Incrementing SSU count
mps1: Sending StopUnit: (xpt0:mps1:0:3:ffffffff): handle 15
mps1: Incrementing SSU count
mps1: Sending StopUnit: (xpt0:mps1:0:4:ffffffff): handle 16
mps1: Incrementing SSU count
mps1: Time has expired waiting for SSU commands to complete.
The SSDs are Silicon Power 1TB A55, which reportedly has a Silicon Motion controller. Since the machine froze every time I scrubbed these drives regardless of where they were, it seems like the SSDs themselves are the issue (or are at least part of it), however they seemed to work fine elsewhere (I'm not sure if they were scrubbed there, so maybe it is load related). Unfortunately I don't have other SSDs right now to test anything with.
I guess the next thing to mess with is break the pool and just try using one or more SSDs independently and see what happens...but the panic was unexpected.