SESUTIL command got stuck in D state

Not sure which sub-forum this belongs so I put it here. This is also not a vanilla FreeBSD system (freeNAS 11.2 which should be based on FreeNAS 11.2) but I feel is more FreeBSD related.

I wrote myself a little script that uses sesutil locate $disk on to blink LED of failed disks and sesutil locate $disk off to un-blink good disks automatically. I set a cron job to have it run every minute and everything seems to work fine. However, after a week or so the script got hanged because the sesutil locate $disk off (to turn off LED for good disks) command stuck in D state (uninterruptible wait). Furthermore, every and each new sesutil locate $disk off will stuck in the same state so append a & would just create a pile of them in the same state. Only a reboot can solve the problem for a while and them it repeats.

I feel some insights into source code is needed here. Why is this command with a seemly non-critical function made uninterruptible (I believe there should be good reason)? Or it's not the command but the drivers? Any idea is welcome.

my hardware:
Dell R620 as head unit:
Dual E5 2690 v2
128GB RAM
2 s3500 80G, boot drive
T520-CR
LSI 9207 8e
P3700, slog
NetAPP DS4243 as DAS:
24 3.5 Bay
IOM6
8 4TB HGST NL SAS drives HUS726040AL5210: 4 2-way mirrors
8 10TB WD: 1 Raid Z2
 
D state means the program is currently in the kernel (has started a syscall, probably an IO request), and that IO request is not finishing. The root cause for that is in the IO stack deep down. Most likely it is some bizarre incompatibility between the OS kernel driver, the HBA (a.k.a. SCSI controller or in your case the LSI card), the SAS expanders (most likely the only ones are in your JBOD), the JBOD itself (in your case the DS4243), and perhaps a disk, although for blinking disks should not even be involved.

Once you are in a D state, you might as well disconnect everything or reboot. It's not coming back to life. Usually, this is not the application's fault (so looking at source code in user space is not the first step). Although serious debugging will probably require checking that source code, or tracing or instrumenting it, to see what exactly caused the problem. And note that "cause" is not the same as "at fault".

The LSI 92xx cards are very good. And I'm quite familiar with the NetApp (former LSI/Engenio) JBODs, those are also very good. But any complex SAS system has lots of gotchas and potential problem. I would begin by checking the firmware versions on all things that are involved, and updating as much as possible.

Can you reproduce the problem in Linux or Windows? Linux would be easier, since sesutil exists there, using nearly the same source code. This might help with debugging.

Are you the kind of customer who can reach out to LSI and NetApp technical support? In my experience, the best way to handle this is: First, eliminate and problems from your end: make sure all your firmware is up-to-date, the hardware is stress tested and good to go. Collect all debugging information that you can think of. Then contact both LSI (now known as Avago) and NetApp tech support, and ask them for a conference call or physical meeting with all three parties. This prevents the "fingerpointing" problem, where LSI says it's NetApp's problem, and vice verse.
 
D state means the program is currently in the kernel (has started a syscall, probably an IO request), and that IO request is not finishing. The root cause for that is in the IO stack deep down. Most likely it is some bizarre incompatibility between the OS kernel driver, the HBA (a.k.a. SCSI controller or in your case the LSI card), the SAS expanders (most likely the only ones are in your JBOD), the JBOD itself (in your case the DS4243), and perhaps a disk, although for blinking disks should not even be involved.

Once you are in a D state, you might as well disconnect everything or reboot. It's not coming back to life. Usually, this is not the application's fault (so looking at source code in user space is not the first step). Although serious debugging will probably require checking that source code, or tracing or instrumenting it, to see what exactly caused the problem. And note that "cause" is not the same as "at fault".

The LSI 92xx cards are very good. And I'm quite familiar with the NetApp (former LSI/Engenio) JBODs, those are also very good. But any complex SAS system has lots of gotchas and potential problem. I would begin by checking the firmware versions on all things that are involved, and updating as much as possible.

Can you reproduce the problem in Linux or Windows? Linux would be easier, since sesutil exists there, using nearly the same source code. This might help with debugging.

Are you the kind of customer who can reach out to LSI and NetApp technical support? In my experience, the best way to handle this is: First, eliminate and problems from your end: make sure all your firmware is up-to-date, the hardware is stress tested and good to go. Collect all debugging information that you can think of. Then contact both LSI (now known as Avago) and NetApp tech support, and ask them for a conference call or physical meeting with all three parties. This prevents the "fingerpointing" problem, where LSI says it's NetApp's problem, and vice verse.
Thanks for the reply. Unfortunately this is a homelab, so I am on my own here. I can confirm that HBA's firmware is up to date (I believe LSI stopped update for SAS2 cards a few years ago). As for the DAS I don't think NetApp's firmware download is open to public, I could be wrong though.

I can try reproduce the problem in Linux... Once I find a chance to shutdown my lab for a week or 2, if I got the chance...... Would be a lot easier if I can trace it in FreeNAS/BSD.
 
Back
Top