Solved zpool import crash CPU

Hi Guys,

I moved my data pool to a different SAS HBA. I export my pool with the old LSI-9201-16e HBA, swap the HBA to an LSI-9207-8e HBA and try to import my pool but all the time the box crash with a CPU NMI hardware problem. I also use 2 NVME cache disks. I try it with the old LSI HBA and it works. The new HBA a new and have the IT firmware release 20 from LSI. How can I find out what's the problem?

best regard ré
 
Step 1: Tell us what the actual error messages are. Saying "crash" and "CPU NMI hardware problem" isn't enough information to go on.
Step 2: Tell us what software you are running, at least the FreeBSD version number.
Step 3: Look for power supply and cooling problems. The LSI/Avahi/Broadcom cards can be famously power hungry and run hot (I have posted some anecdotes about them), your problem might simply be that your power supply is overwhelmed, or the cards getting too hot. Also, check SAS cables, I've seen many problem.
Step 4: Go to LSI's website, and find up-to-date firmware. Version 20 was current when I was working with those cards, which was ~2015-2016, which is a long time ago.
 
Hi Ralph,

Thanks for the answer.

FreeBSD 12.1p3 amd64, error message is:
Code:
panic: NMI indicates Hardware failure
couid = 0
time = 1584564198
KDB: stack backtrace:

The computer is a DL380p_G8 with 2x 750W PS, the harddisks are in 2 external shelves. The firmware 20.00.07.00 is the latest for this controller. The cooling from the server is good. I check the temperatures in ILO and they are ok. I run this server before as well and all are ok. I only change the SAS controller.

Now I also have this type of error message:
Code:
 ioat9: IVB IOAT Ch1 mem 0xblabla irq 63 at device 4.1 on pci15 until
ioat15: IVB IOAT Ch1 mem 0xblabla irq 55 at device 4.7 on pci15

Best regards ré
 
That leaves me only with two suggestions, neither very good. ither try pulling your PCI cards, checking the contacts, perhaps cleaning the sockets (can of compressed air) and reseating them. What if there is a little metal sliver that's shorting pins in the PCI socket, or corrosion on the card edge connector? If that fails, next step is replacing the card. I understand that for most people, that is a big hassle (extra money, or having to deal with a vendor for a return, and all that).
 
Hi,

Thanks for the response. I spent some time and isolate and fix the problem. First issue was a bad controller in one of the 2 external shelves and second issue was one of the NVMe controller in one slot are working and in another slot are not working :-( I also try different NVMe adapter boards, one with capacitor and one without make no different.

best regards ré
 
Back
Top