Server(s) crashing. No errors in logs

I have two similar servers, the first deployed one month ago and the second deployed past week. They are AMD Epyc servers, from the cloud provider OVH (they call them: ADVANCE-STOR | AMD EPYC 4344P), and are used as storage servers with ZFS and Samba. Both have 128 GB of RAM, 2x 960 GB NVME (that I'm using as ZFS boot mirror), and 8x 22 TB HDDs (4x2 striped mirror for the actual shared storage). MSI mobo, 2 Intel E810 25 Gbps each. The servers are in different datacenters 100 kms apart.

Both servers crashed the first time with the system fully booted and functional, after rebooted past applying updates after installing (nothing installed apart from base system at this step). By crashing I mean unresponsive to pings, login console frozen, no input possible from IPMI KVM. The only way out is to server reset via the IPMI management. I tried to reproduce it with the 2nd, but I was unable to (reinstalled, upgraded, no crash). The second server has been reinstalled to another OS after that first crash. The 1st server crashed again last week (after one month working without issues), 2 times the same day (samba419-4.19.9_5 already installed).

I haven't seen anything obvious neither in /var/log/messages nor in the output of dmesg. The only content of /var/crash is a file called minfree containing the value 2048. In the last crash of the first sever I had left a session open with a tail -f /var/log/messages. It showed some irrelevant (I think) Samba messages like these:

Code:
May  6 17:30:20 xxxx nmbd[1575]:   query_name_response: Multiple (2) responses received for a query on subnet 192.168.110.21 for name WORKGROUP<1d>.
May  6 17:30:20 xxxx nmbd[1575]:   This response was from IP 192.168.110.27, reporting an IP address of 192.168.110.27.
May  6 17:35:26 xxxx nmbd[1575]: [2025/05/06 17:35:26.892786,  0] ../../source3/nmbd/nmbd_namequery.c:109(query_name_response)
May  6 17:35:26 xxxx nmbd[1575]:   query_name_response: Multiple (2) responses received for a query on subnet 192.168.110.21 for name WORKGROUP<1d>.
May  6 17:35:26 xxxx nmbd[1575]:   This response was from IP 192.168.110.27, reporting an IP address of 192.168.110.27.

The only fail message I've found appears in /var/log/dmesg.today or /var/log/dmesg.yesterday. I've seen it for both network cards:
Code:
ice1: Malicious Driver Detection Rx event 'Descriptor fetch failed' on Rx queue 1024 PF# 1 VF# 0

After seeing this I started to investigate more the boot messages related to the network, and found messages stating that the ice_ddp module couldn't be loaded, I added it to loader.conf. Now it's being loaded correctly. Not sure if this is the issue because the crash in the 2nd system happened with this module already loaded.

Attached is the /var/log/messages extract from a boot after a crash that eventually crashed too (marked with ---CRASH---):

Captures of some ssh sessions with that I had opened too when the system crashed
top:
xds1-crash-top.png


vmstat
xds1-crash-vmstat.png


It doesn't seem that the ARC took over the memory. And the processor was totally idle.

As PCI extension cards the system has the E810 network card, and an HBA.

Anyone has any suggestion on where to look to continue the troubleshooting? Thanks!

Some more details on the hardware:
# pciconf -lv:
Code:
hostb0@pci0:0:0:0:      class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14d8 subvendor=0x1022 subdevice=0x14d8
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
none0@pci0:0:0:2:       class=0x080600 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14d9 subvendor=0x1022 subdevice=0x14d9
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = base peripheral
    subclass   = IOMMU
hostb1@pci0:0:1:0:      class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
pcib1@pci0:0:1:1:       class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14db subvendor=0x1022 subdevice=0x1453
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = PCI-PCI
pcib2@pci0:0:1:2:       class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14db subvendor=0x1022 subdevice=0x1453
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = PCI-PCI
pcib3@pci0:0:1:3:       class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14db subvendor=0x1022 subdevice=0x1453
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = PCI-PCI
hostb2@pci0:0:2:0:      class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
pcib4@pci0:0:2:1:       class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14db subvendor=0x1022 subdevice=0x1453
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = PCI-PCI
hostb3@pci0:0:3:0:      class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb4@pci0:0:4:0:      class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb5@pci0:0:8:0:      class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
pcib12@pci0:0:8:1:      class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14dd subvendor=0x1022 subdevice=0x14dd
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = PCI-PCI
pcib13@pci0:0:8:3:      class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14dd subvendor=0x1022 subdevice=0x14dd
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = PCI-PCI
none1@pci0:0:20:0:      class=0x0c0500 rev=0x71 hdr=0x00 vendor=0x1022 device=0x790b subvendor=0x1022 subdevice=0x790b
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = 'FCH SMBus Controller'
    class      = serial bus
    subclass   = SMBus
isab0@pci0:0:20:3:      class=0x060100 rev=0x51 hdr=0x00 vendor=0x1022 device=0x790e subvendor=0x1022 subdevice=0x790e
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = 'FCH LPC Bridge'
    class      = bridge
    subclass   = PCI-ISA
hostb6@pci0:0:24:0:     class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e0 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb7@pci0:0:24:1:     class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e1 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb8@pci0:0:24:2:     class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e2 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb9@pci0:0:24:3:     class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e3 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb10@pci0:0:24:4:    class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e4 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb11@pci0:0:24:5:    class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e5 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb12@pci0:0:24:6:    class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e6 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
hostb13@pci0:0:24:7:    class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e7 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = bridge
    subclass   = HOST-PCI
nvme0@pci0:1:0:0:       class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa80a subvendor=0x144d subdevice=0xaa09
    vendor     = 'Samsung Electronics Co Ltd'
    device     = 'NVMe SSD Controller PM9A1/PM9A3/980PRO'
    class      = mass storage
    subclass   = NVM
mpr0@pci0:2:0:0:        class=0x010700 rev=0x00 hdr=0x00 vendor=0x1000 device=0x00e6 subvendor=0x1000 subdevice=0x4060
    vendor     = 'Broadcom / LSI'
    device     = 'Fusion-MPT 12GSAS/PCIe Secure SAS38xx'
    class      = mass storage
    subclass   = SAS
nvme1@pci0:3:0:0:       class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa80a subvendor=0x144d subdevice=0xaa09
    vendor     = 'Samsung Electronics Co Ltd'
    device     = 'NVMe SSD Controller PM9A1/PM9A3/980PRO'
    class      = mass storage
    subclass   = NVM
pcib5@pci0:4:0:0:       class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f4 subvendor=0x1b21 subdevice=0x3328
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = '600 Series Chipset PCIe Switch Upstream Port'
    class      = bridge
    subclass   = PCI-PCI
pcib6@pci0:5:2:0:       class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = '600 Series Chipset PCIe Switch Downstream Port'
    class      = bridge
    subclass   = PCI-PCI
pcib8@pci0:5:3:0:       class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = '600 Series Chipset PCIe Switch Downstream Port'
    class      = bridge
    subclass   = PCI-PCI
pcib9@pci0:5:8:0:       class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = '600 Series Chipset PCIe Switch Downstream Port'
    class      = bridge
    subclass   = PCI-PCI
pcib10@pci0:5:12:0:     class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = '600 Series Chipset PCIe Switch Downstream Port'
    class      = bridge
    subclass   = PCI-PCI
pcib11@pci0:5:13:0:     class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = '600 Series Chipset PCIe Switch Downstream Port'
    class      = bridge
    subclass   = PCI-PCI
pcib7@pci0:6:0:0:       class=0x060400 rev=0x06 hdr=0x01 vendor=0x1a03 device=0x1150 subvendor=0x1a03 subdevice=0x1150
    vendor     = 'ASPEED Technology, Inc.'
    device     = 'AST1150 PCI-to-PCI Bridge'
    class      = bridge
    subclass   = PCI-PCI
vgapci0@pci0:7:0:0:     class=0x030000 rev=0x52 hdr=0x00 vendor=0x1a03 device=0x2000 subvendor=0x1a03 subdevice=0x2000
    vendor     = 'ASPEED Technology, Inc.'
    device     = 'ASPEED Graphics Family'
    class      = display
    subclass   = VGA
ahci0@pci0:8:0:0:       class=0x010601 rev=0x02 hdr=0x00 vendor=0x1b21 device=0x0612 subvendor=0x1b21 subdevice=0x1060
    vendor     = 'ASMedia Technology Inc.'
    device     = 'ASM1061/ASM1062 Serial ATA Controller'
    class      = mass storage
    subclass   = SATA
ice0@pci0:9:0:0:        class=0x020000 rev=0x02 hdr=0x00 vendor=0x8086 device=0x159b subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller E810-XXV for SFP'
    class      = network
    subclass   = ethernet
ice1@pci0:9:0:1:        class=0x020000 rev=0x02 hdr=0x00 vendor=0x8086 device=0x159b subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller E810-XXV for SFP'
    class      = network
    subclass   = ethernet
xhci0@pci0:11:0:0:      class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x1022 device=0x43f7 subvendor=0x1b21 subdevice=0x1142
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = '600 Series Chipset USB 3.2 Controller'
    class      = serial bus
    subclass   = USB
ahci1@pci0:12:0:0:      class=0x010601 rev=0x01 hdr=0x00 vendor=0x1022 device=0x43f6 subvendor=0x1b21 subdevice=0x1062
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = '600 Series Chipset SATA Controller'
    class      = mass storage
    subclass   = SATA
none2@pci0:13:0:0:      class=0x130000 rev=0xc5 hdr=0x00 vendor=0x1022 device=0x14de subvendor=0x1002 subdevice=0x164e
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = 'Phoenix PCIe Dummy Function'
    class      = non-essential instrumentation
none3@pci0:13:0:2:      class=0x108000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1649 subvendor=0x1022 subdevice=0x1649
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = 'Family 19h PSP/CCP'
    class      = encrypt/decrypt
xhci1@pci0:13:0:3:      class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15b6 subvendor=0x1022 subdevice=0x15b6
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = serial bus
    subclass   = USB
xhci2@pci0:13:0:4:      class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15b7 subvendor=0x1022 subdevice=0x15b6
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = serial bus
    subclass   = USB
xhci3@pci0:14:0:0:      class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15b8 subvendor=0x1022 subdevice=0x15b6
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    class      = serial bus
    subclass   = USB
 

Attachments

By crashing I mean unresponsive to pings, login console frozen, no input possible from IPMI KVM.
That's an important distinction. So it doesn't panic(9) and automatically reboots, it appears to lock up? Any messages on the screen when that happens? You say no input possible, does that include ctrl-alt-del? How about ctrl-alt-esc?

Might be a bit more involved, but there's a bunch of debugging you can turn on that might help pin down a deadlock:
 
Sorry for the delayed response. It happended again after 3 weeks of normal behavior.

No, there's no panic(9) output in the console, and it doesn't reboot. It sure seems like a deadlock. No messages in the screen apart from the typical login prompt (but totally frozen and unresponsive). Ctrl-alt-del is unresponsible, crtl-atl-esc not properly tested because it's not a combiation available in the IPMI. The only solution always is physhical server reset.

I will try to dig into the manual's recommendations for deadlock debugging that you shared. Thanks!
 
Any hints in the BMC/IPMI health logs? E.g. ECC errors?
Given there isn't even a last cry for help on the console output, this smells like a hardware/firmware related issue...
 
Might want to check the systems hardware stats too. I could easily imagine the system shutting down when, for example, the CPU overheats for whatever reason.
 
Unresponsive to pings? In my (limited) experience, that's the big distinction between "fully crashed, kernel not running at all" (which more often than not is caused by a hardware problem), and "kernel still running, interrupts being serviced, but user processes can't make progress" (which is typically a software bug).

Is it possible to view the console messages when the "crash" (or hang or ...) happens? Some cloud providers can do that.
 
Sporadic/infrequent crashes without leaving any signs behind screams hardware issue. Would be weird with two different devices at the same time but if they are from the same lot a bad batch of power supplies or ram isn't out of the question.
 
Back
Top