I have two similar servers, the first deployed one month ago and the second deployed past week. They are AMD Epyc servers, from the cloud provider OVH (they call them: ADVANCE-STOR | AMD EPYC 4344P), and are used as storage servers with ZFS and Samba. Both have 128 GB of RAM, 2x 960 GB NVME (that I'm using as ZFS boot mirror), and 8x 22 TB HDDs (4x2 striped mirror for the actual shared storage). MSI mobo, 2 Intel E810 25 Gbps each. The servers are in different datacenters 100 kms apart.
Both servers crashed the first time with the system fully booted and functional, after rebooted past applying updates after installing (nothing installed apart from base system at this step). By crashing I mean unresponsive to pings, login console frozen, no input possible from IPMI KVM. The only way out is to server reset via the IPMI management. I tried to reproduce it with the 2nd, but I was unable to (reinstalled, upgraded, no crash). The second server has been reinstalled to another OS after that first crash. The 1st server crashed again last week (after one month working without issues), 2 times the same day (samba419-4.19.9_5 already installed).
I haven't seen anything obvious neither in /var/log/messages nor in the output of
The only fail message I've found appears in /var/log/dmesg.today or /var/log/dmesg.yesterday. I've seen it for both network cards:
After seeing this I started to investigate more the boot messages related to the network, and found messages stating that the
Attached is the /var/log/messages extract from a boot after a crash that eventually crashed too (marked with ---CRASH---):
Captures of some ssh sessions with that I had opened too when the system crashed
It doesn't seem that the ARC took over the memory. And the processor was totally idle.
As PCI extension cards the system has the E810 network card, and an HBA.
Anyone has any suggestion on where to look to continue the troubleshooting? Thanks!
Some more details on the hardware:
Both servers crashed the first time with the system fully booted and functional, after rebooted past applying updates after installing (nothing installed apart from base system at this step). By crashing I mean unresponsive to pings, login console frozen, no input possible from IPMI KVM. The only way out is to server reset via the IPMI management. I tried to reproduce it with the 2nd, but I was unable to (reinstalled, upgraded, no crash). The second server has been reinstalled to another OS after that first crash. The 1st server crashed again last week (after one month working without issues), 2 times the same day (samba419-4.19.9_5 already installed).
I haven't seen anything obvious neither in /var/log/messages nor in the output of
dmesg. The only content of /var/crash is a file called minfree containing the value 2048. In the last crash of the first sever I had left a session open with a tail -f /var/log/messages. It showed some irrelevant (I think) Samba messages like these:
Code:
May 6 17:30:20 xxxx nmbd[1575]: query_name_response: Multiple (2) responses received for a query on subnet 192.168.110.21 for name WORKGROUP<1d>.
May 6 17:30:20 xxxx nmbd[1575]: This response was from IP 192.168.110.27, reporting an IP address of 192.168.110.27.
May 6 17:35:26 xxxx nmbd[1575]: [2025/05/06 17:35:26.892786, 0] ../../source3/nmbd/nmbd_namequery.c:109(query_name_response)
May 6 17:35:26 xxxx nmbd[1575]: query_name_response: Multiple (2) responses received for a query on subnet 192.168.110.21 for name WORKGROUP<1d>.
May 6 17:35:26 xxxx nmbd[1575]: This response was from IP 192.168.110.27, reporting an IP address of 192.168.110.27.
The only fail message I've found appears in /var/log/dmesg.today or /var/log/dmesg.yesterday. I've seen it for both network cards:
Code:
ice1: Malicious Driver Detection Rx event 'Descriptor fetch failed' on Rx queue 1024 PF# 1 VF# 0
After seeing this I started to investigate more the boot messages related to the network, and found messages stating that the
ice_ddp module couldn't be loaded, I added it to loader.conf. Now it's being loaded correctly. Not sure if this is the issue because the crash in the 2nd system happened with this module already loaded.Attached is the /var/log/messages extract from a boot after a crash that eventually crashed too (marked with ---CRASH---):
Captures of some ssh sessions with that I had opened too when the system crashed
top: vmstatIt doesn't seem that the ARC took over the memory. And the processor was totally idle.
As PCI extension cards the system has the E810 network card, and an HBA.
Anyone has any suggestion on where to look to continue the troubleshooting? Thanks!
Some more details on the hardware:
# pciconf -lv:
Code:
hostb0@pci0:0:0:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14d8 subvendor=0x1022 subdevice=0x14d8
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
none0@pci0:0:0:2: class=0x080600 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14d9 subvendor=0x1022 subdevice=0x14d9
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = base peripheral
subclass = IOMMU
hostb1@pci0:0:1:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
pcib1@pci0:0:1:1: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14db subvendor=0x1022 subdevice=0x1453
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = PCI-PCI
pcib2@pci0:0:1:2: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14db subvendor=0x1022 subdevice=0x1453
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = PCI-PCI
pcib3@pci0:0:1:3: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14db subvendor=0x1022 subdevice=0x1453
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = PCI-PCI
hostb2@pci0:0:2:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
pcib4@pci0:0:2:1: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14db subvendor=0x1022 subdevice=0x1453
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = PCI-PCI
hostb3@pci0:0:3:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb4@pci0:0:4:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb5@pci0:0:8:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14da subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
pcib12@pci0:0:8:1: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14dd subvendor=0x1022 subdevice=0x14dd
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = PCI-PCI
pcib13@pci0:0:8:3: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x14dd subvendor=0x1022 subdevice=0x14dd
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = PCI-PCI
none1@pci0:0:20:0: class=0x0c0500 rev=0x71 hdr=0x00 vendor=0x1022 device=0x790b subvendor=0x1022 subdevice=0x790b
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = 'FCH SMBus Controller'
class = serial bus
subclass = SMBus
isab0@pci0:0:20:3: class=0x060100 rev=0x51 hdr=0x00 vendor=0x1022 device=0x790e subvendor=0x1022 subdevice=0x790e
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = 'FCH LPC Bridge'
class = bridge
subclass = PCI-ISA
hostb6@pci0:0:24:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e0 subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb7@pci0:0:24:1: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e1 subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb8@pci0:0:24:2: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e2 subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb9@pci0:0:24:3: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e3 subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb10@pci0:0:24:4: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e4 subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb11@pci0:0:24:5: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e5 subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb12@pci0:0:24:6: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e6 subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
hostb13@pci0:0:24:7: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x14e7 subvendor=0x0000 subdevice=0x0000
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = bridge
subclass = HOST-PCI
nvme0@pci0:1:0:0: class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa80a subvendor=0x144d subdevice=0xaa09
vendor = 'Samsung Electronics Co Ltd'
device = 'NVMe SSD Controller PM9A1/PM9A3/980PRO'
class = mass storage
subclass = NVM
mpr0@pci0:2:0:0: class=0x010700 rev=0x00 hdr=0x00 vendor=0x1000 device=0x00e6 subvendor=0x1000 subdevice=0x4060
vendor = 'Broadcom / LSI'
device = 'Fusion-MPT 12GSAS/PCIe Secure SAS38xx'
class = mass storage
subclass = SAS
nvme1@pci0:3:0:0: class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa80a subvendor=0x144d subdevice=0xaa09
vendor = 'Samsung Electronics Co Ltd'
device = 'NVMe SSD Controller PM9A1/PM9A3/980PRO'
class = mass storage
subclass = NVM
pcib5@pci0:4:0:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f4 subvendor=0x1b21 subdevice=0x3328
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = '600 Series Chipset PCIe Switch Upstream Port'
class = bridge
subclass = PCI-PCI
pcib6@pci0:5:2:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = '600 Series Chipset PCIe Switch Downstream Port'
class = bridge
subclass = PCI-PCI
pcib8@pci0:5:3:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = '600 Series Chipset PCIe Switch Downstream Port'
class = bridge
subclass = PCI-PCI
pcib9@pci0:5:8:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = '600 Series Chipset PCIe Switch Downstream Port'
class = bridge
subclass = PCI-PCI
pcib10@pci0:5:12:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = '600 Series Chipset PCIe Switch Downstream Port'
class = bridge
subclass = PCI-PCI
pcib11@pci0:5:13:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43f5 subvendor=0x1b21 subdevice=0x3328
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = '600 Series Chipset PCIe Switch Downstream Port'
class = bridge
subclass = PCI-PCI
pcib7@pci0:6:0:0: class=0x060400 rev=0x06 hdr=0x01 vendor=0x1a03 device=0x1150 subvendor=0x1a03 subdevice=0x1150
vendor = 'ASPEED Technology, Inc.'
device = 'AST1150 PCI-to-PCI Bridge'
class = bridge
subclass = PCI-PCI
vgapci0@pci0:7:0:0: class=0x030000 rev=0x52 hdr=0x00 vendor=0x1a03 device=0x2000 subvendor=0x1a03 subdevice=0x2000
vendor = 'ASPEED Technology, Inc.'
device = 'ASPEED Graphics Family'
class = display
subclass = VGA
ahci0@pci0:8:0:0: class=0x010601 rev=0x02 hdr=0x00 vendor=0x1b21 device=0x0612 subvendor=0x1b21 subdevice=0x1060
vendor = 'ASMedia Technology Inc.'
device = 'ASM1061/ASM1062 Serial ATA Controller'
class = mass storage
subclass = SATA
ice0@pci0:9:0:0: class=0x020000 rev=0x02 hdr=0x00 vendor=0x8086 device=0x159b subvendor=0x8086 subdevice=0x0000
vendor = 'Intel Corporation'
device = 'Ethernet Controller E810-XXV for SFP'
class = network
subclass = ethernet
ice1@pci0:9:0:1: class=0x020000 rev=0x02 hdr=0x00 vendor=0x8086 device=0x159b subvendor=0x8086 subdevice=0x0000
vendor = 'Intel Corporation'
device = 'Ethernet Controller E810-XXV for SFP'
class = network
subclass = ethernet
xhci0@pci0:11:0:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x1022 device=0x43f7 subvendor=0x1b21 subdevice=0x1142
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = '600 Series Chipset USB 3.2 Controller'
class = serial bus
subclass = USB
ahci1@pci0:12:0:0: class=0x010601 rev=0x01 hdr=0x00 vendor=0x1022 device=0x43f6 subvendor=0x1b21 subdevice=0x1062
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = '600 Series Chipset SATA Controller'
class = mass storage
subclass = SATA
none2@pci0:13:0:0: class=0x130000 rev=0xc5 hdr=0x00 vendor=0x1022 device=0x14de subvendor=0x1002 subdevice=0x164e
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = 'Phoenix PCIe Dummy Function'
class = non-essential instrumentation
none3@pci0:13:0:2: class=0x108000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1649 subvendor=0x1022 subdevice=0x1649
vendor = 'Advanced Micro Devices, Inc. [AMD]'
device = 'Family 19h PSP/CCP'
class = encrypt/decrypt
xhci1@pci0:13:0:3: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15b6 subvendor=0x1022 subdevice=0x15b6
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = serial bus
subclass = USB
xhci2@pci0:13:0:4: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15b7 subvendor=0x1022 subdevice=0x15b6
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = serial bus
subclass = USB
xhci3@pci0:14:0:0: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15b8 subvendor=0x1022 subdevice=0x15b6
vendor = 'Advanced Micro Devices, Inc. [AMD]'
class = serial bus
subclass = USB