How to investigate hardware/software bug?

Hi to everyone!

Preamble: I have Hewlett-Packard ProLiant DL380G4 server with 7.1R installed on it. Here is the list of hardware installed on the server (sorry for long list):

Code:
Copyright (c) 1992-2009 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 7.1-RELEASE #2: Sat Nov 21 14:07:02 EET 2009
    root@planetmk.eeu.mkcorp.com:/usr/src/sys/i386/compile/ULTRA2
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 3.40GHz (3400.14-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0xf41  Stepping = 1
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x649d<SSE3,DTES64,MON,DS_CPL,EST,CNXT-ID,CX16,xTPR>
  AMD Features=0x20000000<LM>
  Logical CPUs per core: 2
real memory  = 3221172224 (3071 MB)
avail memory = 3150790656 (3004 MB)
ACPI APIC Table: <HP     00000083>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  6
 cpu3 (AP): APIC ID:  7
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 24-47 on motherboard
ioapic2 <Version 2.0> irqs 48-71 on motherboard
ioapic3 <Version 2.0> irqs 72-95 on motherboard
ioapic4 <Version 2.0> irqs 96-119 on motherboard
kbd1 at kbdmux0
acpi0: <HP P51> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
Timecounter "ACPI-safe" frequency 3579545 Hz quality 850
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x908-0x90b on acpi0
pcib0: <ACPI Host-PCI bridge> on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> at device 2.0 on pci0
pci2: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> at device 0.0 on pci2
pci3: <ACPI PCI bus> on pcib2
bge0: <HP NC7782 Gigabit Server Adapter, ASIC rev. 0x2100> mem 0xfdef0000-0xfdefffff irq 25 at device 1.0 on pci3
miibus0: <MII bus> on bge0
brgphy0: <BCM5704 10/100/1000baseTX PHY> PHY 1 on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bge0: Ethernet address: 00:13:21:6b:c3:e7
bge0: [ITHREAD]
bge1: <HP NC7782 Gigabit Server Adapter, ASIC rev. 0x2100> mem 0xfdee0000-0xfdeeffff irq 26 at device 1.1 on pci3
miibus1: <MII bus> on bge1
brgphy1: <BCM5704 10/100/1000baseTX PHY> PHY 1 on miibus1
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bge1: Ethernet address: 00:13:21:6b:c3:e6
bge1: [ITHREAD]
pcib3: <ACPI PCI-PCI bridge> at device 0.2 on pci2
pci4: <ACPI PCI bus> on pcib3
ciss0: <HP Smart Array 6i> port 0x4000-0x40ff mem 0xfdff0000-0xfdff1fff,0xfdf80000-0xfdfbffff irq 51 at device 3.0 on pci4
ciss0: [ITHREAD]
pcib4: <ACPI PCI-PCI bridge> at device 6.0 on pci0
pci5: <ACPI PCI bus> on pcib4
pcib5: <ACPI PCI-PCI bridge> at device 0.0 on pci5
pci6: <ACPI PCI bus> on pcib5
pcib6: <ACPI PCI-PCI bridge> at device 0.2 on pci5
pci10: <ACPI PCI bus> on pcib6
uhci0: <Intel 82801EB (ICH5) USB controller USB-A> port 0x2000-0x201f irq 16 at device 29.0 on pci0
uhci0: [GIANT-LOCKED]
uhci0: [ITHREAD]
usb0: <Intel 82801EB (ICH5) USB controller USB-A> on uhci0
usb0: USB revision 1.0
uhub0: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb0
uhub0: 2 ports with 2 removable, self powered
uhci1: <Intel 82801EB (ICH5) USB controller USB-B> port 0x2020-0x203f irq 19 at device 29.1 on pci0
uhci1: [GIANT-LOCKED]
uhci1: [ITHREAD]
usb1: <Intel 82801EB (ICH5) USB controller USB-B> on uhci1
usb1: USB revision 1.0
uhub1: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb1
uhub1: 2 ports with 2 removable, self powered
uhci2: <Intel 82801EB (ICH5) USB controller USB-C> port 0x2040-0x205f irq 18 at device 29.2 on pci0
uhci2: [GIANT-LOCKED]
uhci2: [ITHREAD]
usb2: <Intel 82801EB (ICH5) USB controller USB-C> on uhci2
usb2: USB revision 1.0
uhub2: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb2
uhub2: 2 ports with 2 removable, self powered
uhci3: <Intel 82801EB (ICH5) USB controller USB-D> port 0x2060-0x207f irq 16 at device 29.3 on pci0
uhci3: [GIANT-LOCKED]
uhci3: [ITHREAD]
usb3: <Intel 82801EB (ICH5) USB controller USB-D> on uhci3
usb3: USB revision 1.0
uhub3: <Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb3
uhub3: 2 ports with 2 removable, self powered
ehci0: <Intel 82801EB/R (ICH5) USB 2.0 controller> mem 0xfbef0000-0xfbef03ff irq 23 at device 29.7 on pci0
ehci0: [GIANT-LOCKED]
ehci0: [ITHREAD]
usb4: EHCI version 1.0
usb4: companion controllers, 2 ports each: usb0 usb1 usb2 usb3
usb4: <Intel 82801EB/R (ICH5) USB 2.0 controller> on ehci0
usb4: USB revision 2.0
uhub4: <Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1> on usb4
uhub4: 8 ports with 8 removable, self powered
pcib7: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci1: <ACPI PCI bus> on pcib7
vgapci0: <VGA-compatible display> port 0x3000-0x30ff mem 0xfc000000-0xfcffffff,0xfbff0000-0xfbff0fff at device 3.0 on pci1
pci1: <base peripheral> at device 4.0 (no driver attached)
pci1: <base peripheral> at device 4.2 (no driver attached)
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH5 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x500-0x50f at device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata0: [ITHREAD]
ata1: <ATA channel 1> on atapci0
ata1: [ITHREAD]
acpi_tz0: <Thermal Zone> on acpi0
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atkbd0: [ITHREAD]
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: [GIANT-LOCKED]
psm0: [ITHREAD]
psm0: model Generic PS/2 mouse, device ID 0
sio0: <Standard PC COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A
sio0: [FILTER]
cpu0: <ACPI CPU> on acpi0
acpi_perf0: <ACPI CPU Frequency Control> on cpu0
p4tcc0: <CPU Frequency Thermal Control> on cpu0
cpu1: <ACPI CPU> on acpi0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 112d0000112d
device_attach: est1 attach returned 6
p4tcc1: <CPU Frequency Thermal Control> on cpu1
cpu2: <ACPI CPU> on acpi0
est2: <Enhanced SpeedStep Frequency Control> on cpu2
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 112d0000112d
device_attach: est2 attach returned 6
p4tcc2: <CPU Frequency Thermal Control> on cpu2
cpu3: <ACPI CPU> on acpi0
est3: <Enhanced SpeedStep Frequency Control> on cpu3
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 112d0000112d
device_attach: est3 attach returned 6
p4tcc3: <CPU Frequency Thermal Control> on cpu3
pmtimer0 on isa0
orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xcbfff,0xee000-0xeffff pnpid ORM0000 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
sio1: [FILTER]
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 1.000 msec
ipfw2 initialized, divert enabled, nat loadable, rule-based forwarding enabled, default to accept, logging unlimited
acd0: DVDROM <DV-28E-N/C.6B> at ata0-master UDMA33
SMP: AP CPU #2 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
da0 at ciss0 bus 0 target 0 lun 0
da0: <COMPAQ RAID 5  VOLUME OK> Fixed Direct Access SCSI-0 device
da0: 135.168MB/s transfers
da0: 560039MB (1146960784 512 byte sectors: 255H 63S/T 65535C)

The problem description: It was noticed that the server stops responding when it receives a lot of data via network adapter and stores it into the MySQL database. In two words trhis behavior could be descibed as "When server receives a lot of data constantly or periodically it stops responding every two or three days". No network respond. Black screen. No mouse or keyboard respond. No messages in /var/log/messages. We tried to investigate this issue but after reboot no core dumps were found. All "green" functions are disabled. We have been tried to also disable hpasmcli (HP ProLiant Advanced System Management CLI) but with no effect. The server still stops responding every two or three days.

Do you have any idea on how to investigate this issue more deeply?

Thank you in advance.
 
If you can try out another NIC (not Broadcom (bge)). I had the same problem with bge onboard NICs. After switching to a Intel Gigabit NIC the freezes disappeared. Sorry, i cant say more about that because i had no time to investigate this problem on the server.
 
Back
Top