PXE boot hangs during kernel or module load

We've had a netboot setup in our co-location for some time now, it's not used often as it's mainly intended for new installs or rescue purposes. Last time I had to netboot something, I had no problems.

Tonight I'm seeing a box hang during the process of loading the kernel or the modules (we use mfsBSD, so ZFS, OpenSolaris, geom_uzip, and zlib kernel modules get loaded):

Code:
Intel(R) Boot Agent GE v1.2.28
Copyright (C) 1997-2005, Intel Corporation

CLIENT MAC ADDR: 00 E0 81 D0 15 85  GUID: 00000000 0000 0000 0000 000000000000
CLIENT IP: 10.0
Building the boot ler and the BTX
Star/boot/kernel/kernel text=0x63e133 data=0xc27a8+0xa3048 syms=[0x8+0xa8d68+0x8+0x9b5a0]
/boot/kernel/zfs.ko size 0x19eb18 at 0xae8000
loading required module 'opensolaris'
/boot/kernel/opensolaris.ko size 0x3868 at 0xc87000
/boot/kernel/geom_uzip.ko size 0x31d8 at 0xc8b000
loading required module 'zlib'
/boot/kernel/zlib.ko size 0xdc40 at 0xc8f000
/

Note the serial console output is a bit garbled, I've come to expect this with most serial console redirection implementations. Just noting that I see similar junk on working setups.

The only way to recover here is to power cycle or reset. Even with a keyboard available locally the box appears locked up.

Another datapoint: we have older FreeBSD netboot NFS trees exported as well. The above is trying to boot 8.1. If I try to boot an 8.3 kernel, it doesn't even finish loading the kernel over NFS. The 8.1 kernel is a few megabytes smaller, which really makes me wonder if I'm exhausting some memory resource here.

The DHCP configuration is pretty simple, and the root path contains an mfsBSD mfsroot:

Code:
host h21.i.xxx.com {
        hardware ethernet 00:e0:81:d0:15:85;
        fixed-address 10.99.88.121;
        next-server 10.99.88.111;
        filename "/freebsd83-64/boot/pxeboot";
        option root-path "10.99.88.111:/tank1/exports/netboot/freebsd83-64";
}

If I run tcpdump during the boot, I simply see the traffic stop. I believe the checksum errors shown here are just the result of the network card doing TX an RX checksum offloading. h11 is the NFS/TFTP/DHCP server, h21 is the host trying to netboot.

Code:
   h21.i.xxx.com.4031 > h11.i.xxx.com.nfs: 104 read [|nfs]
02:52:25.347065 IP (tos 0x0, ttl 64, id 4853, offset 0, flags [none], proto UDP (17), length 1180, bad cksum 0 (->9dae)!)
    h11.i.xxx.com.nfs > h21.i.xxx.com.4031: reply ok 1152 read REG 555 ids 0/0 [|nfs]
02:52:25.349063 IP (tos 0x0, ttl 20, id 4204, offset 0, flags [none], proto UDP (17), length 132)
    h21.i.xxx.com.4032 > h11.i.xxx.com.nfs: 104 read [|nfs]
02:52:25.349113 IP (tos 0x0, ttl 64, id 4854, offset 0, flags [none], proto UDP (17), length 1180, bad cksum 0 (->9dad)!)
    h11.i.xxx.com.nfs > h21.i.xxx.com.4032: reply ok 1152 read REG 555 ids 0/0 [|nfs]
02:52:25.351111 IP (tos 0x0, ttl 20, id 4205, offset 0, flags [none], proto UDP (17), length 132)
    h21.i.xxx.com.4033 > h11.i.xxx.com.nfs: 104 read [|nfs]
02:52:25.351162 IP (tos 0x0, ttl 64, id 4855, offset 0, flags [none], proto UDP (17), length 1180, bad cksum 0 (->9dac)!)
    h11.i.xxx.com.nfs > h21.i.xxx.com.4033: reply ok 1152 read REG 555 ids 0/0 [|nfs]

I have to dig around a bit to try another client since there's nothing there that I can just randomly pull out of service to test.
 
Any ideas? Hardware? Software?

Interesting development: when I looked at the serial console the next day, it had finished booting. After being stalled it came back to life and booted over the course of 4+ hours. The console server adds a timestamp every hour so you can see how slow the progress is here:

Code:
(sometime prior to 3 a.m.)
Initializing Intel® Boot Agent
All Rights Reserved
/boot/kernel/kernel data=0x988bdc 

[-- MARK -- Thu Sep 19 03:00:00 2013]
[-- MARK -- Thu Sep 19 04:00:00 2013]

data=0x136e38+0xc43f0 syms=[0x8+0xf0660+0x8+0xe28ec]
/boot/kernel/zfs.ko size 0x2072f8 at 0xf57000
loading required module 'opensolaris'
/boot/kernel/opensolaris.ko 

[-- MARK -- Thu Sep 19 05:00:00 2013]
[-- MARK -- Thu Sep 19 06:00:00 2013]

size 0x4a08 at 0x115f000
/boot/kernel/geom_uzip.ko size 0x3190 at 0x1164000
loading required module 'zlib'
/boot/kernel/zlib.ko size 0xdc80 at 0x1168000

[-- MARK -- Thu Sep 19 07:00:00 2013]

/boot/kernel/tmpfs.ko size 0xb748 at 0x33a8000

(assume the following is the beastie boot menu - snipped all the ESC codes and such)

|    |    |_|   |_|  \___|\___|
     ____   _____ _____    |  _ \ / ____| __ \    | |_) | (___ | |  | |
   |  _ < \___ \| |  |_) |____) | |__| |

Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
       The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.3-RELEASE #0: Mon Apr  9 21:23:18 UTC 2012
   root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
Timecounter "i8254" frequency 1193182 Hz quality
[…]
Starting sshd.
Starting background file system checks in 60 seconds.

Thu Sep 19 11:14:58 UTC 2013

FreeBSD/amd64 (blinstall) (ttyu0)

login: 

[-- MARK -- Thu Sep 19 08:00:00 2013]

Now that it's booted, I am able to verify that em1 is running clean. ifconfig shows it at 1000/FD and netstat reports no errors on the interface.
 
The problem still exists and it's making it hard to clear this box for production since PXE is our rescue path if something goes awry.

Should I take this to the lists? If so, what's appropriate? freebsd-net? I'm a bit lost as to where to ask since the problem seems like it's limited to the boot loader.
 
Back
Top