PXE boot hangs during kernel or module load

spork · Sep 19, 2013

We've had a netboot setup in our co-location for some time now, it's not used often as it's mainly intended for new installs or rescue purposes. Last time I had to netboot something, I had no problems.

Tonight I'm seeing a box hang during the process of loading the kernel or the modules (we use mfsBSD, so ZFS, OpenSolaris, geom_uzip, and zlib kernel modules get loaded):

Code:

Intel(R) Boot Agent GE v1.2.28
Copyright (C) 1997-2005, Intel Corporation

CLIENT MAC ADDR: 00 E0 81 D0 15 85  GUID: 00000000 0000 0000 0000 000000000000
CLIENT IP: 10.0
Building the boot ler and the BTX
Star/boot/kernel/kernel text=0x63e133 data=0xc27a8+0xa3048 syms=[0x8+0xa8d68+0x8+0x9b5a0]
/boot/kernel/zfs.ko size 0x19eb18 at 0xae8000
loading required module 'opensolaris'
/boot/kernel/opensolaris.ko size 0x3868 at 0xc87000
/boot/kernel/geom_uzip.ko size 0x31d8 at 0xc8b000
loading required module 'zlib'
/boot/kernel/zlib.ko size 0xdc40 at 0xc8f000
/

Note the serial console output is a bit garbled, I've come to expect this with most serial console redirection implementations. Just noting that I see similar junk on working setups.

The only way to recover here is to power cycle or reset. Even with a keyboard available locally the box appears locked up.

Another datapoint: we have older FreeBSD netboot NFS trees exported as well. The above is trying to boot 8.1. If I try to boot an 8.3 kernel, it doesn't even finish loading the kernel over NFS. The 8.1 kernel is a few megabytes smaller, which really makes me wonder if I'm exhausting some memory resource here.

The DHCP configuration is pretty simple, and the root path contains an mfsBSD mfsroot:

Code:

host h21.i.xxx.com {
        hardware ethernet 00:e0:81:d0:15:85;
        fixed-address 10.99.88.121;
        next-server 10.99.88.111;
        filename "/freebsd83-64/boot/pxeboot";
        option root-path "10.99.88.111:/tank1/exports/netboot/freebsd83-64";
}

If I run tcpdump during the boot, I simply see the traffic stop. I believe the checksum errors shown here are just the result of the network card doing TX an RX checksum offloading. h11 is the NFS/TFTP/DHCP server, h21 is the host trying to netboot.

Code:

   h21.i.xxx.com.4031 > h11.i.xxx.com.nfs: 104 read [|nfs]
02:52:25.347065 IP (tos 0x0, ttl 64, id 4853, offset 0, flags [none], proto UDP (17), length 1180, bad cksum 0 (->9dae)!)
    h11.i.xxx.com.nfs > h21.i.xxx.com.4031: reply ok 1152 read REG 555 ids 0/0 [|nfs]
02:52:25.349063 IP (tos 0x0, ttl 20, id 4204, offset 0, flags [none], proto UDP (17), length 132)
    h21.i.xxx.com.4032 > h11.i.xxx.com.nfs: 104 read [|nfs]
02:52:25.349113 IP (tos 0x0, ttl 64, id 4854, offset 0, flags [none], proto UDP (17), length 1180, bad cksum 0 (->9dad)!)
    h11.i.xxx.com.nfs > h21.i.xxx.com.4032: reply ok 1152 read REG 555 ids 0/0 [|nfs]
02:52:25.351111 IP (tos 0x0, ttl 20, id 4205, offset 0, flags [none], proto UDP (17), length 132)
    h21.i.xxx.com.4033 > h11.i.xxx.com.nfs: 104 read [|nfs]
02:52:25.351162 IP (tos 0x0, ttl 64, id 4855, offset 0, flags [none], proto UDP (17), length 1180, bad cksum 0 (->9dac)!)
    h11.i.xxx.com.nfs > h21.i.xxx.com.4033: reply ok 1152 read REG 555 ids 0/0 [|nfs]

I have to dig around a bit to try another client since there's nothing there that I can just randomly pull out of service to test.

spork · Sep 22, 2013

Any ideas? Hardware? Software?

Interesting development: when I looked at the serial console the next day, it had finished booting. After being stalled it came back to life and booted over the course of 4+ hours. The console server adds a timestamp every hour so you can see how slow the progress is here:

Code:

(sometime prior to 3 a.m.)
Initializing IntelÂ® Boot Agent
All Rights Reserved
/boot/kernel/kernel data=0x988bdc 

[-- MARK -- Thu Sep 19 03:00:00 2013]
[-- MARK -- Thu Sep 19 04:00:00 2013]

data=0x136e38+0xc43f0 syms=[0x8+0xf0660+0x8+0xe28ec]
/boot/kernel/zfs.ko size 0x2072f8 at 0xf57000
loading required module 'opensolaris'
/boot/kernel/opensolaris.ko 

[-- MARK -- Thu Sep 19 05:00:00 2013]
[-- MARK -- Thu Sep 19 06:00:00 2013]

size 0x4a08 at 0x115f000
/boot/kernel/geom_uzip.ko size 0x3190 at 0x1164000
loading required module 'zlib'
/boot/kernel/zlib.ko size 0xdc80 at 0x1168000

[-- MARK -- Thu Sep 19 07:00:00 2013]

/boot/kernel/tmpfs.ko size 0xb748 at 0x33a8000

(assume the following is the beastie boot menu - snipped all the ESC codes and such)

|    |    |_|   |_|  \___|\___|
     ____   _____ _____    |  _ \ / ____| __ \    | |_) | (___ | |  | |
   |  _ < \___ \| |  |_) |____) | |__| |

Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
       The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.3-RELEASE #0: Mon Apr  9 21:23:18 UTC 2012
   root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
Timecounter "i8254" frequency 1193182 Hz quality
[â€¦]
Starting sshd.
Starting background file system checks in 60 seconds.

Thu Sep 19 11:14:58 UTC 2013

FreeBSD/amd64 (blinstall) (ttyu0)

login: 

[-- MARK -- Thu Sep 19 08:00:00 2013]

Now that it's booted, I am able to verify that em1 is running clean. ifconfig shows it at 1000/FD and netstat reports no errors on the interface.

spork · Nov 28, 2013

The problem still exists and it's making it hard to clear this box for production since PXE is our rescue path if something goes awry.

Should I take this to the lists? If so, what's appropriate? freebsd-net? I'm a bit lost as to where to ask since the problem seems like it's limited to the boot loader.

PXE boot hangs during kernel or module load

spork

spork

spork