FreeBSD 9.0 fails to boot after memory upgrade.

I ran into a little problem when trying to upgrade the installed RAM in my system, and eventually found a fix which I thought I'd share, especially since the symptoms were initially deceiving (looked like bad RAM).

You should know that I am relatively new to FreeBSD; having played with it a bit about 9 years ago, I am only just now returning to the fold. I can see a lot has changed in my absence. I have a fair amount of other Unix and Linux experience, however, and I find many things about FreeBSD refreshing. So far, putting together a small server/workstation has been a fairly straight-forward proposition, and has presented me with few problems which could not be solved by referring to the extensive documentation, or just applying some generic *nix troubleshooting steps. However, at least one thing has really taken me off guard; here is an account of my most unexpected issue to date:

I have been running FreeBSD 9.0 since shortly after the official Release last month, on a quad core Intel Xeon (Sandy Bridge), Asus P8B WS motherboard, with 8GB (2 x 4GB UDIMMs) of Kingston PC3 10600 ECC RAM. It has been running fine with a custom kernel, but note that my customizations have been quite mild--mainly removing support for hardware I don't, and never will, use (e.g. floppy disk drive, hoards of ethernet cards I don't use, etc). As an aside, I have rebuilt kernel and world using Clang/LLVM, and to date everything has been running smoothly.

One of the uses I am putting this machine to is as an NFS server for ZFS filesystems, with dedup enabled. ZFS can be very memory hungry, and dedup in particular, so I decided to install an additional 8GB of RAM, for 16GB total. Yesterday I put another 2 DIMMs, identical to the 2 Kingston sticks I already had, into the machine. Much to my chagrin, shortly after I powered the machine back on, a kernel trap forced an immediate reboot... and again... and again. The fault consistently happened late in the boot process (almost immediately before the login prompt would normally appear), at apparently the same point every time.

My first suspicion was naturally that the RAM I just installed was bad, so I took the new sticks out, and tried to start up again. Of course, the machine booted normally. So, I swapped the "old" sticks for the new ones (identical in model, but the "old" ones are at least known to be good). With just the new RAM in the machine, it once again booted normally. To make a long story short, I tried various combinations of the old and new RAM, in different banks, etc, and no matter which combination I tried, FreeBSD would boot with 8GB installed, but never with all 16GB. Bummer.

I ran at least one pass of memtest86, just to be sure, and with all 16GB installed, memtest86 passed with no errors. I also discovered that I could boot into single user mode with all the RAM installed, which was a relief (not least of which because it meant I could more easily mess with things to find out what the problem was).

At that point, I began to suspect that the trouble was with a kernel module--one which apparently does not load until relatively late in the startup order--which did not agree with the change in memory configuration. However, I had difficulty ascertaining exactly what was causing the crashes, because I couldn't find anything specific in the logs I reviewed; perhaps the system was going down before it had time to log anything relating to the misbehaving module itself. But surely the maximum allowable RAM would not be hard-coded into any part of the kernel, or kernel modules, would it? If there was a RAM expectation, I decided, it could only be a result of compile-time optimizations. So, just for the heck of it, I decided to rebuild my kernel from single user mode (my understanding is that using make buildkernel is also supposed to pull in any kernel modules and rebuild them as well).

It worked. After building and installing the new kernel, with the 16GB RAM installed the whole time, I rebooted and the system came up normally. Having checked memory stats after the system came up, all 16GB is clearly recognized.

I never would have expected that the OS would be so specifically tied to the amount of installed memory, short of the obvious technical limitations that might come with 32-bit systems, PAE, etc, but those clearly don't apply to this situation (this has been an amd64 build from the beginning). However, I'm glad that a simple recompilation of my kernel (and presumably the associated modules) solved my problem.

Having found almost no information of value on the net while googling this problem, I decided I would post my experience here, in case the solution proves to be of value to anyone else in a similar circumstance. Forgive me if I am expounding on something glaringly obvious to the FreeBSD-savvy, but as I said, I am only just returning to the platform, and this situation has never come up for me before (nor in the context of many Linux systems I have managed).

On that note, can anyone with more technical knowledge of the inner-workings of FreeBSD explain to me exactly what was happening? I obviously found a solution that worked, but I would be interested to know why the change in RAM would have caused the problem to begin with.
 
If you have the time, could you try rebuilding/reinstalling the world/kernel using GCC, with only 8 GB of RAM installed during the build? Make sure everything boots correctly. Then add the extra 8 GB of RAM and see if things still boot correctly.

That will narrow down whether it's a GCC or LLVM "optimisation" causing the issue.

Also, what (if anything) do you have in /etc/make.conf and /etc/src.conf?

If you rename those files and rebuild the world again with 8 GB installed, then install the extra RAM, does it still fail? If not, then the issue is an "optimisation" in one of those files (most likely make.conf).
 
To clang or not to clang

Thanks for the reply, and your suggestions.

I _may_ have the time this weekend to at least recompile the kernel, as you suggest, and see what happens. Rebuilding world takes substantially longer, and may not even be necessary to expose the truth. Although, I can foresee potential problems with running a kernel compiled with gcc in a world previously compiled with clang. We'll have to see how much time I have. Pity we're not testing in the other direction, as I've noticed that clang/LLVM does seem to build kernel/world much faster than gcc.

There are some customizations in my /etc/make.conf; I've listed my changes below. I have not messed with /etc/src.conf. For the record, my source tree is csuped with 9.0-Release twice weekly by cron.

Perhaps before even rebuilding with gcc, I should try to take the -O flags out of make.conf and see if that makes a difference (still building with clang)?


Code:
# Generic optimizations:
CFLAGS= -O2 -fno-strict-aliasing -pipe

# More conservative optimization for building kernel:
COPTFLAGS= -O -pipe -ffast-math -fno-strict-aliasing

# Two kernel configuration files exist on my system (I am currently running "CUSTOM"):
KERNCONF= CUSTOM GENERIC

# Use clang (instead of gcc) for any makefile that does not
# explicitly insist otherwise:
.if !defined(CC) || ${CC} == "cc"
CC=clang
.endif
.if !defined(CXX) || ${CXX} == "c++"
CXX=clang++
.endif
.if !defined(CPP) || ${CPP} == "cpp"
CPP=clang-cpp
.endif

# Misc optimization flags:
OPTIMIZED_CFLAGS=      YES
BUILD_OPTIMIZED=       YES
WITH_CPUFLAGS=	       YES
WITHOUT_DEBUG=	       YES
WITH_OPTIMIZED_CFLAGS= YES

# I am using the proprietary Nvidia driver for X11 (for a Quadro 400):
# It is worth noting that I did need to mess with the nvidia configuration 
# files a bit to get it to build/install on FreeBSD 9.0; their makefile seems to
# abandon installation if it detects a version > 8.x.
WITH_NVIDIA_GL=	       YES
WITH_NVIDIA=	       YES
WITHOUT_NOUVEAU=       YES
 
Faustus said:
Perhaps before even rebuilding with gcc, I should try to take the -O flags out of make.conf and see if that makes a difference (still building with clang)?

Yes. Custom CFLAGS is common source of problems.
 
Yeah, I'd remove (comment out) all of the "optimisation" entries in make.conf. Then rebuild the world with 8 GB installed. Make sure that works. Then add the extra RAM and boot.

9 times out of 10, "optimisations" in make.conf do the exact opposite of what people think, or generally just break things. :)
 
Will do

I'll rebuild with clang/LLVM, but without the custom make.conf, sometime this weekend and let you know how it goes. Thanks.

However, further testing will probably have to wait until my new CPU cooler arrives, and I get a chance to install it. The stock CPU fan and heatsink that came with the Xeon lacks a backplate for mounting it to the motherboard; it is only mounted using some (fairly flimsy looking) plastic clips which pop into the mounting holes from the top side of the board. While I was installing the RAM I happened to notice that the cooling apparatus was "peeling away" from my CPU for lack of proper purchase on the motherboard! I re-secured the cooler, for the time being, but I don't trust it. This situation will be corrected by Noctua shortly; until then, CPU-intensive tasks of long duration are on hold. I don't need to figure out this RAM situation only to have my CPU go up in flames! FedEx tracking estimates delivery of the new cooler tomorrow.
 
Back
Top