I ran into a little problem when trying to upgrade the installed RAM in my system, and eventually found a fix which I thought I'd share, especially since the symptoms were initially deceiving (looked like bad RAM).
You should know that I am relatively new to FreeBSD; having played with it a bit about 9 years ago, I am only just now returning to the fold. I can see a lot has changed in my absence. I have a fair amount of other Unix and Linux experience, however, and I find many things about FreeBSD refreshing. So far, putting together a small server/workstation has been a fairly straight-forward proposition, and has presented me with few problems which could not be solved by referring to the extensive documentation, or just applying some generic *nix troubleshooting steps. However, at least one thing has really taken me off guard; here is an account of my most unexpected issue to date:
I have been running FreeBSD 9.0 since shortly after the official Release last month, on a quad core Intel Xeon (Sandy Bridge), Asus P8B WS motherboard, with 8GB (2 x 4GB UDIMMs) of Kingston PC3 10600 ECC RAM. It has been running fine with a custom kernel, but note that my customizations have been quite mild--mainly removing support for hardware I don't, and never will, use (e.g. floppy disk drive, hoards of ethernet cards I don't use, etc). As an aside, I have rebuilt kernel and world using Clang/LLVM, and to date everything has been running smoothly.
One of the uses I am putting this machine to is as an NFS server for ZFS filesystems, with dedup enabled. ZFS can be very memory hungry, and dedup in particular, so I decided to install an additional 8GB of RAM, for 16GB total. Yesterday I put another 2 DIMMs, identical to the 2 Kingston sticks I already had, into the machine. Much to my chagrin, shortly after I powered the machine back on, a kernel trap forced an immediate reboot... and again... and again. The fault consistently happened late in the boot process (almost immediately before the login prompt would normally appear), at apparently the same point every time.
My first suspicion was naturally that the RAM I just installed was bad, so I took the new sticks out, and tried to start up again. Of course, the machine booted normally. So, I swapped the "old" sticks for the new ones (identical in model, but the "old" ones are at least known to be good). With just the new RAM in the machine, it once again booted normally. To make a long story short, I tried various combinations of the old and new RAM, in different banks, etc, and no matter which combination I tried, FreeBSD would boot with 8GB installed, but never with all 16GB. Bummer.
I ran at least one pass of memtest86, just to be sure, and with all 16GB installed, memtest86 passed with no errors. I also discovered that I could boot into single user mode with all the RAM installed, which was a relief (not least of which because it meant I could more easily mess with things to find out what the problem was).
At that point, I began to suspect that the trouble was with a kernel module--one which apparently does not load until relatively late in the startup order--which did not agree with the change in memory configuration. However, I had difficulty ascertaining exactly what was causing the crashes, because I couldn't find anything specific in the logs I reviewed; perhaps the system was going down before it had time to log anything relating to the misbehaving module itself. But surely the maximum allowable RAM would not be hard-coded into any part of the kernel, or kernel modules, would it? If there was a RAM expectation, I decided, it could only be a result of compile-time optimizations. So, just for the heck of it, I decided to rebuild my kernel from single user mode (my understanding is that using make buildkernel is also supposed to pull in any kernel modules and rebuild them as well).
It worked. After building and installing the new kernel, with the 16GB RAM installed the whole time, I rebooted and the system came up normally. Having checked memory stats after the system came up, all 16GB is clearly recognized.
I never would have expected that the OS would be so specifically tied to the amount of installed memory, short of the obvious technical limitations that might come with 32-bit systems, PAE, etc, but those clearly don't apply to this situation (this has been an amd64 build from the beginning). However, I'm glad that a simple recompilation of my kernel (and presumably the associated modules) solved my problem.
Having found almost no information of value on the net while googling this problem, I decided I would post my experience here, in case the solution proves to be of value to anyone else in a similar circumstance. Forgive me if I am expounding on something glaringly obvious to the FreeBSD-savvy, but as I said, I am only just returning to the platform, and this situation has never come up for me before (nor in the context of many Linux systems I have managed).
On that note, can anyone with more technical knowledge of the inner-workings of FreeBSD explain to me exactly what was happening? I obviously found a solution that worked, but I would be interested to know why the change in RAM would have caused the problem to begin with.
You should know that I am relatively new to FreeBSD; having played with it a bit about 9 years ago, I am only just now returning to the fold. I can see a lot has changed in my absence. I have a fair amount of other Unix and Linux experience, however, and I find many things about FreeBSD refreshing. So far, putting together a small server/workstation has been a fairly straight-forward proposition, and has presented me with few problems which could not be solved by referring to the extensive documentation, or just applying some generic *nix troubleshooting steps. However, at least one thing has really taken me off guard; here is an account of my most unexpected issue to date:
I have been running FreeBSD 9.0 since shortly after the official Release last month, on a quad core Intel Xeon (Sandy Bridge), Asus P8B WS motherboard, with 8GB (2 x 4GB UDIMMs) of Kingston PC3 10600 ECC RAM. It has been running fine with a custom kernel, but note that my customizations have been quite mild--mainly removing support for hardware I don't, and never will, use (e.g. floppy disk drive, hoards of ethernet cards I don't use, etc). As an aside, I have rebuilt kernel and world using Clang/LLVM, and to date everything has been running smoothly.
One of the uses I am putting this machine to is as an NFS server for ZFS filesystems, with dedup enabled. ZFS can be very memory hungry, and dedup in particular, so I decided to install an additional 8GB of RAM, for 16GB total. Yesterday I put another 2 DIMMs, identical to the 2 Kingston sticks I already had, into the machine. Much to my chagrin, shortly after I powered the machine back on, a kernel trap forced an immediate reboot... and again... and again. The fault consistently happened late in the boot process (almost immediately before the login prompt would normally appear), at apparently the same point every time.
My first suspicion was naturally that the RAM I just installed was bad, so I took the new sticks out, and tried to start up again. Of course, the machine booted normally. So, I swapped the "old" sticks for the new ones (identical in model, but the "old" ones are at least known to be good). With just the new RAM in the machine, it once again booted normally. To make a long story short, I tried various combinations of the old and new RAM, in different banks, etc, and no matter which combination I tried, FreeBSD would boot with 8GB installed, but never with all 16GB. Bummer.
I ran at least one pass of memtest86, just to be sure, and with all 16GB installed, memtest86 passed with no errors. I also discovered that I could boot into single user mode with all the RAM installed, which was a relief (not least of which because it meant I could more easily mess with things to find out what the problem was).
At that point, I began to suspect that the trouble was with a kernel module--one which apparently does not load until relatively late in the startup order--which did not agree with the change in memory configuration. However, I had difficulty ascertaining exactly what was causing the crashes, because I couldn't find anything specific in the logs I reviewed; perhaps the system was going down before it had time to log anything relating to the misbehaving module itself. But surely the maximum allowable RAM would not be hard-coded into any part of the kernel, or kernel modules, would it? If there was a RAM expectation, I decided, it could only be a result of compile-time optimizations. So, just for the heck of it, I decided to rebuild my kernel from single user mode (my understanding is that using make buildkernel is also supposed to pull in any kernel modules and rebuild them as well).
It worked. After building and installing the new kernel, with the 16GB RAM installed the whole time, I rebooted and the system came up normally. Having checked memory stats after the system came up, all 16GB is clearly recognized.
I never would have expected that the OS would be so specifically tied to the amount of installed memory, short of the obvious technical limitations that might come with 32-bit systems, PAE, etc, but those clearly don't apply to this situation (this has been an amd64 build from the beginning). However, I'm glad that a simple recompilation of my kernel (and presumably the associated modules) solved my problem.
Having found almost no information of value on the net while googling this problem, I decided I would post my experience here, in case the solution proves to be of value to anyone else in a similar circumstance. Forgive me if I am expounding on something glaringly obvious to the FreeBSD-savvy, but as I said, I am only just returning to the platform, and this situation has never come up for me before (nor in the context of many Linux systems I have managed).
On that note, can anyone with more technical knowledge of the inner-workings of FreeBSD explain to me exactly what was happening? I obviously found a solution that worked, but I would be interested to know why the change in RAM would have caused the problem to begin with.