Serious bug - X11 killed my system!

Just a warning to those who may be inclined to follow the line in the handbook that reads:

Code:
As of version 7.3, Xorg can often work without any configuration file by simply typing at prompt:

startx

DON'T DO IT!!! I installed my system, then spent the better part of a week compiling/installing various ports. There were issues, like when libreoffice crashes and the KDE screensaver crashes - I was really hoping that a standalone video card would help clear up these problems. I shut down the system, installed an nVidia PCI video card, and booted back up. Everything was going well until I decided to believe that line in the handbook...

Knowing my old /etc/X11/xorg.conf wouldn't work because the onboard video was based on a radeon chipset, I deleted the file. Then I foolishly decided to see if I could bypass the X11 configuration that I went through last time by simply typing startx from a user account, figuring the worst that could happen was that X/KDE would fail to start and that I'd have to break out from a different console. How wrong I was!

I got a fast clicking/buzzing noise from my speakers after the screen went black. I knew something had gone wrong, but figured it was no big deal - I'll just Control-Alt-F1 to get to the root shell I had open on the first console and kill the process from there. No such luck. Nothing I did was able to get the screen to do anything but stay black (not no signal, but black) and nothing aside from unplugging the speaker would stop that annoying buzz. Tried switching to different consoles, tried using control-alt-delete to reboot, tried accessing over the network (before realizing I hadn't set that up yet :( ), and still couldn't get any response. Tapping scroll lock twice to switch to another machine on my KVM didn't even work. I let it sit for about 20 minutes in the hope it'd eventually respond. It never did and I ended up having to hold down the power button 4 seconds to get it to shut off.

Powered back on and what a surprise - the file system check complained that / wasn't properly dismounted and started a check. After the file system check completed (with the below lines resulting), it continued booting. Unfortunately, it went extremely slowly and eventually stalled out, refusing to boot any farther. Again the keyboard was frozen and I had to do a hard power-off.

Code:
INCORRECT BLOCK COUNT I=1885000 (4 should be 0) (CORRECTED)
INCORRECT BLOCK COUNT I=2802692 (4 should be 0) (CORRECTED)
INCORRECT BLOCK COUNT I=2826427 (4 should be 0) (CORRECTED)

The same process repeated itself twice more. Then tried booting from both other drives in the mirror with the same result. One interesting point is that it doesn't freeze up at the same point every time - one time the last line shown will be the start of inetd, the next when hald is starting, etc. There's no pattern that I can determine. Even left the system sit on overnight in the hopes that the fsck was still running and somehow blocking the boot process and hence would eventually complete allowing the boot to continue, but to no avail - still froze up in the same spot when I got in this morning.

I can get into single user mode, but because the boot process doesn't stop/lock up in any consistent way, don't really know what to do in order to repair whatever damage was done. The file system check completes at boot time, so I don't see where that would help. Since the mirror was active at the time this happened, all 3 copies of my system were taken down. (Already made a mental note to have one drive be a static backup that is not in the mirror; I'll write something that will mirror the root file system once a week or so if/when the system gets back up & running.) I thought that disabling one or more startup daemons might help, but it doesn't seem to. At this point, I'm about ready to wipe it and start over, which I REALLY hate to do after putting so much time & work in on it, but I really don't see what other option I have. :(



I guess the point of this post is to not believe what it says in the handbook about startx working without configuration first. This is exactly what I did as a normal user and it nuked the system somehow. If anybody has ideas of how I could troubleshoot or isolate or repair what was screwed up, I'd appreciate them. If a developer reads this, you might want to look at how something that a normal user does makes a system unbootable. (I'm willing to provide any information you ask in the way of log files and such, but unless I figure something out, the system's getting wiped and reinstalled so it can compile over the weekend.)
 
I've had no trouble running X without a config on multiple machines. I also don't know whether I'd blame X for the system being unbootable. It sounds like you ran into an issue specific to your setup, and your filesystem got hosed by the hard reboot.

Sucks :(
 
I been running XOrg without an XOrg config file and haven't had issues. I really doubt starting X without a config file hosed your system... but could have attributed to the original lock up though when it was trying to probe and load the correct video driver and monitor settings.

The error you posted looks like something was probably writing to your filesystem and the hard reboot hosed it. For the slowness experienced during start up, I have noticed that FreeBSD actually runs a filesystem check while the system is still starting up (someone correct my if I am wrong). For example, after hard rebooting my system, it will normally boot up and run slow for a few minutes (with the HDD light showing activity) and /usr being scan/checked. Maybe this is what is happening to you?

I would try booting into single user mode, check the filesystem and then proceed and see if it locks up.
 
redw0lfx said:
For the slowness experienced during start up, I have noticed that FreeBSD actually runs a filesystem check while the system is still starting up (someone correct my if I am wrong).
Correct. The fsck(8) gets run in the background. Traditionally you'd have to wait for it to finish before it continued to boot.
 
I guess the point of this post is to not believe what it says in the handbook about startx working without configuration first

Sorry, but your point is invalid.

Your system is broken somewhere, X has nothing to do with it. The X crash was unrecoverable because Xorg directly accesses hardware. It's a hardware problem.

X autoconfiguration is a pretty normal thing; uses PCI IDs to autoload graphics driver, EDID for display configuration, etc. It's not running "without configuration", it's compiling configuration on the fly. It's what we used to do with X -configure, just done in run-time.

You'd have same configuration if you executed
# X -configure
and then copied the file to /etc/X11/xorg.conf.

People who used to autodetect settings and then ran X, can run X without generating the static configuration file. People who used 3rd party drivers or tweaks/stuff, won't. Simple as that.
 
I think that the timeline of your apocalypse is:

Ruler2112 said:
I shut down the system, installed an nVidia PCI video card, and booted back up.
<cut>
Knowing my old /etc/X11/xorg.conf wouldn't work because the onboard video was based on a radeon chipset
Maybe there is a conflict between the PCI NVidia and the onboard Radeon? Or a bug in X autoconfiguration with this type of video card mix? I don't know, but the result is that X crashed leaving you with a black screen.

Ruler2112 said:
I got a fast clicking/buzzing noise from my speakers after the screen went black.
This sounds like a kernel panic with data written to unknown/random addresses.

Ruler2112 said:
I'll just Control-Alt-F1 to get to the root shell
When the kernel panics you have 15 seconds to hit a key to stop the countdown - otherwise the system reboots.
But you have stopped the countdown by trying to get into a console, so now you are in a deadlock: you are waiting your system, and your system is waiting you. The only way out is a reset.

Ruler2112 said:
Powered back on and what a surprise - the file system check complained that / wasn't properly dismounted and started a check. After the file system check completed (with the below lines resulting), it continued booting. Unfortunately, it went extremely slowly and eventually stalled out, refusing to boot any farther. Again the keyboard was frozen and I had to do a hard power-off.
After the reset (and even after the kernel panic) your filesystems needs some check (remember that they were not cleanly unmounted), and when fsck runs any disk access is slow - despite the fact that the check is in the background. Moreover, you have resetted the machine while fsck was trying to fix your already-dirt-filesystems.

You should let the system boot and check the filesystems - it may take some time. When this happens to me I usually login at the console, run top(1) and wait for fsck to finish his work.
Next, you could try to remove the pci card and use only the onboard one, with VESA drivers.

Hope this helps.
 
Dies_Irae said:
You should let the system boot and check the filesystems - it may take some time. When this happens to me I usually login at the console, run top(1) and wait for fsck to finish his work.
Even better would be to boot to single user mode and run fsck there. Remember that fsck cannot fix certain filesystem errors if the filesystem is mounted.
 
Dies_Irae said:
I think that the timeline of your apocalypse is:

Ruler2112 said:
I shut down the system, installed an nVidia PCI video card, and booted back up.
<cut>
Knowing my old /etc/X11/xorg.conf wouldn't work because the onboard video was based on a radeon chipset

Maybe there is a conflict between the PCI NVidia and the onboard Radeon? Or a bug in X autoconfiguration with this type of video card mix? I don't know, but the result is that X crashed leaving you with a black screen.

I have a setup like this. I had to disable the on board video in the BIOS or all heck breaks lose.
 
Thanks for all the input guys.



redw0lfx said:
The error you posted looks like something was probably writing to your filesystem and the hard reboot hosed it.

Mormegil said:
...your filesystem got hosed by the hard reboot.

Sucks :(

I believe this is what happened. Went back and discovered I hadn't enabled soft updates or journalling when installing initially, so the hard reboot is what probably killed something.

I did notice that journalling isn't even available from the menu without going into the newfs options and manually adding -J there... Isn't journalling what saves you in the event of a power loss/unexpected reboot? (UPDATE: The system complains about a journal provider not being found on reboot with this config.)


roddierod said:
I have a setup like this. I had to disable the on board video in the BIOS or all heck breaks lose.

Dies_Irae said:
Maybe there is a conflict between the PCI NVidia and the onboard Radeon?

From past nightmares trying to mix onboard & standalone video cards, I disabled the onboard video in the BIOS before booting the first time after slapping the PCI card in, so there should be no problem there.


redw0lfx said:
For the slowness experienced during start up, I have noticed that FreeBSD actually runs a filesystem check while the system is still starting up (someone correct my if I am wrong). For example, after hard rebooting my system, it will normally boot up and run slow for a few minutes (with the HDD light showing activity) and /usr being scan/checked. Maybe this is what is happening to you?

Dies_Irae said:
You should let the system boot and check the filesystems - it may take some time. When this happens to me I usually login at the console, run top(1) and wait for fsck to finish his work.
Next, you could try to remove the pci card and use only the onboard one, with VESA drivers.

It's more than slow when starting up - it's slow and then stops completely. Left it trying to boot overnight and when I came back ~18 hours later, it was still sitting at the same exact place just as unresponsive as when I left.

The onboard video is quite slow and I was hoping that a standalone video card would eliminate the KDE screen saver crashing, the bug report about which was linked in my original post on the subject. (Obviously not everybody with KDE 3.5.10 has the problem...)


redw0lfx said:
I would try booting into single user mode, check the filesystem and then proceed and see if it locks up.

SirDice said:
Even better would be to boot to single user mode and run fsck there. Remember that fsck cannot fix certain filesystem errors if the filesystem is mounted.

Just tried that - fixed a bunch of stuff, but made no difference. Rebooted and froze up during the boot sequence again, so I did a binary wipe and am reinstalling now. :(


Dies_Irae said:
Or a bug in X autoconfiguration with this type of video card mix? I don't know, but the result is that X crashed leaving you with a black screen.

That was my thought. I wouldn't think that a PCI nVidia GeForce6200 is all that uncommon of a video card though...

If X were to have just crashed and returned me to a prompt, it wouldn't have been a problem. Being stuck at a black screen and unable to do anything to affect the system is ridiculous IMO. (Out of curiosity, do the Num Lock, Caps Lock, and Scroll Lock lights flash on a kernel panic in FreeBSD like they do in Linux? Never had a kernel panic in BSD before... I ask because the lights were not flashing, but neither did the associated keys toggle the on/off status of the light.)


Dies_Irae said:
When the kernel panics you have 15 seconds to hit a key to stop the countdown - otherwise the system reboots.
But you have stopped the countdown by trying to get into a console, so now you are in a deadlock: you are waiting your system, and your system is waiting you. The only way out is a reset.

This doesn't sound like the type of good design I've come to expect from BSD. Basically what this means is that if the system crashes in such a way that you don't know what's happened and you try to do anything to figure out what's happened, your only recourse is to do a hard reboot.


Zare said:
Your system is broken somewhere, X has nothing to do with it. The X crash was unrecoverable because Xorg directly accesses hardware. It's a hardware problem.

I don't understand - the system would boot, I tried running startx as a normal user and it had a problem, then the system wouldn't boot. How does X have nothing to do with it???

You obviously know more about X than I probably ever will (or care to ;) ), so I'm not going to argue the point. However, while I understand that a hard reset without having soft updates/journalling enabled is what was most likely the ultimate cause of the system dying, the auto-config of X causing a hard lockup is what precipitated the whole mess.

We'll never know if going through the configuration steps would have prevented the hard lockup X caused because the system wouldn't boot and is now wiped, but I know I would appreciate somebody posting a warning to not bypass those steps if this had happened to them, so this is exactly what I did.


Zare said:
X autoconfiguration is a pretty normal thing; uses PCI IDs to autoload graphics driver, EDID for display configuration, etc. It's not running "without configuration", it's compiling configuration on the fly. It's what we used to do with X -configure, just done in run-time.

You'd have same configuration if you executed
# X -configure
and then copied the file to /etc/X11/xorg.conf.

People who used to autodetect settings and then ran X, can run X without generating the static configuration file. People who used 3rd party drivers or tweaks/stuff, won't. Simple as that.

Don't have any tweaks or 3rd party drivers that I'm aware of - just followed directions in the handbook. The configuration steps in the handbook worked for the onboard video, but I decided to be lazy and try to skip it with the standalone nvidia card because the handbook said it would be fine. It'll be interesting (to me at least) to see if the configuration steps in the handbook work for the standalone as they did for the onboard video the first time...

I'll post back once the system is installed and X/KDE compiled/installed with the results.
 
Ruler2112 said:
I did notice that journalling isn't even available from the menu without going into the newfs options and manually adding -J there... Isn't journalling what saves you in the event of a power loss/unexpected reboot? (UPDATE: The system complains about a journal provider not being found on reboot with this config.)

Nevermind about this. Doing reading on the subject has made me realize that only soft updates are needed to protect the system in the event of a power loss. (From what I've found, gjournal eliminates the need to fsck after a hard reboot at the cost of writing everything twice and consuming considerable drive space.)
 
Something to consider... The "open source" nv driver is actually highly obfuscated. It was written by developers at nvidia, and they even announced a while back that they were going to stop development on it. So you essentially left your computer in the hands of a nearly closed-source "nv" driver that, frankly, doesn't get the attention from nvidia as their actual closed source "nvidia" drivers, and doesn't get the peer review that actual open source drivers receive.

BTW, you may want to confirm that the BIOS really did disable the on-board GPU by checking the output of 'pciconf'.

Adam
 
Ruler2112 said:
This doesn't sound like the type of good design I've come to expect from BSD. Basically what this means is that if the system crashes in such a way that you don't know what's happened and you try to do anything to figure out what's happened, your only recourse is to do a hard reboot.

It's not a matter of good or bad design - when you have a kernel panic the game is over.
We are not talking of the crash of a simple app, but of the system itself.

When a kernel panics you have two choices: let the system reboot automatically after 15 seconds, or stop the countdown.

If you stop the countdown and your kernel is compiled with
Code:
options DDB
you could use the kernel debugger to investigate the crash - but in the end you have to reboot, you cannot go back from a panic.

You are in a critical situation, beyond the point of no return, but despite that the kernel will give you the opportunity to investigate the problem.

Compare this with the (in)famous BSOD of Windows :e

For more info, see here and ddb(4)
 
Probably too late now as you reinstalled, but I had a problem where I tried to start Xorg after an unclean exit that also left me with a blank screen. This was solved by removing the .Xauthority* files in my home directory.
 
Back
Top