EQ overflow FreeBSD 12.0 + nVidia 1660Ti with 430 driver + Ryzen 2400G

Rob S · Jun 29, 2019

Hello.

I am getting EQ overflow errors in my Xorg log using the nVidia 430 driver. These correspond to my system freezing-up. The mouse and keyboard work for about 10 seconds before this happens. However, I am able to get a stable display with the VESA driver.

* I'm Running FreeBSD 12.0 with generic kernel (also tried custom, without option VESA). I have switched off my on-board graphics in my AsRock AB350 bios. My primary display is on an nVidia 1660Ti with HDMI. I was using a kvm switch ( Belkin Flip ) but have subsequently plugged both mouse and keyboard directly into PC and this does not solve the problem.

* The 430 driver was installed from a patched version of nvidia-driver in the /usr/ports tree. The patch was taken from this bug report page: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232645

* I'm running nvidia_modeset and nvidia kernel modules (specified in /boot/loader.conf) and dbus and hald (specified in rc.conf). Also, I am using the xorg.conf generated automatically by nvidia-xconfig (430). I'm using vga textmode (in loader.conf) but it doesn't make any difference with/without.

Can someone offer advice on how to resolve this, please? I would be grateful.

Thanks,

Rob.

This has also been raised on reddit and bugs.freebsd.org (from where I was redirected here): https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232645

shkhln · Jun 30, 2019

cat /var/log/messages | grep -i -E "(nvidia|NVRM)"? sysctl hw.nvidia.registry.ResmanDebugLevel=0 will get you more verbose debug output from the driver, although I'm not sure how useful it is in practice.

Rob S · Jun 30, 2019

Hi. Thanks for the advice. I did as you suggested. There's some interesting output in /var/log/messages (attached) but I'm not sure what it means.

I did note that the BusID is shown as 0:10:0:0 in /var/log/messages but 0:16:0:0 in pciconf -lv. Possibly one is hex and one is dec - so nothing to worry about?

I haven't got any further towards identifying the problem.

Thanks,

Rob.

Rob S · Jun 30, 2019

Also wondered if it's my choice of hardware. Using MSI Armor OC 1660 Ti. Maybe the overclocking is throwing things off? If so, I don't know how to avoid this.

shkhln · Jun 30, 2019

By the way, there no need to attach < 100 line files, it's totally fine to include them inline.

Jun 30 22:17:16 robs-pc kernel: NVRM: Xid (PCI:0000:10:00): 79, GPU has fallen off the bus.

Quite generic error, unfortunately. Make sure you don't have hardware issues: PSU is strong enough, power connector is properly attached, GPU isn't overheating. Check whether your video card works properly under Windows/Linux, if you must.

Rob S · Jul 1, 2019

PSU is 750W...easily enough for this system. Card works fine under heavy load in Win10.

shkhln · Jul 1, 2019

1. You can try to send Nvidia a crash dump as an NVRM message suggests. They are unlikely to react on it, though.
2. Do any non-NVRM messages between "RmInitAdapter succeeded!" and "GPU has fallen off the bus." lines look interesting?
3. Did you test the most basic X11 desktop environment? I usually suggest X -retro, that doesn't even start a terminal — just a mouse pointer on a gray background.

Rob S · Jul 3, 2019

Interesting. It is apparently stable with the X -retro setup. I can move the mouse pointer and it doesn't freeze, even after a few minutes.

However, vty switching doesn't seem to work. When I do this, the monitor says it stopped getting a signal. So I can't kill X without a reboot. I think that's an issue that others have had and possibly is unrelated to the freezing problem on starting X normally. If I try to vty switch and then do a hard reboot with my power button, I can capture some error messages, as in the log below. However, if I just do a hard reboot from X, I don't get these messages.

Another odd thing is that I get kernel log messages (before I try to vty switch) that say "interrupt storm detected on "irq:259"; throttling interrupt source". Perhaps that's just me moving the mouse a lot when I check if my desktop is frozen? I'm using sysmouse - could that be a source of the problem? Is there an alternative?

Thanks,

Rob S.

Code:

Jul  3 00:19:00 robs-pc kernel: NVRM: GPU at PCI:0000:10:00: GPU-890b60a8-d9b3-824a-784b-648e84db328b
Jul  3 00:19:00 robs-pc kernel: NVRM: GPU Board Serial Number: 
Jul  3 00:19:00 robs-pc kernel: NVRM: Xid (PCI:0000:10:00): 79, GPU has fallen off the bus.
Jul  3 00:19:00 robs-pc kernel: NVRM: GPU 0000:10:00.0: GPU has fallen off the bus.
Jul  3 00:19:00 robs-pc kernel: NVRM: GPU 0000:10:00.0: GPU is on Board .
Jul  3 00:19:00 robs-pc kernel: NVRM: A GPU crash dump has been created. If possible, please run
Jul  3 00:19:00 robs-pc kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
Jul  3 00:19:00 robs-pc kernel: NVRM: the NVIDIA kernel module is unloaded.
Jul  3 00:19:04 robs-pc kernel: uhub_reattach_port: giving up port reset - device vanished
Jul  3 00:19:16 robs-pc syslogd: last message repeated 10 times
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57d:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:1:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:3:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:5:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:7:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57d:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:1:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:3:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:5:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:7:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:2:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Jul  3 00:19:17 robs-pc kernel: uhub_reattach_port: giving up port reset - device vanished
Jul  3 00:19:48 robs-pc syslogd: last message repeated 25 times
Jul  3 00:19:49 robs-pc devd[721]: check_clients:  dropping disconnected client
Jul  3 00:19:50 robs-pc kernel: uhub_reattach_port: giving up port reset - device vanished

shkhln · Jul 3, 2019

Rob S said:
Interesting. It is apparently stable with the X -retro setup. I can move the mouse pointer and it doesn't freeze, even after a few minutes.

Now we (well, you) need to find what actually crashes the driver. Start a twm session ( startx with default settings, i.e. without ~/.xinitrc), run glxgears from Mesa, then maybe something more heavy like Unigine Valley benchmark.

Rob S said:
However, vty switching doesn't seem to work. When I do this, the monitor says it stopped getting a signal.

Switching from (modern and relatively new) vt back to syscons might or might not help with it. Also see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237050.

Rob S said:
Another odd thing is that I get kernel log messages (before I try to vty switch) that say "interrupt storm detected on "irq:259"; throttling interrupt source". Perhaps that's just me moving the mouse a lot when I check if my desktop is frozen?

Probably not.

Rob S said:
I'm using sysmouse - could that be a source of the problem?

No, nothing of the sort. It doesn't talk to hardware directly, that would be either ums or psm driver.

Rob S · Jul 7, 2019

I tried running a full xfce4 desktop. It was actually stable(-ish) this time. The weird thing is that the fans spun up on starting X and remained on even when idling. These fans are supposed to stop when the card is idle (and they do on Win10). I did get a few short freezes before it finally died when I opened a terminal. There were a few clusters of "interrupt storm" in the message log.

T-Daemon · Jul 7, 2019

If you haven't done it yet update to version 12.0-RELEASE-p7. According to dmesg.txt your system is at r341666.
freebds-update fetch
freebsd-update install
reboot

remove in /boot/loader.conf

Code:

linux_enable="YES"
nvidia_load="YES"
nvidia-modeset_load="YES"

edit /etc/rc.config, set:

Code:

linux_enable="YES"
kld_list="nvidia-modeset"

rename any xorg.conf file, ex. xorg.conf.nvidia,
create /usr/local/etc/X11/xorg.conf.d/nvidia.conf file, set:

Code:

Section "Device"
   Identifier "Card0"
   Driver     "nvidia"
EndSection

reboot

login as user, execute
startx

If the problems persist report back with
dmesg
pciconf -lv |grep -B4 VGA
/var/log/Xorg.0.log

Rob S · Jul 7, 2019

Thank you for your reply. I have upgraded FreeBSD, changed the /boot/loader.conf and /etc/rc.conf and changed the Xorg config, as you suggested. Of course, it was also necessary to recompile the driver module.

I found that the standard startx setup worked stably for about 5 minutes before I did a soft reset (because I can't switch back to VT). Subsequently, I tried startx and startxfce4. Both of these crashed quite quickly, after 30 seconds or less. I didn't get the fans spinning up as before but I think that was happening randomly anyway.

I attach/paste the requested logs.

Thank you.

Code:

vgapci0@pci0:16:0:0:    class=0x030000 card=0x37501462 chip=0x218210de rev=0xa1 hdr=0x00
    vendor     = 'NVIDIA Corporation'
    device     = 'TU116 [GeForce GTX 1660 Ti Rev. A]'
    class      = display
    subclass   = VGA

For info., the card is an MSI Armor OC GeForce GTX 1660Ti .

Therer are two Xorg logs below. The "old" one has a suspicious error message in it. The other one did not report any messages but crashed anyway.

shepper · Jul 7, 2019

Your nVidia card is a circa 2010 and has no where near the capability of the built in Graphics of your Ryzen 2400G. Plus nVidia support is not great for older cards. Someone needs to ask, why the complexity of a separate card? This forum has several threads reporting success with Ryzen graphics.

Rob S · Jul 7, 2019

shepper said:
Your nVidia card is a circa 2010 and has no where near the capability of the built in Graphics of your Ryzen 2400G. Plus nVidia support is not great for older cards. Someone needs to ask, why the complexity of a separate card? This forum has several threads reporting success with Ryzen graphics.

Hi Shepper. My card was released about 5 months ago.

Edit: I updated the typo from "1600Ti" to "1660Ti" in my post.

shkhln · Jul 7, 2019

shkhln said:
run glxgears from Mesa, then maybe something more heavy like Unigine Valley benchmark.

?

Rob S · Jul 8, 2019

shkhln said:
?

Hi shkhln. Thanks - sorry I didn't follow up on the glxgears thing earlier. I did get glxgears to work once with my setup ( this when X was running for about 5 minutes before I decided to reset ). However, last time I tried ( just now ), glxgears just hanged on the command line. Also, glxinfo hanged on the command line (after saying one line about the display name). I then ran firefox in another window and then the whole of X crashed.

I will try glxgears again.

Rob S · Jul 8, 2019

Attempt 1:

startx
Run glxgears in login term - hangs
Run glxinfo in login term - hangs after generic one-line message
Run firefox - X stops responding

Attempt 2:

startx
Run glxgears in login term - glx gears works - reports 60 fps three times before I exit
Run glxinfo in login term - hangs after generic one-line message
Run glxgears in login term - hangs
Run firefox in other term - no response ( ps STAT has state D ).
... try last few lines a few more times with same results...
quit by pressing hw reset

On an earlier attempt I did get glxinfo to work and it reported that 3D rendering was enabled.

What I notice is that in both times I get:

interrupt storm detected "irq276:" - throttling input source

This appears in my dmesg or /var/log/messages. I got this message about 60 times on attempt 2, before I reset. I was getting these messages (but with irq259) before I upgraded FreeBSD in response to an earlier post in this thread.

toorski · Jul 8, 2019

I also didn’t see anything in your Xorg log that would indicate issues with nividia’s GPU driver and display output.

All I can think of is your video card’s OC setting/configuration. You should maybe use MSI’s GPU tuner software to reset the video card to its default settings, with no OC, if there’s such option.

Or else there's some kind of DMA/IRQ hardware conflict that FreeBSD cannot deal with.

Edit:
I would also do:
kldstat | grep nvidia
to make sure that the mods are in
I would re-run:
nvidia-xconfig

Then, reboot and try startx again.

Rob S · Jul 8, 2019

Thanks toorski. There are two xorg logs. The "old" one has a driver backtrace in it.

I will definitely try the overclocking thing. The Win10 MSI tool allows the clocks to be slowed down relative to the current OC setting, so I'll need to look up the values for the stock clocks. Hopefully those settings will persist after a reboot.
I have to go offline now for about 20 hours.

shkhln · Jul 8, 2019

Rob S said:
Run glxgears in login term - glx gears works - reports 60 fps three times before I exit
Run glxinfo in login term - hangs after generic one-line message
Run glxgears in login term - hangs

All in the same session? Can you post truss glxgears output (where it hangs)?

Rob S · Jul 8, 2019

OK so I tried again with glxgears. It ran OK at 60 FPS for about 1 minute then the framerate dropped to about 3 FPS and the desktop became very poorly responsive. I did a ctrl+c to quit. Then I ran glxgears again and it didn't even start. I did the truss glxgears this time (output attached). I also attach the result of truss glxinfo when it hanged.

System is very choppy with random freezes.

I'm going to try reducing the overclock now. Not sure if it is possible to make this persist across a reboot with a factory overclocked card but we'll see. I have no way to adjust the overclock in FreeBSD apparently - I need to reboot to Win10.

I got the usual interrupt storm detected on irq276 (repeated 11 times). Also I get EQ overflow in the Xorg log ( not the "card has fallen off the bus" like I did last time ). Seems to be intermittently freezing / crashing with one of those two errors.

(EE) [mi] EQ overflowing. Additional events will be discarded until existing events are processed.
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/X (?+0x0) [0x3dd360]
(EE) 1: /usr/local/bin/X (?+0x0) [0x2a1d30]
(EE) 2: /usr/local/bin/X (?+0x0) [0x2de8b0]
(EE) 3: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0xe06a25990]
(EE) 4: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0xe06a22e10]
(EE) 5: /usr/local/lib/xorg/modules/input/mouse_drv.so (?+0x0) [0xe06a21e90]
(EE) 6: /usr/local/bin/X (?+0x0) [0x2cf780]
(EE) 7: /usr/local/bin/X (?+0x0) [0x2f3030]
(EE) 8: /lib/libthr.so.3 (pthread_sigmask+0x536) [0x800ae9916]
(EE) 9: /lib/libthr.so.3 (pthread_getspecific+0xe12) [0x800ae96f2]
(EE) 10: ? (?+0xe12) [0x7fffffffee15]
(EE) 11: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (nvidiaAddDrawableHandler+0x52c89) [0x80230df82]
(EE)
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause. It is a victim.
[ 129.232] [mi] Increasing EQ size to 1024 to prevent dropped events.
[ 129.233] [mi] EQ processing has resumed after 43 dropped events.
[ 129.233] [mi] This may be caused by a misbehaving driver monopolizing the server's resources.

Rob S · Jul 8, 2019

further note: I have been running X for 15 minutes now - possibly a record. However, whenever I open a new window there is a ~10 second freeze. Also happens when I open a new tab in browser and also intermittently. The /var/log/messages interrupt storm on irq276 has now increased "repeated 98 times"

Rob S · Jul 8, 2019

All in the same session?

Yes, all in the same session.

shkhln · Jul 11, 2019

Rob S said:
The /var/log/messages interrupt storm on irq276 has now increased "repeated 98 times"

vmstat -i?

Amzo · Jul 11, 2019

Did you build the driver yourself from outside of the port tree? I'm just curious as since the issue is only with X, it could be you failed to patch or address something. Ports nvidia-driver is still at 390.87 which was released before the GTX 1660ti if I remember correctly.

EQ overflow FreeBSD 12.0 + nVidia 1660Ti with 430 driver + Ryzen 2400G

Rob S

Attachments

shkhln

Rob S

Attachments

Rob S

shkhln

Rob S

shkhln

Rob S

shkhln

Rob S

T-Daemon

Rob S

Attachments

shepper

Rob S

shkhln

Rob S

Rob S

toorski

Rob S

shkhln

Rob S

Attachments

Rob S

Rob S

shkhln

Amzo