EQ overflow FreeBSD 12.0 + nVidia 1660Ti with 430 driver + Ryzen 2400G

Amzo

Active Member

Reaction score: 35
Messages: 101

Some time today I'll upgrade to the same driver version with the patches. If it is a driver issue relating to the newest FreeBSD Nvidia release and based on your information I should be able to reproduce it and go from there.
 
OP
OP
R

Rob S

Member


Messages: 37

Did you build the driver yourself from outside of the port tree? I'm just curious as since the issue is only with X, it could be you failed to patch or address something. Ports nvidia-driver is still at 390.87 which was released before the GTX 1660ti if I remember correctly.
Hi Amzo. Thanks for your reply. I built the driver from the ports tree using a patch:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232645

But yes, perhaps it's possible it's out-of-sync with my version of ports? I installed FreeBSD less than a day after building the driver.

Maybe I could try the 418 version instead.

Thanks,

RobS.
 
OP
OP
R

Rob S

Member


Messages: 37

vmstat -i?
interrupt storm detected on "irq275:"; throttling interrupt source
interrupt storm detected on "irq275:"; throttling interrupt source
interrupt storm detected on "irq275:"; throttling interrupt source
interrupt storm detected on "irq275:"; throttling interrupt source
$ vmstat -i
interrupt total rate
cpu0:timer 17283 97
cpu1:timer 10466 59
cpu2:timer 15061 85
cpu3:timer 11164 63
cpu4:timer 92594 521
cpu5:timer 8643 49
cpu6:timer 12006 68
cpu7:timer 11184 63
irq259: hdac0 8 0
irq261: xhci1 179 1
irq262: ahci0 11564 65
irq263: ahci1 268 2
irq265: re0 16651 94
irq266: nvme0 14 0
irq267: nvme0 168 1
irq268: nvme0 45 0
irq269: nvme0 44 0
irq270: nvme0 274 2
irq271: xhci2 4664 26
irq275: vgapci0 7181 40
Total 219461 1235
 
OP
OP
R

Rob S

Member


Messages: 37

Some time today I'll upgrade to the same driver version with the patches. If it is a driver issue relating to the newest FreeBSD Nvidia release and based on your information I should be able to reproduce it and go from there.
Cool!
 
OP
OP
R

Rob S

Member


Messages: 37

Also I have this but not sure if it's saying anything useful:

root@robs-pc:/usr/home/robs # nvidia-debugdump -z -D
nvmlInit succeeded
Using ALL devices
Dumping all components.
nvdZip_Open(dump.zip) for writing succeeded
System: Dumping component: system_info.
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpSystemComponent() failed, return code: 0x3e7
System: Dumping component: error_data.
GetCaptureBufferSize succeeded, bufSize: 0x8fb
GetCaptureBuffer succeeded, bufSize: 0x83d
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
Nvlog: Dumping component(nvlog.log): nvlog.
internal_dumpNvLogComponent() succeeded
Device: GeForce GTX 1660 Ti : 0: Dumping component: debug_buffers.
GetCaptureBufferSize succeeded, bufSize: 0x22
GetCaptureBuffer succeeded, bufSize: 0x2
nvdZip_AddFile succeeded
internal_dumpGpuComponent() succeeded
Device: GeForce GTX 1660 Ti : 0: Dumping component: rm.
GetCaptureBufferSize succeeded, bufSize: 0x41c0
GetCaptureBuffer succeeded, bufSize: 0x3af3
nvdZip_AddFile succeeded
internal_dumpGpuComponent() succeeded
Nvlog: Dumping component(nvlog.gpu000.log): nvlog.
internal_dumpNvLogComponent() succeeded
nvdZip_Close() succeeded
 

T-Daemon

Well-Known Member

Reaction score: 87
Messages: 269

I have installed on a 12.0-RELEASE test system the 430.34 NVIDIA driver, not from ports but from downloaded tar ball at NVIDIA, without linux compatibility support. The video card is an old GeForce GT 630, passive cooled. So far I haven't had any problems. In your case the issues could be related to the overclocking. Have you tried slowing down the card as you mentioned in your post #19?
 
OP
OP
R

Rob S

Member


Messages: 37

I have installed on a 12.0-RELEASE test system the 430.34 NVIDIA driver, not from ports but from downloaded tar ball at NVIDIA, without linux compatibility support. The video card is an old GeForce GT 630, passive cooled. So far I haven't had any problems. In your case the issues could be related to the overclocking. Have you tried slowing down the card as you mentioned in your post #19?
I looked at this but it seemed the only way was to use the MSI Afterburner tool ( dual-boot with Win10 ). I'm not sure if settings will persist across reboot but I will try now. The card is supposed to have a boost clock of 1860, which is more than stock.

I will also try a different tool now because the version of MSI Afterburner I have seems to make the settings obscure.

For info. ( as you probably saw ) I am running linux compatibility support ( or so I understand ).

Thanks for trying it!
 
OP
OP
R

Rob S

Member


Messages: 37

Some overclock settings in Win 10 attached.

As far as I can tell, the card is operating at the default clock speeds but it is a model that claims to be overclocked.
 

Attachments

OP
OP
R

Rob S

Member


Messages: 37

Some overclock settings in Win 10 attached.

As far as I can tell, the card is operating at the default clock speeds but it is a model that claims to be overclocked.
Perhaps a VBIOS flash would solve it but I don't really know.
 

toorski

Member

Reaction score: 10
Messages: 59

For info. ( as you probably saw ) I am running linux compatibility support ( or so I understand ).
I've noticed that your linux kernel module is invoked in /etc/rc.conf
I'm not sure which is the correct way for enabling the Linux module, especially in 12.0 :(
In my case, I load the module in /boot/loader.conf, in 11.2
In 11.2, my nvidia driver module is also loaded from /boot/loader.conf
In 12.0, I don't have nvidia GPU to play with nvidia driver.

Moreover, I would also try this:

I have installed on a 12.0-RELEASE test system the 430.34 NVIDIA driver, not from ports but from downloaded tar ball at NVIDIA,
I remember, sometime ago, I had to make latest nvidia-driver (from tarball) to play with CUDA and my GTX960, when 11.*? didn't have it in pkg or ports tree. The driver worked fine and so did CUDA.

I would even try the nvidia-driver 390.* from pkg install, just to see what would happen :confused:

Edit:
I just verified and corrected, in my 12.0, how linux module is loaded. It's from /etc/rc.conf
 

shkhln

Aspiring Daemon

Reaction score: 203
Messages: 610

Also I have this but not sure if it's saying anything useful:
It doesn't. Only Nvidia has means to analyze crash/debug dumps.

The card is supposed to have a boost clock of 1860, which is more than stock.
Factory OC cards are completely meaningless with GPU boost. Each card (including non-OC versions) boosts as much as it can, which should be somewhere in 19xx.

I will also try a different tool now because the version of MSI Afterburner I have seems to make the settings obscure.
You can lower the power limit level with -pl option of nvidia-smi utility. For some reason it does require starting Xorg first, though. Another setting you can play with is Coolbits X config option, which unlocks a few things in nvidia-settings utility.
 

Amzo

Active Member

Reaction score: 35
Messages: 101

Hi Amzo. Thanks for your reply. I built the driver from the ports tree using a patch:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232645

But yes, perhaps it's possible it's out-of-sync with my version of ports? I installed FreeBSD less than a day after building the driver.

Maybe I could try the 418 version instead.

Thanks,

RobS.
The reason I was curious is that the new driver has new IRQ code as was assuming there was a bug in it. I can't test it yet as I'm busy working on tensorflow port atm. Since an interrupt storm is when the processor receives too many interrupt requests I figured it may be a bug.

Try increasing the IRQ limit as a temporary fix, the default is 1000.

Code:
hw.intr_storm_threshold="9000"
 
OP
OP
R

Rob S

Member


Messages: 37

I would even try the nvidia-driver 390.* from pkg install, just to see what would happen :confused:
I did this accidentally at one point. An error was reported on loading the kernel module, which basically said the driver was incompatible with my card.
 
OP
OP
R

Rob S

Member


Messages: 37

The reason I was curious is that the new driver has new IRQ code as was assuming there was a bug in it. I can't test it yet as I'm busy working on tensorflow port atm. Since an interrupt storm is when the processor receives too many interrupt requests I figured it may be a bug.

Try increasing the IRQ limit as a temporary fix, the default is 1000.

Code:
hw.intr_storm_threshold="9000"
Hi Amzo,

Thanks for that. I haven't been able to keep running X windows long enough to see the interrupt storm messages, so I haven't been able to test this fix. The last two times I tried running X, it crashed after about 10 seconds with the EQ overflow. I'll post about that separately.
 
OP
OP
R

Rob S

Member


Messages: 37

I've noticed that your linux kernel module is invoked in /etc/rc.conf
I'm not sure which is the correct way for enabling the Linux module, especially in 12.0 :(
In my case, I load the module in /boot/loader.conf, in 11.2
In 11.2, my nvidia driver module is also loaded from /boot/loader.conf
In 12.0, I don't have nvidia GPU to play with nvidia driver.

Moreover, I would also try this:



I remember, sometime ago, I had to make latest nvidia-driver (from tarball) to play with CUDA and my GTX960, when 11.*? didn't have it in pkg or ports tree. The driver worked fine and so did CUDA.

I would even try the nvidia-driver 390.* from pkg install, just to see what would happen :confused:

Edit:
I just verified and corrected, in my 12.0, how linux module is loaded. It's from /etc/rc.conf
toorski, T-Daemon : I just tried uninstalling the ports tree 430.26 driver (patched from 390) and built + installed the official nvidia driver 430.34 from source. With this driver, I got the EQ overflow error message (in Xorg.0.log) after about 10 seconds and then my system freezes. This is the same as the error I got when I tried it with the 430.26 driver. I also get the "GPU has fallen off the bus in /var/log/messages".

So it seems the irq interrupt storm messages are causing random slowdowns but not crashes. Then the EQ overflow is causing the actual crash.

I would try re-seating my card inside the PC but it works fine under heavy load in Win10, so it would suggest the hardware is fine.
 

Attachments

OP
OP
R

Rob S

Member


Messages: 37

What's the point?
One of the others reported that their system runs fine with this NVIDIA source build. Also, this is a later driver version, which I thought could contain a bug fix. Alas, it has not solved my issue.
 

shkhln

Aspiring Daemon

Reaction score: 203
Messages: 610

Don't do that again.

It's crazy how advice on this forum constantly switches between "never mix packages and ports" and no respect for package management whatsoever. It drives me nuts. (Yes, I know that typically these are different people.)
 

Amzo

Active Member

Reaction score: 35
Messages: 101

Last thing to try, but have you tried rebuilding Xorg and dependencies. I'm wondering if you have any enabled that could be causing issues, could you post /etc/make.conf? Other users on NVidia forums reported that they solved the issue of "Failed to query display engine channel state", by re-seating the card and memory as the problem was from bad contact / hardware.

Also what is the wattage of your power supply?
 
OP
OP
R

Rob S

Member


Messages: 37

Last thing to try, but have you tried rebuilding Xorg and dependencies. I'm wondering if you have any enabled that could be causing issues, could you post /etc/make.conf? Other users on NVidia forums reported that they solved the issue of "Failed to query display engine channel state", by re-seating the card and memory as the problem was from bad contact / hardware.

Also what is the wattage of your power supply?
I will try re-seating stuff. I have also removed my Raid controller. When I booted Linux it posted some IO page fault errors relating to that.

Wattage of PSU is 750W (overkill). As mentioned it's stable under load with Win10.

I'm rebuilding Freebsd now. Btw, he ports driver install asks "WBINVD Flush CPU caches directly". I left this unselected.

My make.conf is blank.
 
OP
OP
R

Rob S

Member


Messages: 37

Re
I will try re-seating stuff. I have also removed my Raid controller. When I booted Linux it posted some IO page fault errors relating to that.

Wattage of PSU is 750W (overkill). As mentioned it's stable under load with Win10.

I'm rebuilding Freebsd now. Btw, he ports driver install asks "WBINVD Flush CPU caches directly". I left this unselected.

My make.conf is blank.
Reseat of gfx card seems to have made no difference. However, one of the connectors on my power cable seemed to be dead (when I tried swapping connectors, just in case). To be safe, I swapped the entire cable for a different one (modular PSU) which now works apparently (just as the original setup did).

Rebuild of FreeBSD and driver, from ports tree, doesn't seem to have made much difference.
 
Top