I've been very unhappy with the instability of the nvidia.ko binary blobs. One wrong access from it triggers a page fault, and the next thing you know your system has rebooted itself without cleanly unmounting your hard disks. No fun.
Intel and AMD require the new KMS stuff, which seems to need a new console. Not really ready to jump into experimentation there yet. nouveau also needs this, but isn't ported over yet, it would seem.
So that leaves me with two drivers: nv, which is so hopelessly slow that maximizing a terminal window takes around 2s to complete; and vesa (pure software), which is paradoxically faster. But vesa was still much too slow to begin to be usable.
It didn't seem logical how slow the vesa driver was, so I developed some benchmarking software. I'll post the source to that at the end of this post. It's essentially modeled off of the neglected/forgotten libvgl, and sets up a 1280x1024x32bpp VESA LFB mode. It then flood fills the screen and benchmarks how long it takes.
Measurements are taken on a Core 2 Duo E6600 system with DDR3-1333 RAM and a Geforce GT210 PCIe x16 graphics card. Compilation was with gcc47 -O3. PCIe x16 has a theoretical bandwidth limit of 8000MB/s.
The result was 1.48s to simulate 1s worth of video data, so not possible to obtain 60fps. This works out to 202.7MB/s of bandwidth. Yet if I instead replace vid_mem with malloc(1280*1024*4), the bandwidth increases to an astounding 6696.4MB/s.
So it's reasonable to assume that passing through the PCIe host controller to the video card is incurring tremendous overhead. Uploading textures via OpenGL with the nvidia driver is substantially faster, but likely involves DMA. I am told it's around 5000MB/s from someone with a similar card.
Thus, I started looking into how hobbyist OS developers addressed the incredible bandwidth limitations. I found out about MTRR (memory-type range registers) present on x86/amd64 CPUs. It seems that the optimal setting for the VESA LFB is write-combine, and that the way to control these on FreeBSD is through memcontrol. Out of the box, FreeBSD sets the VESA LFB range to write-back.
First, find out the LFB for your VESA modes:
On my card, all of the VESA 32-bit modes are at 0xd1000000. This will be different for you. Also round up the frame buffer size to the nearest power of two: pow2(1280*1024*4) = 0x800000. Now enable write-combine for this range:
And if you have a need to remove it:
To view current settings:
The results were spectacular. My test dropped to 0.25s (on average) to complete. The bandwidth increased from 202.7MB/s up to 1200MB/s.
It's still nowhere near the theoretical peak, but at this point, the vesa driver is lightning fast. I can play back movies fullscreen* with zero stuttering. I can maximize/minimize windows with no delay. There's obviously no OpenGL or X-Video acceleration, but it's perfectly usable if you aren't playing 3D games. (* I suspect it'll probably struggle again at 2560x1600, or possibly at 1920x1200.)
So if you're stuck without hardware acceleration, or you find it buggy, this is a great way to produce an absolutely rock-solid desktop environment. Only issue is you'll want to go for an AMD card that has widescreen VESA resolutions.
If anyone has further acceleration tips, or knows of any dangers in changing the MTRR on the VESA LFB, please let me know.
Intel and AMD require the new KMS stuff, which seems to need a new console. Not really ready to jump into experimentation there yet. nouveau also needs this, but isn't ported over yet, it would seem.
So that leaves me with two drivers: nv, which is so hopelessly slow that maximizing a terminal window takes around 2s to complete; and vesa (pure software), which is paradoxically faster. But vesa was still much too slow to begin to be usable.
It didn't seem logical how slow the vesa driver was, so I developed some benchmarking software. I'll post the source to that at the end of this post. It's essentially modeled off of the neglected/forgotten libvgl, and sets up a 1280x1024x32bpp VESA LFB mode. It then flood fills the screen and benchmarks how long it takes.
Measurements are taken on a Core 2 Duo E6600 system with DDR3-1333 RAM and a Geforce GT210 PCIe x16 graphics card. Compilation was with gcc47 -O3. PCIe x16 has a theoretical bandwidth limit of 8000MB/s.
The result was 1.48s to simulate 1s worth of video data, so not possible to obtain 60fps. This works out to 202.7MB/s of bandwidth. Yet if I instead replace vid_mem with malloc(1280*1024*4), the bandwidth increases to an astounding 6696.4MB/s.
So it's reasonable to assume that passing through the PCIe host controller to the video card is incurring tremendous overhead. Uploading textures via OpenGL with the nvidia driver is substantially faster, but likely involves DMA. I am told it's around 5000MB/s from someone with a similar card.
Thus, I started looking into how hobbyist OS developers addressed the incredible bandwidth limitations. I found out about MTRR (memory-type range registers) present on x86/amd64 CPUs. It seems that the optimal setting for the VESA LFB is write-combine, and that the way to control these on FreeBSD is through memcontrol. Out of the box, FreeBSD sets the VESA LFB range to write-back.
First, find out the LFB for your VESA modes:
# vidcontrol -i mode
On my card, all of the VESA 32-bit modes are at 0xd1000000. This will be different for you. Also round up the frame buffer size to the nearest power of two: pow2(1280*1024*4) = 0x800000. Now enable write-combine for this range:
# memcontrol set -b 0xd1000000 -l 0x800000 -o BIOS write-combine
And if you have a need to remove it:
# memcontrol clear -b 0xd1000000 -l 0x800000
To view current settings:
# memcontrol list
The results were spectacular. My test dropped to 0.25s (on average) to complete. The bandwidth increased from 202.7MB/s up to 1200MB/s.
It's still nowhere near the theoretical peak, but at this point, the vesa driver is lightning fast. I can play back movies fullscreen* with zero stuttering. I can maximize/minimize windows with no delay. There's obviously no OpenGL or X-Video acceleration, but it's perfectly usable if you aren't playing 3D games. (* I suspect it'll probably struggle again at 2560x1600, or possibly at 1920x1200.)
So if you're stuck without hardware acceleration, or you find it buggy, this is a great way to produce an absolutely rock-solid desktop environment. Only issue is you'll want to go for an AMD card that has widescreen VESA resolutions.
If anyone has further acceleration tips, or knows of any dangers in changing the MTRR on the VESA LFB, please let me know.
Code:
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/fbio.h>
#include <sys/kbio.h>
#include <sys/mman.h>
#include <sys/consio.h>
#include <sys/memrange.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
int main() {
int oldMode;
ioctl(0, CONS_GET, &oldMode);
video_info_t modeInfo = {0};
video_adapter_info_t adpInfo = {0};
modeInfo.vi_mode = M_VESA_FULL_1280;
ioctl(0, CONS_MODEINFO, &modeInfo);
ioctl(0, KDENABIO, 0);
ioctl(0, VT_WAITACTIVE, 0);
ioctl(0, KDSETMODE, KD_GRAPHICS);
ioctl(0, SW_VESA_FULL_1280, 0);
ioctl(0, CONS_ADPINFO, &adpInfo);
ioctl(0, CONS_SETWINORG, 0);
//int vid_fd = open("/dev/mem", O_RDWR);
//open(stdout, 0) == open("/dev/mem", adpInfo.va_window); but does not require root privileges for /dev/mem access
uint8_t* vid_mem = (uint8_t*)mmap(0, adpInfo.va_window_size, PROT_READ | PROT_WRITE, MAP_FILE, STDOUT_FILENO, 0); //vid_fd, adpInfo.va_window);
//close(vid_fd);
clock_t a = 0, b = 0;
if(vid_mem != MAP_FAILED) {
a = clock();
for(unsigned f = 0; f < 60; f++) {
memset(vid_mem, f, adpInfo.va_window_size);
}
b = clock();
munmap(vid_mem, adpInfo.va_window_size);
usleep(5 * 1000 * 1000);
}
ioctl(0, _IO('S', oldMode), 0);
ioctl(0, KDDISABIO, 0);
ioctl(0, KDSETMODE, KD_TEXT);
struct vt_mode smode = {0};
smode.mode = VT_AUTO;
ioctl(0, VT_SETMODE, &smode);
printf("%f\n", (float)(b - a) / CLOCKS_PER_SEC);
return 0;
}