Xorg vesa driver - massive speedup using MTRR write-combine

I've been very unhappy with the instability of the nvidia.ko binary blobs. One wrong access from it triggers a page fault, and the next thing you know your system has rebooted itself without cleanly unmounting your hard disks. No fun.

Intel and AMD require the new KMS stuff, which seems to need a new console. Not really ready to jump into experimentation there yet. nouveau also needs this, but isn't ported over yet, it would seem.

So that leaves me with two drivers: nv, which is so hopelessly slow that maximizing a terminal window takes around 2s to complete; and vesa (pure software), which is paradoxically faster. But vesa was still much too slow to begin to be usable.

It didn't seem logical how slow the vesa driver was, so I developed some benchmarking software. I'll post the source to that at the end of this post. It's essentially modeled off of the neglected/forgotten libvgl, and sets up a 1280x1024x32bpp VESA LFB mode. It then flood fills the screen and benchmarks how long it takes.

Measurements are taken on a Core 2 Duo E6600 system with DDR3-1333 RAM and a Geforce GT210 PCIe x16 graphics card. Compilation was with gcc47 -O3. PCIe x16 has a theoretical bandwidth limit of 8000MB/s.

The result was 1.48s to simulate 1s worth of video data, so not possible to obtain 60fps. This works out to 202.7MB/s of bandwidth. Yet if I instead replace vid_mem with malloc(1280*1024*4), the bandwidth increases to an astounding 6696.4MB/s.

So it's reasonable to assume that passing through the PCIe host controller to the video card is incurring tremendous overhead. Uploading textures via OpenGL with the nvidia driver is substantially faster, but likely involves DMA. I am told it's around 5000MB/s from someone with a similar card.

Thus, I started looking into how hobbyist OS developers addressed the incredible bandwidth limitations. I found out about MTRR (memory-type range registers) present on x86/amd64 CPUs. It seems that the optimal setting for the VESA LFB is write-combine, and that the way to control these on FreeBSD is through memcontrol. Out of the box, FreeBSD sets the VESA LFB range to write-back.

First, find out the LFB for your VESA modes:

# vidcontrol -i mode

On my card, all of the VESA 32-bit modes are at 0xd1000000. This will be different for you. Also round up the frame buffer size to the nearest power of two: pow2(1280*1024*4) = 0x800000. Now enable write-combine for this range:

# memcontrol set -b 0xd1000000 -l 0x800000 -o BIOS write-combine

And if you have a need to remove it:

# memcontrol clear -b 0xd1000000 -l 0x800000

To view current settings:

# memcontrol list

The results were spectacular. My test dropped to 0.25s (on average) to complete. The bandwidth increased from 202.7MB/s up to 1200MB/s.

It's still nowhere near the theoretical peak, but at this point, the vesa driver is lightning fast. I can play back movies fullscreen* with zero stuttering. I can maximize/minimize windows with no delay. There's obviously no OpenGL or X-Video acceleration, but it's perfectly usable if you aren't playing 3D games. (* I suspect it'll probably struggle again at 2560x1600, or possibly at 1920x1200.)

So if you're stuck without hardware acceleration, or you find it buggy, this is a great way to produce an absolutely rock-solid desktop environment. Only issue is you'll want to go for an AMD card that has widescreen VESA resolutions.

If anyone has further acceleration tips, or knows of any dangers in changing the MTRR on the VESA LFB, please let me know.

Code:
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/fbio.h>
#include <sys/kbio.h>
#include <sys/mman.h>
#include <sys/consio.h>
#include <sys/memrange.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

int main() {
  int oldMode;
  ioctl(0, CONS_GET, &oldMode);

  video_info_t modeInfo = {0};
  video_adapter_info_t adpInfo = {0};
  modeInfo.vi_mode = M_VESA_FULL_1280;
  ioctl(0, CONS_MODEINFO, &modeInfo);

  ioctl(0, KDENABIO, 0);
  ioctl(0, VT_WAITACTIVE, 0);
  ioctl(0, KDSETMODE, KD_GRAPHICS);
  ioctl(0, SW_VESA_FULL_1280, 0);
  ioctl(0, CONS_ADPINFO, &adpInfo);
  ioctl(0, CONS_SETWINORG, 0);

//int vid_fd = open("/dev/mem", O_RDWR);
  //open(stdout, 0) == open("/dev/mem", adpInfo.va_window); but does not require root privileges for /dev/mem access
  uint8_t* vid_mem = (uint8_t*)mmap(0, adpInfo.va_window_size, PROT_READ | PROT_WRITE, MAP_FILE, STDOUT_FILENO, 0);  //vid_fd, adpInfo.va_window);
//close(vid_fd);

  clock_t a = 0, b = 0;
  if(vid_mem != MAP_FAILED) {
    a = clock();
    for(unsigned f = 0; f < 60; f++) {
      memset(vid_mem, f, adpInfo.va_window_size);
    }
    b = clock();
    munmap(vid_mem, adpInfo.va_window_size);
    usleep(5 * 1000 * 1000);
  }

  ioctl(0, _IO('S', oldMode), 0);
  ioctl(0, KDDISABIO, 0);
  ioctl(0, KDSETMODE, KD_TEXT);
  struct vt_mode smode = {0};
  smode.mode = VT_AUTO;
  ioctl(0, VT_SETMODE, &smode);

  printf("%f\n", (float)(b - a) / CLOCKS_PER_SEC);

  return 0;
}
 
Re: Xorg vesa driver - massive speedup using MTRR write-comb

This is really cool! Thank you for posting it. There might be members of the mailing lists who would appreciate it, but I'm not sure which list. freebsd-x11, maybe.
 
Re: Xorg vesa driver - massive speedup using MTRR write-comb

You may try uncached access as well. It sounds counter productive, but it may help. I'll explain later if wanted, but now I'm off to my workplace.
 
Re: Xorg vesa driver - massive speedup using MTRR write-comb

Some more testing done.

My Geforce GTX760 OC actually has a 2560x1600x16bpp VESA2 LFB mode. No 32bpp version, but I planned to use 16bpp for half the bandwidth requirement anyway. Amazingly, it does not have 1920x(1080|1200). Good thing my main PC is a 2560x1600 display. I think nVidia and AMD intentionally choose modes that they don't think anyone would ever actually use for their VESA display lists. It's really ridiculous the modes available on some of these cards.

My Quadro FX580 supports 1440x900x32bpp, but no other widescreen modes. With the Quadro, my test remains 1.48s before write-combine, and drops to 0.148s after. So it's a full 10x speedup on this card. Both no mapping and uncacheable result in the same 1.48s speed, only write-combine gives me 0.148s. Oddly, write-back deadlocks my system. Which is weird, because my primary PC defaults to write-back for the VESA LFB region and it works fine there.

My primary system has a really evil MTRR: it maps 0x0/0x400000000 to write-back. Since I am limited to only eight entries, and some were already used, I had to unmap this and create a few new ones: 0x0/0x80000000 + 0x80000000/0x40000000 + 0xc0000000/0x20000000 + 0xf9000000/0x800000 (VESA LFB) + 0x100000000/0x100000000 + 0x200000000/0x200000000. End result is that 0xe0000000-0xf8ffffff and 0xf9800000-0xffffffff are now unmapped, but those areas don't appear to be used anyway. It's really fun when you unmap write-back on the low 2GB of the address space. The system starts running about a hundred times slower until you remap it back. Point of all this: it may not be super simple to set up MTRR on some boxes.

The new replacement which isn't as limited in mapping ranges/counts seems to be PAT (page attribute tables), which FreeBSD supports. But I'm not yet sure how to configure those, or if I even can for the Xorg VESA driver. I'll look into it tomorrow.

On the main PC, I can actually drive 2560x1600x16bpp VESA2 LFB very well after the MTRR adjustment. Showing/hiding a maximized window seems instantaneous with no redrawing lags, whereas it stuttered for about 100ms before the write-combine adjustment. A fullscreen scaled H.264 video tends to play back at about 30-40fps. So it looks like you're capped at about 1080p if you want 60fps video playback. Aside from fullscreen animated video, I can't tell any difference between vesa and the official nvidia driver, which is just wonderful.

There doesn't appear to be any way to use a video mode not in your VESA display mode list*. The modes are stored in your VBIOS, and it doesn't seem that FreeBSD lets you substitute the in-memory copy of it like you can do with EDID. But you can hex edit the modes in the file (replace one you don't use with a resolution you want) and reflash the card using a DOS utility. Seems to be a 50/50 chance of working if you do that, although user error might be why the success rate is so low. Really glad I don't have to do that on my main PC. (* VBIOS is basically 16-bit x86 code that does the video-card specific writes to change video modes. I'm actually really curious how FreeBSD/amd64 manages to execute 16-bit code in long mode where VM86 is not available, but that's another discussion.) Fun side tangent: I actually had to take apart a monitor and physically lift a pin on the EDID flash memory chip (to disable write protection) so that I could reflash a monitor that reported HDTV modes, which makes nVidia cards forcefully enable horrifying overscan+comb filters. Now that was fun.

Lastly, looks like the 'successor' to VESA is going to be UEFI/GOP. When you boot in pure UEFI mode, there's no VGA or VESA at all. I "can't wait" to see what fun new problems that causes.
 
Re: Xorg vesa driver - massive speedup using MTRR write-comb

byuu said:
Some more testing done.
...
It's really fun when you unmap write-back on the low 2GB of the address space. The system starts running about a hundred times slower until you remap it back. Point of all this: it may not be super simple to set up MTRR on some boxes.
That is not really unexpected, is it? That is where the RAM is in physical address space, so forcing a write to it when something changes in the CPU cache will slow things down immensly. Point in case, when you set the memory of the frame buffer to uncached, the performance did not change, meaning that you only do writes to the frame buffer. Now that is not unexpected, but I thought that it might improve performance as it would only write maybe 4 bytes in one action while a complete cache line is 32 bytes in most cases. But if writing a full cache line back to memory (write trough) is as fast as only writing smaller chunks more often, then the memory interface is tuned for these sizes. That is good to know.

What also could speed up things would be super pages to map the frame buffer, as it is only one object with one possible mapping. This would reduce table walks from the MMU.
byuu said:
There doesn't appear to be any way to use a video mode not in your VESA display mode list*. The modes are stored in your VBIOS, and it doesn't seem that FreeBSD lets you substitute the in-memory copy of it like you can do with EDID. But you can hex edit the modes in the file (replace one you don't use with a resolution you want) and reflash the card using a DOS utility. Seems to be a 50/50 chance of working if you do that, although user error might be why the success rate is so low. Really glad I don't have to do that on my main PC. (* VBIOS is basically 16-bit x86 code that does the video-card specific writes to change video modes. I'm actually really curious how FreeBSD/amd64 manages to execute 16-bit code in long mode where VM86 is not available, but that's another discussion.) Fun side tangent: I actually had to take apart a monitor and physically lift a pin on the EDID flash memory chip (to disable write protection) so that I could reflash a monitor that reported HDTV modes, which makes nVidia cards forcefully enable horrifying overscan+comb filters. Now that was fun.
Hmm, the smell of soldering in the morning. :)

As far as I know there is actually an emulator around somewhere in xorg to do this, switching CPU modes around would be insane.
But re-flashing a monitor? I wish I had the tools, and the time, to do that. And the knowledge, of course. :beergrin
byuu said:
Lastly, looks like the 'successor' to VESA is going to be UEFI/GOP. When you boot in pure UEFI mode, there's no VGA or VESA at all. I "can't wait" to see what fun new problems that causes.

Me too, there will be all kinds of enhancements in user experience. Like requiring signed code to actually call anything there, so you will not simply copy out the frame buffer. No banking trojan shall see what you do there, shall it? That's the reason, honestly. (We need a smily banging it's head into a wall.)
 
Re: Xorg vesa driver - massive speedup using MTRR write-comb

Found something interesting. When running the official nvidia driver, nvidia-settings lets you toggle the clocking mode of the chip. The important one being the memory transfer rate. Default is adaptive, which is the slowest mode when the chip isn't being heavily utilized. Good for power. Bad for VESA. Obviously with no nvidia binary driver, you'll be in the slowest mode always.

If I force my GPU into the performance mode always, then switch to another TTY and run my benchmark, the bandwidth speed doubles.

This is one of the reasons why nouveau is so slow, in fact. After years of trying, they still aren't able to get the chips out of the slowest clock speeds. So there's no real hope for us to kick up our GPU clock speeds, short of reflashing VBIOS to set the default modes. But then you might not want to burn your GPU at max speed 24/7 either.

...

Still not having much luck figuring out PAT under FreeBSD. BSD may have great documentation, but all OSes seem to break down on the kernel-level API documentation.

Crivens said:
That is not really unexpected, is it?

Not really, but I guess the degree of slowdown was surprising. Until recently I hadn't realized MTRR existed, let alone that it was so important to performance.

Crivens said:
while a complete cache line is 32 bytes in most cases

Read the manual from Intel on MTRR, looks like it was 32 bytes on P6, but reserved to be any size in the future. But is likely still 32 bytes. A shame we can't boost that somehow. It stands to reason that 64 bytes would give us an even bigger speedup, so long as we always kept framebuffer writes sequential (no idea if Xorg/vesa uses dirty rectangles.)

Crivens said:
What also could speed up things would be super pages to map the frame buffer, as it is only one object with one possible mapping. This would reduce table walks from the MMU.

Any ideas on how one might go about doing that? I'm up for experimentation.

Crivens said:
Hmm, the smell of soldering in the morning.

Worse. I just used a sewing needle to bend the pin up. I know that floating pins can technically get stuck in either state, but they usually end up reading as logic low. And indeed, it got the job done and that was that. All for OS X too, the only OS that you can't override EDID on through the nVidia driver. Worst part: the monitor's power supply died a year later. Was a lovely 16:10 LP246WP P-MVA panel, which they don't really make any more.

I can/would solder it to a ground, but the pitch sizes on those flash chips are horrifying.

Crivens said:
As far as I know there is actually an emulator around somewhere in xorg to do this, switching CPU modes around would be insane.

Yeah I was thinking it might be a 16-bit x86 emulator. The alternative would be a transition to protected mode, then another to VM86. Then bouncing back up both. Lots and lots of potential problems there. Even just protected<>VM86 was something I was never able to pull off on my own. But in Xorg itself? Wow. I had read that FreeBSD didn't support VESA in the amd64 port for a good while, but couldn't find where they added it in.

Crivens said:
Me too, there will be all kinds of enhancements in user experience. Like requiring signed code to actually call anything there, so you will not simply copy out the frame buffer. No banking trojan shall see what you do there, shall it? That's the reason, honestly.

I am still unsure how these Microsoft-signed shims are going to work. It sounds like a requirement is that nothing be loaded into kernel space. FreeBSD and/or Linux allowing that anyway (as many are saying signed kernel extensions will be optional) would seem to be grounds for key revocation.

As always, less about content protection and more about control. Even in the absolute worst case of perfected DRM, it'd only take one single person wiring up a bus sniffer right between the monitor display matrix and the output from HDCP to a RAID array, a quick H.264 re-encode, and the content is available for everyone on TPB. You and I aren't going to do that, but only one person has to.

UEFI SecureBoot has a lot more to do with locking down new platforms like ARM tablets, where it is required to not be optional to run Windows. They want that Apple-style app store money for every program sold.

Really wish Microsoft and Apple would move entirely to tablets, and leave desktops for people that want to actually own their hardware and do real work. I'll be quite alright with never being able to watch Netflix on my server. That's what I have a Roku box for.
 
I am trying to use memcontrol(8), but it throws an error:
Code:
# memcontrol set -b 0xd0000000 -l 0x800000 -o BIOS write-combine
memcontrol: can't set range: Invalid argument
Anybody got an idea what's wrong? This is the output of # vidcontrol -i mode:
Code:
mode#  flags  type  size  font  window  linear buffer
------------------------------------------------------------------------------
  0 (0x000) 0x00000001 T 40x25  8x8  0xb8000 32k 32k 0x00000000 32k
  1 (0x001) 0x00000001 T 40x25  8x8  0xb8000 32k 32k 0x00000000 32k
  2 (0x002) 0x00000001 T 80x25  8x8  0xb8000 32k 32k 0x00000000 32k
  3 (0x003) 0x00000001 T 80x25  8x8  0xb8000 32k 32k 0x00000000 32k
  4 (0x004) 0x00000003 G 320x200x2 C  8x8  0xb8000 32k 32k 0x00000000 32k
  5 (0x005) 0x00000003 G 320x200x2 C  8x8  0xb8000 32k 32k 0x00000000 32k
  6 (0x006) 0x00000003 G 640x200x1 C  8x8  0xb8000 32k 32k 0x00000000 32k
13 (0x00d) 0x00000003 G 320x200x4 4  8x8  0xa0000 64k 64k 0x00000000 256k
14 (0x00e) 0x00000003 G 640x200x4 4  8x8  0xa0000 64k 64k 0x00000000 256k
16 (0x010) 0x00000003 G 640x350x2 2  8x14  0xa0000 64k 64k 0x00000000 128k
18 (0x012) 0x00000003 G 640x350x4 4  8x14  0xa0000 64k 64k 0x00000000 256k
19 (0x013) 0x00000001 T 40x25  8x14  0xb8000 32k 32k 0x00000000 32k
20 (0x014) 0x00000001 T 40x25  8x14  0xb8000 32k 32k 0x00000000 32k
21 (0x015) 0x00000001 T 80x25  8x14  0xb8000 32k 32k 0x00000000 32k
22 (0x016) 0x00000001 T 80x25  8x14  0xb8000 32k 32k 0x00000000 32k
23 (0x017) 0x00000001 T 40x25  8x16  0xb8000 32k 32k 0x00000000 32k
24 (0x018) 0x00000001 T 80x25  8x16  0xb8000 32k 32k 0x00000000 32k
26 (0x01a) 0x00000003 G 640x480x4 4  8x16  0xa0000 64k 64k 0x00000000 256k
27 (0x01b) 0x00000003 G 640x480x4 4  8x16  0xa0000 64k 64k 0x00000000 256k
28 (0x01c) 0x00000003 G 320x200x8 P  8x8  0xa0000 64k 64k 0x00000000 64k
30 (0x01e) 0x00000001 T 80x50  8x8  0xb8000 32k 32k 0x00000000 32k
32 (0x020) 0x00000001 T 80x30  8x16  0xb8000 32k 32k 0x00000000 32k
34 (0x022) 0x00000001 T 80x60  8x8  0xb8000 32k 32k 0x00000000 32k
37 (0x025) 0x00000003 G 320x240x8 V  8x8  0xa0000 64k 64k 0x00000000 256k
112 (0x070) 0x00000000 T 80x43  8x8  0xb8000 32k 32k 0x00000000 32k
113 (0x071) 0x00000001 T 80x43  8x8  0xb8000 32k 32k 0x00000000 32k
256 (0x100) 0x0000001f G 640x400x8 P  8x16  0xa0000 64k 64k 0xd0000000 250k
257 (0x101) 0x0000001f G 640x480x8 P  8x16  0xa0000 64k 64k 0xd0000000 300k
259 (0x103) 0x0000001f G 800x600x8 P  8x14  0xa0000 64k 64k 0xd0000000 487k
261 (0x105) 0x0000001f G 1024x768x8 P  8x16  0xa0000 64k 64k 0xd0000000 768k
263 (0x107) 0x0000001f G 1280x1024x8 P  8x16  0xa0000 64k 64k 0xd0000000 1280k
272 (0x110) 0x0000001f G 640x480x16 D  8x16  0xa0000 64k 64k 0xd0000000 600k
273 (0x111) 0x0000001f G 640x480x16 D  8x16  0xa0000 64k 64k 0xd0000000 600k
275 (0x113) 0x0000001f G 800x600x16 D  8x14  0xa0000 64k 64k 0xd0000000 975k
276 (0x114) 0x0000001f G 800x600x16 D  8x14  0xa0000 64k 64k 0xd0000000 975k
278 (0x116) 0x0000001f G 1024x768x16 D  8x16  0xa0000 64k 64k 0xd0000000 1536k
279 (0x117) 0x0000001f G 1024x768x16 D  8x16  0xa0000 64k 64k 0xd0000000 1536k
281 (0x119) 0x0000001f G 1280x1024x16 D  8x16  0xa0000 64k 64k 0xd0000000 2560k
282 (0x11a) 0x0000001f G 1280x1024x16 D  8x16  0xa0000 64k 64k 0xd0000000 2560k
289 (0x121) 0x0000001f G 640x480x32 D  8x16  0xa0000 64k 64k 0xd0000000 1200k
290 (0x122) 0x0000001f G 800x600x32 D  8x14  0xa0000 64k 64k 0xd0000000 1950k
291 (0x123) 0x0000001f G 1024x768x32 D  8x16  0xa0000 64k 64k 0xd0000000 3072k
292 (0x124) 0x0000001f G 1280x1024x32 D  8x16  0xa0000 64k 64k 0xd0000000 5120k
323 (0x143) 0x0000001f G 1400x1050x8 P  8x16  0xa0000 64k 64k 0xd0000000 1443k
325 (0x145) 0x0000001f G 1400x1050x16 D  8x16  0xa0000 64k 64k 0xd0000000 2887k
326 (0x146) 0x0000001f G 1400x1050x32 D  8x16  0xa0000 64k 64k 0xd0000000 5775k
355 (0x163) 0x0000001f G 1280x960x8 P  8x16  0xa0000 64k 64k 0xd0000000 1200k
357 (0x165) 0x0000001f G 1280x960x16 D  8x16  0xa0000 64k 64k 0xd0000000 2400k
358 (0x166) 0x0000001f G 1280x960x32 D  8x16  0xa0000 64k 64k 0xd0000000 4800k
371 (0x173) 0x0000001f G 1600x1200x8 P  8x16  0xa0000 64k 64k 0xd0000000 1875k
373 (0x175) 0x0000001f G 1600x1200x16 D  8x16  0xa0000 64k 64k 0xd0000000 3750k
374 (0x176) 0x0000001f G 1600x1200x32 D  8x16  0xa0000 64k 64k 0xd0000000 7500k
465 (0x1d1) 0x0000001f G 1920x1080x8 P  8x16  0xa0000 64k 64k 0xd0000000 2025k
466 (0x1d2) 0x0000001f G 1920x1080x16 D  8x16  0xa0000 64k 64k 0xd0000000 4050k
468 (0x1d4) 0x0000001f G 1920x1080x32 D  8x16  0xa0000 64k 64k 0xd0000000 8100k
 
I did some research and it seems that BIOS write-back is preventing write-combine from being set. Any advice on how to proceed?
revelant snap from memcontrol list
Code:
0x0/0x200000000 BIOS write-back set-by-firmware active
0x200000000/0x100000000 BIOS write-back set-by-firmware active
0x300000000/0x40000000 BIOS write-back set-by-firmware active
0xc0000000/0x40000000 BIOS uncacheable set-by-firmware active
0xbff00000/0x100000 BIOS uncacheable set-by-firmware active
 
OK, I finally got it working by remapping using memcontrol(8). Results are really fast. Benchmark improved from 8.16s to 0.21s! But I'm still unsure if there are any side effect because of the unmapped ranges. Any insights on this?
Code:
0x0/0x80000000 BIOS write-back active
0x200000000/0x100000000 BIOS write-back set-by-firmware active
0x300000000/0x40000000 BIOS write-back set-by-firmware active
0xd0000000/0x800000 BIOS write-combine active
0xbff00000/0x100000 BIOS uncacheable set-by-firmware active
0x80000000/0x40000000 BIOS write-back active
0xc0000000/0x10000000 BIOS uncacheable active
0x100000000/0x100000000 BIOS write-back active
 
OK, I finally got it working by remapping using memcontrol(8). Results are really fast. Benchmark improved from 8.16s to 0.21s! But I'm still unsure if there are any side effect because of the unmapped ranges. Any insights on this?
Code:
0x0/0x80000000 BIOS write-back active
0x200000000/0x100000000 BIOS write-back set-by-firmware active
0x300000000/0x40000000 BIOS write-back set-by-firmware active
0xd0000000/0x800000 BIOS write-combine active
0xbff00000/0x100000 BIOS uncacheable set-by-firmware active
0x80000000/0x40000000 BIOS write-back active
0xc0000000/0x10000000 BIOS uncacheable active
0x100000000/0x100000000 BIOS write-back active
Care to share how you managed that? I also have a 'global' section that is marked write-back, just like your
0x0/0x200000000 BIOS write-back set-by-firmware active

I was able to clear the other smaller sections, but the global chunk seemed to resist any attempt to clear it.
 
You may have to adapt this to your configuration:

Code:
memcontrol clear -b 0x0 -l 0x200000000
memcontrol set -b 0x0 -l 0x80000000 -o BIOS write-back;
memcontrol set -b 0x80000000 -l 0x40000000 -o BIOS write-back;
memcontrol set -b 0xC0000000 -l 0x10000000  -o BIOS write-back;
memcontrol set -b 0x100000000  -l 0x100000000  -o BIOS write-back
memcontrol set -b 0xc0000000  -l 0x10000000  -o BIOS uncacheable
memcontrol clear -b 0xc0000000 -l 0x40000000
memcontrol set -b 0xd0000000 -l 0x800000 -o BIOS write-combine

What does your relevant snap from memcontrol list show as output?
 
Thank you very much for your reply! In fact, I was expecting something rather different to be done. But I guess the problem is probably machine dependent, because when I tried to do the first step to clear the global write-back chunk, my machine sort of 'hung': the shell didn't come back, all input seemed to be unaccepted (including interrupt sequences, Ctrl+Alt+Del, etc.), virtual terminal switching was not possible, etc., although the console screen saver still worked. The relavent part of the output is this:

Code:
0x0/0x100000000 BIOS write-back set-by-firmware active
0x100000000/0x20000000 BIOS write-back set-by-firmware active
0xe0000000/0x20000000 BIOS uncacheable set-by-firmware active
0xde000000/0x2000000 BIOS uncacheable set-by-firmware active
0xdd000000/0x1000000 BIOS uncacheable set-by-firmware active
0x11fe00000/0x200000 BIOS uncacheable set-by-firmware active

If I do #memcontrol clear -b 0x0 -l 0x100000000, then the system 'hangs'. But except the first one, I am able to clear the rest of the items above.
 
After clearing the large chunk the system is supposed to be very very slow so that it may seem that it hangs.
It gets responsive again after setting the other ranges to write-back / write-combine. So its good practice to use a shell script to execute all the memcontrol commands consecutively.
Also it's mandantory, at least for my system, to set the last range to uncacheable before clearing (as seen in the code in the posts above). Otherwise system reboots.
 
Very sorry for the delay! The machine was temporarily out of my reach for the last a few days, until several hours ago.

Thanks to jurgenxiv's help, I was able to set the desired range into the write-combine mode. Using the test program provided by byuu, the result showed a 150-time improvement, from around 4.62s to around 0.03s. I also don't know if there is any side effect. Nonetheless, while waiting for the driver support for my Intel HD 4600 integrated GPU to come, I think that this provides a comfortable way to use Xorg.

Curiously, if I set the memory using a script, then the procedure can be done within 0.51s, whilst if I call memcontrol from the command line, then it takes forever for the global clearing step memcontrol clear -b 0x0 -l 0x100000000 to complete. As a result, it is very practical to do the memory reassignment at boot.

Thank you so much guys!
 
amazing where did u get the idea ?

config MTRR
def_bool y
prompt "MTRR (Memory Type Range Register) support" if EXPERT
---help---
On Intel P6 family processors (Pentium Pro, Pentium II and later)
the Memory Type Range Registers (MTRRs) may be used to control
processor access to memory ranges. This is most useful if you have
a video (VGA) card on a PCI or AGP bus. Enabling write-combining
allows bus write transfers to be combined into a larger transfer
before bursting over the PCI/AGP bus. This can increase performance
of image write operations 2.5 times or more. Saying Y here creates a
/proc/mtrr file which may be used to manipulate your processor's
MTRRs. Typically the X server should use this.

...
 
For a real fix, you should really be using PAT instead of MTRR. That is what the nvidia-driver is doing. It does this for various texture buffers, not just the frame buffer. However, the frame buffer would be easy to do, and it's also something the EFI frame buffer could easily do. In the kernel, the low-level API to change pages to WC is to use pmap_change_attr(). However, that is a bit clunky to use directly. Instead, you can create an sglist(9) that describes the frame buffer's physical address range (probably a very simple list with one entry) and create an OBJT_SG VM object backed by that sglist(9). You can tag that VM object to use WC mappings via vm_object_set_memattr() and you can also set the object to use superpages starting at offset 0 by default. You then just need Xorg to mmap this object when mapping the frame buffer. I'm not sure how Xorg maps the frame buffer for VESA (perhaps it is using /dev/mem in which case this isn't quite so easy to accomplish). If can you fix Xorg to map on a specific cdev for frame buffers, then VM object approach is fairly easy to implement using d_mmap_single() to return the frame buffer object for attempts to map the frame buffer.

The reason to prefer PAT over MTRR is that PAT takes precedence and is much more flexible.
 
Back
Top