Other Load pattern differences between file-backed mmap and plain file access

The setup is simple - large (0.2-1 TB) file with totally random access and small (few KB) records (the hardest thing to do on spinning disks.) The underlying filesystem is UFS.

We are noticing some interesting performance differences between the following two methods:

  1. mmap the entire file and write randomly as memory, at maximum speed.
    • performance is good, "100".
    • process itself uses 3-4% of CPU
    • pagedaemon and geom together use 80-100% of one core.
    • the affected disk is 100% busy
    • system is non-responsive, it takes ages to do do anything (like ps).
  2. open the file and do random lseeks and writes, in exactly the same pattern as #1.
    • performance is a bit lower, "80"
    • process itself uses 80-90% CPU
    • pagedaemon and geom together use 1-2% of one core.
    • the affected disk is 10-20% busy
    • system is fully responsive
In both cases, the entire holding file is first pre-filled with 0s to avoid any startup issues.

We'll go for #2, but I'd appreciate any insights regarding why memory/disk flush management for mmap pages is so dramatically different (and worse) than the one for file buffers.
 
My impression is that mmap is an often misunderstood facility that, to make things worse, is lending itself often to be mistaken as a cheap ticket to high performance and yet comfort.

Your use is kind of the anti-case. You are working on small random pieces of a (no matter how giganticly) large file. That is, mmap has to pretty much do what any classical off-board (file) operations would have to do, too, albeit without the application specific know how. Typically mmap works by (more or less brainlessly) doing what one would do in classical approach but it would offer memory type access mechanisms.
Sure, there are cases when mmap comes in very handy and may even offer some speed advantage thanks to it being close to or even within the kernel memory management. But no matter what it can't change the world (e.g. the fact that spindles are slow and that there is a file system layer doing a job that needs to be done) and it doesn't know about your application; it's simply shifting data around and offering a comfort layer.

What you experience in my mind's eye is that harsh reality. Or, seen from the other side, the classical approach (your case 2), you see what a fine - and considerate (not all but blocking the whole system) - job the FreeBSD storage system does. The major or at least a very decisive difference probably being that case 2, the classical approach, does know (because you use it that way) that you're dealing with small, well cacheable chunks and file accesses, in other words you give the system a chance to put its over many years enhanced and fine tuned algorithms at work.

Now, of course, mmap can be advantageous in certain scenarios, probably all of which share as common ground small files, being read/written as structured data, simply due to it being deeply embedded in the system and close to memory handling. In many if not most cases, however it seems to be non- or misunderstood and mistaken as "cool" and "awwwwwful" and cheap.
 
The problem could well be that mmap creates a mapping of the whole file. Can you imagine how large the page table for 1TB of space is? I would start looking at the code behind this, see if there is a performance impact with larger maps. You may be faster with some windows of maybe 64MB, used in a LRU list and re-mapping the last one to a new place of the file when needed instead of mapping the whole file. If my memory serves me right, this would still keep pages in the cache list since they are bound to the inode, and unmapping them from your process would not change that. Could be worth a try.
 
mmap() functionality mostly serves the purpose of preventing double/triple caching by providing the OS's filesystem cache directly to the application, while stream I/O operations need the application to allocate their own memory to store copies of the IO datasets in.

Doing so however the OS needs to keep the semantics, i.e. if one process writes into a file by writing to a page backed by mmap(), a second process which has a descriptor of that file and tries to access it needs to get the changes immediately - the MAP_NOSYNC flag only specifies data might be written to backing store out of order but the change itself becomes visible to any other process immediately (which might be troublesome if the system crashes but other processes have learned the file contains data that was lost in the crash). This is done by having the page fault interrupt handler refresh the corresponding page (which is shared with the filesystem cache) and mark it read-only again (to post another page fault interrupt on the next write access), so other processes trying to read that fragment of the file at a later time will hit the cache which already contains the update.

Stream I/O on the other hand posts a single system call to copy a block of some size which can both update non-atomic sized ranges in a page and multiple pages at once, thereby causing less resource contention. I don't know whether FreeBSD has a negative cache for open file descriptors to quickly determine the file is not accessible by any other processes but even if it has it still needs to obtain the corresponding lock which is a performance issue.

As such, when doing lots of I/Os use MMAP for reading and stream I/O for writing (unless of cause it's already MMAPed and the write I/Os are individually small).

The problem could well be that mmap creates a mapping of the whole file. Can you imagine how large the page table for 1TB of space is?

2.5M entries which is AFAIK the size of 2009's Xeon's TLB. However mmap()-ing a file causes it to be mapped on demand, which also means the TLB is updated. Most systems these days will receive involuntary reschedules which cause TLB invalidation due to having to stop the process and write dirty data long before they hit the TLB size limit if the whole terabyte is really active.
 
struct vm_page has (on my amd64) a size of 104 bytes, which would be 2.54% overhead for mapped memory.

The mmaped memory for such a large file now could grow to contain several GB of main memory, each GB being represented by 262144 pages, which will have a vm_page associated with it. The pmap module in the kernel will need to walk this set of information when a page fault occurs in the object, that was what I meant. Using 4MB pages for such a mapping would greatly reduce this amount of overhead. Maybe some dtrace magic can point out where this is taking so long. I remember there was a YouTube video about dtrace pinpointing a performance pig in Solaris which turned out to be the gnome load meter, which caused a high load by using a stupid way of mmap, forcing TLB invalidations on all CPUs in the system. Maybe I can dig that out as a reference, if time permits.
 
Back
Top