C mmap() /dev/ada drive or partition?

ybungalobill · Jul 20, 2021

I'd like to mmap(2) a raw /dev/ada drive or partition, but mmap(2) returns "Invalid argument". Is it unsupported?

C:

    int fd = open("/dev/ada0p2", O_RDONLY);
    if(fd < 0) err(EX_OSERR, "open");

    void *p = mmap(0, 32*1024*1024, PROT_READ, MAP_SHARED, fd, 0);
    if(p == MAP_FAILED) err(EX_OSERR, "mmap");

ralphbsz · Jul 20, 2021

mmap'ing of devices is implemented by device drivers. Not all do. Some hardly ever do, for example mapping a serial port just makes no sense, so I bet the serial port drivers don't implement it. Others it makes perfect sense (frame buffer for example), so they are pretty much guaranteed to do it. I suspect that the disk driver does not implement it. You might want to consult the "black daemon book" or the kernel source.

Emrion · Jul 20, 2021

Maybe because you open the device in read only mode and use the flag MAP_SHARED which implies that all modifications in memory will be written to the device.

I'd try MAP_PRIVATE instead.

ybungalobill · Jul 20, 2021

ralphbsz said:
mmap'ing of devices is implemented by device drivers. Not all do. Some hardly ever do, for example mapping a serial port just makes no sense, so I bet the serial port drivers don't implement it. Others it makes perfect sense (frame buffer for example), so they are pretty much guaranteed to do it. I suspect that the disk driver does not implement it. You might want to consult the "black daemon book" or the kernel source.

I'm not asking to mmap a FIFO -- that, of course, would be insane. I expected mmap'ing a partition to be a rather trivial and useful case, and was surprised that it didn't work. I don't know what's the "black daemon book" you're referring to. I did look at the kernel source, however; but from the limited experience with it I cannot tell the missing link between the ata device driver and the the struct fp mmap interface.

Emrion said:
Maybe because you open the device in read only mode and use the flag MAP_SHARED which implies that all modifications in memory will be written to the device.

I'd try MAP_PRIVATE instead.

Thanks, but that makes no difference. The code works perfectly on regular read-only files either way.

mark_j · Jul 21, 2021

I think it's equally insane to be opening a partition you know nothing about. For instance, how is mmap(2) to interpret your file descriptor when it has no understanding of the underlying file structure?

For example, I bet stat(2) can't return a st_size better than 0, effectively undefined.

It works fine on regular read-only files because mmap(2) understands the underlying file structure and is able to read it. This is what ralphbsz said.

Remember what mmap is trying to do, in respect of a file in this instance. It's a map to the file, so when you read from the mmap area, the file is read from and when you write to the mmap area, the file is written to. This cannot be achieved on a 'raw' partition. You would have to use read(2)/write(2) on the device to achieve this and then you would be performing much the same activities a file system or partition manager does.

Edit:
Ah, I didn't connect the dots until seeing an alert (the bell on the upper right forum) but you're the user that lost files with an inadvertent rm. Is this related?

bakul · Jul 21, 2021

You may want to ask this question on FreeBSD-hackers@freebsd.org. Alternatively, file a bugreport.

ybungalobill · Jul 21, 2021

mark_j said:
I think it's equally insane to be opening a partition you know nothing about. For instance, how is mmap(2) to interpret your file descriptor when it has no understanding of the underlying file structure?

For example, I bet stat(2) can't return a st_size better than 0, effectively undefined.

... This cannot be achieved on a 'raw' partition. You would have to use read(2)/write(2) on the device to achieve this and then you would be performing much the same activities a file system or partition manager does.

You seem to assume that /dev/ada is some sort of a black box -- but it is not. AFAIK all mass storage devices are presented logically as a linearly addressable sequence of bytes, that can be read and written with pread(2) and pwrite(2) like any regular file. That's how you'd use dd to flash an image onto a flash drive, for example; or how a user-land utility, like fsck, would access the filesystem circumventing any mount points.

A mmap(2) of a raw disk or partition would do just that -- it will page fault the necessary pages as through pread(2)/pwrite(2). In fact I just checked and Linux allows mmap'ing raw disks and partitions.

mark_j said:
Ah, I didn't connect the dots until seeing an alert (the bell on the upper right forum) but you're the user that lost files with an inadvertent rm. Is this related?

It's tangentially related. I wish I could mmap the thing, use the OS pager for a cache, and ignore memory management altogether (that's what the OS is for!). But reading with pread isn't much of a problem either.

I find this topic useful nonetheless. AFAIK there are databases implementations that sit directly on raw devices to avoid the overhead of a filesystem (databases are filesystems in their own right), and there are databases that use mmap. Would be great if one could marry the two.

ralphbsz · Jul 21, 2021

ybungalobill said:
AFAIK all mass storage devices are presented logically as a linearly addressable sequence of bytes, that can be read and written with pread and pwrite like any regular file.

Right. That's the crucial difference between direct access devices (disks, on which you can seek, and pread/pwrite are nothing but an atomic seek/read/write combination), and the sequential devices (serial ports are the classic example, seeking on that would make no sense). Files are similar direct access: you can seek anywhere and rear or write (and if you seek to

But: Let's think about why mmap. What's the purpose of using mmap, versus the "traditional" way of interacting with random-access object? It's all about caching: Normal files use the buffer cache of the file system (or OS). From that viewpoint, it makes sense to use mmap: When you first touch a page (with a memory access) on a mmap'ed file, the underlying file system can set up the cache content, read the page from disk (if the operation was a read), or mark the page as having been touched and needing to be written later. Because of caching and its intimate interaction with the VM (virtual memory) subsystem in the kernel, file systems are already organized around memory pages, so mmap is a good fit there.

And now try to do the same with an uncached raw disk. Conceptually, all hell break loose. Say you have a disk, and have mmap'ed the whole thing into the address space of the process. Now someone reads 1 byte from that address space (with a memory read instruction). Do you really want to read a whole 512-byte sector? What do you do with the rest of the page? When do you drop it from cache? How do you deal with the fact that disk sectors are 512 bytes, while memory pages are 4K? Even worse: Someone writes 1 byte (or 1 32-bit or 64-bit word) to a previously unused sector. What now, brown cow? To be correct, you have to either tag the remaining 511 (or 508 or 504) bytes as unknown, or you have to do a read-modify operation. And to make matters worse, disk drivers don't have any mechanism to keep track of dirty sectors that need write-behind. In a nutshell, if you want mmap on a disk device, the device driver will end up implementing a significant fraction of the VM subsystem.

And all that effort for what? Who would ever want to mmap a disk? File systems don't: They run in the kernel, and they use IO calls that are logical relatives of pread and pwrite (in reality, they are significantly more complex, and typically have scatter/gather capability for efficiency). Databases (which also used to use disk partitions for storage) have very elaborate IO backends, with interestingly complex caching implementations, and they know really well how to use pread/pwrite and relatives. For example, the typical software engineering department for a database IO backend for a commercial database will include several dozen software engineering staff members, so they don't need mmap to be implemented for them (they can do it better themselves). Fsck (and a few relatives like mkfs) are written by file system people, and they have libraries for access. So a lot of effort would be needed, for no real-world benefit.

Note that above, the whole problem with implementing mmap on a device arises because of the impedance mismatch between the inherently memory-like interface of mmap, and the absence of caching on disk devices. In Linux, the situation is different: Early on, the Linux kernel made the (back then pretty radical) decision to integrate the cache layer and the connections to the VM subsystem in the block device layer, not in the file system layer. This is quite a different way to architect the lower half of the kernel, and I think at the time, nobody else was doing this in a production system. Matter-of-fact, in Linux if you want to do accesses to a disk device *without* the memory cache, you have to go out of your way to turn caching off by selecting a raw disk. For this reason, mmap'ing a block device is natural and easy in Linux.

I've not heard of there being a cacheable block device layer in FreeBSD; but then, I've not gone out of the way to look for one.

If you are serious about learning the why and how of this, you will end up reading the "black daemon book": It's really called "Design and Implementation of the FreeBSD operating system", the main author is Kirk McKusick (there are other authors), and the current edition is mostly black, and has the head of the daemon on the cover.

bakul · Jul 21, 2021

You can probably write a thin layer driver that adds mmap + caching on top of disks controlled by specific drivers and see what that buys you. struct cdevsw that drivers use does have a couple of fields for adding device specific mmap code!

bakul · Jul 21, 2021

ralphbsz said:
Early on, the Linux kernel made the (back then pretty radical) decision to integrate the cache layer and the connections to the VM subsystem in the block device layer, not in the file system layer.

Caching can make sense at multiple levels, just like processors have L{1,2,3} caches. Though any major surgery in the filesystem level code in freebsd would be very difficult.

covacat · Jul 21, 2021

looks like fsck_msdosfs uses mmap (for the FATs)
so it should work

astyle · Jul 22, 2021

Back when I was first learning about UNIX systems administration (2004), I did have to write some C programs that dealt with file descriptors and pointers. I ended up using the system() C API to call a UNIX command to ultimately write out the file to disk. That was an easier way to do it from within a C program. If the disk is mounted, the UNIX command doesn't care where it writes. All this is off the top of my head, I'm a bit too lazy to go link hunting right now.

Emrion · Jul 22, 2021

covacat said:
looks like fsck_msdosfs uses mmap (for the FATs)
so it should work

I was just working on this subject and found that fsck_msdosfs tries to use mmap() but it has a B plan in case that fails:

Code:

/* Attempt to mmap() first */
    if (allow_mmap) {
        fat->fatbuf = mmap(NULL, fat->fatsize,
                PROT_READ | (rdonly ? 0 : PROT_WRITE),
                MAP_SHARED, fd_of_(fat), off);
        if (fat->fatbuf != MAP_FAILED) {
            fat->is_mmapped = true;
            return 1;
        }
    }

    /*
     * Unfortunately, we were unable to mmap().
     *
     * Only use the cache manager when it's necessary, that is,
     * when the FAT is sufficiently large; in that case, only
     * read in the first 4 MiB of FAT into memory, and split the
     * buffer into chunks and insert to the LRU queue to populate
     * the cache with data.
     */

Tried a similar code as ybungalobill and it results exactly what he noticed: mmap() works on a regular file but not on a device (I used a USB stick), even if it is not mounted (and actually open() produces no error).

Emrion · Jul 23, 2021

I modified the code of /usr/src/sbin/fsck_msdosfs/fat.c. Just added some printf(), compiled and installed it. It turns out that it fails to use mmap().

So, this is a real problem.

covacat · Jul 23, 2021

Emrion said:
I modified the code of /usr/src/sbin/fsck_msdosfs/fat.c. Just added some printf(), compiled and installed it. It turns out that it fails to use mmap().

So, this is a real problem.

it only uses mmap when you fsckan image file. fails / falls back on block devices. tried on 11.4 and 12.2

astyle · Jul 23, 2021

Sometimes, you have to ask yourself, "What are you trying to accomplish?".

mmap(2) is an API that clearly makes use of file descriptors, memory (RAM) address, and offsets that are measured in bytes.

Emrion said:
Tried a similar code as ybungalobill and it results exactly what he noticed: mmap() works on a regular file but not on a device (I used a USB stick), even if it is not mounted (and actually open() produces no error).

If the problem to be solved is something like what Emrion posed, then yeah, keep chugging. I'd love to see the mounting headaches on FreeBSD go away. Otherwise, my personal recommendation would be to look for a simpler solution.

Emrion · Jul 23, 2021

covacat said:
it only uses mmap when you fsckan image file. fails / falls back on block devices. tried on 11.4 and 12.2

No. It tries to use mmap() anyway unless you specify the -M option (see main.c, variable allow_mmap). I ain't saying that the 13.0-RELEASE is the problem. The fact that previous versions don't work too is irrelevant.

So far, I haven't a clue on how to make it works. Maybe it's not possible by (FreeBSD) design but then, one shall explain why. Noone gave a beginning of explanation.

And yes, there is at least another means to get the job done, fat.c uses it. But that wasn't the question.