Other porting linux MD raid to BSD

I think this discussion is valid in multiple sub-forums, but I had to make a choice, so it is here under the storage forum.

BSD = freeBSD in all subsequent text.

As I've mentioned before on here, the lack of compatibility with the linux MD devices is a major reason I can't move more of my R&D infrastructure to BSD. I'm aware of ZFS and I don't want to go there, nor do I want to get involved in a discussion on the pros/cons.

My question is whether anyone has considered, or even attempted a port of the linux-md-raid drivers and usertools to work under BSD. I've considered it, but without getting too much into the mental exercise I start thinking about a couple of challenges in porting: 1) BSD lack of block device drivers for disks, and 2) possible differences in expected disk layouts between the platforms.

I know. I know...It's time for someone to start chirping about the evil GPL. Again, no need to debate. I believe that if the source of a port is made available, and the end user is expected to build it themself, then GPL doesn't have a foot to stand on WRT integration into the BSD world.

Anyway, just wondering if others have gone down this particular rabbit hole before me.
 
I once added Linux md support to the CCD driver, which is no longer in the kernel.

What raid level do you run there? Raid0 and raid1 are fairly trivial.
 
In general I'm leery of implementing a storage system that is compatible with another one without copying the code. Because there is too much potential for subtle mistakes. This is particularly true if the same on-disk data structures have to be shared back and forth by two implementations.

For the GPL worries: Person 1 could study the Linux MD implementation, and write down the exact algorithm and on-disk data structures (metadata) used by it. Then give that document to person 2, who could implement it on BSD without ever having seen the GPL'ed code. I think in a corporate setting, that would not be sufficient to circumvent the GPL worries, as some of the GPL'ed ideas are in the design of the metadata. For amateurs who are not worth suing, this might be sufficient.

I think the better idea is to implement a "data hoover" (named after the vacuum cleaner): Run a process on a Linux machine (might be a VM) that reads the MD array, outputs the content as a sequential stream. A second process on a FreeBSD machine then ingests it into a new native array. This requires doing a complete copy (not a bad idea to deal with format and compatibility worries), and requires extra space.
 
md raid1 and raid0 were straightforward to do. You could just re-write the header and use them in geom raid that way (assuming here the goem header isn't too big). Rebuilding after a disk error would have to be tested.

I didn't look at raid5 and higher.
 
Obviously, the path would be a simple code copy, modifying where necessary to work with the BSD kernel. for striping there is no such thing as a rebuild, but for mirroring a rebuild would need to be tested. Said I don't want to discuss GPL because too many people have too many opinions about what it is and is not. If source for mods is available and end user compiles it for their their system(s) then GPL is satisfied. The code is "inspectable", and no steps are taken to restrict further use/mod rights further along the chain. data hoover is irrelevant because the purpose of the exercise is to make media "interchangable" between platforms in a dual boot config.
 
Obviously, the path would be a simple code copy, modifying where necessary to work with the BSD kernel. for striping there is no such thing as a rebuild, but for mirroring a rebuild would need to be tested. Said I don't want to discuss GPL because too many people have too many opinions about what it is and is not. If source for mods is available and end user compiles it for their their system(s) then GPL is satisfied. The code is "inspectable", and no steps are taken to restrict further use/mod rights further along the chain. data hoover is irrelevant because the purpose of the exercise is to make media "interchangable" between platforms in a dual boot config.

The trick is finding somebody to do the work when the end result can be used on but not committed into the FreeBSD tree.

There's also edge cases. Let's say a resync is in progress while you switch OSes and you need to continue the resync on FreeBSD. Lots of opportunity to screw up.
 
The trick is finding somebody to do the work when the end result can be used on but not committed into the FreeBSD tree.

There's also edge cases. Let's say a resync is in progress while you switch OSes and you need to continue the resync on FreeBSD. Lots of opportunity to screw up.

I did not consider the interrupted rebuild, but that is indeed important. Thanks. Yeah, I'm expecting that the project would need to be maintained apart from the BSD tree. I've done some driver work for linux but not BSD. am anxious to see how similar the driver models are.

I envision development being done where BSD and linux are both VMs and the target disk devices are large flat files on the host that can be assigned to the VMs for testing. Not fast by any means, but a functional way to test.
 
I did not consider the interrupted rebuild, but that is indeed important. Thanks. Yeah, I'm expecting that the project would need to be maintained apart from the BSD tree. I've done some driver work for linux but not BSD. am anxious to see how similar the driver models are.

I think it would be far faster to adapt the geom raid code to understand the on-disk format of md.

As we said the trickier bits would be array creation, resync and the like.
 
The hairiest part of RAID5 (or in general any parity-based RAID) is how to deal with the write hole. As far as I remember, Linux MD uses a partial parity log (parity of the data blocks not affected by a small write), and I don't know whether it is a true rotating log or a buffer area that can hold a limited number of IOs. If MD has a clean shutdown marker, and the FreeBSD version was to be readonly, then this would be easy to deal with: (a) refuse to start if the Linux side didn't do a clean shutdown, meaning the log might be necessary for recovery. (b) Ignore parity, and refuse to operate in degraded mode. (c) Never update anything, in particular not parity or the log. In a nutshell, this turns RAID-5 into a strange version of striping, where the data is scattered a little wider than normal striping.

If recovery from an unclean shutdown is required, then the log has to be processed. If running in degraded mode is required, parity layout needs to be considered. And if writes are required (in particular in degraded mode with crash protection for the write hole), then everything needs to be implemented.

What I'm really saying here: If the requirements can be reduced to something that might solve 80% of the problem (readonly, no degraded mode, only after a clean shutdown), that might be doable with 20% of the effort.

Not that I'm volunteering ... I haven't implemented RAID in C in over 5 years, and don't care to start doing it again.
 
To discover disk format, calls etc., you can perhaps experiment on Linux by using a usermode block device driver and capturing what precisely changes, starting from a set of ublk "disks" and forming a RAID5 out of them and trying sample accesses that capture all (at least common) scenarios.
 
To discover disk format, calls etc., you can perhaps experiment on Linux by using a usermode block device driver and capturing what precisely changes, starting from a set of ublk "disks" and forming a RAID5 out of them and trying sample accesses that capture all (at least common) scenarios.
Yes, as mentioned above, the dev and test would be done using flat file pseudo-block devices assigned to VMs that actually do the raid processing safely.
 
So my assumption is that the algorithmic heavy lifting has already been done in the existing linux tool. I'd simply port the driver(s) and userspace tool, then test the crap out of it. I understand though that the driver API might be different between the two systems, which will require some analysis.
 
Back
Top