Why netlink over BSD routing sockets?

spmzt

Developer
I was reading the source code for route(8) and noticed that we have implemented netlink support for our base utilities.

I'm trying to understand why we decided to switch our userland sockets to support netlink.

I have read several comparison articles on the internet, and their reasoning includes:
  1. "Netlink is async": BSD sockets can be used asynchronous too; they can be used with poll(2) and SOCK_NONBLOCK sockets.
  2. "Netlink is a loadable kernel module": not justify the reason behind changing the default route.c to route_netlink.c
  3. "Netlink socket supports multicast": Again, This alone does not justify replacing BSD sockets with netlink in our base utilities.
  4. "Netlink socket includes firewall features": We don't have this feature, which is understandable, and it is also not related to our use case.
Could anyone help me understand the reasoning behind this decision?

I also saw commit messages regarding netlink that mentioned "not abusing" BSD sockets. What exactly constitutes abusing the syscall, which is primarily designed for that purpose?
 
I can't really speak for the routing sockets (although if I had to guess I'd say it might be to be compatible with Linux), but I'm in the process of converting the pf ioctl interface to netlink.

The reasons for the pf conversion are:
- extensibility
- performance
- memory use

Netlink uses a type/length/value encoding, so if either kernel or userspace add fields the other side doesn't know they can be easily ignored. ioctl calls copy structs back and forth, so any time you extend one you break the interface. I keep adding features to pf, and this makes that a lot easier.

The performance improvement is mostly relevant when compared to the alternative that also provides extensibility, namely nvlists. Those are a different way of providing type/length/value encoding, but a much slower one. I ran into the performance issues trying to use it for pf's state export.

Finally, the memory use. Netlink allows userspace to start processing the reply while the kernel is generating it. In the ioctl case (with or without nvlists) we have to generate the entire reply before we can copy it.
For very large messages this makes a huge difference. See https://cgit.freebsd.org/src/commit/?id=f218b851da0480460195c1128d0c0b41d3b6a6d4 for exact numbers, but we went from needing multiple gigabytes of memory to a few tens of megabytes for a large state export, while actually being slightly faster by converting the ioctl call to netlink.
 
Back
Top