syscall-ioctl performance FreeBSD vs Linux

Hi,
I have an application that runs on FreeBSD and Linux which requires calling lots of ioctl. Through a simple "for" loop calling a custom ioctl, I get the following duration:
C:
#ifdef __linux__
    fd = open("/dev/test_device", O_RDWR);
#else
    fd = open("/dev/test", O_RDWR);
#endif
    if(fd < 0)
    {
        printf("Cannot open device file...\n");
        return 0;
    }

    gettimeofday(&tv1, NULL);
    for (i = 0; i < 0xffffff; i++)
        ioctl(fd, IOCTL_TEST, NULL);
    gettimeofday(&tv2, NULL);
    printf ("Total time = %f us\n",
             (double) (tv2.tv_usec - tv1.tv_usec) +
             (double) ((tv2.tv_sec - tv1.tv_sec) * 1000000));

    close(fd);
Duration(us)
FreeBSD 14.01559924
Linux - Ubuntu 20.04.6 LTS841341

Is there a way to optimize the ioctl call in FreeBSD to have similar performance as Linux?
 
Last edited by a moderator:
The function gettimeofday is not a syscall in the Linux world (it is a vsyscall see https://lwn.net/Articles/446528/),
so it is more fast on Linux vs FreeBSD. So it may induce an error when doing the computation time.
You could use dtrace to profile the execution time and visualize it with flamegraph, so that you could be more precise on the real duration of the ioctl call.
 
  • Like
Reactions: _al
The gettimeofday is just to get the duration for running 16M of ioctl:
Code:
for (i = 0; i < 0xffffff; i++)
  ioctl(fd, IOCTL_TEST, NULL);
I wrote a simple kernel module with a fake ioctl.
Code:
#include <sys/types.h>
#include <sys/systm.h>
#include <sys/param.h>  /* defines used in kernel.h */
#include <sys/module.h>
#include <sys/kernel.h> /* types used in module initialization */
#include <sys/conf.h>   /* cdevsw struct */

static d_open_t  test_open;
static d_close_t test_close;
static d_ioctl_t test_ioctl;

static struct cdev *test_cdev;
static struct cdevsw test_cdevsw =
{
    .d_version = D_VERSION,
    .d_open    = test_open,
    .d_close   = test_close,
    .d_ioctl   = test_ioctl,
    .d_name    = "test",
};

static int test_loader(struct module *m __unused, int what, void *arg __unused)
{
    int err = 0;

    switch (what)
    {
        case MOD_LOAD:
            err = make_dev_p(MAKEDEV_CHECKNAME | MAKEDEV_WAITOK,
                             &test_cdev,
                             &test_cdevsw,
                             0,
                             UID_ROOT,
                             GID_WHEEL,
                             0600,
                             "test");
            printf("test loaded\n");
        break;
        case MOD_UNLOAD:
            destroy_dev(test_cdev);
            printf("test removed\n");
        break;
        default:
            err = EOPNOTSUPP;
        break;
    }
    return err;
}

static int test_open(struct cdev *dev __unused, int oflags,
                     int devtype, struct thread *td)
{
    printf("test opened\n");
    return 0;
}

static int test_close(struct cdev *dev __unused, int fflag,
                      int devtype, struct thread *td)
{
    printf("test closed\n");
    return 0;
}

static int test_ioctl(struct cdev *dev, u_long cmd, caddr_t data,
                      int fflag, struct thread *td)
{
    return 0;
}

DEV_MODULE(test, test_loader, NULL);
 
Last edited by a moderator:
Is there a way to optimize the ioctl call in FreeBSD to have similar performance as Linux?
No idea.

I would be very concerned with any design that literally issues that many ioctls per second. As an example, why write one byte of data to a device at a time when you can write hundreds or thousands of bytes in the same operation. I'd focus on making the API presented by the ioctl layer to be more efficient.

If the userland code needs to poll something - move the polling into the driver with some measure of rate limiting so the rest of the system isn't burdened.
 
I agree that I can reduce the number of ioctl call in my application.
The point here is to test the ioctl call and measure the duration taken.
It is showing almost twice the time for FreeBSD as compare to Linux kernel to complete the iterations. Thus I am checking if there is a way to optimize the syscall routine in FreeBSD.
 
What hardware and compiler?
More important - why?
This smells of a design problem. You are not to poll, you read from the device and that call returns when data is available. You may want to read up on FreeBSD device drivers, the book. It is a good reference and starting point. But this smells like someone in the Linux camp was cutting corners to make bad designs more performant instead of fixing the design.
 
I think the test program and the kernel driver are actually above. In user space, there is fundamentally nothing to test, and the flame graph would just show one call to ioctl(). My guess is that the hardware is standard 64-bit Intel, and the compiler is standard production for Linux and FreeBSD.

Ultimately, this is a really simple question: What is the "performance" (latency, CPU consumption, whatever you want to call it) of the kernel's syscall interface. The ioctl method of testing it is reasonable, as that is a particularly simple syscall (except for having to dereference the device number that comes from user space to the kernel's internal structure, which is not too hard). Other traditional methods of testing this include getuid and sched_yield (while gettimeofday is an outlier, having been moved into user space on Linux). The systems research community has put significant work into making syscall more efficient, but the conclusion from all these efforts is that the transition between user- and system space will always be expensive, thousands of instructions or typically hundreds of nanoseconds.

That brings up the question: cui bono? For disk IO (which includes file systems), a penalty of a few hundred ns doesn't matter much, as real disk IO (even to SSD, even over fast storage interconnects such as SAS) takes microseconds. Where this becomes an issue is networking, starting at 10gig and Infiniband. This is why often the networking stack has been moved into user space, with buzzwords such as RDMA, Verbs, OFED, RoCE, iWarp and iSer, and so on. Or the networking stack has been moved off-host, into adapter cards that handle a whole slew of tasks (routing, encryption, DMA), and are triggered from user space. So today, does anyone actually care about this performance aspect any longer?
 
I appreciate chngjp for signing up to this forum two days ago to offer this insight. One thing here is the actual ioctl method 'test_ioctl' does nothing more than return. A real ioctl method would execute dozens or hundreds of instructions so the overhead of the actual call gets diluted. Mind you, this *IS* an interesting point but likely not as important as it might seem.

I do thank chngjp for sharing.
 
The function gettimeofday is not a syscall in the Linux world (it is a vsyscall see https://lwn.net/Articles/446528/),
so it is more fast on Linux vs FreeBSD. So it may induce an error when doing the computation time.
You could use dtrace to profile the execution time and visualize it with flamegraph, so that you could be more precise on the real duration of the ioctl call.

FreeBSD has a fast gettimeofday, too. About 10x faster than -say- getrusage(2).
 
I guess this is down to the different memory mapping models. Flushing the TLBs and such.

Now the question is would changing that make the days work faster or the edge case? We gain a lot with the current memory handling as objects, the vnode cache for example.

100% difference sounds big, but if that is between one or two microseconds, for operations you should not do that often, is it still worth changing a single line?

ralphbsz one of the biggest reasons for the move of stacks to user space an off to special HW is unlikely to be syscall performance. I suspect the impact of cross core shared mapping flushes to be higher. But that is better discussed with real data and maybe a cold beer.
 
ioctls should not affect memory mapping or tab flushing. The above test driver allows measuring ioctl overhead and this is the minimum any ioctl has to do. 2:1 ratio is not too bad actually as likely freebsd does more stuff. IMHO these syscall code paths would be very hard to “optimize” without a thorough understanding of the whole syscall machinery and why it is the way it is. To get an idea, apart from reading code, one suggestion is to run gdb on the kernel and single step through code until the driver’s ioctl code is called.
 
Oh but it does affect mapping. You switch from user mode to kernel mode, suddenly the reachable memory area increases. And it decreases when you switch back
 
Could you post a complete tarfile with both kernel modules and the test program?

Or maybe you are able to make Flame Graphs yourself? https://www.brendangregg.com/flamegraphs.html
1704601196500.png

#0 0xffffffff80b9002d at kdb_backtrace+0x5d
#1 0xffffffff8313d0c9 at test_ioctl+0x9
#2 0xffffffff809d10dc at devfs_ioctl+0xcc
#3 0xffffffff80c3b9b4 at vn_ioctl+0xd4
#4 0xffffffff809d177e at devfs_ioctl_f+0x1e
#5 0xffffffff80bb1535 at kern_ioctl+0x255
#6 0xffffffff80bb1273 at sys_ioctl+0x123
#7 0xffffffff8100d129 at amd64_syscall+0x109
#8 0xffffffff80fe413b at fast_syscall_common+0xf8
 

Attachments

  • ioctl_test.zip
    2.9 KB · Views: 46
I did some test by rebuilding the kernel with
#options CAPABILITIES # Capsicum capabilities
The test time drop from 1559924us to 1426744us.
Seems to be related to:
int
kern_ioctl(struct thread *td, int fd, u_long com, caddr_t data)
{
struct file *fp;
struct filedesc *fdp;
int error, tmp, locked;

AUDIT_ARG_FD(fd);
AUDIT_ARG_CMD(com);

fdp = td->td_proc->p_fd;

switch (com) {
case FIONCLEX:
case FIOCLEX:
FILEDESC_XLOCK(fdp);
locked = LA_XLOCKED;
break;
default:
#ifdef CAPABILITIES
FILEDESC_SLOCK(fdp);
locked = LA_SLOCKED;
#else
locked = LA_UNLOCKED;
#endif
break;
}

#ifdef CAPABILITIES
if ((fp = fget_noref(fdp, fd)) == NULL) {
error = EBADF;
goto out;
}
if ((error = cap_ioctl_check(fdp, fd, com)) != 0) {
fp = NULL; /* fhold() was not called yet */
goto out;
}
if (!fhold(fp)) {
error = EBADF;
fp = NULL;
goto out;
}
if (locked == LA_SLOCKED) {
FILEDESC_SUNLOCK(fdp);
locked = LA_UNLOCKED;
}
#else
error = fget(td, fd, &cap_ioctl_rights, &fp);
if (error != 0) {
fp = NULL;
goto out;
}
#endif
 
I have difficulty compiling the Linux module on kernel 6.6.3:
Code:
  CC [M]  /home/cracauer/work/ioctl_test/test_ioctl.o
In file included from ./include/linux/linkage.h:7,
                 from ./include/linux/kernel.h:17,
                 from /home/cracauer/work/ioctl_test/test_ioctl.c:1:
/home/cracauer/work/ioctl_test/test_ioctl.c: In function 'test_driver_init':
./include/linux/export.h:29:22: error: passing argument 1 of 'class_create' from incompatible pointer type [-Werror=incompatible-pointer-types]
   29 | #define THIS_MODULE (&__this_module)
      |                     ~^~~~~~~~~~~~~~~
      |                      |
      |                      struct module *
/home/cracauer/work/ioctl_test/test_ioctl.c:65:41: note: in expansion of macro 'THIS_MODULE'
   65 |     if (IS_ERR(dev_class = class_create(THIS_MODULE, "test_class")))
      |                                         ^~~~~~~~~~~
In file included from ./include/linux/device.h:31,
                 from ./include/linux/cdev.h:8,
                 from /home/cracauer/work/ioctl_test/test_ioctl.c:6:
./include/linux/device/class.h:230:54: note: expected 'const char *' but argument is of type 'struct module *'
  230 | struct class * __must_check class_create(const char *name);
      |                                          ~~~~~~~~~~~~^~~~
/home/cracauer/work/ioctl_test/test_ioctl.c:65:28: error: too many arguments to function 'class_create'
   65 |     if (IS_ERR(dev_class = class_create(THIS_MODULE, "test_class")))
      |                            ^~~~~~~~~~~~
./include/linux/device/class.h:230:29: note: declared here
  230 | struct class * __must_check class_create(const char *name);
      |                             ^~~~~~~~~~~~
cc1: some warnings being treated as errors
make[3]: *** [scripts/Makefile.build:243: /home/cracauer/work/ioctl_test/test_ioctl.o] Error 1
make[2]: *** [/usr/src/linux-6.6.3-cracauerdlwavehh/Makefile:1913: /home/cracauer/work/ioctl_test] Error 2
make[1]: *** [Makefile:234: __sub-make] Error 2
make[1]: Leaving directory '/usr/src/linux-6.6.3-cracauerdlwavehh'
make: *** [Makefile:6: all] Error 2
 
#0 0xffffffff80b9002d at kdb_backtrace+0x5d
#1 0xffffffff8313d0c9 at test_ioctl+0x9
#2 0xffffffff809d10dc at devfs_ioctl+0xcc
#3 0xffffffff80c3b9b4 at vn_ioctl+0xd4
#4 0xffffffff809d177e at devfs_ioctl_f+0x1e
#5 0xffffffff80bb1535 at kern_ioctl+0x255
#6 0xffffffff80bb1273 at sys_ioctl+0x123
#7 0xffffffff8100d129 at amd64_syscall+0x109
#8 0xffffffff80fe413b at fast_syscall_common+0xf8
If you gdb kernel.full in /usr/obj/usr/src/amd64.amd64/sys/GENERIC (or for whichever kernel conf you used) you will get <file>:<line> as opposed to just offset.
 
Code:
linux5.19:     
Total time = 111143824.000000 us
1:51.14 111.14 real 33.87 user 77.23 sys 99% CPU 0/80 faults

freebsd13:
Total time = 152667448.000000 us
2:32.67 152.67 real 25.84 user 126.81 sys 99% CPU 0/88 faults

freebsd15:
Total time = 174724613.000000 us
time/call = 81.36 ns
2:54.72 174.72 real 26.53 user 148.18 sys 99% CPU 0/108 faults

As you can see I am not getting a 2x slowdown with FreeBSD. But the slowness I get is interesting because small system calls are often faster in FreeBSD than Linux.
 
Back
Top