C How to use kevent; confused by manpage

JanBeh · Feb 20, 2024

I recently got into using kevent(). Reading the manpage (as of FreeBSD 14.0), I'm left with a lot of questions and I didn't find better documentation yet.

In particular, the following questions remain unclear to me:

What is the purpose of EV_ENABLE and EV_DISABLE (in contrast to deleting an event with EV_DELETE when no longer needed). Is this performance related only, or are there any other effects when disabling an event temporarily (either manually or through EV_DISPATCH) instead of removing it (e.g. with EV_ONESHOT or manually)?
Why does the manpage mention the "remaining space in eventlist" in regard to EV_RECEIPT, and explains that when it's too small "subsequent changes will not get processed"? As far as I understand, EV_RECEIPT should avoid "draining any pending events". So how could the space be exhausted (other than being too small in the first place)? I don't understand what the word "remaining" means here.
What is the purpose of not specifying EV_CLEAR for those events where it's not implicitly enabled? How long would an event be reported as pending? Until I disable the event? Until I delete the event? Is there any way to "acknowledge" the event? Bonus question: Which events in particular "report state transitions instead of the current state" (as written in the manpage)?
Which flags are used as an input (in struct kevent *changelist and which flags are an output (in struct kevent *eventlist)? I assume that EV_EOF and EV_ERROR are only outputs and all other flags are only used as inputs? Is there any reason why all flags listed in the same list on the manpage? Is it correct that no flag is used both as input and output?
Monitoring a process to exit (EVFILT_PROC with NOTE_EXIT) results in the process' exit status (to be checked with WIFEXITED and related macros) to be returned in the data field of the struct kevent in the eventlist. However, the manpage is unclear about whether the process' status is collected or not. Do I have to subsequently call waitpid() to reap zombie processes? Does the behavior depend on EV_ONESHOT or EV_CLEAR, or any other flags?

The last question is actually a blocker for me. I guess I could read the source and/or experiment with how FreeBSD (currently) behaves, but I would prefer if this was clarified by the documentation (also because I want to work on other platforms, e.g. using libkqueue).

Maybe I'm just having difficulties in understanding/interpreting the manpage, but I feel like it's suboptimally written. Is there any better (ideally authoritative/normative) documentation on this subject?

JanBeh · Feb 20, 2024

JanBeh said:
I guess I could read the source and/or experiment with how FreeBSD (currently) behaves, but I would prefer if this was clarified by the documentation […]

I did some tests. My observations so far:

Child processes must be explicitly reaped with waitpid() as they become zombies otherwise. This is not (explicitly) documented.
For EVFILT_PROC and NOTE_EXIT, an event seems to get automatically removed from the queue once the monitored process terminates (which is also not documented), i.e. the behavior seems to be like EV_ONESHOT would be set (even if it's not), thus making a subsequent EV_DELETE return with ENOENT.

Can I rely on this behavior?

JanBeh · Feb 23, 2024

My current solution is to:

Assume that setting EV_RECEIPT will ensure that no pending effects are drained if the output buffer is not longer than the input buffer; i.e. ensure that the output buffer has the exact same size as the input buffer when I only want to add/remove multiple events without reacting to any other potentially pending events.
Explicitly specify EV_ONESHOT when there is doubt how kevent() might behave.
Always reap child processes through waitpid().

I still think the manpage is quite confusing at the least, and I might open an issue on that matter.

unitrunker · Feb 23, 2024

JanBeh said:
I still think the manpage is quite confusing at the least, and I might open an issue on that matter.

Please do!

JanBeh · Feb 23, 2024

Yet another issue regarding kqueue and timers:

The section on EVFILT_TIMER does not specify which timer is used for absolute times (NOTE_ABSTIME), i.e. whether CLOCK_REALTIME, CLOCK_MONOTONIC, or CLOCK_UPTIME is used. The manpage speaks of "non-monotonic timers". Does this mean NOTE_ABSTIME uses CLOCK_MONOTONIC? Or does it refer to relative timers (where CLOCK_MONOTONIC is used internally)? When exactly is a timer "non-monotonic" and what does that mean? What effect does the (then implicitly set) EV_CLEAR flag have on timers?

unitrunker said:
Please do!

I just did here: PR 277238 (including the above issue regarding kqueue and timers).

unitrunker · Feb 23, 2024

Some generic background (for anyone else looking):

https://people.freebsd.org/~jlemon/papers/kqueue.pdf

https://freebsdfoundation.org/wp-content/uploads/2014/05/Kqueue-Madness.pdf

unitrunker · Feb 23, 2024

JanBeh said:
What is the purpose of EV_ENABLE and EV_DISABLE (in contrast to deleting an event with EV_DELETE when no longer needed). Is this performance related only, or are there any other effects when disabling an event temporarily (either manually or through EV_DISPATCH) instead of removing it (e.g. with EV_ONESHOT or manually)?

From the second PDF linked above:

EV_ENABLE(I) – Turn on a kqueue event. This may seem redundant but its not, since when you add an event, unless you specify this enable flag, it will not be watched for. You also use this after an event has triggered and you wish to re-enable it (if it does not automatically re-enable itself).
EV_DISABLE(I) – This is the opposite of enable, so you can send this down with a previously enabled filter and have the event disabled, but the internal kqueue structure inside the kernel will not be removed (thus its ready to be re-enabled with EV_ENABLE when you want).

wjwithagen · Feb 23, 2024

Right,

Had a similar experience when porting Ceph to FreeBSD...
And I was even using parts of the work Alan Sommers had done..

The manual page is very condensed and ambigues (or incomplete) at a lot of parts.
So that requires quite some searching, but there is not too much info available.

I remember I had a lot of trouble getting the callback functions and parameter list
out of kevent_value working with crashing every time over.
But lots of cycles later, I got it to work.

Even searching through /usr/src, then are not too many programs using kqueue().
I tried finding exmples in /usr/ports, but gave up in the end...

So it was a big trial and error excersise.
It's oke if it works as expected, but once you miss events, or events keep popping up..
Hard to debug.

unitrunker · Feb 27, 2024

So far ...

1. VNODE filters don't seem usable without EV_CLEAR.
2. closing a descriptor automatically deletes the associated event filter (as hinted at by JanBeh).
3. you can create an EVFILT_READ or EVFILT_WRITE on a POSIX message queue using the under-documented mq_getfd_np call (grep /usr/include/mqueue.h to see it). You'll get notified each time a message is added to or removed from a queue.
4. NOTE_TRACK with EVFILT_PROC is awesome (go look it up).

As wjwithagen says, it is all trial and error. One thing that has helped me is to write a small function to dump out the contents of the retrieved event.

Code:

#define _countof(arg) ( (sizeof(arg)) / (sizeof((arg)[0]) ) )

struct Pair { unsigned mask; const char *label; };

static void shownotes(FILE *output, const char *prefix, unsigned notes, const struct Pair *pairs, unsigned size)
{
    fprintf(output, " %s:", prefix);
    for (unsigned i = 0; i < size; i++) {
        if ((notes & pairs[i].mask) == pairs[i].mask)
            fprintf(output, " %s", pairs[i].label);
    }
}

static void showfilter(FILE *output, short filter)
{
    const struct { short filter; const char *label; } pairs[] = {
        {EVFILT_AIO, "AIO"}, {EVFILT_EMPTY, "EMPTY"}, {EVFILT_FS, "FS"}, {EVFILT_LIO, "LIO"},
        {EVFILT_PROC, "PROC"}, {EVFILT_PROCDESC, "PROCDESC"}, {EVFILT_READ, "READ"},
        {EVFILT_SENDFILE, "SENDFILE"}, {EVFILT_SIGNAL, "SIGNAL"}, {EVFILT_SYSCOUNT, "SYSCOUNT"},
        {EVFILT_TIMER, "TIMER"}, {EVFILT_USER, "USER"}, {EVFILT_VNODE, "VNODE"}, {EVFILT_WRITE, "WRITE"}};
    for (unsigned i = 0; i < _countof(pairs); i++) {
        if (filter == pairs[i].filter)
        {
            fprintf(output, " %s", pairs[i].label);
            return;
        }
    }
    fprintf(stdout, "%hd", filter);
}

void showevent(FILE *output, struct kevent *what)
{
    const struct Pair verbs[] = {
        {EV_ADD, "ADD"}, {EV_CLEAR, "CLEAR"}, {EV_DELETE, "DELETE"}, {EV_DISABLE, "DISABLE"},
        {EV_ENABLE, "ENABLE"}, {EV_DISPATCH, "DISPATCH"}, {EV_DROP, "DROP"}, {EV_EOF, "EOF"},
        {EV_ERROR, "ERROR"}, {EV_FORCEONESHOT, "FORCEONESHOT"}, {EV_KEEPUDATA, "KEEPUDATA"},
        {EV_ONESHOT, "ONESHOT"}, {EV_RECEIPT, "RECEIPT"}, {EV_FLAG1, "FLAG1"}, {EV_FLAG2, "FLAG2"} };
    const struct Pair vnodes[] = {
        {NOTE_DELETE, "DELETE"}, {NOTE_CLOSE, "CLOSE"}, {NOTE_ATTRIB, "ATTRIB"}, {NOTE_CLOSE_WRITE, "CLOSE-WRITE"},
        {NOTE_EXTEND, "EXTEND"}, {NOTE_LINK, "LINK"}, {NOTE_OPEN, "OPEN"}, {NOTE_READ, "READ"}, {NOTE_WRITE, "WRITE"} };
    const struct Pair procs[] = { {NOTE_EXIT, "EXIT"}, {NOTE_EXEC, "EXEC"}, {NOTE_FORK, "FORK"}, {NOTE_CHILD, "CHILD"}, {NOTE_TRACK, "TRACK"}, {NOTE_TRACKERR, "TRACKERR"} };
    const struct Pair users[] = { {NOTE_TRIGGER, "TRIGGER"}, {NOTE_FFAND, "AND"}, {NOTE_FFCOPY, "COPY"}, {NOTE_FFNOP, "NOP"}, {NOTE_FFOR, "OR"} };
    const struct Pair timers[] = { {NOTE_SECONDS, "SECONDS"}, {NOTE_MSECONDS, "MSECONDS"}, {NOTE_USECONDS, "USECONDS"}, {NOTE_NSECONDS, "NSECONDS"}, {NOTE_ABSTIME, "ABSTIME"} };

    fprintf(stdout, "Ident %lu Data %lu Verbs %04hx Notes %08x Ext [%lu %lu %lu %lu]",
        what->ident, what->data, what->flags, what->fflags,
        what->ext[0], what->ext[1], what->ext[2], what->ext[3]);
    showfilter(output, what->filter);
    shownotes(output, "verbs", what->flags, verbs, _countof(verbs));
    unsigned notes = what->fflags;
    switch (what->filter) {
        case EVFILT_SIGNAL:
        case EVFILT_AIO:
        case EVFILT_EMPTY:
        case EVFILT_READ:
        case EVFILT_WRITE:
            if (notes > 0)
                fprintf(output, "notes: %08X", notes);
            break;
        case EVFILT_TIMER:
            shownotes(output, "notes", notes, timers, _countof(timers));
            break;
        case EVFILT_VNODE:
            shownotes(output, "notes", notes, vnodes, _countof(vnodes));
            break;
        case EVFILT_PROC:
            shownotes(output, "notes", notes, procs, _countof(procs));
            break;
        case EVFILT_USER:
            shownotes(output, "notes", notes, users, _countof(users));
            break;
    }
    fprintf(output, "\n");
}

This allows me to dump out the events from the main event loop, like so:

Code:

    struct kevent captured;

    bool alive = true;
    while (alive)
    {
        result = kevent(kq, NULL, 0, &captured, 1, NULL);
        if (result == 1)
        {
            showevent(stdout, &captured);
            // do some work here.
        }
    }

Some sample output for two VNODE filters. Ident 4 is /tmp and Ident 5 is some random file under /tmp.

Code:

Ident 5 Data 0 Verbs 0020 Notes 00000080 Ext [0 0 0 0] VNODE verbs: CLEAR notes: OPEN
Ident 5 Data 0 Verbs 0020 Notes 00000500 Ext [0 0 0 0] VNODE verbs: CLEAR notes: CLOSE READ
Ident 4 Data 0 Verbs 0020 Notes 00000080 Ext [0 0 0 0] VNODE verbs: CLEAR notes: OPEN
Ident 4 Data 0 Verbs 0020 Notes 00000400 Ext [0 0 0 0] VNODE verbs: CLEAR notes: READ
Ident 4 Data 0 Verbs 0020 Notes 00000400 Ext [0 0 0 0] VNODE verbs: CLEAR notes: READ
Ident 4 Data 0 Verbs 0020 Notes 00000100 Ext [0 0 0 0] VNODE verbs: CLEAR notes: CLOSE
Ident 4 Data 0 Verbs 0020 Notes 00000180 Ext [0 0 0 0] VNODE verbs: CLEAR notes: CLOSE OPEN
Ident 4 Data 0 Verbs 0020 Notes 00000012 Ext [0 0 0 0] VNODE verbs: CLEAR notes: LINK WRITE

It's mildly entertaining to see VNODE /tmp events pop up while compiling, opening windows in a browser or entering a wild-card at a shell prompt.

Sample output from EVFILT_PROC where you can monitor EXEC, FORK, and EXIT (and track child processes):

Code:

Ident 5916 Data 5915 Verbs 0030 Notes 00000004 Ext [0 0 0 0] PROC verbs: CLEAR ONESHOT notes: CHILD
Ident 5915 Data 0 Verbs 0020 Notes 40000000 Ext [0 0 0 0] PROC verbs: CLEAR notes: FORK
Ident 5916 Data 512 Verbs 8030 Notes 80000000 Ext [0 0 0 0] PROC verbs: CLEAR EOF ONESHOT notes: EXIT
Ident 5915 Data 256 Verbs 8030 Notes 80000000 Ext [0 0 0 0] PROC verbs: CLEAR EOF ONESHOT notes: EXIT

Note that I did not need to set EV_CLEAR for EVFILT_PROC. Likewise, if a process exits, you don't need to delete the associated filter. This is consistent with how socket and file descriptors are handled.

Legend: verbs = kevent.flags and notes = kevent.fflags.

unitrunker · Feb 27, 2024

I just noticed this. You can create an EVFILT_READ on a kqueue descriptor. You can have a kqueue of kqueues.

unitrunker · Feb 27, 2024

unitrunker said:
2. closing a descriptor automatically deletes the associated event filter (as hinted at by JanBeh).

This is accurately described in the existing kqueue(2) docs.

Code:

EV_DELETE    Removes the event from the kqueue.  Events which are
            attached to file descriptors are automatically deleted on
            the last close of the descriptor.

unitrunker · Feb 27, 2024

JanBeh said:
Monitoring a process to exit (EVFILT_PROC with NOTE_EXIT) results in the process' exit status (to be checked with WIFEXITED and related macros) to be returned in the data field of the struct kevent in the eventlist. However, the manpage is unclear about whether the process' status is collected or not. Do I have to subsequently call waitpid() to reap zombie processes? Does the behavior depend on EV_ONESHOT or EV_CLEAR, or any other flags?

I've found that EVFILT_PROC is self-clearing. NOTE_EXIT is a one-time event for a given pid. I tested NOTE_FORK with NOTE_TRACK two levels deep (forked a child which then forked once more) - all self-clearing. I had no trouble getting the exit status for each child (see the logged output shared above). The need to call wait(2) is a separate concern for the parent that spawned the child process. In other words, kevent does not change fork/posix_spawn/wait behavior.

My personal opinion on the kevent(2) syscall is it is simpler to only pass one change-list parameter or one event-list parameter but not both at the same time. Either add / remove one filter from the kqueue or retrieve one event from the kqueue - don't mix these two. This makes debugging easier since the scope for any error is limited to the one kevent object passed to kevent(2). You can raise the number of events to drain at once after testing proves the code works with one event at a time.