C Process getting own CPU time

Summary: various ways of getting user CPU time get stuck at about 105.824 hours; is there a better way to get this info?

OS: FreeBSD 12.0-RELEASE-p3, but this behavior was observed on FreeBSD 11.1 and FreeBSD 11.2 also. amd64 architecture.

I have a program that generates a lot of permutations and checks if each permutation satisfies some criteria that I want. (example: arrange numbers on a many-sided die so that the total of the numbers at each vertex add up to the same number). This program tends to run a long time and essentially emulates a CPU infinite loop except for outputting a solution every few minutes, days, years, or geologic eons. I want to get some idea of how long a particular run will take, so I put in code to use CPU time and real time to estimate time to completion for a run. Currently it spits out status every 5 minutes. The process is uni-threaded.
The process is run at idle priority on systems with 2 and 4 cores, with one core left free for light use and overnight maintenance jobs. It typically gets 99.9% user CPU averaged over a good portion of a day, even with Firefox or Libreoffice (at normal priority) sucking up CPU a few hours a day.

I have discovered, though, that user CPU time seems to quit advancing after 380967 seconds (105.824 hours or 4.409 days; this is sampled every 5 minutes real time) and then just returns the same number over and over. Mostly I'm using getrusage() to get user CPU time, but clock() and clock_gettime(CLOCK_VIRTUAL) seem to be getting the information with common code. I spent a bit of time looking for 64-bit overflows in my time-formatting routines before realizing they weren't the problem. Is there a better clock to use with clock_gettime()? A per-thread clock?

Is there a better way to get this information? I don't *have* to have just user CPU; user + system + interrupt time would probably be indistinguishable. "top" seems to get the run time right. How portable is kinfo_getproc(), which "top" uses? How long has that interface been around? Will it stay around? How efficient is it compared to getrusage()? Does anyone know when *IT* will overflow?

The only type I can use for the estimated time for a run is "long double", because of its exponent range. However, just because a full run will take way longer than the age of the universe doesn't mean I won't get a few solutions in a few weeks.
 
Replying to my own post after some research and testing.
CPU time is divided into three different types for a process: user, system, and interrupt. These are obtained by multiplying total cpu time by the proportion of clock ticks where the system was in the relevant state. This involves a 64-bit number multiplied by a 64-bit number and then divided by a 64-bit number, which really needs, but doesn't have, a 128-bit intermediate result, (so it overflows after about 4.41 days of CPU time). Either a 128-bit intermediate, or use floating point (long double), which I understand is problematic in the kernel. Either way, it's slow.

Just about any method of getting CPU time that tries to separate user and system time will have a problem with overflow: getrusage(), and clock_gettime using CLOCK_VIRTUAL, CLOCK_PROF, or CLOCK_PROCESS_CPUTIME_ID.

Ways you can get CPU time for your own process (in all cases, this gets you user + system + interrupt):
- If your process has a single thread,
Code:
clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...)
will give you the CPU time of your process in a struct timespec.

- Call
Code:
struct kinfo_proc *p;
p = kinfo_getproc(getpid())
and use the returned p->ki_runtime value. This transfers a lot of info from the kernel and uses very little of it. Unfortunately I can't seem to get the inputs to the code in calcru1() that overflows with kinfo_getproc() so I can't just do the calculation with more bits to get it right.

- I'm still testing whether you can use setitimer(ITIMER_VIRTUAL, ...). Set up a periodic interrupt for some CLOCK_INTERVAL and a signal handler for SIGVTALRM to count interrupts. I suggest CLOCK_INTERVAL be a multiple of 1 second, and about 1-10 minutes for testing, and 1 or 2 days for production use. Call getitimer(ITIMER_VIRTUAL,...), then compute (interrupt_count + 1)* CLOCK_INTERVAL - time_from_getitimer as the CPU time. This does work but I haven't managed to run it up to the point where overflow of the total time might screw it up (even though no single interval will overflow). Strictly speaking, I need mutexes or somethng, but since the interrupt count is changed only once every day or two, I can probably get away with reading the overflow count until I get the same value twice in a row. Note: making CLOCK_INTERVAL 1 or 2 days doesn't reduce the granularity of the result. You can still get sub-second results.
 
Back
Top