/bin/timeout stopped sending sigkill signals on one of my FreeBSD 13.2

Greetings,
I face a very strange problem.
On one of my servers /bin/timeout does not work as expected. Does not send SIGKILL (-9)!

If I issue:
/bin/timeout -k 3s 3s dig @7.7.7.7 google.com

this will timeout on every other FreeBSD 13.2 server, but on one particular VPS instance the timeout will not kill the dig that is trying and retrying (7.7.7.7 is not a valid DNS).
On every other FreeBSD the above command will terminate in 6 seconds. On this particular 13.2-RELEASE-p1 FreeBSD 13.2-RELEASE-p1 GENERIC amd64
it will never terminate.
If I issue from another terminal
killall -9 timeout
it will kill this. So one day /bin/timeout stopped sending SIGKILL signals?

On the same server if I issue

/usr/local/bin/gtimeout -k 3s 3s dig @7.7.7.7 google.com


the command terminates.

Does this server requires a reboot? Very strange even for me that I though that I had see many things over the last 30 years on BSD.

# uptime
9:47AM up 39 days, 9:28, 10 users, load averages: 0.44, 0.71, 0.65

all my FreeBSD have the same MD5 (both the ones that work OK and also this VPS that does not).
# md5sum /bin/timeout
4fccec7e1914ba2a302c898bc872760d /bin/timeout

freebsd-update IDS does not print any error.

If you want to give me some commands to try to help debug it I will do it.
Soon I will perform the reboot, because many scripts of mine utlize /bin/timeout and it is essential for the well-being of the machines.

Thank you all!
 
Could you try to "truss" timeout process and provide information, what exactly doing timeout at that moment? Do next:
1) run a "/bin/timeout -k 3s 3s dig @7.7.7.7 google.com"
2) after 10 seconds from another terminal find a PID of command from step 1) and run "truss -p <PID>"
Also, may be there are some messages from dmesg or /var/log/messages?
 
[…] On one of my servers /bin/timeout does not work as expected. […]
Now you should take that conclusion with a grain of salt. Do you observe the same issue with other commands, too? E. g. timeout 3s sleep 42 “finishes” after 3 seconds, right?​
[…] /bin/timeout -k 3s 3s dig @7.7.7.7 google.com […]
Wait a minute, why are you using timeout(1) for dig? ? dig(1) has a +timeout=3s parameter (in conjunction with +tries=2). There is no need for external tools.​
 
Hi,
thanks for the replies.

Without reboot the problem was resolved, because I believe this problem was due to other issues (other servers).

This machine executes every 2 minutes many scripts that every script has a /bin/timeout for each command. In Saturday a power outage knock out a room full of servers and this server (which was operational) had NFS mounts and ISCSI mounts and was using files from other servers. Timeout was unable to kill the processes because they were in BLOCK WAIT / DISCK WAIT. New execution keep piling up and after some time I had 1200 more processes than usual without being able to be killed. In this situation the /bin/timeout could not deliver new signals to new processes (even to sleep). please note that gtimeout was working at this extreme case without any problem.

On Monday, somebody managed to re-activate the server room (its on a university, not mission critical), servers went online, all processes got out from DISK WAIT and were killed and now /bin/timeout works as expected. Check my munin graph that I attach for the spike in sleeping/zombie/idle processes.


TLDR;
The conclusion is: When multiple processes are in disk wait and cannot accept SIGKILL from /bin/timeout (but kill -9 from another terminal does work), the /bin/timeout cannot send any signal even on newly created processes but gtimeout can operate and deliver SIGKILL as expected. I am changing my scripts in this server to utilize gtimeout which seems more robust.


For the technical info, this is a VPS with dmesg:
FreeBSD 13.2-RELEASE-p1 GENERIC amd64
CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (2300.06-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x306f2 Family=0x6 Model=0x3f Stepping=2
Features=0xf83fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,SS>
Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRANH>
AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
AMD Features2=0x21<LAHF,ABM>
Structured Extended Features=0x7ab<FSGSBASE,TSCADJ,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID>
Structured Extended Features3=0x84000400<MD_CLEAR,IBPB,SSBD>
XSAVE Features=0x1<XSAVEOPT>
Hypervisor: Origin = "KVMKVMKVM"
real memory = 4294967296 (4096 MB)
avail memory = 4104073216 (3913 MB)



Thank you again for your useful queries. Problem solved.
 

Attachments

  • tooManyProcesses.png
    tooManyProcesses.png
    53 KB · Views: 158
Back
Top