what is causing sbwait in php-fpm, how to debug

Hi guys,

we upgraded server to FreeBSD 10.3-STABLE and php-fpm during high load of server started to make strange things (take all CPU, locks, sbwait, select etc). When server is not under load, everything works perfectly

top:
Code:
59 processes:  4 running, 55 sleeping
CPU: 11.7% user,  0.0% nice,  6.1% system,  0.4% interrupt, 81.8% idle
Mem: 3024M Active, 7578M Inact, 1956M Wired, 11M Cache, 1588M Buf, 3351M Free
Swap: 32G Total, 8016K Used, 32G Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
63826 www           1  30    0   310M 45056K accept 19   0:06  14.70% php-fpm
63815 www           1  30    0   310M 45780K accept 21   0:06  13.77% php-fpm
63827 www           1  29    0   306M 42532K accept 14   0:06  13.77% php-fpm
63828 www           1  30    0   306M 42468K accept  9   0:05  13.28% php-fpm
63822 www           1  30    0   306M 42380K accept  5   0:06  13.18% php-fpm
...
 5085 www           1  23    0   118M 62504K kqread 18  44:06   6.49% lighttpd
763 root          1  34    0   298M 23264K kqread 11   0:30   0.00% php-fpm
11824 www           1  52    0   314M 38628K sbwait 11   0:22   0.00% php-fpm
13634 www           1  36    0   306M 35432K sbwait 21   0:21   0.00% php-fpm
14140 www           1  52    0   310M 35948K sbwait 14   0:19   0.00% php-fpm
13141 www           1  20    0   314M 35884K sbwait 18   0:18   0.00% php-fpm
11133 www           1  40    0   310M 36260K sbwait 19   0:17   0.00% php-fpm
...

What I dont like in this case is sbwait of php-fpm processes - as you can see, they are not respawned and they should be in accept state, I believe they get to this state after some timeout (?) to something. First they go to select state and end up in sbwait state taking 0% of CPU. I would like to debug and see what is causing of this sbwait (truss -p <PID> never finish) - please help me how I can debug this and find out for what they are actually waiting.

some values from php-fpm.conf
Code:
emergency_restart_threshold = 10
emergency_restart_interval = 1m
pm = dynamic
pm.max_children = 250
pm.start_servers = 30
pm.min_spare_servers = 10
pm.max_spare_servers = 30
pm.max_requests = 500

within php-fpm there are connections to various resources - mysql, memcache, redis, sphixsearch...but none of them giving timeout in logs.
 
you need to diagnose this further as 10.1 soon wont be supported.

here is a quote from a FreeBSD developer.

The sbwait wchan is present when a thread has invoked the in-kernel
sbwait() function to wait for a socket event. It's used in a number of
situations, but the main ones are:

- The thread is trying to send on a blocking socket, but there's
insufficient socket buffer space, so it must wait for space. This might
occur if it has managead to max out the bandwidth available to a TCP
connection, or flow control is in use and the receiver does not wish to
receive more data yet.

- The thread is trying to receive on a blocking socket, but there's not
enough data to satisfy the read request, so it must wait for data to be
received. It might be waiting for a remote TCP sender to have data
available, or for in-flight data to arrive.

Robert N M Watson

So from that I would check into the following.

Increasing socket buffer sizes
Increasing open socket limits
Increasing socket queue limits, one that comes to mine is 'somaxconn' which has a very low default of 128
Check firewall configuration if the communication is over tcp.
Also ideally the php to mysql/memcache/redis etc. should be done over sockets rather than TCP.
 
Top