timeout and socket problems with mysql server

icecoke · May 19, 2015

Hi everyone!
We are using FreeBSD 10.1p4 with databases/mysql56-server 5.6.24 and have some performance problems compared to an Linux System with the same Hardware/mysqld setup.

While searching for the reason we found the following this, that does not arise on the Linux boxes. Maybe you have some idea about that?

truss -p `cat /var/db/mysql/*.pid` produces tons of this output (maybe 10-20 per second or more, depending on load):

Code:

sendto(445,"\^A\0\0\^A\^AT\0\0\^B\^Cdef\^Pna"...,117,0x0,NULL,0x0) = 117 (0x75)
recvfrom(445,0xaa50bf000,4,0x0,0x0,0x0)          ERR#35 'Resource temporarily unavailable'
poll({445/POLLIN|POLLPRI},1,28800000)            = 1 (0x1)
recvfrom(445,"s\0\0\0",4,0x0,NULL,0x0)           = 4 (0x4)
recvfrom(445,"\^CSELECT `main_table`.* FROM `b"...,115,0x0,NULL,0x0) = 115 (0x73)
sendto(445,"\^A\0\0\^A\^T?\0\0\^B\^Cdef\^Pna"...,1641,0x0,NULL,0x0) = 1641 (0x669)
recvfrom(445,0xaa50bf000,4,0x0,0x0,0x0)          ERR#35 'Resource temporarily unavailable'
poll({445/POLLIN|POLLPRI},1,28800000)            = 1 (0x1)
recvfrom(445,"{\0\0\0",4,0x0,NULL,0x0)           = 4 (0x4)
recvfrom(445,"\^CSELECT `main_table`.* FROM `b"...,123,0x0,NULL,0x0) = 123 (0x7b)
sendto(445,"\^A\0\0\^A\^T?\0\0\^B\^Cdef\^Pna"...,1641,0x0,NULL,0x0) = 1641 (0x669)
recvfrom(445,0xaa50bf000,4,0x0,0x0,0x0)          ERR#35 'Resource temporarily unavailable'
poll({445/POLLIN|POLLPRI},1,28800000)            = 1 (0x1)
recvfrom(445,"d\0\0\0",4,0x0,NULL,0x0)           = 4 (0x4)
recvfrom(445,"\^CSELECT `core_session`.* FROM "...,100,0x0,NULL,0x0) = 100 (0x64)
sendto(445,"\^A\0\0\^A\^CR\0\0\^B\^Cdef\^Pna"...,295,0x0,NULL,0x0) = 295 (0x127)
recvfrom(445,0xaa50bf000,4,0x0,0x0,0x0)          ERR#35 'Resource temporarily unavailable'
poll({445/POLLIN|POLLPRI},1,28800000)            = 1 (0x1)
recvfrom(445,"\M^U\n\0\0",4,0x0,NULL,0x0)        = 4 (0x4)
recvfrom(445,"\^CINSERT INTO `core_session` (`"...,2709,0x0,NULL,0x0) = 2709 (0xa95)
sendto(445,"\a\0\0\^A\0\^A\0\^B\0\0\0",11,0x0,NULL,0x0) = 11 (0xb)
recvfrom(445,0xaa50bf000,4,0x0,0x0,0x0)          ERR#35 'Resource temporarily unavailable'

The 445 is one of the used filedescriptors of /tmp/mysql.sock - the others used by the server are shown, too.

The /tmp dir is tmpfs an we have plenty of RAM and SSD storage on these machines.

The second thing we get from this check is:

Code:

_umtx_op(0x80122d128,0xf,0x0,0x18,0x7ffffe3f1de0,0x0) ERR#60 'Operation timed out'
clock_gettime(13,{1432061827.000000000 })        = 0 (0x0)
gettimeofday({1432061827.170639 },0x0)           = 0 (0x0)
_umtx_op(0x80122d110,0xf,0x0,0x18,0x7ffffe5f2dd0,0x0) ERR#60 'Operation timed out'
select(0,0x0,0x0,0x0,{0.999000 })                = 0 (0x0)
select(0,0x0,0x0,0x0,{1.000000 })                = 0 (0x0)
gettimeofday({1432061827.215284 },0x0)           = 0 (0x0)
gettimeofday({1432061827.216048 },0x0)           = 0 (0x0)
gettimeofday({1432061827.216698 },0x0)           = 0 (0x0)
gettimeofday({1432061827.217732 },0x0)           = 0 (0x0)
gettimeofday({1432061827.217487 },0x0)           = 0 (0x0)
clock_gettime(13,{1432061827.000000000 })        = 0 (0x0)
_umtx_op(0x80122d128,0xf,0x0,0x18,0x7ffffe3f1de0,0x0) ERR#60 'Operation timed out'
_umtx_op(0x80122d110,0xf,0x0,0x18,0x7ffffe5f2dd0,0x0) ERR#60 'Operation timed out'

which comes less often - maybe each second a 'pair' of such timeouts.

We just can assume this might point to the source of our performance problems, as we do not find any obviously that we could look up in the logs or things.

Any help, is really welcome and when I missed some needed information, please let me know!

Many thanks in advance!
Jimmy

icecoke · May 21, 2015

bump

Anyone that can at least confirm something like

Code:

ERR#35 'Resource temporarily unavailable'

in truss output from a well used mysql process?

Would be quite interested if we are the onlyone suffering from this.

Greetings,
Jimmy

daos · Apr 4, 2016

Sorry for reanimating this rather old thread, but this actually happens (or quite similarly) on AMD64 10.2-RELEASE-p13 as well, with MySQL version 5.6.27. Did you ever figure out what the problem was?

Trussing mysqld for my case shows a lot of _umtx_op / "Operation timed out" events as you describe above, and the mysqld process is unresponsive. This happens after sending SIGHUP to mysqld process in an attempt to recover resources, after which forced closing of connections are logged to MySQL server error log, and mysqld then appears hung. Trussing mysqld process in this state shows the _umtx_op lines. Sending SIGTERM to mysqld process does nothing, and SIGKILL works "sometimes". The safe_mysqld process proceeds to restart and recover the databases and everything works again for a few days. When trouble starts, it leads with "ERROR 1040 (HY000): Too many connections" when trying to use mysql monitor locally to mysql socket.

icecoke · Apr 5, 2016

I'm sorry to say, that this persists here and that we did not found any solution for this

But the 'too many connections' could be the reason in your case, which would be different to our case, except if you rechecked that you are not reaching any connection limits.

SirDice · Apr 5, 2016

daos said:
Sending SIGTERM to mysqld process does nothing, and SIGKILL works "sometimes".

That's a good way to corrupt your databases. You really shouldn't do that.

When trouble starts, it leads with "ERROR 1040 (HY000): Too many connections" when trying to use mysql monitor locally to mysql socket.

Increase your max_connections.

daos · Apr 5, 2016

I realise that stopping mysqld with SIGKILL is a risk with regards to data corruption. I do use SIGHUP, SIGTERM and check what's open with fstat before taking the SIGKILL leap. As I see it, there is no alternative, including a reboot. The process is "dead" for all practical purposes and there is no clean way to shut it down that I've found so far. So, I'm looking for a reason as to why this lockup may happen and how to prevent it.

I've had a lot higher setting for max_connections (4000 IIRC) compared to the more moderate 1400 right now (and being careful with kern.threads.max_threads_per_proc). The current idea is that something causes a resource or threading problem/race/deadlock in mysqld, leading to more and more connections getting held up, eventually leading to the "Too many connections" problem. If that's true, setting a higher max_connections setting will not help in this case (and it has been tested). It's intermittent and not easily reproduced.

I'm in kind of a bind here and any and all suggestions are welcome as to what data to collect, what OS/mysqld settings to consider and so on.

SirDice · Apr 5, 2016

I'd have a close look at MySQL's processlist to see what it's doing. It's possible you have some really slow queries taking up too much time. If the client hits 'refresh' the current process will keep going and the 'refresh' will actually start a new query. Which will take a long time to finish, client hits 'refresh' again, new query, etc. If this keeps up long enough, and the client hits 'refresh' often enough, it will eventually bring the server down.

daos · Apr 5, 2016

Yeah, that may be one reason. I'm convinced it's a resource starvation issue somewhere. The problem is this happens on multiple servers that previously were working fine with MySQL v5.5. I will attempt an upgrade to v5.7 as well, but I really can't shake the feeling something's up with resource management, either with FreeBSD v10.1/10.2 or MySQL v5.6 line (as both myself and icecoke saw similar issues).

SirDice · Apr 5, 2016

One thing you could easily verify is if there's been any changes in defaults, especially regarding time-outs and such. It's not unthinkable some of these settings have changed and you may not be aware of them.

daos · Apr 5, 2016

Thanks SirDice/icecoke, I'll dig some more

timeout and socket problems with mysql server

icecoke

icecoke

daos

icecoke

SirDice

Administrator

daos

SirDice

Administrator

daos

SirDice

Administrator

daos