Apache crashing - lockf status

Hello all!

I'm facing a critical and strange situation. I have a production box running Apache 2.2.17 + PHP 5.2.17 + MySQL 5.5.13 + memcached 1.4.5. The traffic has increased this week from 60 to 100~120 users at the same time. Apache stops responding a lot of times in a day remaining on lockf status as you can see:

Code:
last pid: 38618;  load averages: 143.17, 40.30, 15.75     up 0+22:36:01  10:30:30
2261 processes:335 running, 1926 sleeping
CPU:  0.1% user,  0.0% nice, 99.9% system,  0.0% interrupt,  0.0% idle
Mem: 1622M Active, 4311M Inact, 1721M Wired, 827M Buf, 244M Free
Swap: 512M Total, 512M Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
38507 daemon           1  47    0 86836K  3480K ls_loc 11   0:05 30.18% httpd
38322 daemon           1  47    0 86836K  3368K lockf   1   0:07 19.09% httpd
38151 daemon           1  46    0 86836K  3364K lockf  10   0:07 18.90% httpd
38300 daemon           1  50    0 86836K  3364K RUN     9   0:05 18.36% httpd
38338 daemon           1  51    0 86836K  3368K lockf  13   0:04 16.55% httpd
38319 daemon           1  51    0 86836K  3368K lockf  15   0:05 16.36% httpd
38310 daemon           1  49    0 86836K  3368K RUN    10   0:05 15.38% httpd
38225 daemon           1  51    0 86836K  3368K lockf   6   0:05 15.28% httpd
38361 daemon           1  50    0 86836K  3368K CPU12  11   0:03 14.99% httpd
38308 daemon           1  49    0 86836K  3372K ls_loc  2   0:04 14.79% httpd
38399 daemon           1  51    0 86836K  3368K lockf   9   0:03 14.70% httpd
38415 daemon           1  55    0 86836K  3368K RUN    12   0:03 14.26% httpd

I read the manual but could not conclude any clue about what can be.
http://www.freebsd.org/cgi/man.cgi?query=lockf&sektion=1

I also looked in /var/log/messages and the Apache error log, no errors. The BOX is a HP Proliant 2x Xeon 2.17Ghz quad core (2 package(s) x 4 core(s) x 2 SMT threads) + 8GB of RAM.

Anyone have an idea about what it could be?

Thanks in advance.
 
Hi,

lockf seems to mean that the process wants to have an exclusive lock on a file, so something is hanging in that process. It could be a bug in Apache, an Apache module, PHP etc etc. It could also be a file system error, can you reboot to make sure all file systems are clean? Also could be a hardware problem.

But first of all you can run these two commands against one of the hung PIDs to get a bit more info:

[cmd=]procstat -kk <PID>[/cmd]
[cmd=]lsof -o <PID>[/cmd]

cheers Andy.
 
Hello AndyUKG!

Thanks for the reply.

The Apache just crashed right now, I was able to procstat from some process, but it's not clear to me the information!

Code:
[root@www ~]# procstat -kk 32246
  PID    TID COMM             TDNAME           KSTACK                       
32246 101658 httpd            -                mi_switch+0x16f sleepq_catch_signals+0x31f
 sleepq_wait_sig+0xc _sleep+0x26b lf_advlockasync+0xf2e lf_advlock+0x47 vop_stdadvlock+0xb3
 kern_fcntl+0xd47 fcntl+0x3b syscall+0x246 Xfast_syscall+0xe1

Could you please advise?

I'm also trying to figure out which file can be causing this issue.

Thanks!
 
Hi,

To me that doesn't mean too much. Can you also try lsof? If not already installed then install from sysutils/lsof. Hopefully lsof will tell you what file it's trying to access/lock.

Andy.
 
Hi AndyUKG,

I installed lsof.

Apache crashed minutes ago, but I was not able to get any information with lsof, seems that it was not able to access the locked PID :(

Code:
[root@www ~]# ps -aux | grep lsof
root    41237  0.2  0.3 33512 28480   4  R+   12:07PM   0:00.95 lsof -o 40854
root    40520  0.0  0.6 52932 48436   0  R+   12:06PM   0:01.54 lsof -o 39890
root    41789  0.0  0.2 23272 18520   5  R+   12:08PM   0:00.56 lsof -o 40964

After around two minutes, I killed the lsof and restarted the Apache to get back the services.

Code:
[root@www ~]# lsof -o 40964
Killed: 9
[root@www ~]#

Any tip more would be appreciated!

Thanks for the help!
 
Ok thanks AndyUKG.

I'm waiting for the Apache to crash again to get it with lsof -p! Can be in the next minute or next day, hehe. Lets see!

If anyone has any other comments as well it's fully welcome!

Thanks
 
Hi AndyUKG,

Server crashed again, and now I was able to get some process. I preferred to use pastebin to don't flood the post here. I got a few ones.

Here is the link: http://pastebin.com/XJhZiAqB

I could not find the issue by myself looking at the lsof -p result.

Thanks.
 
Hi,

Can't say much from that info either, but others may spot something or have other suggestions for debugging info.

Did you manage to confirm all the file systems are clean? Also they have free space right? Also have you made any configuration changes since this started? Upgraded anything, enabled anything (Apache modules, memcache, etc, etc)?

Andy.
 
Yes, it's with free resources as well. I turned on apache on info log to check out as well but not much info. The system is the same since one year. I think that is the application causing this issue, I'm trying to analyze with truss to try to figure out.

Suggestions are welcome team =)
 
Well the last thing in the truss is this, including an error:

Code:
81477: open("/usr/local/apache/htdocs/images/3motivos/1.png",O_RDONLY,00) = 22 (0x16)
81477: fcntl(22,F_GETFD,)			 = 0 (0x0)
81477: fcntl(22,F_SETFD,FD_CLOEXEC)		 = 0 (0x0)
81477: mmap(0x0,153,PROT_READ,MAP_SHARED,22,0x0) = 34365829120 (0x8005cf000)
81477: read(20,0x803922048,8000)		 ERR#35 'Resource temporarily unavailable'
81477: writev(0x14,0x7fffffffe6c0,0x2,0x1,0x7fffffffe7e8,0x1) = 453 (0x1c5)
81477: munmap(0x8005cf000,153)			 = 0 (0x0)
81477: write(11,"187.17.143.93 - - [06/Jul/2011:16:39:51 -0300] "GET /images/3motivos/1.png HTTP/1.0" 200 153\n",93) = 93 (0x5d)
81477: close(22)				 = 0 (0x0)
81487: SIGNAL 9 (SIGKILL)

Is there any issue with the above file? Or the file system in which it sits?

Andy.
 
I checked the file, seems to be ok!

Code:
[rafael@www /usr/local/apache/htdocs/images/3motivos]$ ll 1*
-rw-r--r--  1 webmaster  webmaster  153 Sep  1  2010 1.png

The file system is the same from other files that is working now. How can I know if there is an issue with file system or even with the file itself?
 
Ok, firstly I should say I don't understand exactly the output of truss so I'm having a bit of a guess as I would trying to debug one of my own systems. I'm not sure that the "Resource Temporarily Unavailable" error is related to the file system. But you can test the file system by, for example doing a tar dump to /dev/null of the entire file system and seeing if it completes ok. You already mentioned that there are no errors in the logs so nothing to do there...

You might want to ask the Apache experts here:
http://httpd.apache.org/userslist.html

Andy.
 
The file is not the problem:

Code:
"GET /images/3motivos/1.png HTTP/1.0" [FILE][B]200[/B][/FILE]

That means it was served OK.

So the error with regards to resources relates to something else. With limited information, this can be anything, from a missing /tmp or /var/tmp or /dev/null, to lack of swap, a missing filesystem, or even a hardware malfunction. So you have to look further back for the context.
 
Back
Top