Nginx + FUSE stalls with 'grbmaw'

ScopeDog

Member


Messages: 23

Hi.
I'm developing a distributed file system with FUSE (+ fusefs-libs) on FreeBSD 12-Release and Stable. However while Nginx is accessing the FUSE mounted file system, it stalls with 'grbmaw' status after working correctly for a while.
This grbmaw status seems to be related to virtual memory and be used by vm_page_busy_sleep(). My FUSE program is just waiting for the next command at fuse_session_loop_mt().
I need to reboot my server if this happens, but it doesn't always shutdown and I need to manually turn it off. (This is actually pain for me working from home now.)

Does anybody know what's going on with it?
 
OP
S

ScopeDog

Member


Messages: 23

I guess it's better to send-pr this. It seems to be fuse.ko related and a userland program should never cause vm stall.
 

bigbrother

New Member


Messages: 10

I can report that does not happen only with fuse but also with NFS filesystem.

short topic: nginx stuck will sending a file from an NFS mounted directory

nginx stuck:
# ps axuwww | grep nginx
www 12593 0.0 0.1 19552 5980 - D 03:04 0:00.30 nginx: worker process (nginx)
www 12594 0.0 0.1 19356 5752 - D 03:04 0:01.35 nginx: worker process (nginx)


procstat -kka | grep nginx
12593 100132 nginx - mi_switch+0xe2 sleepq_wait+0x2c _sleep+0x247 vm_page_busy_sleep+0x8f vm_page_grab_pages+0x417 allocbuf+0x34a getblkx+0x5c4 breadn_flags+0x3d vfs_bio_getpages+0x323 ncl_getpages+0x2be VOP_GETPAGES_APV+0x7c vop_stdgetpages_async+0x49 VOP_GETPAGES_ASYNC_APV+0x7c vnode_pager_getpages_async+0x7e vn_sendfile+0xd9c sendfile+0x12b amd64_syscall+0x364 fast_syscall_common+0x101
12594 101754 nginx - mi_switch+0xe2 sleepq_wait+0x2c _sleep+0x247 vm_page_busy_sleep+0x8f vm_page_grab_pages+0x417 allocbuf+0x34a getblkx+0x5c4 breadn_flags+0x3d vfs_bio_getpages+0x323 ncl_getpages+0x2be VOP_GETPAGES_APV+0x7c vop_stdgetpages_async+0x49 VOP_GETPAGES_ASYNC_APV+0x7c vnode_pager_getpages_async+0x7e vn_sendfile+0xd9c sendfile+0x12b amd64_syscall+0x364 fast_syscall_common+0x101



procstat -t 12593
PID TID COMM TDNAME CPU PRI STATE WCHAN
12593 100132 nginx - -1 120 sleep grbmaw
root@bigb5:/ # procstat -t 12594
PID TID COMM TDNAME CPU PRI STATE WCHAN
12594 101754 nginx - -1 120 sleep grbmaw





This happened while I was download a file from my nginx server that is 500 MB, with the file belonging to an NFS mounted directory. After some minutes, all NFS accesses to this directory stalled, with vfs_busy:

#ps axuww | grep df
root 86591 0.0 0.0 11292 2216 1 DN 21:26 0:00.00 df -h
root 86599 0.0 0.0 11432 2368 1 SN+ 21:27 0:00.00 grep df
root 85320 0.0 0.0 11292 2220 6 DN 21:23 0:00.00 df -h
# procstat -t 86591
PID TID COMM TDNAME CPU PRI STATE WCHAN
86591 100591 df - -1 255 sleep vfs_busy

The NFS server was operating succesfully for the other machines on the LAN. The problem existed only on this machine and only with this mounted directory. I could access other NFS mounted directories on this machine without any problem.


FreeBSD XXXX 12.1-RELEASE-p8 FreeBSD 12.1-RELEASE-p8 GENERIC amd64

# kldstat
Id Refs Address Size Name
1 48 0xffffffff80200000 2448f20 kernel
2 1 0xffffffff82649000 2ca0 coretemp.ko
3 3 0xffffffff8264c000 49ba8 ipfw.ko
4 1 0xffffffff82a21000 cb50 geom_eli.ko
5 1 0xffffffff82a2e000 88d8 tmpfs.ko
6 1 0xffffffff82a37000 18a0 uhid.ko
7 1 0xffffffff82a39000 1aa0 wmt.ko
8 1 0xffffffff82a3b000 19c8 ipdivert.ko
9 1 0xffffffff82a3d000 2450 ipfw_nat.ko
10 1 0xffffffff82a40000 ac32 libalias.ko
11 1 0xffffffff82a4b000 1010 cpuctl.ko
12 3 0xffffffff82a4d000 529c8 vboxdrv.ko
13 2 0xffffffff82aa0000 2ce0 vboxnetflt.ko
14 2 0xffffffff82aa3000 9e30 netgraph.ko
15 1 0xffffffff82aad000 1710 ng_ether.ko
16 1 0xffffffff82aaf000 3f30 vboxnetadp.ko
17 1 0xffffffff82ab3000 2472e0 zfs.ko
18 1 0xffffffff82cfb000 7628 opensolaris.ko
19 1 0xffffffff82d03000 2940 nullfs.ko
20 1 0xffffffff82d06000 30c1 if_tap.ko




Unable to do anything else, I resorted to the unclean_reboot.c program that I have created
(that has only one function call in the main
return reboot(RB_NOSYNC);
)

because this is a remote server and there was a great possiblity of hanging during the shutdown.

After reboot, I changed the sendfile to off on nginx. If a problem reappears, I will post a follow up, otherwise if you have any suggestion let me know.
 

glebius@

New Member
Developer

Reaction score: 5
Messages: 19

> 13-current seems to have totally new sendfile implemented by Netflix and I also recommend to try with 13-current.

The new sendfile appeared in 12.0-RELEASE. There are some differencies between 12 and 13 of course, but not substantial.
 
Top