Solved UFS: Bonnie++ causes SSD-backed GMIRROR to "Freeze": High I/O busy with no throughput

adams · Feb 5, 2015

I've just deployed two FreeBSD 10.1-R (fully updated to -p5 as of this post) servers which are using two Crucial 960M500 SSDs (running the latest firmware, MU05) in a GMIRROR. Partitions are aligned to 1 MiB boundaries. I'm using UFS with TRIM enabled.

When I run bonnie++ on the volume write/read tests start out fast but degrade. By the time it hits the file creation phase I can see under gstat that I/O "busy" is maxed out yet very little throughput. I would expect this to some degree as it's very random read/writes while it makes files but this "busyness" can take hours to run a test that should take 5-10 minutes. What's even more interesting is the "busy" will last past the bonnie++ run, sometimes for hours.

What's even odder is originally before GMIRRORing these disks there was an Adaptec 6504 RAID card and I had the SSDs in a hardware RAID 1 with it and I saw the same issue. There are two servers configured identically and both show the problem.

Here is a typical example of gstat during this perriod:

Code:

dT: 1.020s  w: 1.000s  filter: ada[0-9]$
 L(q)  ops/s  r/s  kBps  ms/r  w/s  kBps  ms/w  %busy Name
  29  56  0  0  0.0  0  0  0.0  101.0| ada0
  25  53  0  0  0.0  0  0  0.0  99.2| ada1

Additionally while this is happening the system can freeze when any other I/O is required, even causing the occasional console message like this:

Code:

swap_pager: indefinite wait buffer: bufobj: 0, blkno: 120335, size 28672

Load also climbs to 4-5.0 during this time (as I'd expect if I/O was backed up).

We've used bonnie++ for years and it's standard for us on new deployments to test the hardware before going into production, I've never see something like this.

Here is the bonnie++ output. For and SSD write performance is very low (meaning this issue is very-write specific in my view) and read performance is on par with what I'd expect (roughly 2x a single SSD - yay GMIRROR!).

Read: 172 MiB/sec
Write: 960 MiB/sec

Code:

Version  1.97  ------Sequential Output------ --Sequential Input- --Random-
Concurrency  1  -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine  Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
db1.michigan.ls 64G  796  99 172216  10 99444  7  1313  99 960844  39  3428  27
Latency  31966us  1347ms  15310ms  18280us  9752us  1991ms
Version  1.97  ------Sequential Create------ --------Random Create--------
db1.michigan.ls.pri -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
  16  1  0  115  0 +++++ +++  2  0 +++++ +++ +++++ +++
Latency  9118s  141s  36us  5462s  9us  19us
1.97,1.97,db1.michigan.ls.privsub.net,1,1423046069,64G,,796,99,172216,10,99444,7,1313,99,960844,39,3428,27,16,,,,,1,0,115,0,+++++,+++,2,0,+++++,+++,+++++,+++,31966us,1347ms,15310ms,18280us,9752us,1991ms,9118s,141s,36us,5462s,9us,19us

System specs:

FreeBSD 10.1-RELEASE-p5 64-bit; Kernel GENERIC
Intel® Xeon® CPU E5-1650 v2 @ 3.50GHz; Hyper-Threaded; 64-bit; 6x Physical Cores
32.0 GiB RAM: 2.7 GiB used / 327.0 MiB cache / 29.0 GiB free
Intel Patsburg AHCI SATA controller: Channel 0 (ahci0:ahcich0)

Things I have tried to resolve the issue:

Disabling power management entirely in BIOS
Disabling powerd
Disabling SATA agressive link management
Switching SATA from AHCI to IDE mode
Disabling VT-d and other virtualization options

If I do a DD to test throughput I can get 400 MiB/sec writes, and under "normal" use the system seems fine -- however it's not in production yet so it's not seeing real load. I worry that something is broken and when it does start doing it's job it's going to start freezing (these will become DB servers).

This may very well be some kind of artifact of bonnie++ but the state it puts the system (sometimes for HOURS) is very worrying and I feel like should not be happening.

Any suggestions/questions/etc welcome. I've got time to continue testing things easily for now.

Sebulon · Feb 5, 2015

Is this related?

https://forums.freebsd.org/threads/...d-9-3-and-freebsd-10.41880/page-4#post-280811

/Sebulon

adams · Feb 5, 2015

I don't believe it is a memory leak issue, here is a better overview while it's happening .. it's usually this phase of the benchmarking that kills it (oddly this takes the least amount of time on normal systems):

We manage over a hundred 10.1-R servers and VMs running mostly UFS though we do have some ZFS as well and I have not seen the memory issue (yet) outlined in that thread. We just shifted from 9.1 to 10.1 right after it was released though.

Sebulon · Feb 5, 2015

That post wasn´t about ZFS, it was about UFS (falsely written XFS at first), which was why I was asking.

About the problem with ZFS is, for us, tightly connected with the use of "cache" drives in the pool. As long as you don´t have any, they are perfectly stable. But after adding them to the pool, the system goes unresponsive after different periods of time, depending on load and amount of RAM. If you have pools with cache drives added, keep a close lookout for slowly decreasing ARC. After it reaches a critical low point, the system stops responding.

/Sebulon

adams · Feb 5, 2015

Got it, yeah I see now at the end that it's a general memory issue ... I don't think that's what I'm seeing here un/fortunately. Through these issues I see the free RAM stays very high.

It does tempt me to want to reformat one of these servers as ZFS though and see if the issues persist ...

And I will definitely keep an eye on the ARC issue you describe on our production ZFS servers, thanks!

adams · Feb 5, 2015

So I took one of the servers, backed it up, wiped, reformatted it as ZFS and restored onto it (so same exact files, OS, etc, just ZFS instead of UFS). The issue is gone under ZFS, so it's definitely something specific to UFS.

I get 360 MiB/sec writes and 890 MiB/sec reads on the SSDs in a ZMIRROR. bonnie++ completes without the I/O hangs and there is no crazy "hours to run" (test completes in about 10 minutes which is normal).

I'll likely switch the other server to ZFS shortly, so if anyone has anything they'd like me to test before then let me know.

wblock@ · Feb 6, 2015

gmirror(8) does not pass through TRIM, so it could be garbage collection on the SSDs. It might be possible to determine if that is what is happening with the SMART data.

adams · Feb 6, 2015

Really?! For some reason I thought I saw somewhere that pass-through of TRIM was now supported in GMIRROR but I cannot seem to find it now -- wishful thinking perhaps!

When I would enable it it would say something like "enabling TRIM on volume sets" (as it it clearly understood it was a GMIRROR). I also did not see any errors like what used to be reported (ie; "WARNING: /mnt: TRIM flag on fs but cannot get whether disk supports TRIM") when mounting.

How would I check this in smartctl ?

And of course before I switch that last server over to ZFS I can format it straight UFS (no GMIRROR) and test, too.

User23 · Feb 6, 2015

I hit that bug too while using iozone on a SSD Mirror (S3500) with 20% over-provisioning. The test was 100% done and the system (FreeBSD 10.1) stopped responding. I'll try to reproduce that.

User23 · Feb 6, 2015

wblock@ said:
gmirror(8) does not pass through TRIM, so it could be garbage collection on the SSDs. It might be possible to determine if that is what is happening with the SMART data.

What? I thought it is working rudimentary since FreeBSD 9.1?

https://www.freebsd.org/releases/9.1R/relnotes.html

Code:

The MIRROR geom(4) class now supports BIO_DELETE. This means TRIM command will be issued on supported devices when deleting data.[r238500]

http://svnweb.freebsd.org/base?view=revision&revision=238500

User23 · Feb 6, 2015

Hm, ~~gstat doesn't show me a throughput but~~ gstat -d shows the BIO_DELETEs systat -vm and does while using bonnie++.

Update: test is done, systat -vm shows 100% busy and 2 MB/s throughput on ada0 and ada1. ~~System is not responding anymore.~~

Update: system is responding very slowly while doing something on ada0 + ada1 ... strange

Code:

          /0%  /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
ada0  MB/sXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      tps|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX2053.18
ada1  MB/sXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      tps|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX2106.33

Code:

last pid:  2073;  load averages:  3.54,  3.70,  3.58                           up 0+01:15:07  12:11:00

32 processes:  1 running, 31 sleeping
CPU:  0.1% user,  0.0% nice, 24.8% system,  0.0% interrupt, 75.1% idle
Mem: 21M Active, 16M Inact, 487M Wired, 21M Cache, 22M Buf, 23G Free
Swap: 12G Total, 584K Used, 12G Free

Is this the behavior of the TRIM feature?

It looks like, let see if there are some options to make it less aggressive.

Update: The kernel paniced at reboot, could not write on the gmirrors.

adams · Feb 6, 2015

Wow. I've never seen a panic. And I knew I read there was TRIM support for GMIRROR somewhere, thanks!

User23 · Feb 6, 2015

adams said:
Wow. I've never seen a panic. And I knew I read there was TRIM support for GMIRROR somewhere, thanks!

And the best thing, after the bonnie++ run iI updated the system to 10.1 p5, with new /usr/src, make install kernel ... everything OK... but at reboot -> panic & gmirrors broken after reboot.

Single ssd setup without gmirror is testing at the moment.

User23 · Feb 6, 2015

Without gmirror the BIO_DELETEs appear, but are done in no time, without any "freezing".

wblock@ · Feb 6, 2015

I could be mistaken about gmirror(8) and trim, but thought I saw something recently about that. Sure seems like it makes a difference from the results above, though.

User23 · Feb 6, 2015

wblock@ said:
I could be mistaken about gmirror(8) and trim, but thought I saw something recently about that. Sure seems like it makes a difference from the results above, though.

As you see, at the end you are right

Gmirror is not working correctly with trim support at the moment.

We'll investigate that next week.

User23 · Feb 10, 2015

Bug report:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197516

User23 · Feb 10, 2015

It is a little surprise that nobody hit and file that bug since FreeBSD 9.1.

belon_cfy · Feb 12, 2015

Sebulon said:
That post wasn´t about ZFS, it was about UFS (falsely written XFS at first), which was why I was asking.

About the problem with ZFS is, for us, tightly connected with the use of "cache" drives in the pool. As long as you don´t have any, they are perfectly stable. But after adding them to the pool, the system goes unresponsive after different periods of time, depending on load and amount of RAM. If you have pools with cache drives added, keep a close lookout for slowly decreasing ARC. After it reaches a critical low point, the system stops responding.

/Sebulon

Surprisingly this issue still persist across several major and minor upgrade since FreeBSD 9.2 and no complete resolution yet. I was waiting for the newer version since 1 year ago until now but every version released after that would make me disappointed.

Still looking forward to the FreeBSD 9.4 and 10.2, hopefully the same issue won't bring down my server anymore.

I have some FreeBSD 9.1 with more than 400 days uptime without any issue.

Sebulon · Feb 12, 2015

belon_cfy said:
Surprisingly this issue still persist across several major and minor upgrade since FreeBSD 9.2 and no complete resolution yet.

Have you seen this?
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197164

/Sebulon

User23 · Feb 12, 2015

Please don't mix the problems. This thread is about ufs gmirror trim problem.

User23 · Feb 12, 2015

It looks (not tested enough yet) like the

sysctl kern.geom.mirror.sync_requests=4

help to avoid the above described behavior.
System is responding, and gstat looks a lot more like UFS with TRIM on SSD

Default is sysctl kern.geom.mirror.sync_requests=2

Please help to verify that.

User23 · Mar 27, 2015

Problem still exists:

Bug 197516 - TRIM on gmirror is slow and results in inresponsive system
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197516

Sadly we can't provide the hardware any longer for debugging.

adams · Mar 27, 2015

I too had to convert the servers I was testing on to ZFS as they are now in production. So I have no test servers either anymore.

I might be interested in spinning up a colocated box for testing if anyone is interested / thinks they can actually fix the issue. From following that PR I'm thinking that it's pretty conclusive there is a problem -- we need someone who can create patches and fix it now.

User23 · Mar 31, 2015

Bump:https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197516

Alexander Motin

2015-03-31 10:02:14 UTC
r280757 commit in head is supposed to fix this problem. I'll merge in to stable in couple weeks. Please test that change before if you can.

Problem seemed to be fixed, tyvm.