SSD hangs on heavy writing after upgrading to 11.0

neptunium · Oct 26, 2016

Hi,

I have been running FreeBSD-10.X on an SSD (Lenovo IdeaPad S400) for more than two years without an issue.

After upgrading to 11.0, my SSD hangs on heavy writing, for example when I try to copy a large file (>10G) from one partition to another, or when doing a simple cp -Rp on a large directory (~1G). "Hanging" means that disk is obviously being detached or something similar; I can still switch between ttys, but I cannot login, enter any command or cancel anything. I can just press the power button and turn off the laptop.

Of course, my first thought was that the SSD is dying; so I reverted my laptop to 10.X and repeated all problematic actions; I experienced no problems. So it is something with 11.0.

This is so strange that I do not even know how to debug or what to try. So I hope that someone can help with an idea or an information about changes in 11.0 that could cause such a strange behavior. For example, is there any new sysctl switches that I should try to disable?

Of course, I am ready to post any relevant information, just say what.

Thanks.

wblock@ · Oct 27, 2016

There is an NCQ trim feature that was added recently, which some SSDs do not like at all. The Samsung ones I tried, for instance, would have difficulty booting. You did not say which SSD is present.

neptunium · Oct 27, 2016

This is my SSD as seen on 11.0:

Code:

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <Crucial CT240M500SSD3 MU03> ACS-2 ATA SATA 3.x device
ada0: Serial Number 1351095F93D6
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 228936MB (468862128 512 byte sectors)
ada0: quirks=0x2<NCQ_TRIM_BROKEN>

As far as I understand, this means that NCQ trimming is not activated:

Code:

# sysctl kern.cam.ada.0
kern.cam.ada.0.sort_io_queue: 0
kern.cam.ada.0.max_seq_zones: 0
kern.cam.ada.0.optimal_nonseq_zones: 0
kern.cam.ada.0.optimal_seq_zones: 0
kern.cam.ada.0.zone_support: None
kern.cam.ada.0.zone_mode: Not Zoned
kern.cam.ada.0.rotating: 0
kern.cam.ada.0.unmapped_io: 1
kern.cam.ada.0.write_cache: -1
kern.cam.ada.0.read_ahead: -1
kern.cam.ada.0.delete_method: DSM_TRIM

This is the same SSD on 10.0:

Code:

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <Crucial CT240M500SSD3 MU03> ATA-9 SATA 3.x device
ada0: Serial Number 1351095F93D6
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 228936MB (468862128 512 byte sectors: 16H 63S/T 16383C)

Anything suspicious?

wblock@ · Oct 28, 2016

That one has a quirk set to prevent NCQ trim, so it's not that problem. I don't have any Crucial SSDs, so have no experience with them.

neptunium · Oct 28, 2016

Is there anything strange in the fact that the disk was identified on 10.0 as ATA-9, and on 11.0 as ACS-2 ATA?

neptunium · Nov 1, 2016

I found the cause of this strange behavior. It is -- tmpfs! With tmp filesystem like this

Code:

tmpfs /tmp tmpfs rw,mode=777,size=1G 0 0

(which I used to have since 9.0 with no problems) the 11.0 misbehaves as described above. I will start a separate thread about this issue.

kpa · Nov 1, 2016

That's very odd to say the least, the tmpfs(5) does use the swap as backing store but it shouldn't cause such hangups on an SSD disk.

One note though, you're using a wrong mode for /tmp. It should be 1777, not 777. It may not be significant on a single user system but on a system with many users that's a security disaster.

neptunium · Nov 1, 2016

kpa said:
That's very odd to say the least, the tmpfs(5) does use the swap as backing store but it shouldn't cause such hangups on an SSD disk.

Yes, I agree, it is strange. I don't know if it's the cause of my problems, but after removing tmpfs line from fstab, my disk behaves normally. I'll investigate this further.

One note though, you're using a wrong mode for /tmp. It should be 1777, not 777. It may not be significant on a single user system but on a system with many users that's a security disaster.

Sure. This is a single user system.

neptunium · Nov 2, 2016

One note though, you're using a wrong mode for /tmp. It should be 1777, not 777. It may not be significant on a single user system but on a system with many users that's a security disaster.

BTW, mode=777 is taken from here: https://wiki.freebsd.org/TMPFS.

topcat · Feb 17, 2017

Hope it's okay to reply to this thread, because I have a very similar issue with a new install of FreeBSD 11 on SSD. My initial setup was based on this guide. For me the problem showed up during heavy disk i/o as well: complete freeze with no logs.

Excerpt from my dmesg:

Code:

ada0 at ahcich0 bus 0 scbus2 target 0 lun 0
ada0: <LITEONIT LCS-128M6S 2.5 7mm 128GB DC7110D> ATA8-ACS SATA 3.x device
ada0: Serial Number TW032GYJ5508536I4326
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 122104MB (250069680 512 byte sectors)

After I saw the OPs remarks I tried the following:
1. Disable tmpfs based /tmp and just use a directory on the SSD.
2. Move the swap file off the SSD to another drive.

After these two changes I tortured the system with a big ports-mgmt/synth build, launching big programs like www/firefox, copying large files, etc but was unable to hang the system even with 50+% swap usage. In the previous setup the sytem would easily freeze when swap approached 20%.

Please let me know if I can provide any other details which might help.

Code:

% sysctl kern.cam.ada.0
kern.cam.ada.0.sort_io_queue: 0
kern.cam.ada.0.max_seq_zones: 0
kern.cam.ada.0.optimal_nonseq_zones: 0
kern.cam.ada.0.optimal_seq_zones: 0
kern.cam.ada.0.zone_support: None
kern.cam.ada.0.zone_mode: Not Zoned
kern.cam.ada.0.rotating: 0
kern.cam.ada.0.unmapped_io: 1
kern.cam.ada.0.write_cache: -1
kern.cam.ada.0.read_ahead: -1
kern.cam.ada.0.delete_method: DSM_TRIM

topcat · Feb 17, 2017

neptunium said:
BTW, mode=777 is taken from here: https://wiki.freebsd.org/TMPFS.

Hi, did you learn anything further during your investigations?

neptunium · Feb 18, 2017

topcat said:
Hi, did you learn anything further during your investigations?

Well, this is great that somebody else with same problems appeared. But unfortunately no, disabling tmpfs didn't solve the problem (I still have swap as a file on the SSD). And I am no longer sure if the problem has any connection to SSD as such (because I noticed some corelation -- albeit not 100% -- between heavy RAM usage and the freezes).

My situation after 3 months:

I can 100% of times freeze a system with no logs with heavy use of Chromium (for example with opening several instances of google drive, or several sites containing videos and/or heavy scripting). The only thing I can see before freeze is that RAM usage goes beyond 75-80% and that, sometimes, chrome (or several instances of it) starts to eat >>100% of CPU (according to conky's interpretation).
I can almost surely freeze a system if I recursively grep a huge directory.

For three months I wasn't able to figure out anything else that could be the source of this annoying problem.

I used to this behavior; I simply use chrome less (which is actually a very good thing...) and 99% of time it's ok. But before switching back to 10.3, I'd be glad to try something else. So, please share any new ideas or observations here. Maybe we will come up with a real solution.

wblock@ · Feb 18, 2017

Have you run memtest on that system?

neptunium · Feb 18, 2017

wblock@ said:
Have you run memtest on that system?

No, because I have no problems with this machine on FreeBSD 10.*. Should I? If yes, what type of test do you recommend?

topcat · Feb 19, 2017

I'll run the memtest86 program and report back. For me, it's the swap. Moving the swap file to a spinning drive stopped the freezes. I even pushed the system to 100% swap usage (wasn't easy) but it still recovered in a few seconds. I'd expect this because FreeBSD is very stable when pushed to the limit.

I'm not sure exactly what's wrong. Maybe RAM (I'll know after memtest86), some issue with the SSD? I have only run 11 on this machine so can't comment on 10.3.

topcat · Feb 20, 2017

Update: memtest86 reported no errors after 2 passes.

I'm keen to get to the bottom of this. After all, a lot of people must be using 11.0 on ssd based machines.

TheDreamer · Feb 20, 2017

I think swap on SSD is a strong candidate for being the issue. I don't have any Crucial SSDs, but I know that my system would often hang when I was using a Corsair SSD for swap. Though I haven't had issues doing swap on SanDisk SSDs. Though those SSDs are only doing SATA-II, while I have the Corsair on a PCIe card doing SATA-III.

I now mainly use it for L2ARC, VirtualBox VDI's, and scratch (wrkdir) for ports-mgmt/poudriere.

The Dreamer

FWIW, I had previously done memtest at various times, and even did a full replacement of all my memory (before I it was original + added, to a full set from single vendor...since there was a very minor timing difference between the original memory and what I had added.), once or twice (i'd have to check my notes.) I know at least one time where I had upgraded another machine using the old memory....

The 'new' motherboard I plan to swap in is fully populated with ECC RAM. Thinking I need to setup a second backup server before I make the switch though.

neptunium · Mar 1, 2017

Update. Memtest86 passed. Topcat and TheDreamer, I tried your suggestions, and my conclusion is: the problem is swap as a file:

Code:

md99        none                swap        sw,file=/usr/swap,late    0    0

In other words, I'm not actually sure if it's SSD.

Namely, when I moved my swap file to a spinning drive, freezes with no logs continued (even with very low swap/ram usage). When I moved to a standard swap partition on a spinning drive, freezes stopped. I even reenabled tmpfs and it works (see message #6 of this thread). Several questions:

At the moment, I am not able to test a swap partition on an SSD drive. Is it desirable at all? Does someone have a possibility to try it?
Since this is 11.0 problem (I had no such problem on 10.*), does anybody know what changes to swap code could cause this annoying problem? Maybe we should come up with something and fill a PR.
I don't want a swap on a spinning drive. It is my backup drive, and it eats a lot of battery when it's constantly mounted, and it's slow. Can anyone suggest an alternative approach for solving this 11.0-related swap-as-a-file problem?

chrbr · Mar 2, 2017

neptunium said:
At the moment, I am not able to test a swap partition on an SSD drive. Is it desirable at all? Does someone have a possibility to try it?

Since FreeBSD 11.0 I have tried ZFS and I guess I stick with it. I can also confim the issues with a swap file on zfs. On the other hand it is no surprise if swap on zfs has issues when zfs is in trouble because the system is running out of resources. My configuration is a mirror with two SSD and two swap partitions as below. I have no issues about swap anymore and I am happy with the performance.

Code:

 gpart show -l
=>       40  468862048  ada0  GPT  (224G)
         40       1024     1  boot0  (512K)
       1064        984        - free -  (492K)
       2048  419430400     2  zfs0  (200G)
  419432448    8388608     3  swap0  (4.0G)
  427821056   41041032        - free -  (20G)

=>       40  468862048  ada1  GPT  (224G)
         40       1024     1  boot1  (512K)
       1064        984        - free -  (492K)
       2048  419430400     2  zfs1  (200G)
  419432448    8388608     3  swap1  (4.0G)
  427821056   41041032        - free -  (20G)

Code:

swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/swap0    4194304        0  4194304     0%
/dev/gpt/swap1    4194304        0  4194304     0%
Total             8388608        0  8388608     0%

tmpfs(5) is active as well. I hope it helps.

topcat · Mar 2, 2017

Two comments:

1. Do not do swap on ZFS. It is not recommended. I have a ZFS based 10.3 machine with spinning disks, on which I have a small swap partition. I needed more swap, so I added a UFS partition containing a swap file. A swap file is convenient as I can change the size without rebooting.

2. neptunium, it's interesting that the freezes continue for you with a swap file on a spinning disk. I can't get mine to crash in this setup (again the swap file is on a UFS partition). I have repeatedly maxed out the swap on this machine during my testing and it just shrugs it off.

Also putting the swap partition on SSD means that it will appear permanently in use to the disk firmware, and hence cannot be used for wear leveling. When it's on a filesystem then TRIM solves this issue.

labiol · Jun 26, 2017

I have the same issue. In my opinion (just my observation) the problem is with swap file (on some ssd disk). When I moved swap file to swap partition I have no more unexpected system freeze.

chrcol · Jun 27, 2017

wblock when a ssd has the ncq_trim quirk flag, does it mean standard trim is still issued to the drive? as all my ssd's used on FreeBSD are affected.

Thanks.

Also neptunium tuning the metadata cache for zfs may help with the grepping large directory issue.

The sysctl is here which I believe defaults to 1/4 of the ARC size.

'kstat.zfs.misc.arcstats.arc_meta_limit'

this sysctl will tell you how full it is.

'kstat.zfs.misc.arcstats.arc_meta_used'

execve · Aug 2, 2017

I faced this issue recently, and this thread actually helped me find out what was causing the issue. I was not using a SSD but was facing a system freeze while doing intensive I/O operations. Root cause seems to be the swapfile used since I dont have a dedicated swap partition available for that system.

I raised a bug report for this 220971 ; but wanted to ask here if someone knows how I can help provide any information which might help the developers. Any suggestions - since there are no logs / system panic -- just a freeze.