ZFS Hanging ZFS mount

cracauer@

Developer
I have another Seagate problem here. However, I think there might be opportunity here for the ZFS code to do better. This behavior has no showing in dmesg or zpool status and it probably should, instead of just hanging indefinitely.

Long story:

I have one zpool that had one, now two ZFS filesystems inside that will make the command line zfs mount hang indefinitely.

Both have gotten into this state with heavy deletions ( rsync) while under 10.1-release (amd64, 12 GB RAM, ). I believe both deletions ended up in a machine hang. After reboot all other filesystems in the pool can still be mounted.

zfs mount is in STATE zio->i and there are no system calls. The kernel is not CPU-busy.

I booted into a 10.2-prerelease kernel to see whether that one would mount it but same behavior. I did one scrub without seeing errors and without fixing this. As you can see below, a new scrub I started now is very slow.
Code:
~(flippers)2% sud zpool status
  pool: cbackup6
state: ONLINE
  scan: scrub in progress since Sun Jul  5 11:58:25 2015
        10.9M scanned out of 7.90T at 37.7K/s, (scan is slow, no estimated time)
        0 repaired, 0.00% done
config:

        NAME        STATE     READ WRITE CKSUM
        cbackup6    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0
            ada4p3  ONLINE       0     0     0

errors: No known data errors

Let's test the individual disks for read throughput.

Code:
~(flippers)7% sud sh -c "cstream -o- -T1 < /dev/ada0p3"
126541824 B 120.7 MB 1.00 s 126026756 B/s 120.19 MB/s
257990656 B 246.0 MB 2.02 s 127887250 B/s 121.96 MB/s
^C~(flippers)8% sud sh -c "cstream -o- -T1 < /dev/ada1p3"
124739584 B 119.0 MB 1.00 s 124602276 B/s 118.83 MB/s
255311872 B 243.5 MB 2.01 s 127209809 B/s 121.32 MB/s
^C~(flippers)9% sud sh -c "cstream -o- -T1 < /dev/ada2p3"
129359872 B 123.4 MB 1.03 s 125215607 B/s 119.41 MB/s
260931584 B 248.8 MB 2.05 s 127561604 B/s 121.65 MB/s
^C~(flippers)10% sud sh -c "cstream -o- -T1 < /dev/ada3p3"
126869504 B 121.0 MB 1.01 s 125361627 B/s 119.55 MB/s
257261568 B 245.3 MB 2.02 s 127624332 B/s 121.71 MB/s
^C~(flippers)11% sud sh -c "cstream -o- -T1 < /dev/ada4p3"
0 B 0.0 B 1.53 s 0 B/s 0.00 B/s
8192 B 8.0 KB 3.07 s 2664 B/s 2.60 KB/s
49152 B 48.0 KB 4.62 s 10628 B/s 10.38 KB/s
90112 B 88.0 KB 6.16 s 14633 B/s 14.29 KB/s
114688 B 112.0 KB 7.70 s 14895 B/s 14.55 KB/s
155648 B 152.0 KB 9.23 s 16857 B/s 16.46 KB/s
~(flippers)7% sud sh -c "cstream -o- -T1 < /dev/ada0p3"
126541824 B 120.7 MB 1.00 s 126026756 B/s 120.19 MB/s
257990656 B 246.0 MB 2.02 s 127887250 B/s 121.96 MB/s
^C~(flippers)8% sud sh -c "cstream -o- -T1 < /dev/ada1p3"
124739584 B 119.0 MB 1.00 s 124602276 B/s 118.83 MB/s
255311872 B 243.5 MB 2.01 s 127209809 B/s 121.32 MB/s
^C~(flippers)9% sud sh -c "cstream -o- -T1 < /dev/ada2p3"
129359872 B 123.4 MB 1.03 s 125215607 B/s 119.41 MB/s
260931584 B 248.8 MB 2.05 s 127561604 B/s 121.65 MB/s
^C~(flippers)10% sud sh -c "cstream -o- -T1 < /dev/ada3p3"
126869504 B 121.0 MB 1.01 s 125361627 B/s 119.55 MB/s
257261568 B 245.3 MB 2.02 s 127624332 B/s 121.71 MB/s
^C~(flippers)11% sud sh -c "cstream -o- -T1 < /dev/ada4p3"
0 B 0.0 B 1.53 s 0 B/s 0.00 B/s
8192 B 8.0 KB 3.07 s 2664 B/s 2.60 KB/s
49152 B 48.0 KB 4.62 s 10628 B/s 10.38 KB/s
90112 B 88.0 KB 6.16 s 14633 B/s 14.29 KB/s
114688 B 112.0 KB 7.70 s 14895 B/s 14.55 KB/s
155648 B 152.0 KB 9.23 s 16857 B/s 16.46 KB/s


All right, so as you can see ada4 is "a bit" slow. It is probably stuck in a sector reallocation with that great storage manufacturer's software.

smartctl reveals where the actual problem is, I highlighted the key to brokenness in bold red.

Code:
/tmp(flippers)14% sud smartctl -a /dev/ada4
smartctl 6.2 2014-02-18 r3874 [FreeBSD 10.2-PRERELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     [b][color=red]Seagate[/color][/b] Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-9YN166
Serial Number:    W1F0KCXQ
LU WWN Device Id: 5 000c50 051118b68
Firmware Version: CC9E
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Jul  5 12:09:41 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 338) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3081) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always       -       19330312
  3 Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       98
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   083   060   030    Pre-fail  Always       -       4506389487
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       18124
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       98
183 Runtime_Bad_Block       0x0032   096   096   000    Old_age   Always       -       4
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   075   075   000    Old_age   Always       -       25
188 Command_Timeout         0x0032   100   097   000    Old_age   Always       -       4 6 10
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   054   048   045    Old_age   Always       -       46 (Min/Max 42/46)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       79
193 Load_Cycle_Count        0x0032   096   096   000    Old_age   Always       -       9423
194 Temperature_Celsius     0x0022   046   052   000    Old_age   Always       -       46 (0 15 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       9
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       12943h+01m+35.960s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       68465224119808
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       77857227983056

SMART Error Log Version: 1
ATA Error Count: 24 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 24 occurred at disk power-on lifetime: 7095 hours (295 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
 
I wonder what should be done here instead.

Myself I will kick this disk out of the pool since it make some filesystems outright inaccessible and it probably doesn't provide good redundancy for the others anyway. I wouldn't be comfortable doing that automatically to a user.

Putting a timeout into the zfs command line command then complaining to the terminal with a message sounds easy, but doesn't work that simply, because you would have to take care of canceling the mount in the kernel. You can't let the command line utility return to the shell while the mount system call is still pending. If it then would happen to go through eventually the user would be misinformed.
 
Actually it is worse. I kicked out the Seagate but the hangs continue. Now everything is slow while I try to mount this ZFS, without the Seagate which was the only disk being slow before.

I wonder whether the 'dedup=on' property has to do with this?

Both filesystems had it it on. Both got into the broken state with lots of deletions. Trying to set dedup to off right now seems to hang also, on a filesystem that was successfully mounted and is not broken, and only has 2 files in it... Hmmm...

I also wonder whether the mount command will eventually succeed. It seems to cause slow but constant disk I/O. It is doing something.
 
You should probably know that deduplication is not a feature that you can turn on just to try it. It requires lots and lots of RAM (how much depends on your pool size, I've heard double the pool size) and a powerful machine. And also, I don't think you can turn it off again.
 
It sounds like the biggest issue is just the different behavior between consumer drives and enterprise drives where enterprise drives will return a failure almost immediately while consumer drives will retry over and over. I'm not sure what the best course of action is software side to deal with the drive firmware since assuming things based off broken or bad firmware behavior sounds like it may come with unintended consequences.
 
There are no read errors on the disk involved.

Both the dedup=off and the mount on one of the lost ones finally succeeded, the latter took a day. As to what caused the hangs on heavy deletions I don't know, either the Seagate was blocking the bus, or maybe it is RAM. I would have thought that 12 GB for the lame 7200 disks with a single operation should be sufficient, and even if not that there would be a better error mode (such as unlink(2) returning with an error).

I'll re-try the unlink with the Seagate out of the array.
 
Disregarding some of my train of thought. For some reason my eyes lined the Reallocated sectors row up with the 4506389487. Oops! That is a pretty big difference.
 
So, doing the heavy deletion thing on the then-mounted zfs again was hanging again, as in locking up the machine hard (no panic). This really looks like a case for needing a debug kernel. Looks like a kernel deadlock. At this point in time dedup was turned off but as mentioned it was originally on when this filesystem was filled with the contents I now want to nuke. I did not try unlinking a lot slower. I'll do that with a debug kernel.

This was with the Seagate kicked out of the pool. As hard as that is to believe, this problem in a Seagate-having system might not have been caused by Seagate. Or maybe Seagate got the zfs into the wedged state?
 
Disregarding some of my train of thought. For some reason my eyes lined the Reallocated sectors row up with the 4506389487. Oops! That is a pretty big difference.

That's not a count. That is a Seagate-specific bitmap.
 
This is going very badly. Mounting some of these pre-crashed ZFS filesystems inside the pool will now consistently make various 10-versions hang with no notice.

FreeBSD-current with lock debugging on gets into a panic but it says there is a locking problem in fork(). The backtrace has nothing ZFS or block device related in it.

I am pretty sure that the original problem came about because the Seagate disk went into that slow mode and that left some ZFS operations unfinished. I am ready to redo the entire zpool (without the Seagate, got another Toshiba), however I would like to mount the zpool one more time to get some data off. Right now it's not looking good. I tried both with a degraded array (no seagate) and with the seagate inside.

FreeBSD keeps trying to mount the zfs in question for many hours before the machine hangs. There are no dmesg or anything else useful. It is silent. iostat shows activity in the 10s to 100s of Kb/sec.

I have no reason to accuse the hardware of instability.

I think my list of options now is:
  • See whether the locking-during-fork panic is an unreleated -current instability and if so use a -stable kernel with lock detection. Any ideas how to make it print useful things during those hours of mount attempts?
  • Try Linux or Solaris
  • Try resilver the array and see whether that makes the mount faster. It is currently in degraded mode with the seagate dropped, I could put the new Toshi.
  • The root of this machine is on NFS (PXE boot). The mountpoints for the ZFS are on NFS. I wonder whether moving them local could help somehow.
 
I tracked this down.

The problem is that this damaged zfs somehow needs a lot more memory to mount, or to add a drive to the pool or anything else. The memory in use is pure RAM and no paging does happen. So the thing just grows until all userland is dead and the machine hangs.

I went from a 12 GB machine to 24 GB and that didn't do it but I did catch it via vmstat 1 and not being able to execute a command from the shell with insufficient memory.

By coincidence I had memory to go 80 GB in the mail today (not making this up). Another try showed that I needed 64 GB of RAM but it did mount it now. Copying out the data as we speak. This zfs has 365 GB, the pool has:
Code:
NAME       SIZE  ALLOC   FREE   FRAG  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
cbackup6  13.5T  8.43T  5.07T    50%         -    62%  1.27x  DEGRADED  -

I wonder whether the dedup is involved here.
 
This debugging brought to you by testing with:
  • ZFS on Linux, FreeBSD-current, FreeBSD-10.1-RELEASE and FreeBSD-10.2-stable (same hangs for all except -current which had a different panic)
  • All debug zfs sysctls (nothing useful)
  • Two mainboards (hence going 12 to 24 GB)
  • Two power supplies
  • Replacing that Seagate (except the zpool was unsyncable from some point)
  • Stupid Linux somehow wiping out the GPT table on the new additional drive that I already had boot-ready :mad:

Overall I was surprised how similar Linux and FreeBSD behaved. Same hang in about the same time.
 
Yes, that is the top suspect here. RAM piling up when deleting a lot of things fast and ZFS trying to sort out dedup.

It looks like the whole scenario will be reasonably easy to reproduce, and with a single disk so that I don't have to play with this big array. I have no idea what the code looks like but given that FreeBSD and Linux were hanging in identical manners I think a fix would be very satisfying to have. If this is really centered on dedup it seems likely that you could do one of:
  • Use pageable memory
  • Trade time for memory
  • Abort instead of just locking up the machine
  • Give up on dedup in the particular situation

Most of my dedup there is from renaming, BTW. I don't actually have the same file twice. They have been renamed or moved across directories on the primary machine and this ZFS is a backup. Instead of teaching rsync to track renames which is difficult except for same-directory I opted to use a filesystem with dedup. This is a side effect.

Coding time might also be well-spent in teaching rsync to do a whole-run-scope rename tracking. On filesystems that have unique inode numbers that might be doable with a bit of additional metadata stored in a file.
 
Have you shared this on freebsd-fs@FreeBSD.org yet? I'm betting getting some fellow developer assistance from the ZFS gurus will help you come up with the best course of action. I seem to remember recommendations to add enough RAM to get the pool to behave or to import it on another machine with much more RAM being a solution. I don't know if this has changed and the mailing list may be the best source of current info on this one.
 
I'll repeat this so that no one forgets it later:
You should probably know that deduplication is not a feature that you can turn on just to try it. It requires lots and lots of RAM (how much depends on your pool size, I've heard double the pool size) and a powerful machine. And also, I don't think you can turn it off again.
 
I actually like the semi-manual dedup that btrfs has. You make a pass over the filesystem when and if you have plenty RAM, then you submit the blocks to be de-duplicated to the kernel/filesystem. If you run out of RAM you terminate adding new signatures to the dedup run but you continue deduping against what you have. You can also start it inside directories at less than filesystem scope.

Personally I also do not have a need for block-level deduplication. My "duplicates" resulting from having moved files before a rsync are all file-level duplicates.
 
Back
Top