ZFS Unable to import pool - I/O error

For the last four years I've been successfully using a script to make daily incremental backups to a USB hard drive using zfs send/receive. The script imports the backup pool for the duration of the backup with zpool import -N -o cachefile=none m3. Last night the script failed with the message:
Code:
cannot import 'm3': I/O error
Destroy and re-create the pool from
a backup source
There was no further info in any logs apart from the following in the console log:
Code:
Feb 15 20:11:58 curlew kernel: Feb 15 20:11:58 curlew ZFS[20269]: failed to load zpool $m3
Feb 15 20:11:58 curlew kernel: Feb 15 20:11:58 curlew ZFS[20279]: failed to load zpool $m3
The script checks the status of all pools with zpool status immediately before exporting the backup pool on completion and reports any problems so whatever caused this problem must have occurred as the pool was exported the previous night or as soon as attempting to import it at the start of the script because the drive was disconnected and powered down during the intervening time.

This isn't a major disaster because I can recreate the backup pool and send a new replication dataset but I'm concerned that it's possible that an entire pool can so easily be silently destroyed and I'm interested to know what could be the possible causes and what can be done to minimise the risk of it happening again.
 
Is the disk itself still in good order? Disks tend to break over time. Check with smartctl(8), although not all USB enclosures may pass along the SMART checks to the disk.
 
Is the disk itself still in good order?
Thanks for the hint. It did indeed turn out to be the cause of the problem.

I had expected to see lots of error messages about read errors and in their absence wrongly assumed that the disk would be physically OK but a subsequent full surface scan with Seatools detected lots of errors. Smartctl now shows over 1000 reallocated sectors compared to zero only three days before.

Needless to say a new drive is on order and the old one is relegated for temporary short term storage of less important data.

Is it normal for bad sectors to totally destroy an entire zfs pool or was I just unlucky that they coincided with some essential meta data defining the pool?
 
Normally bad sectors are going to happen. But the drive's firmware maps these to a "spare" bit of space of the disk, you won't notice this as it's all done "behind the scene" within the drive's firmware. However, when this spare bit is full then it can't map those bad blocks any more and you start getting read and/or write errors. If that read error happens to be in an important part of (meta)data for ZFS then there's nothing much ZFS can do about it.
 
In this case smartctl shows there should still be spare blocks so I assume the failed blocks had somehow been damaged to the point of being uncorrectable some time between last being writter and attempting to read them.

The backup pool that failed was only on a single drive but the system pool is mirrored over two drives so there should be less risk of total loss if a similar thing happens there
 
There are typically two values you need to keep an eye on, 197 Current_Pending_Sector and 198 Offline_Uncorrectable. Especially that last one. When Offline_Uncorrectable starts showing up and steadily increases over time you really need to replace it. Errors showing up with the short or long tests are also a good indication it might be time to replace it.

On my servers I have smartd(8) running, that runs those tests on my drives on a regular basis. That way I can get notified if one of them starts going bad before it stops working completely. So I can order a new disk and replace it before anything really bad happens. Most disks run for years without issues, some die within a couple of months. Disks die, that's just a reality, the only question is when.
 
Yes, I'm not going to rely on this drive any more for anything important.
[FONT=fixed]
Code:
S.M.A.R.T report for Maxtor M3 disk S/N WDE2Z95Q
+------------+------+--- 5 ---+-- 187 --+---- 188 ----+-- 197 --+-- 198 --+-- 199 --+
|    Date    | Dev  | ReAlloc | Reported|   Command   | Pending | Offline |   CRC   |
|            |      | Sectors |Uncorrect|   Timeout   | Sector  |Uncorrect|  Error  |
|            |      |  Count  |         |    Count    |  Count  |  Count  |         |
+------------+------+---------+---------+-------------+---------+---------+---------+
| 2019-12-30 |  da4 |       0 |       0 |           2 |       0 |       0 |       0 |

<snip>

| 2021-02-12 |  da4 |       0 |       0 |           5 |       0 |       0 |       0 |
| 2021-02-17 |  da4 |    1154 |      59 | 17180131337 |      40 |      40 |       0 |
[/FONT]
 
Back
Top