A few days ago I decided to convert my home server from Ubuntu Linux to FreeBSD 8.2 so that I could use ZFS, for data integrity. I am aware that ZFS is available on Linux, but from what I've read the FreeBSD implementation is better.
I installed FreeBSD onto a mirrored ZFS pool, and created a raidz2 pool with 5 2TB drives (2 Seagate, 2 Western Digital, 1 Samsung) for my data. I deliberately mixed manufacturers to guard against the possibility of a bad batch from the same factory.
It worked fine for about a day, when my system suddenly locked up, and I was forced to do a hard power down. When I tried to restart the system, it always hung in the same place. Eventually I suspected a bad drive, so I disconneded every drive except one member of the mirrored pool, after which the system was able to start. With a little trial and error, I discovered that one of the Western Digital drives was the culprit - the system booted fine with all the other disks reattached.
The problem was that my 5 disk raidz2 pool was reported as faulted, although only a single disk was unavailable, and I could find no way to fix it. Eventually I gave up and recreated the pool, using another Seagate drive in place of the failed Western Digital. I wasn't happy trusting my data to a filesystem that could be taken out by a single drive failure, so I decided to do some experiments.
I tried swapping disks between SATA ports, but this didn't seem to cause any problems. I disconnected 1 or 2 disks, but the pool remained available in a degraded state.
To reproduce what had happened to my earlier pool, I disconnected the entire pool and reconnected only some of the drives. I discovered that the pool sometimes became available with 1 or 2 missing disks, but other times a single missing disk was enough to put it in a faulted state. It seemed to depend on which disk was missing. Here is an example:
I shut down the system, reattached the disk that was missing above, detached 2 different disks, and restarted. I now see this outout:
My understanding is that raidz2 should be able to cope with the loss of any 2 drives without losing data. I can't understand what's going on. I've searched online and seen a few examples of other people having similar problems, but no clear explanation of the cause.
Any help would be much appreciated.
I installed FreeBSD onto a mirrored ZFS pool, and created a raidz2 pool with 5 2TB drives (2 Seagate, 2 Western Digital, 1 Samsung) for my data. I deliberately mixed manufacturers to guard against the possibility of a bad batch from the same factory.
It worked fine for about a day, when my system suddenly locked up, and I was forced to do a hard power down. When I tried to restart the system, it always hung in the same place. Eventually I suspected a bad drive, so I disconneded every drive except one member of the mirrored pool, after which the system was able to start. With a little trial and error, I discovered that one of the Western Digital drives was the culprit - the system booted fine with all the other disks reattached.
The problem was that my 5 disk raidz2 pool was reported as faulted, although only a single disk was unavailable, and I could find no way to fix it. Eventually I gave up and recreated the pool, using another Seagate drive in place of the failed Western Digital. I wasn't happy trusting my data to a filesystem that could be taken out by a single drive failure, so I decided to do some experiments.
I tried swapping disks between SATA ports, but this didn't seem to cause any problems. I disconnected 1 or 2 disks, but the pool remained available in a degraded state.
To reproduce what had happened to my earlier pool, I disconnected the entire pool and reconnected only some of the drives. I discovered that the pool sometimes became available with 1 or 2 missing disks, but other times a single missing disk was enough to put it in a faulted state. It seemed to depend on which disk was missing. Here is an example:
Code:
[pollochr@Warspite ~]$ zpool status tank
pool: tank
state: FAULTED
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
see: [url]http://www.sun.com/msg/ZFS-8000-3C[/url]
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tank FAULTED 0 0 1 corrupted data
raidz2 DEGRADED 0 0 6
ad6 ONLINE 0 0 1
ad4 ONLINE 0 0 0
ad10 ONLINE 0 0 0
ad16 ONLINE 0 0 0
ad18 UNAVAIL 0 0 0 cannot open
I shut down the system, reattached the disk that was missing above, detached 2 different disks, and restarted. I now see this outout:
Code:
[pollochr@Warspite ~]$ zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: [url]http://www.sun.com/msg/ZFS-8000-2Q[/url]
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
ad6 UNAVAIL 0 0 0 cannot open
ad4 REMOVED 0 0 0
ad10 ONLINE 0 0 0
ad16 ONLINE 0 0 0
ad18 ONLINE 0 0 0
errors: No known data errors
My understanding is that raidz2 should be able to cope with the loss of any 2 drives without losing data. I can't understand what's going on. I've searched online and seen a few examples of other people having similar problems, but no clear explanation of the cause.
Any help would be much appreciated.