What is wrong with my ZFS?

crazychip · Jun 13, 2012

Hi, I have been googeling and reading forums for a while now and tried many different things to fix my unstable ZFS file server. I am at a loss regarding where to look now so I hope someone here has a good idea.

The Story:
A year ago I made a file server with FreeBSD (AMD64, 2GB RAM), and three 2TB disks. It worked well at first, but then it started getting kernel panics. So I tuned it the best I could from guides/tips and suggestions I found online. Finally I got rid of all the panics, but it was still slow and disks kept "disappearing" randomly. I ran checks on the disks (Samsung's ES-tool) without results. So I fiddled with the SATA cables, and presumed that it was the problem.

Some months later the disks had only "disappeared" a few times. I bought more RAM to try to boost performance, removed all my tuning, and was pleased to find a significantly better response. However, this joy was short lived when the disks kept "disappearing" at an increased rate. Some further "testing" showed that the errors occur quite extensively when the disks are being written to. Another scan with ES-tool now report bad sectors with code AJ36. The ES-tool suggests a low level format. I am inclined to try but I need my data.

I hope there are some bright minds out there that can help me out here. I have a very limited FreeBSD knowledge beyond the every day usage, installing and updating.

Here is a printout of the latest dmesg: The RAID is resilvering two disks.
http://pastebin.com/qScqzaKp

phoenix · Jun 13, 2012

Sounds like you need a better SATA controller.

wblock@ · Jun 13, 2012

I would guess a power supply problem, but hard to say. Could you give the brand and model number of the power supply?

Low-level formatting is usually ignored by modern drives. They return a "yes, did it" code instantly, but don't actually do anything.

crazychip · Jun 13, 2012

These are all the components I bought for the server:

- Chieftec Smart Series 550W PSU
ATX 12V V2.3, Standard, 1x 6pin+1x 6+2pin PCIe, 6x SATA, 120mm

- ASUS M4A78LT-M LE, Socket-AM3
mATX, AMD760G+SB710, DDR3, 1xPCIe(2.0)x16, GbLAN,

- AMD Athlon II X2 250
Dual Core, 3.0Ghz, AM3, 2MB, 65W, Boxed

- Kingston ValueR. DDR3 1333MHz 2GB, CL9
Kit w/two matched ValueRAM 1024MB DDR3

- Samsung SpinPoint F4EG 2TB
5400RPM, SATA 3.0 Gbps, 3,5", 32MB Cache, 8.9ms,"

As I said I have later upgraded the RAM and now have:

- Corsair XMS3 DDR3 1333MHz 8GB CL9
Kit w/2x 4GB XMS3, CL9-9-9-24, for Phenom II and C

Edit: I also used an old IDE drive for the system disk.

wblock@ · Jun 13, 2012

The Chieftec appears to be a Delta power supply, which is not too bad. Still might be worth swapping with another as a test. The interesting part is that it was initially okay but then started having problems later. Something degraded, whether hardware or software. Are the disks in a hardware RAID? Some consumer disks will drop out of RAID arrays because of missing TLER.

crazychip · Jun 13, 2012

I will try for another PSU as soon as I find one.
Besides the components I mentioned in my last post there is a cabinet and some cable, so no HW RAID

And the problem has always been there but appears to have increased in frequency over time.
The problem may even have been there at the very beginning. I just didn't discover it until I got rid of the panics.

It might also be worth mentioning that I began with FreeBSD 8.0 (meaning the previous version of ZFS) and have since upgraded to FreeBSD 9.0 with ZFS v28.

kpa · Jun 13, 2012

Post what tuning you have done, some settings that were recommended before for earlier versions might be even harmful for 9.0.

crazychip · Jun 13, 2012

I remember removing/changing some tuning when I updated, and I removed all tuning when I upgraded to 8GB RAM.

crazychip · Jun 13, 2012

I am seeing these errors in the dmesg:

Code:

GEOM: ada0: corrupt or invalid GPT detected.
GEOM: ada0: GPT rejected -- may not be recoverable.
GEOM: ada1: corrupt or invalid GPT detected.
GEOM: ada1: GPT rejected -- may not be recoverable.
GEOM: ada2: corrupt or invalid GPT detected.
GEOM: ada2: GPT rejected -- may not be recoverable.

From what I can find online this is because there are MBR or partition tables on the disk. Could this have something to do with the problems I am seeing or is it totally unrelated?

Also I have found discussions, etc. saying that the 'AJ36: Bad sector' I am getting on the disks might not be true. I am inclined to believe that to be true, since buying three new disks with bad sectors seams unlikely.

crazychip · Jun 14, 2012

Another update:
After the resilvering of the two disks, I did

# zpool clear storage
# zpool scrub storage
The scrub completed and said:

Code:

scan: scrub repaired 256K in 6h6m with 2 errors on Thu Jun 14 03:39:35 2012

It recommended I restore to files from backup that was broken, so I checked and just deleted them (incomplete torrents).

Then starting a new download to the RAID it almost immediately said:

Code:

ahcich0: Timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 00000000 rs 00000002 tfd 80 serr 00000000 cmd 0000e117
(ada0:ahcich0:0:0:0): lost device
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 1 port 0
ahcich0: is 00000000 cs 000007fe ss 000007fe rs 000007fe tfd 80 serr 00000000 cmd 0000e117
(ada0:ahcich0:0:0:0): removing device entry
Closed disk ada0 -> 6
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 10 port 0
ahcich0: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000 cmd 0000ea17
GEOM: ada1: corrupt or invalid GPT detected.
GEOM: ada1: GPT rejected -- may not be recoverable.
Limiting open port RST response from 592 to 200 packets/sec

And:

Code:

        NAME                     STATE     READ WRITE CKSUM
        storage                  DEGRADED     0     0     0
          raidz1-0               DEGRADED     0     0     0
            8235960250751855291  REMOVED      0     0     0  was /dev/ada0
            ada1                 ONLINE       0     0     0
            ada2                 ONLINE       0     0     0

wblock@ · Jun 14, 2012

The GPT errors make me think that ZFS is overwriting GPT sectors, the primary ones at the beginning of the disk; it would be a different error if the secondary GPT table at the end of the disk was being overwritten. Why ZFS would be doing that would probably be due to initial setup, telling ZFS to use the whole disk rather than a GPT partition.

crazychip · Jun 14, 2012

Oky. So I consider the GPT errors unrelated to the current problem, but it might then be a good idea to backup and recreate the pool without GPT partitions later?

wblock@ · Jun 14, 2012

That could be the source of the problem. GEOM would protect the GPT tables and prevent writes to those sectors if they were in use (usually if a partition is mounted). It seems unlikely that GEOM would rewrite those blocks if ZFS overwrote them, but maybe ZFS is getting the GPT data back instead of the data it wrote.

crazychip · Jun 14, 2012

Okay, I will try to fix that then. So if I have three disks in my pool, how would I go about fixing that problem without damaging my data? Can I remove one disk from the RAID, get rid of the GEOM partitions, and re-add the disk?

wblock@ · Jun 14, 2012

Step 0: make a full backup on some other media. There is no safe alternative.

If you want GPT partition tables on the drives, it's not a big deal, just refer to the partitions when allocating space for ZFS or anything. In other words, with a single GPT partition that covers the whole drive, use ada0p1 rather than just ada0.

crazychip · Jun 14, 2012

Alright. I will try that then

If I don't want GPT partition tables on the disks what do I have to do then? The three disks are used exclusively for the storage RAID so I don't see any point in having partitions.

wblock@ · Jun 14, 2012

Use [cmd=]gpart destroy -F[/cmd] to remove the GPT partition tables. Then the drives are effectively blank, and can be referred to just by device name.

crazychip · Jun 17, 2012

Okay, time for an update. I have taken a backup so I can experiment some more without worrying about data loss. Scraping the GTP partition tables and recreating the zpool got rid of the GEOM error but has not solved the original problem.

Code:

ahcich2: Timeout on slot 18 port 0
ahcich2: is 00000000 cs 0ffc0000 ss 0ffc0000 rs 0ffc0000 tfd 80 serr 00000000 cmd 0000f217
(ada2:ahcich2:0:0:0): removing device entry
Closed disk ada2 -> 6
ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich2: Poll timeout on slot 27 port 0
ahcich2: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00000000 cmd 0000fb17

crazychip · Jul 17, 2012

UPDATES:
Okey, so I submitted a request to return the three disks within warranty and have revived one back so far with two others on their way. Actually lucked out on this one because I am getting these risk in return;

Western Digital AV-GP 2TB,SATA 2.0, 1 million hours MTBF, 64MB, 4.5W, 24x7 reliability, Whisper quiet.

But anyway I have made a zpool of the one disk I have received and restored all my data to it. I will add the other two once they arrive. So far so good, no errors or anything. I don't want to call it a success just yet as I want to wait and see what happens when I add the other two.

(fingers crossed, knock on wood, etc...)