What is wrong with my ZFS?

Hi, I have been googeling and reading forums for a while now and tried many different things to fix my unstable ZFS file server. I am at a loss regarding where to look now so I hope someone here has a good idea.

The Story:
A year ago I made a file server with FreeBSD (AMD64, 2GB RAM), and three 2TB disks. It worked well at first, but then it started getting kernel panics. So I tuned it the best I could from guides/tips and suggestions I found online. Finally I got rid of all the panics, but it was still slow and disks kept "disappearing" randomly. I ran checks on the disks (Samsung's ES-tool) without results. So I fiddled with the SATA cables, and presumed that it was the problem.

Some months later the disks had only "disappeared" a few times. I bought more RAM to try to boost performance, removed all my tuning, and was pleased to find a significantly better response. However, this joy was short lived when the disks kept "disappearing" at an increased rate. Some further "testing" showed that the errors occur quite extensively when the disks are being written to. Another scan with ES-tool now report bad sectors with code AJ36. The ES-tool suggests a low level format. I am inclined to try but I need my data.

I hope there are some bright minds out there that can help me out here. I have a very limited FreeBSD knowledge beyond the every day usage, installing and updating.

Here is a printout of the latest dmesg: The RAID is resilvering two disks.
http://pastebin.com/qScqzaKp
 
I would guess a power supply problem, but hard to say. Could you give the brand and model number of the power supply?

Low-level formatting is usually ignored by modern drives. They return a "yes, did it" code instantly, but don't actually do anything.
 
These are all the components I bought for the server:

- Chieftec Smart Series 550W PSU
ATX 12V V2.3, Standard, 1x 6pin+1x 6+2pin PCIe, 6x SATA, 120mm

- ASUS M4A78LT-M LE, Socket-AM3
mATX, AMD760G+SB710, DDR3, 1xPCIe(2.0)x16, GbLAN,

- AMD Athlon II X2 250
Dual Core, 3.0Ghz, AM3, 2MB, 65W, Boxed

- Kingston ValueR. DDR3 1333MHz 2GB, CL9
Kit w/two matched ValueRAM 1024MB DDR3

- Samsung SpinPoint F4EG 2TB
5400RPM, SATA 3.0 Gbps, 3,5", 32MB Cache, 8.9ms,"


As I said I have later upgraded the RAM and now have:

- Corsair XMS3 DDR3 1333MHz 8GB CL9
Kit w/2x 4GB XMS3, CL9-9-9-24, for Phenom II and C


Edit: I also used an old IDE drive for the system disk.
 
The Chieftec appears to be a Delta power supply, which is not too bad. Still might be worth swapping with another as a test. The interesting part is that it was initially okay but then started having problems later. Something degraded, whether hardware or software. Are the disks in a hardware RAID? Some consumer disks will drop out of RAID arrays because of missing TLER.
 
I will try for another PSU as soon as I find one.
Besides the components I mentioned in my last post there is a cabinet and some cable, so no HW RAID :)

And the problem has always been there but appears to have increased in frequency over time.
The problem may even have been there at the very beginning. I just didn't discover it until I got rid of the panics.

It might also be worth mentioning that I began with FreeBSD 8.0 (meaning the previous version of ZFS) and have since upgraded to FreeBSD 9.0 with ZFS v28.
 
Post what tuning you have done, some settings that were recommended before for earlier versions might be even harmful for 9.0.
 
I remember removing/changing some tuning when I updated, and I removed all tuning when I upgraded to 8GB RAM.
 
I am seeing these errors in the dmesg:

Code:
GEOM: ada0: corrupt or invalid GPT detected.
GEOM: ada0: GPT rejected -- may not be recoverable.
GEOM: ada1: corrupt or invalid GPT detected.
GEOM: ada1: GPT rejected -- may not be recoverable.
GEOM: ada2: corrupt or invalid GPT detected.
GEOM: ada2: GPT rejected -- may not be recoverable.

From what I can find online this is because there are MBR or partition tables on the disk. Could this have something to do with the problems I am seeing or is it totally unrelated?

Also I have found discussions, etc. saying that the 'AJ36: Bad sector' I am getting on the disks might not be true. I am inclined to believe that to be true, since buying three new disks with bad sectors seams unlikely.
 
Another update:
After the resilvering of the two disks, I did

# zpool clear storage
# zpool scrub storage
The scrub completed and said:
Code:
scan: scrub repaired 256K in 6h6m with 2 errors on Thu Jun 14 03:39:35 2012
It recommended I restore to files from backup that was broken, so I checked and just deleted them (incomplete torrents).

Then starting a new download to the RAID it almost immediately said:
Code:
ahcich0: Timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 00000000 rs 00000002 tfd 80 serr 00000000 cmd 0000e117
(ada0:ahcich0:0:0:0): lost device
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 1 port 0
ahcich0: is 00000000 cs 000007fe ss 000007fe rs 000007fe tfd 80 serr 00000000 cmd 0000e117
(ada0:ahcich0:0:0:0): removing device entry
Closed disk ada0 -> 6
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 10 port 0
ahcich0: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000 cmd 0000ea17
GEOM: ada1: corrupt or invalid GPT detected.
GEOM: ada1: GPT rejected -- may not be recoverable.
Limiting open port RST response from 592 to 200 packets/sec
And:
Code:
        NAME                     STATE     READ WRITE CKSUM
        storage                  DEGRADED     0     0     0
          raidz1-0               DEGRADED     0     0     0
            8235960250751855291  REMOVED      0     0     0  was /dev/ada0
            ada1                 ONLINE       0     0     0
            ada2                 ONLINE       0     0     0
 
The GPT errors make me think that ZFS is overwriting GPT sectors, the primary ones at the beginning of the disk; it would be a different error if the secondary GPT table at the end of the disk was being overwritten. Why ZFS would be doing that would probably be due to initial setup, telling ZFS to use the whole disk rather than a GPT partition.
 
Oky. So I consider the GPT errors unrelated to the current problem, but it might then be a good idea to backup and recreate the pool without GPT partitions later?
 
That could be the source of the problem. GEOM would protect the GPT tables and prevent writes to those sectors if they were in use (usually if a partition is mounted). It seems unlikely that GEOM would rewrite those blocks if ZFS overwrote them, but maybe ZFS is getting the GPT data back instead of the data it wrote.
 
Okay, I will try to fix that then. So if I have three disks in my pool, how would I go about fixing that problem without damaging my data? Can I remove one disk from the RAID, get rid of the GEOM partitions, and re-add the disk?
 
Step 0: make a full backup on some other media. There is no safe alternative.

If you want GPT partition tables on the drives, it's not a big deal, just refer to the partitions when allocating space for ZFS or anything. In other words, with a single GPT partition that covers the whole drive, use ada0p1 rather than just ada0.
 
Alright. I will try that then :) If I don't want GPT partition tables on the disks what do I have to do then? The three disks are used exclusively for the storage RAID so I don't see any point in having partitions.
 
Use [cmd=]gpart destroy -F[/cmd] to remove the GPT partition tables. Then the drives are effectively blank, and can be referred to just by device name.
 
Okay, time for an update. I have taken a backup so I can experiment some more without worrying about data loss. Scraping the GTP partition tables and recreating the zpool got rid of the GEOM error but has not solved the original problem.

Code:
ahcich2: Timeout on slot 18 port 0
ahcich2: is 00000000 cs 0ffc0000 ss 0ffc0000 rs 0ffc0000 tfd 80 serr 00000000 cmd 0000f217
(ada2:ahcich2:0:0:0): removing device entry
Closed disk ada2 -> 6
ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich2: Poll timeout on slot 27 port 0
ahcich2: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00000000 cmd 0000fb17
 
UPDATES:
Okey, so I submitted a request to return the three disks within warranty and have revived one back so far with two others on their way. Actually lucked out on this one because I am getting these risk in return;

Western Digital AV-GP 2TB,SATA 2.0, 1 million hours MTBF, 64MB, 4.5W, 24x7 reliability, Whisper quiet.

But anyway I have made a zpool of the one disk I have received and restored all my data to it. I will add the other two once they arrive. So far so good, no errors or anything. I don't want to call it a success just yet as I want to wait and see what happens when I add the other two.

(fingers crossed, knock on wood, etc...)
 
Back
Top