Random disk dropping out of ZFS pools

Hello,

I have moved a backup server running Bacula from Solaris 11 express to FreeBSD 9.0, using the same hardware, but I´m having issues with ZFS since the switch.

My problem is that random disks in my 2 ZFS pools gets disconnected from time to time, and the pools lose redundancy. A simple reboot and the pool resilvers without problems. Since I use single disk redundancy this can potentially be bad news if i don´t spot the problem quickly.

My hardware setup is a 4 year old Core 2 duo machine with 4GB of ram, 12 HDDs and 1 SSD ARC. The drives are connected to the internal SATA ports as well as to a LSI Logic 8 port SAS/SATA PCI-e card.

This is the current pool config:
Code:
        NAME        STATE     READ WRITE CKSUM
        datapool    ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            da6     ONLINE       0     0     0
        cache
          da7p2     ONLINE       0     0     0

        NAME           STATE     READ WRITE CKSUM
        zroot          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            gpt/disk0  ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
        cache
          da7p1        ONLINE       0     0     0

The only thing this server does is to run the Bacula backup software, backing up 4 other machines, and it also has an rsync job mirroring some VM images to a deduplicated ZFS filesystem. The deduplicated part consumes about 200GB, and the non-deduplicated Bacula files consumes around 9 TB.

I know I´m on the low side when it comes to memory, but I hope the SSD ARC will help with the dedup tables, and I don´t think the O/S should lose disks even if I run low on memory.

Since this is my first FreeBSD installation, I´m not sure how to debug this, I have mostly used Linux and Solaris before, but I don´t want to go back to Solaris just to fix this.

Any help would be appreciated.
 
Hi LasseKongo

I had the same thing on my FreeBSD 9 amd/64 machine when I installed two new WDC WD2002FAEX-007BA0 to a Marvell 88SX7042 controller.

I am not a hundred percent sure of the cause, but I did upgrade to RC3 and put them onto a different controller, after that the disk dropping stopped.

Another thing to look at is smartctl, which can tell you if power management etc is enabled
 
LasseKongo said:
I have moved a backup server running Bacula from Solaris 11 express to FreeBSD 9.0, using the same hardware, but I´m having issues with ZFS since the switch.

My problem is that random disks in my 2 ZFS pools gets disconnected from time to time, and the pools lose redundancy.
Are there any console messages logged by the kernel about the disks? Are the drops limited to the onboard ports or the LSI ports?
 
Terry_Kennedy said:
Are there any console messages logged by the kernel about the disks? Are the drops limited to the onboard ports or the LSI ports?

Yes, there were console messages like this:

Code:
Nov  6 07:43:26 backup kernel: (da3:mpt0:0:3:0): lost device - 0 outstanding
Nov  6 07:43:30 backup kernel: (da3:mpt0:0:3:0): removing device entry
Nov  6 07:43:31 backup kernel: da3 at mpt0 bus 0 scbus0 target 3 lun 0
Nov  6 07:43:31 backup kernel: da3: <ATA SAMSUNG HD154UI 1118> Fixed Direct Access SCSI-5 device 
Nov  6 07:43:31 backup kernel: da3: 300.000MB/s transfers
Nov  6 07:43:31 backup kernel: da3: Command Queueing enabled
Nov  6 07:43:31 backup kernel: da3: 1430799MB (2930277168 512 byte sectors: 255H 63S/T 182401C)


I have lost disks on both the internal and the LSI controller. I actually replaced disks the first few times it happened, but when I ran diagnostics on them in another machine they were OK.
 
gkontos said:
I think that an upgrade to 9.1-RC3 will solve those issues.

I have been thinking about that, but I think I will wait for the 9.1 release which I believe should be released soon. Do you have any links to information about my problems and their eventual solution in 9.1 ?

Slightly unrelated to my problem, but should I expect problems with a 9.1 upgrade when using boot from a ZFS mirror ? The way I installed it was not supported by the FreeBSD installer, but rather quite a bit of manual configuration, will a freebsd-update retain this configuration ? Since I´m at total newbie to FreeBSD, I have never used freebsd-update and is unsure how it works.

I really like the addition of ZFS boot-environments to 9.1, that worked great in Solaris and made it less risky to upgrade the O/S.
 
LasseKongo said:
I have been thinking about that, but I think I will wait for the 9.1 release which I believe should be released soon. Do you have any links to information about my problems and their eventual solution in 9.1 ?

Am not sure exactly about the LSI model that you are using. But I had my issues in FreeBSD 9.0-RELEASE and some controllers until the new driver was MFC to 9-STABLE

LasseKongo said:
Slightly unrelated to my problem, but should I expect problems with a 9.1 upgrade when using boot from a ZFS mirror ? The way I installed it was not supported by the FreeBSD installer, but rather quite a bit of manual configuration, will a freebsd-update retain this configuration ? Since I´m at total newbie to FreeBSD, I have never used freebsd-update and is unsure how it works.

To be honest, I have never used freebsd-update either and I am not a newbie. I don't think you would have any issues but without knowing the way you installed I can't be certain.

LasseKongo said:
I really like the addition of ZFS boot-environments to 9.1, that worked great in Solaris and made it less risky to upgrade the O/S.

I don't think this is a particular feature of 9.1. But again I might be wrong.
 
throAU said:
Possibly a silly question - but how is your power supply for all those drives?

Not silly at all, I know from experience that a weak PSU can cause exactly these kind of problems, had a Linux machine with these symptoms a few years back, solved it by upgrading the PSU.
In this box I have a 650W PSU, I think that should be enough.
 
650 W PSU for 12-ish drives? Seems a little under-powered to me. All our boxes with 12+ drive bays have 900+ W PSUs.

Also, what firmware/driver revision are you running on your mpt(4) controllers?
 
I think it depends on what else you have in the box... If you're able to cripple a 650W PSU using 12 drives on a single-CPU board, then I'd say the PSU is defective or really poor quality... regular drives don't draw that much power apart from the startup sequence (so a controller+drive setting that can do a staggered spin up sequence is always nice to have of course)
 
phoenix said:
650 W PSU for 12-ish drives? Seems a little under-powered to me. All our boxes with 12+ drive bays have 900+ W PSUs.

Also, what firmware/driver revision are you running on your mpt(4) controllers?


I don´t think it is a problem, the box consumes around 160W when powered up, it is 5400/7200 rpm SATA drives they are pretty low power. I think I would have noticed problem during power up if the PSU was not up to it.

Don´t know the FW version, but I flashed it some time back, and it is a fairly old controller chip(LSI 1068) so I not sure there are any newer FW out there.
Since I have lost drives on the internal controller as well as on the LSI card I am suspecting something in FreeBSD or some faulty hardware, like motherboard.
 
LasseKongo said:
I don´t think it is a problem, the box consumes around 160W when powered up,...

@offtopic:
How can you read out the power consumption of the running system?
Are there any sysctl flags?
 
lockdoc said:
@offtopic:
How can you read out the power consumption of the running system?
Are there any sysctl flags?

I simply used a power meter in the wall socket. Would be really nice to have it through the O/S though, but since i´m a noob with FreeBSD I don´t know if that is possible.
 
Just a small update on my problem. I decided to update to 9.1RC3 about 2 weeks ago, the update went smooth, but yesterday I lost a disk again, so the update didn´t seem to cure the problem.
 
Sfynx said:
I think it depends on what else you have in the box... If you're able to cripple a 650W PSU using 12 drives on a single-CPU board, then I'd say the PSU is defective or really poor quality... regular drives don't draw that much power apart from the startup sequence (so a controller+drive setting that can do a staggered spin up sequence is always nice to have of course)


12 is too much for 650W.
 
zero said:
12 is too much for 650W.

Twelve 7200 RPM drives draw around 120 Watts or something when in full operation? How would that be a big problem for a 650W PSU provided that you do not put them all on the same wire and use a staggered spin up boot sequence to flatten the power up load? My 12-disk server draws a lot less power than my average desktop machine which even still uses a 550W PSU. Not having a fat graphics card in a server also helps a lot there.

It is a whole different story if it's not an uniprocessor machine, because a couple fully loaded CPUs can use quite some juice, but I cannot get my single-Xeon file server to cross the 300W mark even when trying really hard.
 
LasseKongo said:
Just a small update on my problem. I decided to update to 9.1RC3 about 2 weeks ago, the update went smooth, but yesterday I lost a disk again, so the update didn´t seem to cure the problem.

Is the controller still under warranty?
 
gkontos said:
Is the controller still under warranty?

No, the warranty have expired. However I don´t think it is a problem with
the controller itself since it worked just fine in Solaris for about a year. I also have
exactly the same controller model in another system running Linux (no ZFS), and it has worked
fine for several years.
 
Sfynx said:
Twelve 7200 RPM drives draw around 120 Watts or something when in full operation? How would that be a big problem for a 650W PSU provided that you do not put them all on the same wire and use a staggered spin up boot sequence to flatten the power up load? My 12-disk server draws a lot less power than my average desktop machine which even still uses a 550W PSU. Not having a fat graphics card in a server also helps a lot there.

It is a whole different story if it's not an uniprocessor machine, because a couple fully loaded CPUs can use quite some juice, but I cannot get my single-Xeon file server to cross the 300W mark even when trying really hard.


I agree, as I mentioned earlier the server draws around 160W from the wall socket when idling, so 650W should give plenty of headroom. I have also made sure not to connect too many disks to the same cable from the PSU, doing that could cause problems with disks dropping out, I know that from first hand experience :e
 
Many power supplies are hilariously overrated, or just plain bad. I've resolved to only buy Seasonic now.

Assuming 80% efficiency, 160 W from the wall is only 128 W actual use. Recent hard drives are between 5 and 10 W each.
 
Back
Top