Enclosure removes disks in pool - CAM status: CCB request completed with an error

antperval

New Member

Reaction score: 1
Messages: 12

Hi all,

We are experiencing a really strange issue that are driving us crazy to solve and we have not never experienced before.

We use a Supermicro enclosure with a LS3008 SAS+SATA controller and Freebsd 11.0. We know this version has reached EOL date and we are working to move-on. Nevertheless, the issue we are experiencing is quite strange and we are not pretty sure if it is directly related with this as it is not an expected behaviour, even with an relative old version.

We created 3 pools with 12 disks (6 TB) in RaidZ2 and we launched a couple of robocopy against one of the pools (/MT:10) We are using OPT1 interface which is a failover between 10GB Ethernet and 10GB FC, but data is being transferred by FC Interface. Each copy should transfer 18TB of data, filling the pool with 36TB. Initially, copy seemed to be working but, after a while, we began to obtain Read-Write errors in zpool status (even CKSUM) Once we began to obtain these errors, copy process continued but after some hours, disk began to be removed from pool and copies failed. Not only one or a couple of disks. Most of them were removed, causing pool degradation and finally data loss.

Reviewing system log, we obtained tons of CAM error messages for each disk (retries exhausted, write operation finished with errors...) Our first thought was that issue pointed to disks used in enclosures (WD60EFAX) as looks like they are not the best choice for ZFS. Strangely, launching a smartctl command against the disks do show that parameters as Reallocated_sector, Pending_sector, Uncorrectable (5,197,198...) looks like disks were not really damaged (all values were 0) But it is a fact that disks do not work properly because if we try to start the enclosure with a live-cd, we see the same CAM errors.

After replacing WD disks with Hitachi, a new 12 disks -RaidZ2 pool was configured and processes was relaunched. Unfortunately, we are experiencing more or less the same issue. Enclosure do not remove disks and pool still stands, but zpool status shows Read-Write errors after 30 hours. Copies are still being performed.

This behaviour has been observed, not only in one enclosure but in four. They are exactly the same hardware and firmware and are located in different data centre, so we have discarded a hardware failure (such a malfunctioning cable or electrical component) as we don't think there are 4 faulty components in different enclosures. Temperature issues have been also discarded as data centre temperature is ok and fans have been forced to work at 100%.

We are attaching some log files with errors in one of the disks (new disks), actual zpool status and dmesg to help understand the issue.

Our initial thought was perhaps the issue could be related with a heavy IO using FC link that may reach hard drives capacity (6GB) that might cause stress the disks until their limit. We could probably expect copy gets slow, enclosure do not respond to other requests, but not disks with fails or directly removed.

We are working with enclosure provider to try to fix the issue but we have open a post here just in case any other has experienced the issue and can provide some extra information which may help us understand this behaviour.

Thank you very much in advance.
 

Attachments

  • smartctl-da3.txt
    10.5 KB · Views: 23
  • zpoolstatus.png
    zpoolstatus.png
    13.6 KB · Views: 21
  • dmesg.txt
    95.9 KB · Views: 22

covacat

Well-Known Member

Reaction score: 171
Messages: 366

probably not much help here but
i have an intel (SRCSASBB8I) branded lsi sas/sata which i was using in a raid5 config in an ubuntu box
box was mostly idle and used for storage backup
from time to time one disk was removed from the raid and raid degraded because of various timeout errors
the rejected-bad disks were perfectly functional and we traced the problem to the fact the the hba was picky about the disks
somehow if a disk write was taking longer than the hba expected it was marked as failed
in the end i found out that not all sata drives where recommended to be used on that HBA and the "good" ones had to have
something in the firmware that the common ones didn't. (something related to the max time a write op may take IIRC)
in the end i discarded the controller and hooked the drives directly to onboard sata ports + SW raid
this was 5 years ago and the drives still work

my errors/problems were something like
 

T-Daemon

Daemon

Reaction score: 790
Messages: 1,627

There are these forum threads:

Also have a look at the following PR involving LSI SAS3008 and mpr(4) driver. There are several workarounds mentioned.

The mpr(4) driver has global and per device loader tunables to diagnose further in case the solutions or workarounds are no good.
 

ralphbsz

Son of Beastie

Reaction score: 2,190
Messages: 3,147

The smartctl output looks perfect.

The dmesg doesn't show any actual disk IO errors (nothing from the head and platter), but only communications errors. All errors seem to be caused by bus resets (look at the unit attentions). That means that your LSI card is issuing bus resets. The difficult question is: why is it doing that? Most likely some other low-level communication error (such as a packet timeout), and the LSI card has to retry to command, but it first resets the buses. Only problem is: We don't know what the root cause ones were.

Usually, I would start by reseating all the power and SAS/SATA cables to the disk drives. But if the same behavior happens on multiple machines, that's not it. More likely it is a firmware problem. You say you have a disk enclosure. That enclosure has a backplane, and on that backplane must be some SAS expander (de-facto multiplexer) chip. Find out who the manufacturer of that chip and its firmware is (SuperMicro can figure that out), and check the firmware on it. Also update the firmware on the LSI card.

Upgrading FreeBSD should be obvious; such problems can be caused (indirectly) by incorrect error handling in the OS too.
 
OP
A

antperval

New Member

Reaction score: 1
Messages: 12

Hi all,

First of all I would like to beg your pardon for not answering in this thread for nearly one month. Thanks everybody for your responses. They have driven us to perform hundred of tests in the enclosures in order to fix the issue, although, unfortunately we are still experiencing the issue. We have tried to execute tests with the last freebsd version, modified controller, check backplane, play with different firmwares and still we are receiving IO errors. The good new is that we have found a configuration that is more or less stable. After days copying, creating and stressing the storage, enclosure does not expel the disks. There are IO errors but a zpool clear solves them and pools remain healthy and there is no loss of performance. Data keeps safe.

We are still investigating and testing. I promise an update when the issue is fully solved, so that, all our effort (and yours, answering my questions) can help some other colleague in the forum.

Thanks!
 

Terry_Kennedy

Aspiring Daemon

Reaction score: 337
Messages: 968

We are experiencing a really strange issue that are driving us crazy to solve and we have not never experienced before.

We are still investigating and testing. I promise an update when the issue is fully solved, so that, all our effort (and yours, answering my questions) can help some other colleague in the forum.
If I had to guess, I'd say you were running SATA drives behind a SAS expander backplane and not directly connected to a controller that has a port for each drive.

IMHO, that is a Bad Thing. Read "Technical mumbo-jumbo" below if you want the technical description...

Another possibility is that you are using a controller that is something more than a dumb host adapter - if it has on-board RAID, cache, etc. it can 'fight' with filesystems like ZFS, or at least conceal things that ZFS should know about.

Other more remote possibilities include:
  • Buggy controller firmware. As one example, some years ago, LSI released "Phase 20" firmware for some controllers. It took a number of re-releases until it started working properly again. The problem arose in 20.00.00.00, got mostly fixed by 20.00.04.00 but didn't get completely fixed until 20.00.07.00.
  • OEM-specific controller firmware. If this is an OEM-branded or OEM-built controller, it may have firmware that is subtly different from the generic LSI firmware. There are a number of schools of thought about why OEMs do this, but it does happen.
  • Bad drive cabling. Unlikely, but can happen. Supermicro is pretty good, so look at non-Supermicro parts like the cables between the controller and the backplane.
  • Bad power. As above, Supermicro is pretty good. But they don't prevent you from configuring a chassis with insufficient power to run everything you put in there.
  • FreeBSD bug. Also quite unlikely - I've been running multiple systems with 16 drives on LSI controllers since FreeBSD 8, and other than the occasional bug that hits everybody using that hardware (and gets fixed rapidly) I haven't seen any problems like this.
  • Bad system memory. Exceedingly unlikely, as it is always clobbering the disk subsystem and not anything else.
Frankly, anything other than the first 2 bullets falls into "sacrifice a rubber chicken, take 2 aspirin and call me in the morning" - things that almost NEVER happen, but if I said they never happen, someone would had experienced them.

Lastly, (speaking of sacrificing chickens), back in the FreeBSD 10 days I was helping someone well-known in the FreeBSD community who had a half dozen or so servers configured with 250TB of storage or so on 24 ports. Despite multiple replacements of parts and later entire systems, we could never get them to operate reliably. We joked that his systems were "cursed", and he eventually decided to send them all back and buy completely differet hardware (everything) from a different vendor. That fixed it. Getting those systems back into production was more important than continuing to troubleshoot (we were at the point of taliking about renting SAS protocol analyzers by the time they threw in the towel).

Technical mumbo-jumbo:

SAS controllers use a translation layer called SAT to map SAS commands into the appropriate SATA ones. When you add a SAS expander to the configuration, the expander handles SAT and the controller thinks it is talking to multiple SAS drives on that controller port. Which fine almost all of the time, except when it isn't. When the controller thinks it is having a problem on that port, it sends a SAS reset request to the expander. Unfortunately, some expander implementations see that, start (virtually) jumping up and down and yelling "Resets for EVERYBODY!!!". That reset affects other drives than the one that (might have) had the initial problem. Which causes the controller to issue another reset. Lather, rinse, repeat until enough drives are kicked out of the array.

Additionally, expanders are "out of sight, out of mind" and vendors often don't make firmware updates available, so you're stuck with whatever firmware came with the expander.

This is why a number of system vendors (I'm mostly familiar with Dell, but I assume HP, Lenovo, etc. do the same thing) either don't allow configurations with SATA drives on expander backplanes, or put the SATA drives in special trays with "interposer" boards that are single-port SAS / SATA converters. That way the expander sees all SAS drives.

You save a lot of money going with Supermicro and doing your own integration and testing, but unfortunately problems like this can (very rarely) arise. I've been doing this for 15+ years and have have run into most of the problems along the way. This was my first public report on what I was building, and this is my most recent public report.
 

ralphbsz

Son of Beastie

Reaction score: 2,190
Messages: 3,147

You save a lot of money going with Supermicro and doing your own integration and testing, but unfortunately problems like this can (very rarely) arise.
Allow me to respectfully disagree: Problems like this don't arise "very rarely". They arise way more often than people think.

OK, now serious: I 99% agree with what the distinguished Terry said. In particular, I agree with his technical diagnosis (which is not a diagnosis, but at this point mostly a hunch) that this is likely caused by reset storms.

Building complex (big) storage systems like this takes a lot of effort. You have several layers: The OS driver (in the case of Linux, written or at least supervised by LSI staff, I don't know who is in charge in FreeBSD), the HBA which contains highly complex firmware (see Terry's comments about version 20, that was NOT fun), the SAS expander on the backplane (way more complex firmware, often from a vendor other than LSI, or modified by third parties, LSI ships an SDK that vendors such as Supermicro can "improve", been there done that), and finally the drives, where you have again multiple vendors, all of which implement the standards 99.9% correctly. So you have four layers, written by typically 3-4 companies (and communications within one company are hard, across companies they become darn near impossible), and the vendors can only lab test a very small subset of possible configurations.

Put yourself in the shoes of any component of that hardware. You are trying to communicate with the layers above and below. Something goes wrong, like one of your communications partners doesn't follow the protocol, or you get into a situation that you know is impossible to get into (like you sent a command, and neither got a reply nor a NAK). In a sufficiently large system (I used to build the with ~400 drives per host, and the attendant number of HBAs and enclosures), everything that is unlikely will happen all the time, and everything that's impossible will happen occasionally. What now? There is nothing you can do to make forward progress. So what you do: Ask everyone you know how to reach to please reset themselves, forget everything they thought they knew (some of which must be wrong, otherwise we wouldn't have gotten into an impossible situation), and hope that the IOs that need to be done will be retried. But: The reset storm that this causes disrupts everything, and causes a break in IOs being served, which means that when all the pieces come back up, the system has to work extra hard. If there is something flaky in there, it is likely to break again right away, another reset storm (you can have them happen every millisecond), and after a few retries, things go down.

Here is the right way to build the system: First, simplify. For example, get rid of SATA disk: one more unnecessary complexity of protocol translation; SAS disks only cost a little more. Try to get all the hardware from one vendor. Set up a testing lab, with say 30 or 50 complete systems. Hire a few administrators to manage that lab and set things up. Hire a dozen testers, who write scripts to torture the system (for example pull cables during intense workload, or pull power on redundant supplies). Be super religious about tracking firmware versions, and stay in touch with engineering (get advance warning of firmware changes). Buy a SAS analyzer, and train two EEs on how to use them. Set up weekly phone meetings, with engineering staff from your company, from LSI, the enclosure vendor, and the disk drive vendor. Set up an e-mail bug tracking system. Make sure all the executives of the companies involved are on board. Schedule 3 to 6 months to get the integration done. If you do all this, you can ship a reliable system.

This is an amateur where amateurs or small players will have a very hard time, because the effort required for thorough integration and testing is out of reach. When you buy the (expensive) kit from places like EMC, IBM, NetApp or Hitachi (or indirectly by setting up your workload at AWS, Google and Microsoft), you pay for exactly this kind of thorough testing and integration. Having failed to pay for those, you need a lot of luck, patience, and elbow grease.
 

Terry_Kennedy

Aspiring Daemon

Reaction score: 337
Messages: 968

Allow me to respectfully disagree: Problems like this don't arise "very rarely". They arise way more often than people think.
My "very rarely" was based on an (unmentioned, my bad) assumption that people building larger, more complex systems are building them of various name-brand parts from quality manufacturers, purchased new from authorized suppliers, and not built from the junk bin with whatever parts were on hand. I'm sure there is a whole spectrum there, and if you've read my linked posts above you can see why I tend to assume that people are working at the high end, because that's what I do. FreeBSD has users and developers that do everything from "junk bin" systems to custom-engineered systems of their own design, and everything in between.

It is also vitally important to think problems through in a logical manner and resist the urge to "run around and shout". If that's not possible for someone, I agree wiith you that they shouldn't be acting as their own system integrator.
[snip]
For example, get rid of SATA disk: one more unnecessary complexity of protocol translation; SAS disks only cost a little more.
I agree about the small price difference if you're buying new disks through the white-box distribution channel. If you're buying through retail channels, you may only have a limited number of SATA drive models available. If you can get a retail SAS drive at all it will usually cost a lot more. If buying used through places like eBay, used SAS drives are generally a lot more expensive than used SATA drives, until you get to old models that no longer have any enterprise value. And the SAS drives likely have oddball OEM firmware on them. People may even wind up with T10-PI drives with oddball sector sizes that don't work at all without low-level reformatting.

Remember, only a tiny percentage of drive production moves through the authorized white-box and retail channels. Vast numbers of drives go to OEMs who have special deals with drive manufacturers. Some of those OEM drives get diverted into the white-box channel where they get sold to unsuspecting buyers.
Try to get all the hardware from one vendor. Set up a testing lab, with say 30 or 50 complete systems. Hire a few administrators to manage that lab and set things up. Hire a dozen testers, who write scripts to torture the system (for example pull cables during intense workload, or pull power on redundant supplies). Be super religious about tracking firmware versions, and stay in touch with engineering (get advance warning of firmware changes). Buy a SAS analyzer, and train two EEs on how to use them. Set up weekly phone meetings, with engineering staff from your company, from LSI, the enclosure vendor, and the disk drive vendor. Set up an e-mail bug tracking system. Make sure all the executives of the companies involved are on board. Schedule 3 to 6 months to get the integration done. If you do all this, you can ship a reliable system.
That's probably a bit excessive (and totally impractical) for the FreeBSD user who is building a system or three for their own use. If you're talking about a Netflix-scale entity, then it is possible to get closer to that ideal.
When you buy the (expensive) kit from places like EMC, IBM, NetApp or Hitachi (or indirectly by setting up your workload at AWS, Google and Microsoft), you pay for exactly this kind of thorough testing and integration.
Well, at least you hope you do. It isn't always that easy. I had a ringside seat for the mail.com / EMC bloodbath 20+ years ago, which was an "all hands on deck" exercise that ran 24 hours a day for a few weeks.

Aside from the much higher cost of systems from those places, you run into the "FreeBSD isn't a supported operating system" routine. Generally you get a few specific Windows Server editions and a couple (if you're lucky) Linux distributions that are officially supported.

The era where a manufacturer would routinely send top people out to deal with intractable problems* at any customer site has been gone for many, many years. It isn't practical for manufacturers to do that except for their very largest customers. The best you can hope for is that nearly all of the oddball corner cases have been addressed by a top-tier manufacturer. Unfortunately, most of them have probably been addressed by making many possible combinations of parts un-orderable because the manufacturer never validated that combination of devices. So if you have the money, can get at least something close to what you exactly wanted with some manufacturers canned configuration and you're willing to deal with the "Sorry, we don't support FreeBSD - can you reproduce this problem under Windows Server?" runaround, go for it.

Otherwise, start small with a minimum configuration and test thoroughly, then build the full-blown system that you want. As an example, when I was designing the RAIDzilla 2.5, I started with a single HGST He8 SAS drive because I wasn't sure if the LSI controller and FreeBSD would support 4Kn drives. It did, so then I started buying drives by the case.

* I've told the story before, so I'll just do the ultra-condensed version: Years ago, I had a problem with an IBM 370/138 mainframe that all 3 levels of IBM service escalation could not resolve. So IBM sent one of the system's designers out to fix it. At approximately the same time I found a microcode bug in the DEC PDP-11/44 CPU. DEC's answer was pretty much "too bad, so sad". At the time, those were the #1 and #2 computer companies in the world.
 
Top