Help with CAM DEBUG / troubleshooting

fabiob · Dec 6, 2017

Hello community!
This is my first post, please correct me for any mistakes or guidelines violations.

I'm a FreeNAS (FreeBSD 11.0-STABLE) user and I'm struggling with a CAM Timeout problem.
I've already posted here:

https://forums.freenas.org/index.ph...olve-cam-status-command-timeout-issues.59549/

I would ask to you as well, FreeBSD community support, to get deeper as possible into this problem.
I've asked support even to Seagate, well, they say something like: no SMART failures, not our problem.

After some RTFM for CAM debugging, I'm a bit confused.
Does someone ever tried to debug CAM? What kind of result/log is expected?
Are there other ways to understand what's happening?

Many thanks for your time.

SirDice · Dec 6, 2017

Have you considered power issues? If the PSU is barely holding up this can cause drives to go offline randomly. And by barely holding up I mean close to the maximum power output.

ralphbsz · Dec 6, 2017

These are SMR drives. According to Seagate, they are not suitable for NAS applications. What's worse: You are using them in a RAID group (your zpool status shows that they are mirrored), which is a really bad idea with drives with complex internal structure (like SMR), which will lead to correlated failures. In a nutshell, these are the wrong disk drives. Use them for non-redundant archival workloads, like they have been intended, not for a RAIDed NAS workload. If you insist on using them this way, at the minimum your performance will be bad and in particular highly variable; at worse you could have data reliability problems.

I have two theories of what is going wrong here. Theory 1: It is a simple hardware problem causing communication problems. If the problem affected only one drive, I would suspect cables. But it seems to affect all four, so it must be something common to all four, which makes power supply most likely. A static voltage measurement with a DVM is not useful, since the problem could be short power dropouts and spikes. I would simply start by swapping the power supply against another one, and see what happens. But the whole power supply theory is unlikely: your 650W supply should be more than sufficient to handle one motherboard and 4 disk drives; the only way the power supply could cause these correlated problems is that it is simply defective.

Theory 2 is much nastier. These are SMR drives, so they have an internal translation layer. It could be that due to that layer, some SATA commands take so long that the FreeBSD SATA driver declares a timeout. Why would some commands take so long? Good question, and all I can offer is an educated guess: if due to internal fragmentation, the translation layer has to do a "reorganizing" or defragmentation on a specific command, that will take much longer. If that's true, then the issue is a fundamental incompatibility between the NAS workload, these drives, and the FreeBSD kernel. This will be very hard to debug; my suggestion would be to get a SATA analyzer, log all commands and responses in the analyzer, and whenever the problem happens, search the records in the analyzer for hints. This would cost tens of thousands of $, plus weeks of work of a skilled SATA engineer.

Personally, I would try two things: either replace all drives with different models, and separately replace FreeNAS with Linux. See what helps.

P.S. For fun, I googled for a review of these drives, and found one http://www.storagereview.com/seagate_archive_hdd_review_8tb They claim that they saw write IO latencies of up to 212 seconds under an intense but sane workload (queue depth 16). That should trigger a timeout on any operating system. If that is indeed true, these drives are only suitable for carefully managed archive workloads with low write traffic, and in particular not suitable for any RAID implementation that requires resilvering or rebuild. What's even worse is this: one drive gets an IO error. This causes ZFS to start a rebuild. That causes a write-intensive workload, which drives latency up, which causes all drives to get timeouts, which causes ZFS to start even more rebuild. If that's what is happening, this will not end well.

Help with CAM DEBUG / troubleshooting

fabiob

SirDice

Administrator

ralphbsz