ZFS CCB request was invalid

Hello,

I'm building a new machine as a home server. Consistently, I am getting the following error which makes it impossible to get any storage pool to be stable. It happens with any disk on any SATA port after at most a few Gigabytes of writing data.
It happens with a RAID-Z with 3 disks as well as with just a single disk in striped mode. I created new pools for each test over and over again.

Code:
(ada0:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 68e4 0a 40 1c 00 00 01 00 00
(ada0:ahcich5:0:0:0): CAM status: CCB request was invalid
(ada0:ahcich5:0:0:0): Error 22, Unretryable error
(ada0:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 f8 93 4b 40 01 00 00 01 00 00
(ada0:ahcich5:0:0:0): CAM status: CCB request was invalid
(ada0:ahcich5:0:0:0): Error 22, Unretryable error
(ada0:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 48 cf b8 40 01 00 00 01 00 00
(ada0:ahcich5:0:0:0): CAM status: CCB request was invalid
(ada0:ahcich5:0:0:0): Error 22, Unretryable error

Hardware Configuration:
- AMD Ryzen 5 3400G
- ASRock X570 Pro4 with BIOS version 3.20 (AMD AGESA Combo-AM4 V2 1.0.8.0 Patch A)
- 32GB G.Skill Fortis DDR4-2400 DIMM CL15 Dual Kit
- Harddrives (connected to onboard SATA controller)
3x brand new 8000GB WD Red Plus WD80EFAX, older Samsung HD502HJ 500GB for testing
- old Crucial M4 SATA SSD as boot device (no problems with this drive)

I already ruled out the following possible reasons for the issue:
- operating system: it happens on FreeNAS 11.3-U5 as well as TrueNAS CORE 12.0-RELEASE
- RAM was tested with multiple passes of memtest86 without errors
- changed SATA cables, tried multiple new ones and an one from another working machine
- happens with all 3 new harddrives and with an older one that I know is working fine. SMART readings for all drives are fine and I suspect that it's not 4 drives failing at the same time. I even had one of the new drives run a long SMART test that completed without errors as well to be sure on this.
- swapped the PSU including the cables with one from another working machine
- went through the BIOS many times and tried different explicit and AUTO settings for the onboard SATA controller. All power saving features are off and it is definitely running in AHCI mode.

By now, I am a bit lost and the only thing I can think of is that either the mainboard (SATA controller) is broken out of the box or there is some major hardware incompatibility going one. But since the machine is running fine apart from the issue described, I would suspect the board is fine.
Unfortunately, I do not have a spare SATA controller lying around to check on this theory. What do you think? Send in the mainboard? Try a BIOS upgrade (newest BIOS uses AMD AGESA ComboAM4v2 1.1.0.0 Patch C)?

Thank you for any input!
 
You say you get this problem on all SATA disks. Given that there are two different vendors of disks (with very different firmware), the problem is unlikely to be the disk. Individual SATA cables are also highly unlikely. That leaves things that are common: Power supply, motherboard, and the OS. You have a very new technology motherboard (AMD), with a very old FreeBSD version, which is then buried in an OS distribution that FreeBSD people don't even know.

My advice would be: Try installing an up-to-date FreeBSD version, like SirDice said, and report back. This might be hard, if you are relying on FreeNAS/TrueNAS features.
 
Thank you for the feedback so far. I did not realize that they reached EoL with their FreeBSD base, good to know.
To check on the issue, I installed FreeBSD 12.2-RELEASE and see the exact same errors. I think this narrows it down to the mainboard if you don't suspect a bug in the OS here.
 
To rule out compatibility issues, I installed Debian Buster and created a similar setup. The syslog is more verbose and I'm getting unaligned writes and hard resets on the link. Pretty sure this is the controller, I'm sending the mainboard back.
 
You could try with a BIOS upgrade first; it's easy and it rules out another possibility if it doesn't help.
 
I tried upgrading the BIOS as well before swapping the board. That made the errors more rare, but they still happened. Unfortunately, the new board has the same kind of errors. It takes a lot more stress to get them to show however. With real write load over 1Gbit network it took about 4 hours/850GB to produce another write error.
I switched to benchmarking with fio and a combination of about 800GB sequential writes and random 4k iops is what it takes to induce these write failures. I used:


fio --rw=write --name=test --size=800G --filename=testfile
fio --rw=randwrite --name=IOPS-write --bs=4k --direct=1 --filename=iopstest --size=800G --numjobs=4 --iodepth=32 --refill_buffers --group_reporting --runtime=60 --time_based


Altogether, having swapped all the parts and the system performance/the amount of stress it takes to induce errors makes me think it can't be hardware related. Even mounted the drives rock solid to another case to exclude that too much vibration is the issue. Also got a word from AsRock, all they said is that they do not support anything but Windows so things might not work 100%.

Any other ideas? Could this be platform related, could the new AMD hardware be the problem? Or is it normal that some write errors happen?
 
Back
Top