CAM errors FreeBSD 11.1+

Hi all.
Faced strange problem.
After upgrading to FreeBSD 11.1 or to CURRENT during boot for all disks I get something like
Code:
(ada0:ata2:0:0:0): WRITE_DMA48. ACB: 35 00 50 29 10 40 6c  00 0c 00
(ada0:ata2:0:0:0): CAM status: Command timeout
(ada0:ata2:0:0:0): Retrying command

What was tried so far:
1) Placed disk with same sata cables to another PC - boots fine
2)Tried separate raid controller on this baseboard - doesn't boots properly
4)Used separate power supply for disk - the same errors.
Also if I boot in single mode, mount all zfs partitions, perform something like "find / -name something" and then bring OS to multiuser state - it will boot without errors.
Any thoughts what can cause this?
 
Hi Wozzeck.Live.
Thanks for quick response.
It was late here when I created this topic, so missed to add some details)
Currently host is
$ uname -a
FreeBSD tower.local 11.0-RELEASE-p11 FreeBSD 11.0-RELEASE-p11 #0 r321493: Wed Jul 26 01:21:58 EEST 2017 root@tower.local:/usr/obj/usr/src/sys/MY amd64

Hardware: Supermicro X8DTN+-F / 6xWD1502FYPS-02W3B0 /2xE5649
HDDs connected to sata ports on baseboard.
As for vfs.unmapped_buf_allowed it is by default equal to 1.
I will try to update again to 11.1, maybe defaults changed for this sysctl.
Will post about results
 
Pity, but
Code:
vfs.unmapped_buf_allowed=0
didn't help.
I have few thoughts about father testing.
Will try to check.
 
This does look like a low-level problem with SATA communication, which is below the file system. Not clear to me what it would have to do with high-level operations like mapping buffers in the VFS layer (which is the topmost part of the kernel file IO, above the actual file system).

Weird. The error message you quote above says "retrying command". Does this happen for all SATA IOs? Or does it happen occasionally, and when retrying the command the problem goes away (you can tell by looking at the ACB, whether it changes or not)? The part that makes me say it's weird is: It seems implausible that an upgrade to the SATA driver introduced a new bug in low-level IO; that's the kind of thing the developers would have seen very early.
 
ACB stays the same 5 times and then I receiving something like retries exhausted.
It happens for all sata drivers.
Usually such errors appears for some random one driver and then for all other.
It happens only during boot and system isn't really responsible later. Or init will fail or just can't login or something like that, so can't say what happening.
If I add something like "find / -name something" in zfs rc script after mounting, it boots fine but such "workaround" causes huge delays in boot and a little bit ugly)
Totally agree that this issue is really weird. Maybe something wrong on HW layer and causing such issues.
Will try to get suitable power module to check if old make my system crazy and, if not, maybe something wrong with south bridge on baseboard
Also will try particular changes in drivers.
 
(Apologies for correcting your spelling below ... just makes it easier to communicate.)

ACB stays the same 5 times and then I receiving something like retries exhausted.
It happens for all sata drivers.
Usually such errors appears for some random one driver and then for all other.
(You mean SATA drives: A drive is a piece of hardware, while a driver is a piece of software)
What this means is that the problem is consistent enough that retrying the operation doesn't fix it, so it's not just rare random occurrence, but systematic. On the other hand, it clearly can not occur for every SATA IO operation, otherwise the system wouldn't get this far. Weird.

It happens only during boot and system isn't really responsible later. Or init will fail or just can't login or something like that, so can't say what happening.
(You mean the system isn't responsive, meaning it does not respond. Not responsible would mean that the system can't be blamed for what happened.)
That makes sense: If a disk IO completely fails (see above), then some important OS operation (like starting init, or starting getty, or ...) will not work, and the system will not work.

If I add something like "find / -name something" in zfs rc script after mounting, it boots fine but such "workaround" causes huge delays in boot and a little bit ugly)
Even weirder: If the disk drive has a heavy workload (caused by find), it has *fewer* errors, and therefore the problems don't occur? Usually intense workload makes for *more* errors. But assuming your trick actually works, you could try the following as an emergency measure to keep the system going: Instead of "find ..." (which causes a big delay), put the find command in the background, and let it run a little slower: "nice find ... &". Maybe that trick is a compromise that works a little better.

Totally agree that this issue is really weird. Maybe something wrong on HW layer and causing such issues. Will try to get suitable power module to check ...
Checking the hardware is a good idea (it is called a power supply, not power module). On the other hand, you said above that the problem started after the upgrade to FreeBSD 11.1, so perhaps the problem really is caused by the operating system. Perhaps your hardware is a little unhealthy, and in some fashion 11.1 is more unforgiving. Personally, I find a problem in the south bridge chip unlikely, while a problem with the power supply is possible (although still a weird coincidence that it would fail right when you do the OS upgrade).

As an alternative attempt to debug this: How difficult would it be to boot your system from an older FreeBSD install CD (some old version that worked correctly for you), and then see whether access to the disks works correctly.

Good luck!
 
Sorry for my English, it's far away from being perfect)
I have zfs snapshots for 11.0.
And after rolling back to 11.0 all works just fine. After upgrade (no difference freebsd-upgrade or from sources) to 11.1 issue happens again.
Tried multiple times)
As for find, I believe it helps because of the caching on the OS level, which allows to decrease the number of read operations later. And one additional point that there is no errors at all during the find. Thats why I thing that there is something with power supply, because writing is more power consumable then read.
.
 
Indeed, your theory with power supply and read versus write sounds possible. Go try it.

Still, it would be very weird (amusing?) if the same system with a "somewhat bad" power supply functioned under 11.0, and not under 11.1. Amazing what a large effects a minor software change can have.
 
I had something very similar when I first tried FreeBSD and had to select option 3 on the boot menu and enter

set hint.ahci.0.msi=0
boot

in order to get a clean boot. After than I added
hint.ahci.0.msi="0"
to /boot/device.hints and I've not had any problems since.
 
Sadly, but new power supply haven't solved the issue.
Neither hint.ahci.0.msi did, besides disks are in ide mode, so, probably, hint.ahci.0.msi="0" will cause no effect.
 
Back
Top