I am using 3 units of SATA SSDs as my raidz1 storage. I bought QLC SSDs, namely "Patriot Burst Elite 1920GB". I am fully aware that QLCs are not working very well with ZFS (it actually was a mistake when I bought them). So I am using them basically as slow storage, like storing my personal data - pictures from my camera, some web articles and ebooks, backup from my phone - where I do not need the performance. After a write operation like copying e.g. 2 gigabytes of photos onto this raidz1 I experience a slowdown, and I am fine with that. However, yesterday I lost data. Copying seemed exceptionally slow, so I started to investigate. The zpool seemed to be fine, with all three vdevs ONLINE.
But I recognized the following message in dmesg:
The copy job was finished. I verified the data written - but just a small subset: clicking on the photos and view them. Maybe five or so out of 50, and everything seemed fine. As usual, I made a snapshot. The process started hanging for many minutes. I tried to rsync the new data to my backup storage, but realized that after the first file being processed the process also stalled. I constantly checked via
The snapshot I tried to create after copying data to the dataset was not created. To my astonishment, from the 2 folders each containing roughly 25 photos only one folder with about 10 photos was saved on the dataset (not a single file I have verified before was there!), the zpool still reporting everything was fine. Considering the amount of time I waited, I would have expected some information regarding the errors either from the zpool status or dmesg. Luckily, I could use testdisk to recover the deleted files from the exFAT sdcard. What is your opinion on this? Should I file a bug report? Could I try to reproduce this behaviour using bhyve and somehow simulate disk timeouts to help debugging this (I lack knowledge to really debug this lowlevel stuff...)?
But I recognized the following message in dmesg:
ahcich8: Timeout on slot 14 port 0
ahcich8: is 00000000 cs 00008000 ss 0000c000 rs 0000c000 tfd 40 serr 00000000 cmd 0060ce17
(ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 df 9d 40 c5 00 00 01 00 00
(ada4:ahcich8:0:0:0): CAM status: Command timeout
(ada4:ahcich8:0:0:0): Retrying command, 3 more tries remain
The copy job was finished. I verified the data written - but just a small subset: clicking on the photos and view them. Maybe five or so out of 50, and everything seemed fine. As usual, I made a snapshot. The process started hanging for many minutes. I tried to rsync the new data to my backup storage, but realized that after the first file being processed the process also stalled. I constantly checked via
zpool status
and dmesg
if something changed or additional problems were being reported. That was not the case. After roughly 45 minutes, my system became unusable (the operating system and my home directory is a different zpool), so I logged in via ssh and looked for further information - no additional errors, and the zpool still reported as online (both, the zfs snap
as well as the rsync
command were still running but did not progress). I tried to declare the vdev in question as offline, however, this process also did not finish. I waited for maybe another 10 to 15 minutes and rebooted the machine, reboot hang and I did a hard reset. Then I checked the cables of the disks and switched on the system again. It booted and the zpool was there again.The snapshot I tried to create after copying data to the dataset was not created. To my astonishment, from the 2 folders each containing roughly 25 photos only one folder with about 10 photos was saved on the dataset (not a single file I have verified before was there!), the zpool still reporting everything was fine. Considering the amount of time I waited, I would have expected some information regarding the errors either from the zpool status or dmesg. Luckily, I could use testdisk to recover the deleted files from the exFAT sdcard. What is your opinion on this? Should I file a bug report? Could I try to reproduce this behaviour using bhyve and somehow simulate disk timeouts to help debugging this (I lack knowledge to really debug this lowlevel stuff...)?