ZFS ZFS lost my data

I am using 3 units of SATA SSDs as my raidz1 storage. I bought QLC SSDs, namely "Patriot Burst Elite 1920GB". I am fully aware that QLCs are not working very well with ZFS (it actually was a mistake when I bought them). So I am using them basically as slow storage, like storing my personal data - pictures from my camera, some web articles and ebooks, backup from my phone - where I do not need the performance. After a write operation like copying e.g. 2 gigabytes of photos onto this raidz1 I experience a slowdown, and I am fine with that. However, yesterday I lost data. Copying seemed exceptionally slow, so I started to investigate. The zpool seemed to be fine, with all three vdevs ONLINE.

But I recognized the following message in dmesg:

ahcich8: Timeout on slot 14 port 0
ahcich8: is 00000000 cs 00008000 ss 0000c000 rs 0000c000 tfd 40 serr 00000000 cmd 0060ce17
(ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 df 9d 40 c5 00 00 01 00 00
(ada4:ahcich8:0:0:0): CAM status: Command timeout
(ada4:ahcich8:0:0:0): Retrying command, 3 more tries remain

The copy job was finished. I verified the data written - but just a small subset: clicking on the photos and view them. Maybe five or so out of 50, and everything seemed fine. As usual, I made a snapshot. The process started hanging for many minutes. I tried to rsync the new data to my backup storage, but realized that after the first file being processed the process also stalled. I constantly checked via zpool status and dmesg if something changed or additional problems were being reported. That was not the case. After roughly 45 minutes, my system became unusable (the operating system and my home directory is a different zpool), so I logged in via ssh and looked for further information - no additional errors, and the zpool still reported as online (both, the zfs snap as well as the rsync command were still running but did not progress). I tried to declare the vdev in question as offline, however, this process also did not finish. I waited for maybe another 10 to 15 minutes and rebooted the machine, reboot hang and I did a hard reset. Then I checked the cables of the disks and switched on the system again. It booted and the zpool was there again.

The snapshot I tried to create after copying data to the dataset was not created. To my astonishment, from the 2 folders each containing roughly 25 photos only one folder with about 10 photos was saved on the dataset (not a single file I have verified before was there!), the zpool still reporting everything was fine. Considering the amount of time I waited, I would have expected some information regarding the errors either from the zpool status or dmesg. Luckily, I could use testdisk to recover the deleted files from the exFAT sdcard. What is your opinion on this? Should I file a bug report? Could I try to reproduce this behaviour using bhyve and somehow simulate disk timeouts to help debugging this (I lack knowledge to really debug this lowlevel stuff...)?
 
It seems that one (or more?) of your QLC drives became very ill. So much so that they were not responding to operations, such as write commands. Therefore ZFS probably had a considerable amount of data (such as newly created files, their directory entries and content) still in memory, and was trying to write it to those disks. It's not clear to me that "zpool status" would or should have reported anything, because as far as ZFS is concerned, the underlying hardware is just being really slow. On the other hand, you did get some timeouts, and it isn't clear how far up the stack those were propagated; if ZFS ends up receiving IO errors that are caused by timeouts, that should/might have been visible. You should also have seen dozens of timeout messages in dmesg, given that this went on for 45 minutes.

Once you rebooted, the copies of the data in memory were gone, and now you're back to whatever ZFS was able to write before the disks went to heaven (or hell, depending what religion one follows). I fear that data is probably gone and not retrievable, since ZFS would show you whatever it finds on disk, and memory is just gone.

Trying to simulate and reproduce it to file bug reports is very generous and honorable of you, but in practice, very difficult. In theory, the ZFS group should have system tests that involve injecting errors (such as timeouts that are not reported as IO errors). It might be good to mention your experience on the kernel mailing list that covers ZFS, but don't expect immediate action to come from it: testing the behavior of file systems in error cases is tedious and hard, and I bet the ZFS group is doing as much as humanly possible.
 
This smells just like the SMR v CMR hard drives issue with ZFS. You knew the drives in question were unreliable under ZFS but still decided to use them. Not much else to say really, time to pick another filesystem.
 
Not much else to say really, time to pick another filesystem.

No file system will help with worn-out SSD(s), and that's most likely what we're dealing with here. It's worth noting that such RAIDZ pool of SSDs provides almost no protection against drive failures. Yes, it's better than stripe, but it will only help in event of random media problems.

Each SSD has limited write endurance. They are destined to die after certain amount of data is written. Since all drives in RAIDZ perform the same amount of writes, there is significant chance, that more than one SSD will fail at similar time. Such SSD pool provides level of protection more like single CMR HDD.
 
Each SSD has limited write endurance. They are destined to die after certain amount of data is written. Since all drives in RAIDZ perform the same amount of writes, there is significant chance, that more than one SSD will fail at similar time. Such SSD pool provides level of protection more like single CMR HDD.

I doubt that the SSD discussed here died from exceeding its write endurance.
 
It's worth noting that such RAIDZ pool of SSDs provides almost no protection against drive failures. Yes, it's better than stripe, but it will only help in event of random media problems.
That is false. A RAID-Z will handle from any one "random media problem" (known as read error), up to complete failure of one drive. Similarly, RAID-Z2 will handle any two read errors on the same data (the same block) up to two complete device failures, and RAID-Z3 three of them.

Each SSD has limited write endurance. They are destined to die after certain amount of data is written.
As cracauer already said, home users or small servers don't usually reach the write endurance limits of SSDs. Note that modern disk drives also have IO endurance limits (the infamous "550 TB per year"). And the failure rates of SSDs are comparable to CMR and SMR.
 
As cracauer already said, home users or small servers don't usually reach the write endurance limits of SSDs. Note that modern disk drives also have IO endurance limits (the infamous "550 TB per year"). And the failure rates of SSDs are comparable to CMR and SMR.

When reaching a failure from write amount SSDs are also supposed to switch to read-only mode. Whether that actually works is a matter of firmware.

Still, giving random DMA errors sounds more like a busted controller.
 
I just did a zfs scrub and everything is fine, according to SMART, the drive in question has a "Percentage Used Endurance Indicator" of 31.

I am pretty sure it was the cable. Nevertheless, I expected ZFS to make that device fail or somehow inform me via dmesg.
 
That is false. A RAID-Z will handle from any one "random media problem" (known as read error), up to complete failure of one drive.
Of course you are right. My point was: take 2 SSDs of same model, put them in a mirror, and write to the array/pool until the 1st drive fails due to excessive wear. The 2nd SSD will likely fail at similar time, sometimes even before you realize what's happening. This isn't a case with HDDs, even the current endurance-limited drives.

Note that modern disk drives also have IO endurance limits (the infamous "550 TB per year"). And the failure rates of SSDs are comparable to CMR and SMR.
While I am not a professional, I follow a local data-recovery community. Current consumer-grade SSDs have really disappointing reliability. We are saying about QLC models that they fail "just because you're looking at them".

In addition, SMR HDDs proved to be less reliable than earlier models, with most failures involving the second-stage translator, absent in CMR drives.
 
My point was: take 2 SSDs of same model, put them in a mirror, and write to the array/pool until the 1st drive fails due to excessive wear. The 2nd SSD will likey fail at similar time, sometimes even before you realize what's happening.
That's why you never get a bunch of drives from the same batch. That's also the case with the good old fashioned spinning rust disks.
 
That's why you never get a bunch of drives from the same batch. That's also the case with the good old fashioned spinning rust disks.
Cool! That's good to know.
Until now I simply "aged" some drives when got a bunch of the same type before placing them into a pool, to get a lower chance they fail the same moment.
Thanks! (It's always valuable, just to read here; almost daily I learn things :cool: )
 
Wanting/Needing to know more about ZFS like how to create snapshots, boot environments in case OS upgrades break, etc., I discovered that the author of Absolute FreeBSD made a statement like scrub in ZFS reduces drive performance.
The author however used HDDs.
How should I interpret that statement ?
Does he mean that with every scrub the I/O capabilities of the drive worsens ?

Another question on a sidenote.
Should or can I use ZFS tools for nvme SSD defragmentation or should I use provided nvme tools ?
I read that defragmentation of SSDs leads to an earlier death ...
 
In general, and somewhat broadly speaking, ZFS by its very nature (being a COW, that is Copy On Write, filesystem) has increasing fragmentation from its creation time forwards; that may, in (very) exceptional cases, become a performance issue. ZFS does not have tools to go over vdevs and 'defrag' them.

I discovered that the author of Absolute FreeBSD made a statement like scrub in ZFS reduces drive performance.
The author however used HDDs.
How should I interpret that statement ?
Does he mean that with every scrub the I/O capabilities of the drive worsens ?
Last question: no; in essence a scrub (zpool-scrub.8) is a check of all ZFS data and metadata (verifying its hashes it has created), contrary to fsck(8) a scrub must be run with a pool being online. From that it logically follows if pool I/O happens while a scrub is in progress it will be affected by the scrubbing.

For a good background of ZFS' main design ideas and properties, such as pooled storage and secure & redundant data storage, I suggest you have a look at: The Zettabyte File System by Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum, see here.

There is a lot of information to be found here on the forums. For snapshots and boot environments, you may find this helpful.
 
That's why you never get a bunch of drives from the same batch. That's also the case with the good old fashioned spinning rust disks.
Yes, if you find a bad batch of hard drives. Otherwise, they rarely die at same time. This is different with SSDs, where each NAND cell has limited, predetermined write endurance.

Does he mean that with every scrub the I/O capabilities of the drive worsens ?
No. Until your drive's health deteriorates, even very old HDD will perform much like it did in its youth.

I read that defragmentation of SSDs leads to an earlier death ...
Yes, it's true, because every write to flash memory is destructive. And in the case of SSDs, defragmentation is pointless, as flash translation layer always rearranges physical layout of data in its own way. By defragging, you will not improve SSD's performance, but will only bring it closer to natural death.
 
Yes, it's true, because every write to flash memory is destructive. And in the case of SSDs, defragmentation is pointless, as flash translation layer always rearranges physical layout of data in its own way. By defragging, you will not improve SSD's performance, but will only bring it closer to natural death.
Right.
Thank you for pinpointing that out, I remembered destroying my old nvme ssd on Arch Linux back then that way doing an autodefrag each day...
 
Wanting/Needing to know more about ZFS like [...]
Do it like I did:
Grub out some scrappy spare machine from the attic's dust, 8G RAM are fully sufficient - you don't actually gonna use this machine - and plug a couple of old, used 0.256...1T drives into it.
If you don't have one, get one. You get all you need on a rummage sale for 20...50 bucks; any small smeared dirty old monitor with false colors will do (wetwipes can do wonders); maybe an additional ridiculous graphics adapter without a fan, if there is none already left in. Tower was nice: more space, less fumbling.
All it needs to be is, it has to work, be capabale to run FreeBSD with ZFS, and provide at least four SATA slots, and a USB slot, to boot the installer image from. (Almost any old PC from the last twenty years will do - speed, and power are of no importance; it just has to work, only.)
Best fifty bucks you've ever spent!

It's gold to have such a machine by hands: to experiment, and play with, to do weird stuff you never dare, and better never do with your real machine, such as 'what happens if I rm -Rf / while the system runs?' (Of course, you know - but you get the picture.)
You can relaxed bang on this machine, try out things in most brutal ways, simulate a drive failure by plug it out while the machine is doing a write job to the pool... Worst thing happen is you have to reinstall FreeBSD. Maybe you actually kill a drive, an used, ancient, useless small, useless slow, already almost dead drive, containing nothing of value whatsoever - duh!

You not only get a real hang on what's all this replace a drive stuff all about, resilver, snapshot, changing drive's SATA ports with and without labels,...etc.
but further more - and that's the most valuable part - you become confident about handling zfs pools, 'cause you not only read in text, you already actually know, because your really did it yourself.
And that's not to underestimate when the day come, and you actually need to do things the first time for real on one of your real machines. (No VM can provide to deal with real hardware.)
Tip: Write a small documentation on what you've done, what you learned, and a small emergency overview cheat sheet - just for the case; just simple notes - you're not doing it for school, you're doing it for you.
On paper. Electronic emergency cheat sheets are of no much use, if they are on the machine you need them for.
(Guess, why I know that 😂)

You will learn much within one afternoon/weekend,
and be happy to have such a machine in reserve to test other things later.
Enjoy!
🤓
 
Grub out some scrappy spare machine from the attic's dust, 8G RAM are fully sufficient - you don't actually gonna use this machine - and plug a couple of old, used 0.256...1T drives into it.
Or VMWare, Virtualbox, bhyve or any other virtual machine, add a bunch of virtual disks and try every combination. If you mess up, no worries.
 
Of course you are right. My point was: take 2 SSDs of same model, put them in a mirror, and write to the array/pool until the 1st drive fails due to excessive wear. The 2nd SSD will likely fail at similar time, sometimes even before you realize what's happening.
Indeed. And people who build systems for data that needs to be durable know that, and take that into account. There are many possible workarounds.
This isn't a case with HDDs, even the current endurance-limited drives.
Actually, that's not clear. There is data out there on endurance / reliability / durability / failure rates of modern drives, but I have never seen that data publicly available.

Current consumer-grade SSDs have really disappointing reliability. We are saying about QLC models that they fail "just because you're looking at them".
Sadly, also true. But it is not the QLC aspect that makes them fail (it only makes it worse), it is the awful quality of their internal firmware (the FTL) and their awful build quality. QLC makes it worse (because the individual flash chips fail more often). In a nutshell, consumer-grade X should not be used where the user has expectations of professional-grade performance and reliability.

In addition, SMR HDDs proved to be less reliable than earlier models, with most failures involving the second-stage translator, absent in CMR drives.
Given what I know about the real data about drive measurements from large users (who have millions of SMR drives), that statement is completely incorrect. There may have been some firmware problems in the infancy of SMR disks; there may have been SMR disks sold into inappropriate applications. There certainly were consumers (amateurs) who bought shingled disks because they "looked cheap" (in terms of $/TB), and then tortured them by giving them inappropriate workloads that made the internal cleaner get worked to death. But in large production environments, SMRs are doing excellently.

But: All disks fail. The overall device reliability continues to hover around 1M hours (plus or minus a factor of 2), but disk farms are becoming bigger. And the uncorrectable read error rate is around 10^-14 per bit. If you multiply that read error rate with the larger and larger drives you see today, you will see more errors. The big disk users see so many errors that they can do very accurate studies of cause and effect.
 
(No VM can provide to deal with real hardware.)
Or VMWare, Virtualbox, bhyve or any other virtual machine,
Because of your vast experience, and expertise it may escaped your eye how to learn basics best.

Don't get me wrong:
VMs, hypervisors, and simulations are most valuable. Unquestionable.
But I doubt you can gain the same experience about dealing with real hardware from it as doing it for real, if you not already bring the experience of you did it for real at least once.

Besides in some cases doing it for real is quicker, and easier (to simulate you always first need to be well versed in how to simulate, how to rate, and transfer results into real world), no simulation can be so revealing as real world experiments. Anybody made a revealing experience in real world the kind of 'F#ck. I'd never dreamt of that' knows that.
That's the crucial core from where the distinction comes what differs real quality from crap:
Always doing real world tests, and respect the results. Or underestimate that (or safe the cost for it.)
The difference lies in the nature of both. And you have to be always fully aware of that, or simulation can become a source of additional problems itself instead of being a help.
A simulation is always a selective model. It's always a compromise of what's respected to be relevant, and what's skipped. While real world provides all parameters, no matter wanted, or thought of. Which includes yourself - don't underestimate that 'parameter'. (The old root's joke about the user as the failure.) That's what I meant with:
become confident about handling zfs pools

In my eyes ZFS RAIDs are ment for real physical drives to store real data. Doing it as several virtual volumes within a virtual environment makes only sense for testing, and experimenting, but not for learning (the fundamentals), and not for usage, because it can only be as safe as the filesystem and its security measures it's standing on. To put it otherwise: Having a raidz3 of five images within one ufs partition will bring no saftey against a failure of the ufs drive.

So, bottom line:
I doubt you can learn the same how to self-confidentally deal with a real world physical pool when a drive fails, if you've never done it before, but virtually, only. I can imagine if some one learned to deal with zfs pools within virtual environments, only, he or she may become unsecure, if not panicky when they have to exchange or add a real physical drive within a real physical pool.
Hence my recommendation:
Do it at least once on a real, physical machine.
Having enough experiences and further more the self-confidence to know how it feels in real world, you may do the 'rest' in virtual environments.
 
Do it like I did:
Grub out some scrappy spare machine from the attic's dust, 8G RAM are fully sufficient - you don't actually gonna use this machine - and plug a couple of old, used 0.256...1T drives into it.
If you don't have one, get one. You get all you need on a rummage sale for 20...50 bucks; any small smeared dirty old monitor with false colors will do (wetwipes can do wonders); maybe an additional ridiculous graphics adapter without a fan, if there is none already left in. Tower was nice: more space, less fumbling.
All it needs to be is, it has to work, be capabale to run FreeBSD with ZFS, and provide at least four SATA slots, and a USB slot, to boot the installer image from. (Almost any old PC from the last twenty years will do - speed, and power are of no importance; it just has to work, only.)
Best fifty bucks you've ever spent!
Yes, I already have a spare laptop, and a spare MAC OSX PC, and they trashed about 10 FreeBSD systems on purpose 😅
However not for the sake to learn ZFS.
What I really like about ZFS in this regard is boot environments which you can even switch on the loader prompt before booting up the kernel.
I used sysutils/beadm for creating boot environments in ZFS, but I read that this utility is superseded by bectl build during a summer bootcamp ?, I think.
It is very handy to destroy one boot environment, and then load up the other 🤣

During this 4 days of trial/error, and completely reading Absolute FreeBSD, I must say I learnt a lot of things, and even now understand a lot more about ZFS snapshots, the hidden .zfs directory every dataset has, and how to troubleshoot in single user mode, and loader prompt.

That book also forced me to build/destroy, and build again my kernel for about 2 days just to get the right combination of minimalism and functionality. By doing that I saved about 500 MB of RAM usage.

It's gold to have such a machine by hands: to experiment, and play with, to do weird stuff you never dare, and better never do with your real machine, such as 'what happens if I rm -Rf / while the system runs?' (Of course, you know - but you get the picture.)
You can relaxed bang on this machine, try out things in most brutal ways, simulate a drive failure by plug it out while the machine is doing a write job to the pool... Worst thing happen is you have to reinstall FreeBSD. Maybe you actually kill a drive, an used, ancient, useless small, useless slow, already almost dead drive, containing nothing of value whatsoever - duh!

You not only get a real hang on what's all this replace a drive stuff all about, resilver, snapshot, changing drive's SATA ports with and without labels,...etc.
but further more - and that's the most valuable part - you become confident about handling zfs pools, 'cause you not only read in text, you already actually know, because your really did it yourself.
And that's not to underestimate when the day come, and you actually need to do things the first time for real on one of your real machines. (No VM can provide to deal with real hardware.)
Tip: Write a small documentation on what you've done, what you learned, and a small emergency overview cheat sheet - just for the case; just simple notes - you're not doing it for school, you're doing it for you.
On paper. Electronic emergency cheat sheets are of no much use, if they are on the machine you need them for.
(Guess, why I know that 😂)

You will learn much within one afternoon/weekend,
and be happy to have such a machine in reserve to test other things later.
Enjoy!
🤓
I agree, it is much more beneficial not just reading things, but actually doing them.
Learning by doing I guess, is a concept which is mostly overlooked.
Even if you make one mistake, you will start to read other parts in a book more carefully, because you learned that fixing errors takes more time than avoiding them in the first place.

For example, I wrote a shell script once, and ignored the fact to substitute rm -rf with ls, before deleting files.
The outcome was that a directory path was spelled wrong in the script, and instead of /path/to/dir everytithing in / was deleted.
The system run, but I could not issue any command anymore, not even booting down without forcing the PC to boot down through the power supply 🤣

Occasionally, and now for understanding other users source code, getting the connections right, and new terminology I encounter, I take notes.
They are very handy, and they are written in a language I can grasp much faster than looking up 50+ manual pages about it. 😅

Well, through your post, I guess you are a physics or electric engineering professor ?
Since you mentioned that you made your doctor degree in physics, I believe.
 
Or VMWare, Virtualbox, bhyve or any other virtual machine, add a bunch of virtual disks and try every combination. If you mess up, no worries.
I did that, too.
Mainly with oracles virtualbox in windows, because setting it up was very fast, and easy.
But compiling things from source crashed that virtual box in the end...
 
Back
Top