ZFS Forcebly remove files/directories

Deleted member 67440 · Dec 12, 2022

fcorbelli said:
https://github.com/openzfs/zfs/blob/master/cmd/zdb/zdb.c

VERIFY0(space_map_iterate(sm, space_map_length(sm),

iterate_through_spacemap_logs_cb, &uic));

space_map_iterate(sm, space_map_length(sm), iterate_through_spacemap_logs_cb, &uic)

== 0 (0x61 == 0)

=>

ZFS and power failures

Damage and recovery of ZFS after a power failure.

www.klennet.com

For some weird reason there's this belief, very amateurish, that zfs is perfect, works anyway, you never need fsck, and so on

It is not so

One of the "classic" problems is precisely the very, shall we say, imperative tendency to examine the free spaces before doing anything
It's understandable (with a COW system free space is essential)
But the result is that zfs crashes, sometimes
Rarely, but it happens (very often with deduplicated datasets, but here I should explain why there is some sort of fsck and how it works)

In your case, examining the source code, you find an assert where "something" should be empty, but it isn't, and zfs gets really pissed off

I don't think there are any zfs tools that allow you to "cure" the situation
You could patch and recompile the executables, but it would take event to me days if not weeks.

You can ask to some zfs guru (yes, they are kind people and sometimes answer back, try Kirk McKusick or Yao)

Alain De Vos · Dec 12, 2022

I thought zfs was rock-stable. It's not. It even does not have a fsck to fix problems like ufs.

Deleted member 67440 · Dec 12, 2022

Alain De Vos said:
I thought zfs was rock-stable. It's not. It even does not have a fsck to fix problems like ufs.

a zfs' fsck can be done, but will be so huge and complex that I really do not think this can be useful
PS I'll take you to the "dark side", I'm writing a special thread right now

Alain De Vos · Dec 12, 2022

In the end i ended up destroying the dataset with the corrupt directory.
[And backup up files with cp]

Now i only must avoid zfs-corrupting power outages...

Deleted member 67440 · Dec 12, 2022

Alain De Vos said:
In the end i ended up destroying the dataset with the corrupt directory.
[And backup up files with cp]

I'll show you better

cracauer@ · Dec 12, 2022

Alain De Vos said:
I thought zfs was rock-stable. It's not. It even does not have a fsck to fix problems like ufs.

I suspect that these problems with small but unfixable corruption we see on this forum have to do with storage devices that lie to the user about having committed specific data to permanent storage when asked to (sync). Google investigated this and found that a majority of default firmwares lie. Google has custom firmware for most or all datacenter drives.

If the storage device didn't actually commit to permanent storage when asked for for a specific piece of data then the ordering of commits as visible on storage after a sudden reboot can be messed up on e.g. power fails. The result would be corruption without instable hardware and without ZFS bugs.

Personally I am very concerned about this but there is little that I can do about it. A UPS would help, but only if it also includes a clean shutdown on power fail.

Deleted member 67440 · Dec 12, 2022

cracauer@ said:
Personally I am very concerned about this but there is little that I can do about it. A UPS would help, but only if it also includes a clean shutdown on power fail.

Yes, you can
But not with zfs send|receive alone

Alain De Vos · Dec 12, 2022

It's up to the filesystem to handle lying hardware. Eg data is in a buffer but commit is reported.
Zfs-option, "lying hardware".
Note : with ufs i had after power outages some files returned with size 0. But the directory was always ok after an fsck.

Deleted member 67440 · Dec 12, 2022

Alain De Vos said:
It's up to the filesystem to handle lying hardware. Eg data is in a buffer but commit is reported.
Zfs-option, "lying hardware".
Note : with ufs i had after power outages some files returned with size 0. But the directory was always ok after an fsck.

In fact no
No because sometimes even the controller does not know what is running
On SSD the firmware usually "reorder" (aka: "defrag" or "trim" in SSD sense) data, even without a command from the PC.

"You" don't send any write commands on the SATA bus BUT the SSD writes
And "you" (the controller, the OS) will never know

I will turn you into my Sith adept!

Alain De Vos · Dec 12, 2022

Hardware vendors like to do thing in buffers to fool benchmarks with higher speeds.
But reliability is tested with hard power outages.

Deleted member 67440 · Dec 12, 2022

Alain De Vos said:
Hardware vendors like to do thing in buffers to fool benchmarks with higher speeds.
But reliability is tested with hard power outages.

In fact no.
Users try to buy the cheapest hardware as possible to store as much valuable data as possible
Will you pay $2.000 for a 1TB SSD drive, or $50 ?

Alain De Vos · Dec 12, 2022

You should not report to the library that data is written synchronously when it is still in buffers & not on the media itself.

Deleted member 67440 · Dec 12, 2022

Alain De Vos said:
You should not report to the library that data is written synchronously when it is still in buffers & not on the media itself.

As I say the world is real different now
Drive uses cache all the way, very complex (aka: fragile

firmwares
Thermal ricalibrating
Bad-sector shifting
(...) and and and...
SSD are even much worse, with block-level pages. Trims. Hidden "repositioning" for do not wear some specific cells, multilayer (not SSL) etc.etc.

I have about 8 SLC-32GB Intel drive here.
They runs fine after 14 years, but @ $500 for 32GB each, about $4000 (of 14 years ago, maybe $8000 today) for 256GB

Alain De Vos · Dec 12, 2022

If the data on it is worth it.

tomaz · Dec 12, 2022

Why would not lying about when the data has been committed to permanent storage cost $8000 for 256GB?

gpw928 · Dec 12, 2022

cracauer@ said:
I suspect that these problems with small but unfixable corruption we see on this forum have to do with storage devices that lie to the user about having committed specific data to permanent storage when asked to (sync)....

I agree. Kirk McKusick brought us vastly improved reliability with UFS/FFS soft updates, but if the storage devices lie about commits, then all bets are off.

My suspicion is that SSDs do a lot more lying on this matter than traditional spinning disks -- and the manufacturers do their very best to obfuscate this. That is why I asked above about the type of media involved (really just to test my prejudices).

I'm pretty much forced to have a UPS, and a sine wave petrol generator -- I had four unplanned power outages last week, plus a planned one that lasted 7 hours. However, Alain's woes above just reinforce my view that I want reliable power and clean shutdowns -- most especially now that SSDs have become so common.

Even though I have a UPS, I also pay the premium for SSDs with capacitors for the critical ZFS-related infrastructure (separate ZIL and special VDEVs), and always configure redundant storage.

None of this guarantees 100% recoverability. But it sure increases the odds. I run an office with three tower servers, two small (passively cooled) servers, 12 spinning disks, 8 SSDs, two switches, a WiFi router, and a large screen (on a modest video card) all on a 300W running load. A 500W - 600W Eaton or APC UPS can be purchased for well under US$200. Bargain!

[Eaton and APC both usually seem to work well with sysutils/nut.]

ralphbsz · Dec 13, 2022

Alain De Vos said:
I thought zfs was rock-stable. It's not. It even does not have a fsck to fix problems like ufs.

Do you have ECC on your memory? Watch the BIOS logs for single-bit errors that were corrected?

Do you buy enterprise-grade disks? And connect them with enterprise-grade HBAs and cables? And watch the firmware versions (also of the disks) and upgrade when needed? And configure redundant disks and systems?

If the answer to one of those questions is "no", then please don't throw eggs at ZFS.

To reinforce what Cracauer and fcorbelli said above: Consumer grade disks are optimized for lowest possible cost (which is what consumers really want), and looking good on dumb benchmarks (Phoronix or Tom's Hardware come to mind). Enterprise systems (not just disks!) are optimized to make the customer long-term happy, and the single biggest reason for unhappiness is data loss. Part of that optimization is to buy high-quality components. Part of it is to actively investigate firmware versions, and make sure the optimal one is used at all times (anecdote 1 below). Part of it is to prevent unplanned power failures as much as possible (there is a reason data centers have both batteries and diesel generators). Part of that is to perform destructive testing to make sure the (disk...) vendors promises are actually correct (anecdote 2 below). Another part is to enable hardware protection mechanisms where available, for example hardware encyption and checksums (anecdote 3 below). And a final part is to always have multiple copies of the data, typically in multiple locations (anecdote 4 below).

Anecdote 1: At a former employer, sales wanted to ship a system to a customer, using a new model of disk drive for which engineering and quality control had not yet studied, so we didn't know what firmware version should be used. Another engineer and I vetoed having the system shipped out. Our veto was overridden by an executive, because (a) the customer urgently needed the system, (b) we needed the revenue, and (c) there were no other disks available. About a week after the system ships, the disks started dying like flies. We ended up replacing several thousand disk drives in the field, and giving the customer a retroactive 100% discount on the system. All because some VP didn't want to wait for firmware testing.

Anecdote 2: A friend of mine worked for a different storage systems company. His job for a while was to write data to hundreds of SSDs in a lab system, and then cut power in the middle of the writes. This was about 15 years ago, when SSDs were the new and hot thing. He found that a surprisingly large fraction of the SSDs (a) don't actually write to media even though they replied to the host that they had, (b) reorder writes, (c) return corrupted data, or (d) even completely die and refuse to reboot after a power outage during a write. If you remember the beginning of the SSD era, there used to be a lot of small vendors that assembled SSDs from NAND chips and generic OEM controllers: Most of those small vendors went under because they couldn't get their quality under control.

Anecdote 3: Several decades ago, the T10 committee (which defines the SCSI standard, and really leads the industry in how disk interfaces worked) proposed to allow applications to calculate checksums, pass those checksums all the way through the stack (OS, HBA, hardware), write the to disk, and return them to the application. This cost a lot of money, and was really complex. Why did the disk vendors push this? Because they were sick and tired of being wrongly accused of corrupting data: Anytime a database was broken, the software companies like Oracle or IBM claimed "it was the disk drive that corrupted the data", and Seagate and Hitachi wanted to be able to prove: "You calculated this checksum, you wrote it, and we're returning the data to you unmolested".

Anecdote 4: A big corporate IT customer of my previous employer had two complete data centers, with complete storage systems in both, and fast network cables to connect them. In two separate buildings. Like that, even a complete disaster that affects one building could be recovered from. Unfortunately, the second data center was in the other tower of the World Trade Center.

Deleted member 67440 · Dec 13, 2022

Anecdote 1: enterprise level drives gives sometimes problems
The specific fabric test (non smart, a 7 hours long test for each drive) tells all is OK, all green
After a week of digging sometimes fails occour during writes,not reads
And the test write some kind of simply pattern like 000111222
Now I use my very own software to write and read back almost every free bytes, with random pattern, on all brand new spinning drives, watching average speed in every section,before installing in server

Short version: even with zfs or whatever, do the backups and check them

I am on my way to explain a normal-paranoid level

mer · Dec 13, 2022

cracauer@ said:
I suspect that these problems with small but unfixable corruption we see on this forum have to do with storage devices that lie to the user about having committed specific data to permanent storage when asked to (sync). Google investigated this and found that a majority of default firmwares lie. Google has custom firmware for most or all datacenter drives.

I'd also guess a bit is drives that lie about how much space they actually have; that's why I've always preferred to put zfs on partitions and actually size and align those partitions instead of "just use the rest of the space".

Alain De Vos · Dec 13, 2022

ralphbsz said:
Do you have ECC on your memory? Watch the BIOS logs for single-bit errors that were corrected?

Do you buy enterprise-grade disks? And connect them with enterprise-grade HBAs and cables? And watch the firmware versions (also of the disks) and upgrade when needed? And configure redundant disks and systems?

If the answer to one of those questions is "no", then please don't throw eggs at ZFS.

To reinforce what Cracauer and fcorbelli said above: Consumer grade disks are optimized for lowest possible cost (which is what consumers really want), and looking good on dumb benchmarks (Phoronix or Tom's Hardware come to mind). Enterprise systems (not just disks!) are optimized to make the customer long-term happy, and the single biggest reason for unhappiness is data loss. Part of that optimization is to buy high-quality components. Part of it is to actively investigate firmware versions, and make sure the optimal one is used at all times (anecdote 1 below). Part of it is to prevent unplanned power failures as much as possible (there is a reason data centers have both batteries and diesel generators). Part of that is to perform destructive testing to make sure the (disk...) vendors promises are actually correct (anecdote 2 below). Another part is to enable hardware protection mechanisms where available, for example hardware encyption and checksums (anecdote 3 below). And a final part is to always have multiple copies of the data, typically in multiple locations (anecdote 4 below).

Anecdote 1: At a former employer, sales wanted to ship a system to a customer, using a new model of disk drive for which engineering and quality control had not yet studied, so we didn't know what firmware version should be used. Another engineer and I vetoed having the system shipped out. Our veto was overridden by an executive, because (a) the customer urgently needed the system, (b) we needed the revenue, and (c) there were no other disks available. About a week after the system ships, the disks started dying like flies. We ended up replacing several thousand disk drives in the field, and giving the customer a retroactive 100% discount on the system. All because some VP didn't want to wait for firmware testing.

Anecdote 2: A friend of mine worked for a different storage systems company. His job for a while was to write data to hundreds of SSDs in a lab system, and then cut power in the middle of the writes. This was about 15 years ago, when SSDs were the new and hot thing. He found that a surprisingly large fraction of the SSDs (a) don't actually write to media even though they replied to the host that they had, (b) reorder writes, (c) return corrupted data, or (d) even completely die and refuse to reboot after a power outage during a write. If you remember the beginning of the SSD era, there used to be a lot of small vendors that assembled SSDs from NAND chips and generic OEM controllers: Most of those small vendors went under because they couldn't get their quality under control.

Anecdote 3: Several decades ago, the T10 committee (which defines the SCSI standard, and really leads the industry in how disk interfaces worked) proposed to allow applications to calculate checksums, pass those checksums all the way through the stack (OS, HBA, hardware), write the to disk, and return them to the application. This cost a lot of money, and was really complex. Why did the disk vendors push this? Because they were sick and tired of being wrongly accused of corrupting data: Anytime a database was broken, the software companies like Oracle or IBM claimed "it was the disk drive that corrupted the data", and Seagate and Hitachi wanted to be able to prove: "You calculated this checksum, you wrote it, and we're returning the data to you unmolested".

Anecdote 4: A big corporate IT customer of my previous employer had two complete data centers, with complete storage systems in both, and fast network cables to connect them. In two separate buildings. Like that, even a complete disaster that affects one building could be recovered from. Unfortunately, the second data center was in the other tower of the World Trade Center.

I use poor-mans-hardware&ssd...

Deleted member 67440 · Dec 13, 2022

Alain De Vos said:
I use poor-mans-hardware&ssd...

It does not matter
"Rich-mans" HW and software fail, too

cracauer@ · Dec 13, 2022

I don't think that more expensive or higher grade HDs and SSDs are any better on average. Google found lying firmware in their regular stuff. They get custom firmware for enterprise drives.

It is all optimized for common benchmarks.

Deleted member 67440 · Dec 13, 2022

cracauer@ said:
I don't think that more expensive or higher grade HDs and SSDs are any better on average. Google found lying firmware in their regular stuff. They get custom firmware for enterprise drives.

It is all optimized for common benchmarks.

There are differences, but essentially constructive ones
For SSDs there are the "legendary" capacitors that SHOULD allow data to be written, like small backup batteries
Obviously, personally, I don't trust anything at all, so the problem for me is invariant with respect to the hardware and software

The "theory" goes that zfs is completely immune to consistency problems
BUT
practice proves the opposite
THEREFORE
zfs is not immune. Fullstop

Alain De Vos · Dec 13, 2022

I have found one of the bad things i was doing.
Using zfs from "openzfs" on the host.
Using zfs from "base" in the jail.

ZFS Forcebly remove files/directories

Deleted member 67440

Guest

ZFS and power failures

Alain De Vos

Deleted member 67440

Guest

Alain De Vos

Deleted member 67440

Guest

cracauer@

Deleted member 67440

Guest

Alain De Vos

Deleted member 67440

Guest

Alain De Vos

Deleted member 67440

Guest

Alain De Vos

Deleted member 67440

Guest

Alain De Vos

tomaz

gpw928

ralphbsz

Deleted member 67440

Guest

mer

Alain De Vos

Deleted member 67440

Guest

cracauer@

Deleted member 67440

Guest

Alain De Vos