ZFS Confused about dirents, fstat, lstat, and inodes

pedz · Apr 25, 2022

I'm on a TrueNAS so the file system is ZFS. I'm also in a "jail". The file system is originated on the base TrueNAS system and then it gets mounted into the jail.
I have a program that runs through the file system creating a database of dirent entries and inode entries. I've assumed that the st_dev and st_ino are unique and stay constant across reboots. Apparently that is not true in my set up. I was making a pass and the dev was 0x0x2900FF08. I rebooted the jail and now the dev is 0x0x2900FF0B.

I could just ignore the dev since in my limited application, all the files are on the same mount point but I'm wondering if there is any way I can get a real constant that will be preserved across the reboot of not just the jail but also the NAS.

Mathieu · Apr 25, 2022

If the FS is mounted in the jail via a "nullfs" mount, that's probably what tampers with the st_dev. nullfs is designed makes it appear as if the new mount is a new filesystem, so it picks a new st_dev for itself. Doesn't seem like there's an easy way to disable it...

ralphbsz · Apr 25, 2022

Technically, your question is not welcome here, since this forum is about FreeBSD, not about TrueNAS. But your problem is not TrueNAS specific.

The big underlying question is: What are you really trying to accomplish? You say "I've assumed that the st_dev and st_ino are unique and stay constant ...". What does "unique" and "constant" even mean?

In modern file systems (which use more than one physical disk, and can create virtual files in bulk and quickly), the 60-year old concepts of st_dev and st_ino have become mostly meaningless. In the old days, st_dev was the device major/minor of the physical disk the file system is on. Modern file systems don't use a single device, so they make up a random number for st_dev. That random number can be constant as long as you don't perform operations (such as mkfs, create snapshots, change the physical/logical volume mapping), but it does not have to be. If you really want to know a stable identity of the file system, call statfs() and look at the fsid field. That one should not change ... except it might when you start doing things like snapshots. And any code that uses it is not portable between Linux and *BSD (including TrueNAS).

In the old days, you could be sure that if two files have different st_dev, you could not hardlink them to each other, but if they had the same st_dev, you could. I don't think such a guarantee exists today; I definitely know of systems where the link() call will fail even though both places have the same st_dev (whether that's POSIX compliant is anyone's guess). In the old days, you could be sure that if two files that have been found by their name have the same st_dev and st_ino, they are the same file. Today, with concepts such as union or shadow mounts and snapshots, the concept of "same object" is so hard to define, you'll never figure it out by comparing two integers.

So let me ask you this question again: What are you really trying to accomplish? Why do you care what the numeric identity of a file is?

Crivens · Apr 25, 2022

ralphbsz said:
Technically, your question is not welcome here, since this forum is about FreeBSD, not about TrueNAS. But your problem is not TrueNAS specific.

... and that's why it is here. Simply happening on TrueNAS is one thing, but once something can pop up on a native install (like, ZFS internals being smart) it is allowable (imho).

pedz · Apr 25, 2022

ralphbsz said:
call statfs() and look at the fsid field. That one should not change ...

Why do you care what the numeric identity of a file is?

I'll look at the fsid field ... thanks.

I have about 100 disks that I've collected over the years -- backups of laptops and desktops and old "data" drives used to store "precious" stuff like my photographs. I bought a big NAS and dumped all the files onto it and I'm slowly going through finding duplicate files and eliminating them -- either delete the duplicates or link them.

Eventually I match up files of the same size and compute their sha1. If the sha1's match, I compare the two files. If they match, I get excited and drink a 12 pack.

I thought rather than keeping the full path to each file, I would mimic a file system and keep dirents and inodes. But, I don't want to keep multiple entries for the same file. So I need something that stay's "constant" where "constant" in this context is the life of a particular file.

I thought the devno, inode number, and date of last modification would stay constant until the file changed, got deleted, replaced, etc.

ralphbsz · Apr 25, 2022

Ah, that makes sense: You are trying to do full file dedup. I do exactly the same thing in my backup program.

It seems that the way you really use inodes is to disambiguate file names. One file can have different names over time, or it can have multiple names simultaneously, and you are using the inode number to uniquely identify the underlying file. Obviously, the file content can change itself, so you will have to also check that the file's size and mtime haven't changed over time. (Note that this check is not 100% safe: since mtime can be explicitly set, it is theoretically possible to change the content of a file and afterwards have the same inode/mtime/size, but in practice that is unlikely, and anyone who changes the mtime back is an evil adversary and can be ignored.) I think for your purpose, the inode number will mostly work. Just don't create snapshots, or if you do, don't look inside them. Using the fsid from the statfs() call should be a safe way to distinguish two file systems.

But, here is my suggestion: Just skip the inode number completely. Instead, identify a file by its content. I do nearly the same thing you're proposing, except that I use the longest SHA that I found, namely SHA-512. The chance collision probability for that is so low, mis-identifying content is astronomically unlikely on a home server. So what I do is to use the pair of size and SHA to uniquely identify the content of a file: if the size is different, there isn't even a need to look at the SHA. Then to check whether a file has changed, I look at size and mtime (see above).

mer · Apr 25, 2022

ralphbsz said:
So what I do is to use the pair of size and SHA to uniquely identify the content of a file: if the size is different, there isn't even a need to look at the SHA.

It's always about finding the combination to uniquely identify something.

pedz · Apr 25, 2022

mer said:
It's always about finding the combination to uniquely identify something.

Yes. I didn't want to spend the time computing the SHA for large files that have a size different from all others. I'm trucking through about 30TB of junk (99% of it is worthless but I know there are bits I want). So, I was finding all files and basically just saving their size and mtime. Then finding possible matches and compute the SHAs. Then find matches of those and do a full compare.

A thought occurred to me while reading this. Perhaps do a SHA on (for example) the first 4K and the last 4K. That would be somewhat cheap and would be a very good first approximation of a match. I still plan to do a "cmp" to make sure they are identical.

Thank you to all...

mer · Apr 25, 2022

That's an interesting thought. If sizes don't match, who cares about hashes. As long as sizes are true and not "rounded to the nearest 4k"
Of course what are the probabilities that front and back hashes match but middle hashes don't?
False Positives and False Negatives.
I don't envy you this task.

ralphbsz · Apr 28, 2022

pedz said:
I didn't want to spend the time computing the SHA for large files that have a size different from all others. ... So, I was finding all files and basically just saving their size and mtime.

That's a really nice optimization. In particular since normal file systems typically have very few gigantic files (otherwise they'd run out of space), and gigantic files have a huge phase space of sizes, so the probability of two gigantic files having the same size is really low. I like the idea so much, I just added it to my to-do list to steal and implement.

From an implementation viewpoint, this has a nasty drawback: If you suddenly find a second gigantic file that has the same size as one previously seen, you have to go back and calculate the SHA for both. From a code flow viewpoint, this sounds annoying.

A thought occurred to me while reading this. Perhaps do a SHA on (for example) the first 4K and the last 4K. That would be somewhat cheap and would be a very good first approximation of a match.

I had that idea a long time ago, except I did the SHA of the first 1MiB (not the last). Turns out to be a complete waste of time in my case. Here's why: I'm doing backups; when I find a new file (which has a SHA that doesn't match anything), I need to copy it to the backup. So I'm working through the file, and when I reach 1MiB, I complete the SHA calculation, and check whether I already have a match so far. But how does that help? If it is not a match, then I already know that this is guaranteed to be a new file, in which case I need to finish calculating the SHA anyway. And if I have a match in the first MiB, it is reasonably likely that I have two files that are identical in the first MiB but change later (for example, because both had something appended, not uncommon for logs), in which case I need to calculate the SHA to the end anyway to find out whether they're different. So the SHA of the first MiB is useless.

I like your idea of doing the beginning and end of the file, that might help performance a little bit. I'll try it.

In the meantime, I discovered a particularly annoying class of files that I need to learn to deal with, namely MP3-encoded music files. I have thousands of those. I'm now discovering that many of those are "fundamentally identical", the only difference is that the tags (composer name, play count ...) inside them have been updated. Today, I back up both versions. What I want to add is the following functionality: If the file format is MP3, then decode the audio content, and take a SHA of the audio content instead of the whole file. If the audio content is identical (to the bit), but the files are different, then always save only the later one (by mtime), not both. Implementing that is on my to-do list.

I still plan to do a "cmp" to make sure they are identical.

I decided not to. I use a 512-bit SHA, and I have far less than a million files. The probability that two different files (with the same size) have the same SHA is astronomically small, even after taking the birthday paradox into account.

schweikh · Apr 28, 2022

pedz said:
Yes. I didn't want to spend the time computing the SHA for large files that have a size different from all others. I'm trucking through about 30TB of junk (99% of it is worthless but I know there are bits I want). So, I was finding all files and basically just saving their size and mtime. Then finding possible matches and compute the SHAs. Then find matches of those and do a full compare.

A thought occurred to me while reading this. Perhaps do a SHA on (for example) the first 4K and the last 4K. That would be somewhat cheap and would be a very good first approximation of a match. I still plan to do a "cmp" to make sure they are identical.

Thank you to all...

You are trying to reinvent the wheel, it appears. There's sysutils/samefile which only compares files of the same size and thus is blazingly fast. Maybe it fits your use case. Basically you just find . | samefile and it spits out a table of identical files which you then can process for linking or removing.

Crivens · Apr 28, 2022

sysutils/samesame will help you here.

pedz · Apr 30, 2022

I appear to have 77,823 copies of the same gif -- and still counting.

Crivens · Apr 30, 2022

Wow. How did that happen?

grahamperrin · Apr 30, 2022

It's the FreeBSD favicon, of course.

ZFS Confused about dirents, fstat, lstat, and inodes

pedz

Mathieu

ralphbsz

Crivens

Administrator

pedz

ralphbsz

mer

pedz

mer

ralphbsz

schweikh

Crivens

Administrator

pedz

Crivens

Administrator

grahamperrin