Why does diff grow (huge) on identical files?

I have some large-ish text files (approx 8 GB each). I was having trouble comparing these files with diff under 9.1-RC2, so I did some testing with identical files. If I run [CMD="diff"]-q file1 file2[/CMD] everything is OK. But if I just run [CMD="diff"]file1 file2[/CMD] then I get various failures. On an i386 box diff runs out of address space (the error message is "diff: memory exhausted"). Even worse, on an amd64 box the process grows until swap fills up and the system starts killing processes (at random?). Eventually, the system kills "init" and then hangs. This can't be right. Why would diff need that much context when the contents are identical? Is it a bug? Or a horrible misfeature?

  1. Is this diff behavior necessary? Why?
  2. Must the system fail catastrophically just because some user process wants more memory?
  3. Does the system fail like this on other versions?
 
  1. To me that sounds like a bug with diff. I suggest filing a problem report.
  2. UNIX systems usually try to let the user have as many resources as they want unless you set limits. You can set quotas on user resource usage if you want to block this kind of behaviour.
  3. Probably.
 
Uniballer,

the files you're talking about are huge in comparison to your physical memory. I haven't seen diff(1)'s code to see how it works, hence I am not sure whether it could split the input into batches, store them in RAM, compare them, move on to the next batch and so on. From its functionality, though, I am not sure whether something like this could be established. So, it's likely that the problem you're reporting happens because of the way diff(1) is supposed to work. This is why resource limits(1) exist, which you can configure through /etc/login.conf (login.conf(5)).
 
I believe it's because of the way diff(1) works. You might want to try --speed-large-files.

Code:
       --speed-large-files
              Assume large files and many scattered small changes.
 
FYI --speed-large-files doesn't actually help. The system still ran out of swap, killed init and hung/

Physical memory was 8GB, then 12GB with 4GB swap. When I went up to 16GB RAM it did complete, but not without also using a lot of swap.

I went back and checked the diff program from Unix 32V (this code is freely available due to both SCO and Caldera Ancient Unix licenses). It only took a few minor changes to make it compile and work on FreeBSD. It still fails on i386, but it says "diff: files too big, try -h". The Unix 32v diff succeeds on my amd64 system. There is also a related program, diffh, that is more naive about context and has much smaller memory use (it handles the diff -h case). It looks like it would work on anything. This program remained apparently unchanged through the 4.4BSD release, but was not part of Net2 or 4.4-Lite.

So I guess we are down to the "Diff Denial of Service" attack on amd64 systems. I happen to think it is bad to allow a user to bring one of my systems to its knees just by typing
Code:
diff bigfile1 bigfile1
And there seems to be no conventional way to recover the system remotely.
 
All apps

Uniballer said:
So I guess we are down to the "Diff Denial of Service" attack on amd64 systems. I happen to think it is bad to allow a user to bring one of my systems to its knees just by typing
Code:
diff bigfile1 bigfile1
And there seems to be no conventional way to recover the system remotely.

If limits are not set for users, __any__ application can bring the system to its knees, this isn't a diff-specific issue. Lots of programs, if they misbehave and hard limits are not set, will bring down a system. Especially memory intensive ones. That is why multi-user systems almost always have limits put in place to keep users in check. Single user systems rarely do because, obviously, if you do something like this you're only hurting yourself.
 
In my mind the diff issue is simply that there doesn't really seem to be a need for gigabytes of context, especially when the input files have been identical. The diff from Unix 32v can solve that for me.

I find it much more disturbing that the system degrades so badly when user processes consume too much virtual memory. Look at the resource consumption parallels: a user process in a hard loop ends up at a low priority and can be killed by root (who will still get enough CPU time to do what is needed). A user process that consumes too much disk space will fail to eat the whole disk because the file system keeps 8% for root to use in cleaning up the mess. But filling up swap seems to kill the wrong processes more often than not. The quota system is not a very good solution because either you wind up limiting processes to very low levels (e.g. the really low page quotas in early VMS), or you overcommit resources. Better failure handling would help a lot more than artificially low limits. Actually, I don't recall having this problem with early Unix implementations because as I recall swap space was allocated for all virtual memory in the system (i.e. not as it overflowed physical memory). So you never had more virtual memory than could fit in swap, because the allocation attempt would fail.
 
As everyone explained to you before, use resource limits. If you do, the problem you're reporting will not happen. This is what rlimits are there for. You don't need diff(1) to bring your system to its knees, you can do it yourself by writing a program that mmap(2)s all your virtual memory.

Take a look at this; it's a conversation that explains exactly what you're asking.
 
It is pretty clear that process level limits are not enough: setting vmemoryuse to 8G (on a machine with 8G RAM) still allows a single user to break the system except that it now takes 2 processes.

Thanks for the pointer to info on sysctl vm.overcommit - that seems like the way to build a system that is more resistant to this sort of problem than process-level limits can achieve.
 
Back
Top