ZFS silent file truncation on write when disk is filling up

PMc · Sep 6, 2017

I found errors on my postgres database, the reason was an index file being written some half gigabyte too short. Recreating the index solved the problem.

There were absolutely no errors reported when the damage happened, only later when the database found it could not properly read the index.
The most likely reason for the damage is that the filesystem was going over its configured quota.

Disregarding the fact that the filesystem should have been configured with enough size to allow all operations, I consider it very worrysome that there is no warning or error whatsoever when a file is not written fully and payload data is discarded. How can this be fixed?

A full description is here.

ralphbsz · Sep 6, 2017

Read your description. It is lacking lots of detail.

Are you claiming that ZFS accepted a write() system call, returned normally, did not return an error like ENOSPC, but did not write the data (in effect silently truncating the file)? Or are you claiming that Postgres ignored an error from the file system? In the first case, ZFS has a serious bug (sounds unlikely); in the second case, Postgres has a serious bug (also sounds unlikely). You ask "how can this be fixed"; the answer to that depends completely on whether the problem is in ZFS or in Postgres (or somewhere else).

Eric A. Borisch · Sep 6, 2017

PMc said:
I found errors on my postgres database, the reason was an index file being written some half gigabyte too short. Recreating the index solved the problem.

There were absolutely no errors reported when the damage happened, only later when the database found it could not properly read the index.
The most likely reason for the damage is that the filesystem was going over its configured quota.

Disregarding the fact that the filesystem should have been configured with enough size to allow all operations, I consider it very worrysome that there is no warning or error whatsoever when a file is not written fully and payload data is discarded. How can this be fixed?

A full description is here.

Did you set a quota, or a refquota?

If quota, do you have snapshots?

If quota+snapshots, assumptions like "I can rewrite a file on a full disk, so long as it is not growing" may not hold true. Compression complicates the matter further.

SirDice · Sep 6, 2017

PMc said:
Disregarding the fact that the filesystem should have been configured with enough size to allow all operations

Allow for some 'extra' space. Remember that ZFS is a Copy-on-Write filesystem.

PMc · Sep 6, 2017

ralphbsz said:
Read your description. It is lacking lots of detail.

Are you claiming that ZFS accepted a write() system call, returned normally, did not return an error like ENOSPC, but did not write the data (in effect silently truncating the file)? Or are you claiming that Postgres ignored an error from the file system? In the first case, ZFS has a serious bug (sounds unlikely); in the second case, Postgres has a serious bug (also sounds unlikely). You ask "how can this be fixed"; the answer to that depends completely on whether the problem is in ZFS or in Postgres (or somewhere else).

Exactly that is the question. I am not claiming anything (because claiming is a business for lawyers and I am bored about the increasing lawyers talk in IT), I do only see that something has happened which quite certainly should not happen. Besides that, I agree to Your description: either ZFS did not execute the write neither return the syscalls as erroreous, or postgres did not honor the errors and rollback the transaction. Each of these would be a wrong behaviour.
The problem is, forensically I know of no way to figure this out. Therefore I just posted to both parties.

Concerning details, I could post bulks of system configurations, but currently I do not see which of these might actually be useful.

Eric A. Borisch said:
Did you set a quota, or a refquota?

If quota, do you have snapshots?

If quota+snapshots, assumptions like "I can rewrite a file on a full disk, so long as it is not growing" may not hold true. Compression complicates the matter further.

Set quota, no refquota, no snapshots, no compression.

And, the issue is *not* to successfully rewrite a file on an almost full disk. The issue is, the data *did* grow probably over quota, and *should* have run into an error. In such a case, the error must be caught on the application level, which means in this case, the database is expected to rollback the current transaction and then (more or less) stall operation. This did not happen, instead, the database closed the transaction without error, and tried to continue work on a broken file, thereby loosing payload data.

So, just like ralphbsz described, either the ZFS did not properly report the error condition up to the application, or the application database did not properly recognize and react on the error condition.

SirDice said:
Allow for some 'extra' space. Remember that ZFS is a Copy-on-Write filesystem.

Yes, I know this. But sometimes data grows, and some config is not adjusted beforehand, and then things run against limits. This is why we are using databases with transactional security, so that in such a case things get stopped in a defined fashion without data loss. If we were always perfect with all our configuration, there would probably not even be a need for transactional security.

ralphbsz · Sep 7, 2017

Given the lack of evidence, and unless someone has also experienced the same problem, I fear this one will probably remain a mystery. It is very hard to imagine that a file system that is heavily used and written by competent professionals would do something as stupid as accept a write, drop the data on the floor, and not say a word about it. ZFS doesn't do nonsense like that. Similarly, it is unimaginable that a database that's as old ("experienced"), well-engineered and battle tested as Postgres would quietly ignore an error return and plow on. Debugging what really went wrong would need really good information collected at the time the problem occurred ... but that was when the write ran into the quota, and the damage was only found much later. If those arguments are correct, then (a) the problem is very unlikely to exist, and (b) can not be debugged. Sad.

ZFS silent file truncation on write when disk is filling up

PMc

ralphbsz

Eric A. Borisch

SirDice

Administrator

PMc

ralphbsz