Scrubbing of zfs pools: too big

dvl@ · Nov 7, 2013

I noticed this in the 'daily run output' email for a FreeBSD 9.1-RELEASE-p7 system.

The part I am concerned about is the 'too big' message. Looking back, this message first appeared on Oct 27 2013.

Code:

Scrubbing of zfs pools:
   skipping scrubbing of pool 'system':
      last scrubbing is 5 days ago, threshold is set to 7 days
 too big

On 31 Oct, it was slightly different:

Code:

Scrubbing of zfs pools:
   starting scrub of pool 'system':
      consult 'zpool status system' for the result
0): message file too big

FYI

Code:

$ zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
system  16.2T   546G  15.7T     3%  1.00x  ONLINE  -

usdmatt · Nov 7, 2013

What does the actual output of # zpool status say?

This 'too big' message doesn't really seem to make any sense to me in relation to ZFS.
Is it possible this is something to do with the daily status message itself being too big or some other issue with periodic, rather than ZFS?

dvl@ · Nov 7, 2013

Code:

$ zpool status
  pool: system
 state: ONLINE
  scan: scrub repaired 0 in 0h46m with 0 errors on Fri Nov  1 03:57:10 2013
config:

        NAME           STATE     READ WRITE CKSUM
        system         ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            gpt/disk0  ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0     0
            gpt/disk4  ONLINE       0     0     0
            gpt/disk5  ONLINE       0     0     0

errors: No known data errors

dvl@ · Nov 7, 2013

usdmatt said:
This 'too big' message doesn't really seem to make any sense to me in relation to ZFS.
Is it possible this is something to do with the daily status message itself being too big or some other issue with periodic, rather than ZFS?

The email itself is about 32 KB and ends with:

Code:

Local system status:
 3:01AM  up 3 days,  3:44, 0 users, load averages: 1.01, 0.51, 0.43

Mail in local queue:
Mail queue is empty

Security check:
    (output mailed separately)

Checking for denied zone transfers (AXFR and IXFR):

Scrubbing of zfs pools:
   skipping scrubbing of pool 'system':
      last scrubbing is 5 days ago, threshold is set to 7 days
 too big

SMART status:
Checking health of /dev/ada0: OK
Checking health of /dev/ada1: OK
Checking health of /dev/ada2: OK
Checking health of /dev/ada3: OK
Checking health of /dev/ada4: OK
Checking health of /dev/ada5: OK

-- End of daily output --

... so it all seems to be there.

User23 · Nov 8, 2013

I guess the message

Code:

0): message file too big

comes not from ZFS.

It looks more like:

Code:

postfix/sendmail[30680]: fatal: root(0): Message file too big

dvl@ · Nov 8, 2013

Hmmm, could this be a the list of files with negative permissions? (I think that's the term)

usdmatt · Nov 8, 2013

Is root mail finding its way into /var/mail/root and growing to an unmanageable size?

dvl@ · Nov 8, 2013

usdmatt said:
Is root mail finding its way into /var/mail/root and growing to an unmanageable size?

No, all files in /var/mail/root have size = 0.

dvl@ · Nov 10, 2013

I noticed that the 'daily security run output' email did not arrive today for this server. The last time it arrived was 27 October 2013. In that email, there were 19 files listed under 'Checking negative group permissions'.

That title is produced by the script /etc/periodic/security/110.neggrpperm.

Let's run that script manually:

Code:

# /etc/periodic/security/110.neggrpperm > /usr/local/NegativeGrouptPermissions
# ls -l /usr/local/NegativeGrouptPermissions
-rw-r--r--  1 root  wheel  38 Nov 10 19:23 /usr/local/NegativeGrouptPermissions
# cat /usr/local/NegativeGrouptPermissions

Checking negative group permissions:
#

Oh. Let's try another approach:

Code:

# /etc/periodic/security/110.neggrpperm > /usr/local/NegativeGrouptPermissions  2>&1
# ls -l /usr/local/NegativeGrouptPermissions
-rw-r--r--  1 root  wheel  20293230 Nov 10 19:36 /usr/local/NegativeGrouptPermissions
# wc -l /usr/local/NegativeGrouptPermissions
  113397 /usr/local/NegativeGrouptPermissions
#

That is a 19 MB output.

Looking in the mail logs I see:

Code:

Nov 10 03:14:11 slocum postfix/postdrop[29688]: warning: uid=0: File too large
Nov 10 03:14:11 slocum postfix/sendmail[29656]: fatal: root(0): message file too big
Nov 10 03:14:15 slocum postfix/pickup[67935]: 1AF0A1224D: uid=0 from=<root>

It seems to be Postfix which is denying.

Code:

# postconf  |grep message_size_limit
message_size_limit = 10240000

That's 10 MB. Let's bump that up to 32 MB. Restart Postfix, and wait for tomorrow. I also bumped this value up on the mail servers between this server and my mail box.

FYI, I am positive this is the cause, and that it's related to an issue I raised in a previous blog post.

Offhand, I think this 'feature' needs a slight enhancement. Perhaps by returning only the first 100 or so matches in the script, and including details as to how to get the full list.

dvl@ · Nov 11, 2013

The cause of the problem has been confirmed. The 'daily security run output' email arrived today for this server. Yes, it was about 19M in size.

Going back to the original post, the 'daily run output' email looks like this (i.e. '0): message file too big' no longer appears):

Code:

Local system status:
 3:01AM  up 2 days,  4:22, 0 users, load averages: 1.26, 0.55, 0.35

Mail in local queue:
Mail queue is empty

Security check:
    (output mailed separately)

Checking for denied zone transfers (AXFR and IXFR):

Scrubbing of zfs pools:
   skipping scrubbing of pool 'system':
      last scrubbing is 2 days ago, threshold is set to 7 days

SMART status:
Checking health of /dev/ada0: OK
Checking health of /dev/ada1: OK
Checking health of /dev/ada2: OK
Checking health of /dev/ada3: OK
Checking health of /dev/ada4: OK
Checking health of /dev/ada5: OK

-- End of daily output --

FYI, the server was running 9.1 when this thread started, but was recently upgraded to 9.2 (I don't think that affected this problem).