Scrubbing of zfs pools: too big

dvl@

Aspiring Daemon
Developer

Reaction score: 65
Messages: 551

I noticed this in the 'daily run output' email for a FreeBSD 9.1-RELEASE-p7 system.

The part I am concerned about is the 'too big' message. Looking back, this message first appeared on Oct 27 2013.

Code:
Scrubbing of zfs pools:
   skipping scrubbing of pool 'system':
      last scrubbing is 5 days ago, threshold is set to 7 days
 too big
On 31 Oct, it was slightly different:

Code:
Scrubbing of zfs pools:
   starting scrub of pool 'system':
      consult 'zpool status system' for the result
0): message file too big
FYI

Code:
$ zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
system  16.2T   546G  15.7T     3%  1.00x  ONLINE  -
 

usdmatt

Daemon

Reaction score: 522
Messages: 1,402

What does the actual output of # zpool status say?

This 'too big' message doesn't really seem to make any sense to me in relation to ZFS.
Is it possible this is something to do with the daily status message itself being too big or some other issue with periodic, rather than ZFS?
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 65
Messages: 551

Code:
$ zpool status
  pool: system
 state: ONLINE
  scan: scrub repaired 0 in 0h46m with 0 errors on Fri Nov  1 03:57:10 2013
config:

        NAME           STATE     READ WRITE CKSUM
        system         ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            gpt/disk0  ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0     0
            gpt/disk4  ONLINE       0     0     0
            gpt/disk5  ONLINE       0     0     0

errors: No known data errors
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 65
Messages: 551

usdmatt said:
This 'too big' message doesn't really seem to make any sense to me in relation to ZFS.
Is it possible this is something to do with the daily status message itself being too big or some other issue with periodic, rather than ZFS?
The email itself is about 32 KB and ends with:

Code:
Local system status:
 3:01AM  up 3 days,  3:44, 0 users, load averages: 1.01, 0.51, 0.43

Mail in local queue:
Mail queue is empty

Security check:
    (output mailed separately)

Checking for denied zone transfers (AXFR and IXFR):

Scrubbing of zfs pools:
   skipping scrubbing of pool 'system':
      last scrubbing is 5 days ago, threshold is set to 7 days
 too big

SMART status:
Checking health of /dev/ada0: OK
Checking health of /dev/ada1: OK
Checking health of /dev/ada2: OK
Checking health of /dev/ada3: OK
Checking health of /dev/ada4: OK
Checking health of /dev/ada5: OK

-- End of daily output --
... so it all seems to be there.
 

User23

Well-Known Member

Reaction score: 67
Messages: 495

I guess the message
Code:
0): message file too big
comes not from ZFS.

It looks more like:

Code:
postfix/sendmail[30680]: fatal: root(0): Message file too big
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 65
Messages: 551

Hmmm, could this be a the list of files with negative permissions? (I think that's the term)
 

usdmatt

Daemon

Reaction score: 522
Messages: 1,402

Is root mail finding its way into /var/mail/root and growing to an unmanageable size?
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 65
Messages: 551

usdmatt said:
Is root mail finding its way into /var/mail/root and growing to an unmanageable size?
No, all files in /var/mail/root have size = 0.
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 65
Messages: 551

I noticed that the 'daily security run output' email did not arrive today for this server. The last time it arrived was 27 October 2013. In that email, there were 19 files listed under 'Checking negative group permissions'.

That title is produced by the script /etc/periodic/security/110.neggrpperm.

Let's run that script manually:

Code:
# /etc/periodic/security/110.neggrpperm > /usr/local/NegativeGrouptPermissions
# ls -l /usr/local/NegativeGrouptPermissions
-rw-r--r--  1 root  wheel  38 Nov 10 19:23 /usr/local/NegativeGrouptPermissions
# cat /usr/local/NegativeGrouptPermissions

Checking negative group permissions:
#
Oh. Let's try another approach:

Code:
# /etc/periodic/security/110.neggrpperm > /usr/local/NegativeGrouptPermissions  2>&1
# ls -l /usr/local/NegativeGrouptPermissions
-rw-r--r--  1 root  wheel  20293230 Nov 10 19:36 /usr/local/NegativeGrouptPermissions
# wc -l /usr/local/NegativeGrouptPermissions
  113397 /usr/local/NegativeGrouptPermissions
#
That is a 19 MB output.

Looking in the mail logs I see:

Code:
Nov 10 03:14:11 slocum postfix/postdrop[29688]: warning: uid=0: File too large
Nov 10 03:14:11 slocum postfix/sendmail[29656]: fatal: root(0): message file too big
Nov 10 03:14:15 slocum postfix/pickup[67935]: 1AF0A1224D: uid=0 from=<root>
It seems to be Postfix which is denying.

Code:
# postconf  |grep message_size_limit
message_size_limit = 10240000
That's 10 MB. Let's bump that up to 32 MB. Restart Postfix, and wait for tomorrow. I also bumped this value up on the mail servers between this server and my mail box.

FYI, I am positive this is the cause, and that it's related to an issue I raised in a previous blog post.

Offhand, I think this 'feature' needs a slight enhancement. Perhaps by returning only the first 100 or so matches in the script, and including details as to how to get the full list.
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 65
Messages: 551

The cause of the problem has been confirmed. The 'daily security run output' email arrived today for this server. Yes, it was about 19M in size.

Going back to the original post, the 'daily run output' email looks like this (i.e. '0): message file too big' no longer appears):

Code:
Local system status:
 3:01AM  up 2 days,  4:22, 0 users, load averages: 1.26, 0.55, 0.35

Mail in local queue:
Mail queue is empty

Security check:
    (output mailed separately)

Checking for denied zone transfers (AXFR and IXFR):

Scrubbing of zfs pools:
   skipping scrubbing of pool 'system':
      last scrubbing is 2 days ago, threshold is set to 7 days

SMART status:
Checking health of /dev/ada0: OK
Checking health of /dev/ada1: OK
Checking health of /dev/ada2: OK
Checking health of /dev/ada3: OK
Checking health of /dev/ada4: OK
Checking health of /dev/ada5: OK

-- End of daily output --
FYI, the server was running 9.1 when this thread started, but was recently upgraded to 9.2 (I don't think that affected this problem).
 
Top