Solved ZFS logging "vdev state changed" every minute. Should I be worried?

I am seeing the following repetition (every minute) of messages in my /var/log/messages log file related to ZFS logging "vdev state changed." Is this normal or should I be worried? Otherwise I am not seeing any issues, but there's very low usage of the single zpool that I have on that machine.

Code:
Jun 17 04:23:46 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:23:46 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:24:46 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:24:46 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:25:47 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:25:47 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:26:43 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:26:43 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:27:48 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:27:48 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:28:44 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:28:44 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:29:44 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:29:44 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:30:40 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:30:40 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:31:46 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:31:46 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:32:46 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:32:46 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:33:42 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:33:42 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:34:47 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:34:47 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:35:47 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:35:47 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
Jun 17 04:36:43 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=11120040499282210423
Jun 17 04:36:43 vmname ZFS: vdev state changed, pool_guid=5479106261899104283 vdev_guid=9582517886689921256
 
I probably should be worried based on my digging in the FreeBSD source tree.

I searched "vdev state changed" in the FreeBSD source tree on GitHub (https://github.com/freebsd/freebsd/search?p=1&q=vdev+state+changed&type=&utf8=✓). Basically the only relevant search result was this one (https://github.com/freebsd/freebsd/...567604707805814a2128a57/etc/devd/zfs.conf#L74).

Search result (/etc/devd/zfs.conf):

Code:
notify 10 {
match "system" "ZFS";
match "type" "resource.fs.zfs.statechange";
action "logger -p kern.notice -t ZFS 'vdev state changed, pool_guid=$pool_guid vdev_guid=$vdev_guid'";
};

That led me to search for "resource.fs.zfs.statechange" in the source tree (https://github.com/freebsd/freebsd/search?utf8=✓&q=resource.fs.zfs.statechange&type=). This led me to two relevant search results:

  1. https://github.com/freebsd/freebsd/...b/opensolaris/uts/common/fs/zfs/zfs_fm.c#L862
  2. https://github.com/freebsd/freebsd/...d3336f6d/cddl/usr.sbin/zfsd/case_file.cc#L418
The first result (/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_fm.c):

Code:
/*
* The 'resource.fs.zfs.statechange' event is an internal signal that the
* given vdev has transitioned its state to DEGRADED or HEALTHY. This will
* cause the retire agent to repair any outstanding fault management cases
* open because the device was not found (fault.fs.zfs.device).
*/
void
zfs_post_state_change(spa_t *spa, vdev_t *vd)
{
zfs_post_common(spa, vd, FM_RESOURCE_STATECHANGE);
}

The second result (/cddl/usr.sbin/zfsd/case_file.cc):

Code:
} else if (event.Value("class") == "resource.fs.zfs.statechange") {
RefreshVdevState();
/*
* If this vdev is DEGRADED, FAULTED, or UNAVAIL, try to
* activate a hotspare. Otherwise, ignore the event
*/
if (VdevState() == VDEV_STATE_FAULTED ||
VdevState() == VDEV_STATE_DEGRADED ||
VdevState() == VDEV_STATE_CANT_OPEN)
(void) ActivateSpare();
consumed = true;
}

So basically I should be really worried. :( Basically at this point I cannot put this VM into production.

On a plus note, I am surprised by the fact I was actually able to narrow down what was going wrong by just search the error message.
 
I don't know if this is an issue to be worried about, maybe this is some sort of power saving mechanism for the drives and maybe ZFS is confused. One reason I think that this may be the case is that while I the disks are in use, those messages stop appearing every minute (I just observed a 40 minute period where the messages did not appear, so I am hopeful).

I could be wrong though, either way, I am just going to run the VM for longer and take my chances for now (with frequent backups that is). I really do not want to configure a different OS for this personal deployment at this point in time (it is an inconvenience I cannot handle at this point in time).
 
You might want to try and isolate the cause a little further. Is hardware triggering this or is it really a ZFS problem? What would be a cuse for concern is if you start seeing "ZFS timeout" messages.
* I'd suggest using sysutils/smartmontools to monitor the HDD's.
* If the HDDS's have power-save mode and thus going to standby mode or such, there are several good solutins for disabling (including from this forum).
* Of course, #zpool scrub is always a good idea.
 
...Is hardware triggering this or is it really a ZFS problem? ...
* I'd suggest using sysutils/smartmontools to monitor the HDD's. ...
Are these real hard disks? Judging by user Preetpal's other posts, these are probably virtual (emulated) hard disks, provided by a VM environment. If that's the case, the problem is likely in the VM host, or the interface between that VM host and the FreeBSD guest. One possible explanation might be that the VM host is getting confused by a disk going into power-saving mode, and then wrongly allows ZFS to access the wrong disk (which would lead to the vdev_guid change message, but could also have catastrophic consequences if the next access is a write).
 
Are these real hard disks? Judging by user Preetpal's other posts, these are probably virtual (emulated) hard disks, provided by a VM environment. If that's the case, the problem is likely in the VM host, or the interface between that VM host and the FreeBSD guest. One possible explanation might be that the VM host is getting confused by a disk going into power-saving mode, and then wrongly allows ZFS to access the wrong disk (which would lead to the vdev_guid change message, but could also have catastrophic consequences if the next access is a write).

I am assuming these are virtual disks (they are "data disks" attached to a VM in Azure). The error messages have disappeared on the particular virtual machine though (I detached and reattached the disks and rebooted; I know that is not the right thing to do).
 
In terms of how exactly the virtual data disks were added to a ZFS pool, the following script that I wrote was used in setup (only first part of script is relevant; the rest of the script relates to setting up PostgreSQL to use ZFS dataset):

Code:
# Clean disks (VERY DANGEROUS OPERATION)
dd if=/dev/zero of=/dev/da2 bs=1m count=1
dd if=/dev/zero of=/dev/da3 bs=1m count=1
# OPTIONALLY CREATE PARTITIONS (skipped due to alignment fears)
# Writing GEOM labels to disks (the instructions here rely on fact that disks were NOT PARTITIONED (so it assumes raw disks))
# WARNING: THIS WRITES TO LAST SECTOR OF DISK!!!!!
glabel label zfsrawdisk1 /dev/da2
glabel label zfsrawdisk2 /dev/da3
# Create zpool with single vdev (mirror vdev type) containing two disks
zpool create mypool mirror label/zfsrawdisk1 label/zfsrawdisk2
# Create dataset for PostgreSQL in newly created zpool
zfs create mypool/db
# Use lz4 compression on dataset for PostgreSQL
zfs set compression=lz4 mypool/db
# Disable atime on dataset for PostgreSQL
zfs set atime=off mypool/db
# Set record size equal to 16k ((expt 2 14) 16384) on dataset for PostgreSQL
zfs set recordsize=16384 mypool/db
# Install and enable PostgreSQL
pkg install postgresql96-server postgresql96-client postgresql96-contrib
echo 'postgresql_enable="YES"' >> /etc/rc.conf
# Make "postgres" user and "postgres" group the owner of ZFS dataset
chown postgres:postgres /mypool/db
# Initialize PostgreSQL to utilize ZFS dataset as storage
su postgres -c 'initdb --no-locale -E=UTF8 -D /mypool/db'
su postgres -c 'pg_ctl -D /mypool/db -l /var/db/postgres/logfile start'
su postgres -c 'ln -s /mypool/db /var/db/postgres/data96'
### Step 2 — Pointing to the New Data Location
# PostgreSQL has several ways to override configuration values. By default, the data_directory is set to /var/lib/postgresql/9.5/main in the /etc/postgresql/9.5/main/postgresql.conf file. Edit this file to reflect the new data directory:
# sudo nano /etc/postgresql/9.5/main/postgresql.conf
# Find the line that begins with data_directory and change the path which follows to reflect the new location.
# In our case, the updated file looks like the output below:
# /etc/postgresql/9.5/main/postgresql.conf
# . . .
# data_directory = '/mnt/volume-nyc1-01/postgresql/9.5/main'
# . . .
# su - postgres
# vim /mypool/db/postgresql.conf
## Change data_directory to /mypool/db
#### WARNING: Make the following changes to /var/db/postgres/data96/pg_hba.conf
## # TYPE  DATABASE        USER            ADDRESS                 METHOD
##
## # "local" is for Unix domain socket connections only
## local   all             all                                     peer
## # IPv4 local connections:
## host    all             all             127.0.0.1/32            md5
## # IPv6 local connections:
## host    all             all             ::1/128                 md5
## # Allow replication connections from localhost, by a user with the
## # replication privilege.
## #local   replication     postgres                                trust
## #host    replication     postgres        127.0.0.1/32            trust
## #host    replication     postgres        ::1/128                 trust

## The following command is not needed since server is already running.
# service postgresql start
 
Interesting. Your global and ZFS setup is very very boring ... so boring that there is likely nothing wrong with it.

Do you have any idea what these disks da2 and da3 really are? You say they are some sort of virtual disk, which Azure provisions for you. Are they actually individual physical devices, or are they emulated block devices that use some sort of cloud storage infrastructure within Azure? If they are emulated, do you actually know what their availability and reliability characteristics are?

I just discovered something funny: I get the same message "vdev state changed, pool_guid=... vdev_guid=..." in my log, every night at exactly 01:15 in the morning. That's exactly when a "zpool scrub" starts. Could it be that you are running a zpool scrub in the background, and the message repeats overtime scrub is interrupted by foreground workload? That should be easy to test.
 
Interesting. Your global and ZFS setup is very very boring ... so boring that there is likely nothing wrong with it.

Sometimes it's better to start simple, and only increase complexity when needed :). I agree the configuration is boring.

Do you have any idea what these disks da2 and da3 really are? You say they are some sort of virtual disk, which Azure provisions for you. Are they actually individual physical devices, or are they emulated block devices that use some sort of cloud storage infrastructure within Azure? If they are emulated, do you actually know what their availability and reliability characteristics are?

The data disks "are registered as SCSI drives" and are "backed by HDDs" (see: https://docs.microsoft.com/en-us/azure/storage/storage-about-disks-and-vhds-linux).

I just discovered something funny: I get the same message "vdev state changed, pool_guid=... vdev_guid=..." in my log, every night at exactly 01:15 in the morning. That's exactly when a "zpool scrub" starts. Could it be that you are running a zpool scrub in the background, and the message repeats overtime scrub is interrupted by foreground workload? That should be easy to test.

It's possible, the messages have stopped. Notably I manually ran a zpool scrub upon discovering them,
 
I have actually had the VM in question running this entire time (June 2017 - September 2017) and have not encountered this issue since I stated that the messages had stopped. I believe that the messages stopped immediately after I scrubbed my zpool ( zpool scruby myzpool).

I believe ralphbsz was right about this messages being related to zpool scrubs being started but being interrupted (but I don't really have any evidence other than the fact the messages stopped appearing after scrubing my zpool).
 
Back
Top