Solved [Solved] ZFS utilization

_martin · Apr 5, 2014

I've the following ZFS setup:

pool portal: 6 x 2TB drivers in raidz2 + SSD read cache
pool temple: created on top of the portal pool, encrypted.

portal was created using following commands and has now the following status:

Code:

# zpool create portal raidz2 /dev/da0.nop /dev/da1 /dev/da2 /dev/da3 /dev/da4 /dev/da6 cache /dev/da5

# zpool status portal
  pool: portal
 state: ONLINE
  scan: scrub repaired 0 in 25h6m with 0 errors on Sat Feb  1 01:29:35 2014
config:

	NAME        STATE     READ WRITE CKSUM
	portal      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da0     ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	cache
	  da6       ONLINE       0     0     0

errors: No known data errors

As mentioned on top of it I've created an encrypted pool using following commands ( geli commands are omitted here):

Code:

# zfs create -V 4T portal/bolt00
# zpool create temple /dev/zvol/portal/bolt00.eli

# zpool status temple
  pool: temple
 state: ONLINE
  scan: scrub repaired 0 in 14h19m with 0 errors on Mon Feb  3 12:52:22 2014
config:

	NAME                      STATE     READ WRITE CKSUM
	temple                    ONLINE       0     0     0
	  zvol/portal/bolt00.eli  ONLINE       0     0     0

errors: No known data errors

Now my problem is with the utilization of portal pool. What I don't understand is why when I write something on temple pool, utilization of portal is increasing. Currently I've the following utilization status:

Code:

# zfs list portal temple
NAME     USED  AVAIL  REFER  MOUNTPOINT
portal  7.13T  2.91G   288K  none
temple  3.09T   831G   144K  none
#

But I can't write anything big on temple as portal utilization gets to 0. But why? portal/bolt00 was not created as sparse volume, so I'd expect utilization reported in temple is the actual free space.

What is even worse when I noticed this issue I had still around 50GB free space on portal pool. During my tests I did create some large files using dd on temple filesets. After my tests when I did remove the test files from temple filesets (I created couple of 10GB files to see the utilization on both pools) utilization of portal didn't come back but stayed full as is shown in my example.

The question is: but why?

I've noticed one more strange behavior. I've following FS:

Code:

# zfs list temple/tbout
NAME           USED  AVAIL  REFER  MOUNTPOINT
temple/tbout   137G   865G   137G  /local/spool/tb/out

# mount |grep out
temple/tbout on /local/spool/tb/out (zfs, local, noatime, nosuid, nfsv4acls)
#

# df -m /local/spool/tb/out
Filesystem   1M-blocks   Used  Avail Capacity  Mounted on
temple/tbout   1026861 140725 886136    14%    /local/spool/tb/out
#

Utilization of portal is dependent on files being written here. Again, why? :/

Sebulon · Apr 5, 2014

Re: ZFS utilization;

I think this explains it:
https://forums.freebsd.org/viewtopic.php?f=48&t=37365

The ZVOL you made when creating "temple" should´ve been with "-b 128k". You have to move everything into "portal", recreate "temple" with the proper blocksize for ZFS, then move everything back again.

/Sebulon

_martin · Apr 5, 2014

Re: ZFS utilization;

Hm, interesting. Thanks for pointing that out. I'll have a look on that video to better understand what's going on. I also checked your thread. One thing does give me headache though. There's a waste factor to consider. But why when I erase data on temple, utilization of portal does not go down ?

Unfortunately yesterday one process needed space on temple and tried to push data which were couple of times bigger than the free space on portal. Now I've the following issue:

Code:

# zpool status temple
  pool: temple
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub canceled on Sat Apr  5 03:14:34 2014
config:

	NAME                      STATE     READ WRITE CKSUM
	temple                    ONLINE       0     0     0
	  zvol/portal/bolt00.eli  ONLINE       0     0   124

errors: No known data errors
#

Btw. checksum is turned off on temple (as it resides on portal where the actual sum is done, I thought it is redundant).
I've issued scrub on portal to see if I'll hit an error, so far so good:

Code:

# zpool status portal
  pool: portal
 state: ONLINE
  scan: scrub in progress since Sat Apr  5 03:17:14 2014
        7.24T scanned out of 10.7T at 119M/s, 8h26m to go
        0 repaired, 67.63% done
config:

	NAME        STATE     READ WRITE CKSUM
	portal      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da0     ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	cache
	  da6       ONLINE       0     0     0

errors: No known data errors
#

I'll need to figure out where to copy data from temple in order to move it around. I'm still confused about the utilization reports as:

Code:

# zpool list portal temple
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
portal  10.9T  10.7T   180G    98%  1.00x  ONLINE  -
temple  3.97T  2.97T  1022G    74%  1.00x  ONLINE  -
#
# zfs list portal temple
NAME     USED  AVAIL  REFER  MOUNTPOINT
portal  7.13T  4.32G   288K  none
temple  2.97T   958G   144K  none
#

When I look at portal:

Code:

# zfs list -r -d 1 portal
NAME            USED  AVAIL  REFER  MOUNTPOINT
portal         7.13T  4.32G   288K  none
portal/bolt00  7.07T  4.32G  7.07T  -
portal/jails   4.23G  4.32G   631K  /local/jails
portal/spool   12.5G  4.32G   408K  /local/spool
portal/vbox    36.7G  4.32G  3.51G  /local/vbox
#

I see that from 12TB raw space I'm able to use 7.13TB. If I got it right my 4TB zvol, which is actually using only 3TB of an actual space (CWD in the top temple directory)

Code:

# du -sgx * |awk 'BEGIN{c=0;}{c+=$1;}END{print c;}'
2950
#

I've already wasted 7TB ? Phew .. that's something I really didn't know ..

Sebulon · Apr 7, 2014

Re: ZFS utilization;

Most probable is a snapshot holding the space from being reclaimed. What's the output of:
# zfs list -t snapshot
?

/Sebulon

_martin · Apr 7, 2014

Re: ZFS utilization;

Nope, no snapshot of any kind related to this issue:

Code:

# zfs list -t snapshot |grep -E 'portal|temple'
portal/jails/basejail@20130116_21:54:50   224K      -   288K  -
portal/jails/basejail@20140204_09:50:37  4.07M      -   418M  -
portal/vbox/vbsun01@freshcfg              278M      -  3.17G  -
portal/vbox/vbsus01@clonready             220M      -  6.04G  -
portal/vbox/vbsus02@migration             195M      -  6.06G  -
temple/backup@sync                           0      -   160K  -
temple/backup/archives@sync                  0      -   160K  -
temple/backup/archives/clients@sync          0      -  2.77G  -
temple/backup/archives/webs@sync             0      -   208K  -
temple/backup/clients@sync                   0      -   160K  -
temple/backup/clients/hpcoe@sync             0      -  9.94G  -
temple/backup/clients/shrek@sync             0      -  2.87G  -
temple/configs@sync                          0      -   144K  -
temple/configs/flexget@sync                 8K      -   508K  -
temple/configs/rt@sync                       0      -   144K  -
temple/dox@sync                              0      -  36.6G  -
#

Sync snapshot was done after I discovered this issue and moved those data to other system/pool. I did even reboot the machine to see if the free space gets claimed - nope.
That (wasted) space I'm talking about is when I tried to push ~50GB on one of the temple fileset and hit the 100% utilization of portal. After I removed those files temple utilization went down, portal one didn't.

Resilvering on portal is done already with no errors. I'm migrating data to other storage so I can remove the temple.

I had time today to watch the video you posted and checked your tests. It's really interesting - I'll do the same tests you did (e.g. 512GB zvols each created with different block size, same data on it, waste space comparison).

EDIT: So finally I moved all the data and was able to remove the temple pool. It will take some time for portal pool to claim all that space - it's going slowly, but firmly.

Code:

# zfs list -o space portal
NAME    AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
portal  1.74T  5.39T         0    288K              0      5.39T
#

I've read some articles regarding the problem of that wasted space.
These two gave me the general idea why I lost that space after I removed those files [[ if those links are not working after some time, problem is that pool under (portal) does not know that space was released. I guess it would require something like "claim zero pages" which is supported on some high-end storages ]]

http://nex7.blogspot.sk/2013/03/zvol-used-space.html
http://nex7.blogspot.sk/2013/03/reserva ... ation.html

Once I've the free space back, I'll do some testing and share the results. So far thanks @Sebulon for stirring me to the right direction.

_martin · Apr 8, 2014

@Sebulon Thanks!

For someone who might hit the same issue, answers to my questions :

Q: Why did I run out of space ?
A: Blocksize of the underlying zvol was left default (8K). Total size of the zvol was more than 2x the upper (temple) pool utilization.

Q: Why did I hit the 100% on the underlying pool (portal) when 3TB size was set on portal/bolt00 zvol ?
A: Because when zvol is created (thick, or not thin/sparse) refreservation is set to volsize. But that does not mean zvol itself can't use space from the pool it's created on (and it had to use when data was written on temple datasets).

Q: After my tests on pool and files were purged, why did I still have 100% utilization on it?
A: There's no mechanism like 'claim zero pages' (similar to scsi unmap/trim) for these zvols. Zvol didn't know space was removed, hence though temple utilization was decreased, portal utilization stayed as it was. With 8K blocksize combination this was a disaster waiting to happen.

Solved [Solved] ZFS utilization

_martin

Sebulon

_martin

Sebulon

_martin

_martin