ZFS Replacing entire pool?

dhw

Developer
I would like to replace all 3 spinning drives of a raidz pool with (smaller) SSDs: the smaller SSDs will still be more than adequate for my storage needs, and should be significantly faster.

The system has a limited number of SATA ports (4): 1 is in use for the system drive (which is already an SSD, and is sliced and partitioned using UFS2+SU); the other 3 ports are used for the raidz pool:

Code:
freebeast(11.0-S)[1] zpool status
  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0

errors: No known data errors
freebeast(11.0-S)[2]

The pool has been in use since July 2015 -- and its sole use is building packages via poudriere:

Code:
freebeast(11.0-S)[2]  df -ht zfs
Filesystem                      Size    Used   Avail Capacity  Mounted on
tank                            3.5T    3.7G    3.5T     0%    /tank
tank/poudriere                  3.5T     25G    3.5T     1%    /tank/poudriere
tank/poudriere/jails            3.5T    128K    3.5T     0%    /tank/poudriere/jails
tank/poudriere/jails/11amd64    3.5T    1.0G    3.5T     0%    /tank/poudriere/poudriere/jails/11amd64
freebeast(11.0-S)[3]

Each of the current 3 drives is a 2TB "pull" from a ReadyNAS -- I figured that in the worst case, all of the data in the pool can be re-generated. And yeah, I could try that approach, but I'd like to try being a little less destructive about this exercise.

The SSDs are also "pulls"; they're 500GB each; I rather strongly suspect that trying to persuade ZFS to accept a quarter-sized SSD as a replacement for a spinning drive with no known defects is unlikely to turn out well.

It turns out that one of the file systems on the "system" drive had >90GB free, so I created gzipped tarballs of what is in /tank, so I have a backup (of sorts).

I think that (ideally), I'd like to make the current pool go away, remove the spinning drives, swap in the SSDs, set up a shiny new pool (re-using names, as I have a fair number of config files that use those names), restore the data, then continue as if nothing happened (expect the the pool isn't quite as empty, the machine is quieter, and package-building is a lot faster).

So... how do I do that (or something reasonably equivalent)? Or is this something "not recommended?" (Errr... A pointer to existing documentation I've managed to overlook would also be quite welcome.)
 
Last edited by a moderator:
You can't replace the disks in-place with the SSDs because the SSDs are smaller. Replacing disks in-place will only work for disks that are larger or the same size.
 
As said: you can't replace providers or vdevs with smaller ones.

Create a new pool with the SSDs and zfs send -R tank the datasets over from your old pool. The -R flag preserves all properties, snapshots, clones and descendent file systems, so sending "tank" should be sufficient to transfer everything over.
 
Thanks... but I don't see a way to create a new pool without destroying the existing one first: there are only 4 SATA ports on the machine, and they are all in use.

Is there a (somewhat sane) way to back up the data, destroy the old pool, create a new pool, then restore the data -- that will end up with the same paths to get to the same data?

Perhaps something on the order of:
Code:
zfs send -R tank >/bkp/tank
zfs destroy -rf tank
shutdown -p now
[physically replace spinning drives with (smaller) SSDs....]
[power up]
zpool create tank raidz ada{1,2,3}
zfs receive tank </bkp/tank


?
 
You can zfs send to nearly anything file- or block-based, so using a seperate disk that can hold all the data is also possible. Either you write the stream into a file on that disk or (IMHO more elegantly) create a ZFS pool on the backup drive and zfs receive the stream to that pool. As you don't have any SATA port you'd have to do this via USB - so prepare for _very_ long backup/restore times and double-check the integrity of the backup.
I'd destroy the old pool only after everything went fine (so step 2 of your list goes last), or just let the drives sit around as a backup until they are needed somewhere else.


As 4 ports are very limiting (and I suspect it's a slow onboard controller), you might also think about a seperate HBA with more connectors. E.g. the 'classic' IBM M1015 (or other LSI 2008 based HBAs) are well supported, offer decent performance and are widely and cheaply available.
 
Seeing as you only seem to be using about 30GB, I would just use zfs send to dump the file systems onto your system drive. Then remove all the SATA disks and replace with the SSDs. Build a new pool and recv the file systems back into the new pool. It's usually discouraged to send ZFS file systems to a file, as a single bit error would render the copy useless, but it should be fine seeing as you are only doing it briefly to get data between disks, and you still have the original pool on the old disks.
 
Yes; thanks: the path for the (temporary) target of zfs send resides on the boot drive -- it's just in a file system that has a lot of free space (in this context). Worst case, I should be able to re-generate all the data that I care about in the pool: it's for results from running poudriere.
 
OK. I think the process worked -- I haven't yet fired up poudriere, but I can "see" the data that's supposed to be on the ZFS storage (both locally and from other systems, as it's NFS-exported); here's what I did (with most of the missteps elided):
Code:
Script started on Thu Nov 24 04:55:03 2016
freebeast(11.0-S)[1] rm -fr /repo/tmp/* && df -h /repo
Filesystem      Size    Used   Avail Capacity  Mounted on
/dev/ada0s4h    124G     16G     98G    14%    /repo
freebeast(11.0-S)[2] ...
freebeast(11.0-S)[17] zfs snapshot -r tank@bkp
freebeast(11.0-S)[18] echo $?
0
freebeast(11.0-S)[19] zfs send -R tank@bkp >/repo/tmp/tank.bkp
freebeast(11.0-S)[20] echo $?
0
freebeast(11.0-S)[22] df -ht zfs
Filesystem                      Size    Used   Avail Capacity  Mounted on
tank                            3.5T    3.7G    3.5T     0%    /tank
tank/poudriere                  3.5T     25G    3.5T     1%    /tank/poudriere
tank/poudriere/jails            3.5T    128K    3.5T     0%    /tank/poudriere/jails
tank/poudriere/jails/11amd64    3.5T    1.0G    3.5T     0%    /tank/poudriere/poudriere/jails/11amd64
freebeast(11.0-S)[23] df -h /repo
Filesystem      Size    Used   Avail Capacity  Mounted on
/dev/ada0s4h    124G     44G     71G    38%    /repo
freebeast(11.0-S)[25] ls -lTh /repo/tmp/
total 29324480
-rw-r--r--  1 root  wheel    28G Nov 24 05:29:13 2016 tank.bkp
freebeast(11.0-S)[27] zfs destroy -r tank
freebeast(11.0-S)[28] df -ht zfs
Filesystem    Size    Used   Avail Capacity  Mounted on
tank          3.5T    3.7G    3.5T     0%    /tank
freebeast(11.0-S)[29] zfs unmount tank
freebeast(11.0-S)[30] !df
df -ht zfs
freebeast(11.0-S)[31] shutdown -p now
freebeast(11.0-S)[32]
Script started on Thu Nov 24 06:32:20 2016
freebeast(11.0-S)[1] zpool create tank raidz ada{1,2,3}
cannot create 'tank': pool already exists
freebeast(11.0-S)[2] zfs destroy tank
cannot open 'tank': dataset does not exist
freebeast(11.0-S)[3] ^fs^pool
zpool destroy tank
freebeast(11.0-S)[4] zpool create tank raidz ada{1,2,3}
freebeast(11.0-S)[5] df -ht zfs
Filesystem    Size    Used   Avail Capacity  Mounted on
tank          919G    117K    919G     0%    /tank
freebeast(11.0-S)[6] zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0

errors: No known data errors
freebeast(11.0-S)[7] zfs receive tank </repo/tmp/tank.bkp
cannot receive new filesystem stream: destination 'tank' exists
must specify -F to overwrite it
freebeast(11.0-S)[9] zfs receive -F tank < /repo/tmp/tank.bkp
freebeast(11.0-S)[10] echo $?
0
freebeast(11.0-S)[11] df -ht zfs
Filesystem              Size    Used   Avail Capacity  Mounted on
tank                    893G    3.7G    889G     0%    /tank
tank/poudriere          914G     25G    889G     3%    /tank/poudriere
tank/poudriere/jails    889G    117K    889G     0%    /tank/poudriere/jails
freebeast(11.0-S)[12] service mountd reload
freebeast(11.0-S)[13] exit

Script done on Thu Nov 24 06:45:28 2016
Update: While I normally only run poudriere on weekends (shortly before I install the packages during the weekly updates of the machines that use the packages), I figured it would be best to actually do a test run -- which seems to have worked:
Code:
...
[00:00:27] ====>> Building 294 packages using 8 builders
...
[01:55:14] ====>> Stopping 8 builders
...
[11amd64-ports-home] [2016-11-24_15h03m32s] [committing:] Queued: 294 Built: 294 Failed: 0   Skipped: 0   Ignored: 0   Tobuild: 0    Time: 01:55:35
(I apologize for the abuse of the "CODE" tag -- that's the only way I found to leave the whitespace alone.)
 
Pretty much completely off-topic, but I was thinking about how sending to a file is generally a bad idea, because a single error will destroy the entire copy.

Obviously the file must contain all the file data, so it would probably be possible to create a tool to recover some data. I doubt a recovery tool will ever exist of course, because it's a lot of work for a situation that people shouldn't really get into in the first place.

It did get me thinking about how it could be possible to treat the copy as a pool, which led to the following idea -

Code:
# truncate -s xxG /path/backup.pool
# mdconfig -a -t vnode -f /path/backup.pool
# zpool create backup mdX
# zfs send pool/fs@snapshot | zfs recv backup/fs
# zpool export backup
# mdconfig -d -u X

I actually tried this with a 4G file system and corrupted the file to see what would happen. (I just picked a random location in the file, which could of easily been empty.)

Code:
# dd if=/dev/random of=/path/backup.pool bs=1m count=10 seek=120 conv=notrunc
# mdconfig -a -t vnode -f /path/backup.pool
# zpool import backup
# zpool scrub backup
# zpool status -v backup
  pool: backup
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 2K in 0h4m with 9 errors on Wed Nov 23 12:52:10 2016
config:

        NAME        STATE     READ WRITE CKSUM
        backup      ONLINE       0     0     9
          md0       ONLINE       0     0    20

errors: Permanent errors have been detected in the following files:

        backup/fs@22-11-2016:/app/uploads/auth/30783.pdf
        backup/fs@22-11-2016:/app/uploads/auth/30784.pdf
        backup/fs@22-11-2016:/app/uploads/auth/30785.pdf

It actually managed to fix 2K during a scrub, which I'm guessing must be metadata as that is stored twice by default. 3 files were corrupt but the vast majority of the data was accessible.

Of course I'm not suggesting this as a real backup method. It was just an interesting way to store a copy of a ZFS file system on non-ZFS, but give it a chance to actually cope with some corruption.
 
  • Thanks
Reactions: ASX
Back
Top