Solved How to convert existing active zpool, raidz, with root on it to 4k aligned

Matthew Dresden · Apr 29, 2015

With the help of several knowledgeable contributors and some googling of my own at the end, I recently resolved the by-product of a bsdinstall bug on my raidz array.

While resolving an issue where a disk member during the OS install created an out of place pool member freebsd-10-1-p6-with-zfs-on-root-with-raidz-one-disk-member-was-created-oddly

I discovered I didn't have the partitions 4k aligned with gpart show as pointed out by gkontos.

I thought it made more sense to make this a separate post on how to convert an existing pool hot to 4k, assuming this is possible of course.

My initial guess here that maybe I can remove one pool member at a time, convert it and put it back until they are all done.

Would this work? Is this in any possible?

Any pointers would be helpful.

Upon completion, I would like to post a nice how to on the topic.

Here are some specific details:

Code:

# zpool status
  pool: Datastore
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
   continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 28 20:33:17 2015
  63.7G scanned out of 625G at 350M/s, 0h27m to go
  12.6G resilvered, 10.18% done
config:

   NAME  STATE  READ WRITE CKSUM
   Datastore  DEGRADED  0  0  0
    raidz1-0  DEGRADED  0  0  0
    gpt/zfs0  ONLINE  0  0  0
    replacing-1  OFFLINE  0  0  0
    6091044272226804680  OFFLINE  0  0  0  was /dev/diskid/DISK-YFG89PPAp3
    gpt/zfs1  ONLINE  0  0  0  (resilvering)
    gpt/zfs2  ONLINE  0  0  0
    gpt/zfs3  ONLINE  0  0  0
    gpt/zfs4  ONLINE  0  0  0
   logs
    mirror-1  ONLINE  0  0  0
    gpt/log0  ONLINE  0  0  0
    gpt/log1  ONLINE  0  0  0
   cache
    gpt/cache0  ONLINE  0  0  0
    gpt/cache1  ONLINE  0  0  0

errors: No known data errors

BTW, yes I know the pool is resilvering now. I will wait till that completes before tackling this.

Code:

# gpart show
=>  34  250069613  ada0  GPT  (119G)
  34  2014  - free -  (1.0M)
  2048  16777216  1  freebsd-zfs  (8.0G)
  16779264  233290376  2  freebsd-zfs  (111G)
  250069640  7  - free -  (3.5K)

=>  34  3907029101  ada1  GPT  (1.8T)
  34  1024  1  freebsd-boot  (512K)
  1058  12582912  2  freebsd-swap  (6.0G)
  12583970  3894445165  3  freebsd-zfs  (1.8T)

=>  34  3907029101  ada3  GPT  (1.8T)
  34  1024  1  freebsd-boot  (512K)
  1058  12582912  2  freebsd-swap  (6.0G)
  12583970  3894445165  3  freebsd-zfs  (1.8T)

=>  34  3907029101  ada4  GPT  (1.8T)
  34  1024  1  freebsd-boot  (512K)
  1058  12582912  2  freebsd-swap  (6.0G)
  12583970  3894445165  3  freebsd-zfs  (1.8T)

=>  34  3907029101  ada5  GPT  (1.8T)
  34  1024  1  freebsd-boot  (512K)
  1058  12582912  2  freebsd-swap  (6.0G)
  12583970  3894445165  3  freebsd-zfs  (1.8T)

=>  34  234441581  ada6  GPT  (112G)
  34  2014  - free -  (1.0M)
  2048  16777216  1  freebsd-zfs  (8.0G)
  16779264  217662344  2  freebsd-zfs  (104G)
  234441608  7  - free -  (3.5K)

=>  34  15633341  da0  GPT  (7.5G)
  34  1024  1  bios-boot  (512K)
  1058  6  - free -  (3.0K)
  1064  15632304  2  freebsd-zfs  (7.5G)
  15633368  7  - free -  (3.5K)

=>  34  15633341  diskid/DISK-4C530013510724112284  GPT  (7.5G)
  34  1024  1  bios-boot  (512K)
  1058  6  - free -  (3.0K)
  1064  15632304  2  freebsd-zfs  (7.5G)
  15633368  7  - free -  (3.5K)

=>  34  3907029097  da1  GPT  (1.8T)
  34  6  - free -  (3.0K)
  40  409600  1  efi  (200M)
  409640  3906357344  2  apple-hfs  (1.8T)
  3906766984  262147  - free -  (128M)

=>  34  3907029097  diskid/DISK-000000000024  GPT  (1.8T)
  34  6  - free -  (3.0K)
  40  409600  1  efi  (200M)
  409640  3906357344  2  apple-hfs  (1.8T)
  3906766984  262147  - free -  (128M)

=>  34  3907029101  diskid/DISK-YFG89PPA  GPT  (1.8T)
  34  1024  1  freebsd-boot  (512K)
  1058  12582912  2  freebsd-swap  (6.0G)
  12583970  3894445165  3  freebsd-zfs  (1.8T)

protocelt · Apr 29, 2015

Unfortunately that won't work. To my knowledge, the only way to change the ashift value of a zpool is to completely destroy and recreate it. If you have another ZFS box you could use zfs send to send snapshots of the filesystem(s) to that box until the pool is recreated, or you'd have to replace the data from your backup.

kpa · Apr 29, 2015

protocelt said:
Unfortunately that won't work. To my knowledge, the only way to change the ashift value of a zpool is to completely destroy and recreate it. If you have another ZFS box you could use zfs send to send snapshots of the filesystem(s) to that box until the pool is recreated, or you'd have to replace the data from your backup.

4k alignment can be fixed by replacing the non-aligned disks with properly aligned ones. The ashift (blocksize in other words) is a different matter and requires recreation of the pool from scratch.

protocelt · Apr 29, 2015

kpa said:
4k alignment can be fixed by replacing the non-aligned disks with properly aligned ones. The ashift (blocksize in other words) is a different matter and requires recreation of the pool from scratch.

Edit: My apologies. I completely misread the post.

gkontos · Apr 29, 2015

Correct, you need to perform a clean install again. Since you are using ZFS or Root, I would suggest that you boot from a mfsbsd image and then use this guide to assist you with the installation.

You don't need to use the gnop trick anymore though. You can issue this command: sysctl vfs.zfs.min_auto_ashift=12 instead.

Matthew Dresden · Apr 29, 2015

Right, thanks

Will post back when I get done this evening EST.

Can I still send a snapshot to a usb disk for example and then restore it on the newly created system after install to retain work?

gkontos · Apr 29, 2015

Matthew Dresden said:
Right, thanks

Will post back when I get done this evening EST.

Can I still send a snapshot to a usb disk for example and then restore it on the newly created system after install to retain work?

Absolutely.

Matthew Dresden · Apr 30, 2015

Here is what I did to create the backup which is still ongoing in a screen. Can you confirm this will create a complete backup?

I took a recursive snapshop like:

zfs snapshot -r Datastore@Backup
zfs send -R Datastore@Backup > /usb-backup/full-Datastore-backup.zfs

usb-backup is a 2TB drive I partitioned with gpart, created a freebsd-zfs part on, formatted with newfs, and then created a pool from it.

kpa · Apr 30, 2015

That is not the safest way to transfer filesystems with ZFS. There is a possibility of undetected data corruption in the stream you're now sending to /usb-backup/full-Datastore-backup.zfs that could prevent any possibility of restoring the backup. The more safe way would be to use a backup pool and pipe the stream to zfs receive:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Using_ZFS_Snapshots

If you still want to go with the method you're using now you could abort it now and re-do it with gzip(1) compression, this would at least give a way to test the resulting file for data corruption with gzip -t:

zfs send -R Datastore@Backup | gzip -2 > /usb-backup/full-Datastore-backup.zfs.gz

Matthew Dresden · Apr 30, 2015

I had wanted to do it that way, but ran into some error I was creating.

Code:

# zfs send -R Datastore@Backup | zfs receive -dvu usb-backup
cannot receive new filesystem stream: destination 'usb-backup' exists
must specify -F to overwrite it
warning: cannot send 'Datastore@Backup': Broken pipe

The pool I want to send it too is mounted on my server I am backing up, not remote. What am I doing wrong?

Code:

# zpool list
NAME  SIZE  ALLOC  FREE  FRAG  EXPANDSZ  CAP  DEDUP  HEALTH  ALTROOT
Datastore  9.06T  663G  8.41T  3%  -  7%  1.00x  ONLINE  -
usb-backup  1.81T  826K  1.81T  0%  -  0%  1.00x  ONLINE  -

kpa · Apr 30, 2015

You have to use the -F option for zfs receive so that existing datasets (usb-backup in this case) are overwritten on the receiving side. This is by design.

One more thing, for safety you should export and re-import usb-backup with altroot set to for example /mnt before sending backup streams that contain datasets with mountpoints like /, /usr etc:

zpool export usb-backup
zpool import -R /mnt usb-backup

This prevents the unfortunate accident of mounting the datasets from the backup pool over the live system.

Matthew Dresden · Apr 30, 2015

Initially I tried that and it didn't work, but it failed as I see now because a screen terminal still had the directory open.

Its working now

kpa · Apr 30, 2015

Read my edit as well, it is quite important.

Matthew Dresden · Apr 30, 2015

Just out of curiosity do zfs snapshots work with running but fairly idle DB instances like postgres?

Matthew Dresden · Apr 30, 2015

I am not sure how to see what you edited, but I dont have anything I was concerned about losing on the usb disk. Was that your concern?

kpa · Apr 30, 2015

Matthew Dresden said:
Just out of curiosity do zfs snapshots work with running but fairly idle DB instances like postgres?

As far as I know such way of backing up databases is not recommended because you might be taking the backup mid-transaction and the resulting backup wouldn't be consistent.

kpa · Apr 30, 2015

Matthew Dresden said:
I am not sure how to see what you edited, but I dont have anything I was concerned about losing on the usb disk. Was that your concern?

Ah sorry, my edit was about the altroot setting on the usb-backup pool. You're using -u option on zfs-receive so you should be safe for now but next time you import the usb-backup pool you should definitely use zfs import -R /mnt usb-backup.

Matthew Dresden · Apr 30, 2015

Ok, this was generally the case in the linux world, I was just wishfully hoping zfs would have some extra magic, LOL
:/

Also your edit populated now, about the export and import.

What specifically does zpool export usb-backup do?

Does this just make the backup on usb-backup available to import and mount under /mnt for example?

Also, when I am ready to move it from /mnt to the root fs, is that still done with zfs receive?

If so what would that look like?

Matthew Dresden · Apr 30, 2015

Ok, so when you are referring to the "export and import" you are referring to the point when I am ready to restore the backup, not when I am creating the backup right?

My initial backup is still going, looks like I had 490GB or so.

Matthew Dresden · Apr 30, 2015

I Googled it:

What_export_and_import_do

The export sounds pretty important.

kpa · Apr 30, 2015

Matthew Dresden said:
Ok, so when you are referring to the "export and import" you are referring to the point when I am ready to restore the backup, not when I am creating the backup right?

My initial backup is still going, looks like I had 490GB or so.

I intended that for you to do before starting the zfs send ... command but that's all right now. The zpool export command is just remove the pool from the system and unmount all filesystems on it. The matching zpool import is the reverse, add the pool to the system and mount all filesystems on it.

The reason why I gave the zpool import -R /mnt usb-backup command is that ZFS datasets (filesystems) have mountpoint properties, these tell the system where each of them should be mounted at. What you're doing now is sending a recursive snapshot with all properties intact and unchanged. This means that there is a dataset with mountpoint set to / (root directory) since your pool is a bootable one and contains the operating system as well (unless I'm totally mistaken but there are freebsd-boot partitions on your disks). If you export and then import the usb-backup pool (without setting altroot with -R) after the backup has been created on it, ZFS will happily mount all datasets on it including the one with mountpoint set to / and it will be mounted over the current / dataset from the Datastore pool and that is a very bad because it causes a deadlock situation that can not be solved in any other way than hard reset.

If I had constructed your ZFS pool I would have gone for a separate small OS pool and another big pool for just data or even an UFS filesystem for the operating system.

Matthew Dresden · Apr 30, 2015

Ah, that is more clear now.

On the initial install, I just used the guided ZFS option. It turns out it didn't create it optimally for me.

I plan on reinstalling the system and making it by hand.

I planned on struggling my way through this rather than depending on all the help everyone is so quick to offer here.

Do you have a generally high level layout you would recommend?

I otherwise without knowing better was going to recreate it very similar as it now but with 4k alignment.

Even in the Linux world, many of the new OS releases are going to btrfs, where the layouts differ greatly from the ext4/lvm methods I am accustomed too.

I recognize the amazing features here in this paradigm, but am still wrapping my brain around the new tools sets and trying to find best practices for real enterprise application and maintenance.

I manage around 2600 blades/instances professionally and am earning more and more latitude there daily. I would love to see FreeBSD/ZFS find a home on some of that steel.

junovitch@ · May 1, 2015

kpa said:
As far as I know such way of backing up databases is not recommended because you might be taking the backup mid-transaction and the resulting backup wouldn't be consistent.

I would speculate starting the database after a restoring from a snapshot would give a message about an unclean shutdown. I've seen that log message on systems that had a kernel panic. Between snapshots and no backups at all, it's certainly better than nothing but no substitute for a real backup.

If the database is spread across multiple ZFS datasets, then that would almost certainly be bad as taking a recursive snapshot is not a single atomic operation. It has to go one dataset at a time and that would almost certainly result in odd things when data is out of sync.

gkontos · May 1, 2015

Agreed 100%. I had experienced serious corruption once but I guess it is better than having no backup. When it comes to SQL data I always use automysqlbackup. Of course, you can always stop the process before taking a snapshot.

kpa · May 1, 2015

Matthew Dresden said:
Do you have a generally high level layout you would recommend?

I would use two fast SSD disks for the OS as a separate ZFS mirror and put the log and cache devices of the main data pool on the same SSDs as well.