ZFS Plan and best practices to migrate from TrueNAS Core to pure FreeBSD

Shiunbird · Apr 15, 2024

Hello everyone,

I am not planning to migrate to TrueNAS SCALE as my whole environment (desktops, servers and laptops) run FreeBSD, with the sole exception of my web browsing laptop, that is a corebooted QubesOS laptop.

The current setup is 2 servers with 8 disks each, in RAIDZ2, a solid setup I have on FreeNAS/TrueNAS for 10+ years. The active server zfs sends/receives to the backup server and the whole setup is on jails, except my email server that is on mailcow, with an active bhyve VM in one server, using the built-in replication script from mailcow to replicate to the standby VM running on the second server. The whole setup is backed up to Amazon Glacier (still haven't had the cash to get a server up at one of my friend's).

I have decided to start migrating to pure FreeBSD, so I nuked the standby server and started by trying to migrate the standby mailcow bhyve VM. So with the server fresh, I zfs send/received the dataset of my standby VM to the FreeBSD setup, loaded the key and spent 4 hours trying to get the VM to start, to no avail (bhyve error messages are really not that detailed).

So I thought safer to ask for some advice. I am currently on a single TrueNAS CORE box and I would like to migrate the setup to two FreeNAS boxes with HAST. I won't fight too much migrating bhyve, so I can just build a new VM for the email server.

My initial plan is to:
1. zfs send | receive all the data from CORE to the pure freebsd box.
2. Stop services.
3. Resend.
4. load keys.
5. Recreate my jails and the VM in the freebsd, so I have the services back online.
6. Get the new box to be responsible for Amazon Glacier backups.
7. With everything running fine, nuke the TrueNAS CORE installation, install freebsd.
8. Load keys back (all data should be there).

My question is... in this plan, when would I set HAST to minimize data transfers and reduce data loss risk (I'd hate to restore from glacier)? Should I just instead nuke the CORE box, do a fresh install, load keys? Any tips? Or is my plan awful? =)

After everything is running, I have 5 dell r220ii that I will add to my home rack so the storage boxes will do only that and run no services.

Thanks in advance!

tanis · Apr 15, 2024

I’m no expert in TrueNAS ( looked into it years ago and went with FreeBSD instead).

I would recreate everything from scratch using ansible (creating playbooks aka documentation) along the way. As you mentioned you are running FreeBSD everywhere else, the learning curve shouldn’t be that steep.

I would setup the former passive cluster node as the new active node, which gives the opportunity of testing everything compared to the still running active node and fall over gracefully at the end to the new node, when everything is working like a charm. Setting up the now leftover former active node using ansible to be the new passive (standby) node should be a piece of cake and be done in a fraction of the time.

tingo · Apr 16, 2024

Tip: in case you haven't tried it already: use something like sysutils/vm-bhyve to manage bhyve vm's instead of trying to run it "bare".

cy@ · Apr 16, 2024

I think this is worth a PR.

cy@ · Apr 16, 2024

tingo said:
Tip: in case you haven't tried it already: use something like sysutils/vm-bhyve to manage bhyve vm's instead of trying to run it "bare".

If this is a FreeBSD bug, which is unknown at the moment, I don't see how this would resolve anything. We don't have enough information to determine that yet.

Writing terabytes to a zpool shouldn't invalidate the pool. The OP may have tickled a bug by writing a lot of output to the zpool.

The other thing we don't know is, are they running a 64-bit version of FreeBSD or 32-bit? Seems to me that there may be an integer overrun that might be hiding in ZFS somewhere.

tingo · Apr 16, 2024

cy@ said:
Writing terabytes to a zpool shouldn't invalidate the pool. The OP may have tickled a bug by writing a lot of output to the zpool.

I agree. But do we know that the pool is the problem?
As far as I read the OP's post, the problem is with starting the vm. He doesn't say anything about the health of the pool / replicated dataset.
Perhaps providing more info on that part would narrow down the possible problem area.

cy@ · Apr 16, 2024

tingo said:
I agree. But do we know that the pool is the problem?
As far as I read the OP's post, the problem is with starting the vm. He doesn't say anything about the health of the pool / replicated dataset.
Perhaps providing more info on that part would narrow down the possible problem area.

We don't know. But, opening a PR will get someone with enough ZFS knowledge to look at it. Speculation here will not solve anything. Writing terabytes corrupting a zpool has a smell of serious to me.

Shiunbird · Apr 17, 2024

I kinda hoped I was doing something stupid. This escalated way beyond I expected!

I did another test, though. I have a dataset for my nextcloud data, and I mount it into a jail, where I run nextcloud from.
So I did the same power combo zfs send | receive over ssh, accounting for encryption, etc and did a quick SHA256 of a bunch of files and everything I checked went correctly.

My ignorance kicks in when it comes to the bhyve image. I know it is there because it is listed as a dataset, but it does not show in the filesystem (if I ls, for example). How can I check that the data transfer went correctly?

With that ruled out, I (we, with your kindness) can rule out my bhyve stupidity in a different thread. I will also try vm-bhyve.

tanis: to address one thing: I was looking for a smooth transition into a HAST active-active kind of scenario. Is this possible? If not, then my life is easy - I will move all the data over, set up the 3 physical machines I am waiting to add to my rack (3 for compute + the 2 existing ones for storage)

tanis · Apr 18, 2024

Shiunbird said:
USER=52874]tanis[/USER]: to address one thing: I was looking for a smooth transition into a HAST active-active kind of scenario. Is this possible? If not, then my life is easy - I will move all the data over, set up the 3 physical machines I am waiting to add to my rack (3 for compute + the 2 existing ones for storage)

I guess it depends on which layer you define the active-active kind of scenario.

I suggest the following links:

HighAvailableStorage - FreeBSD Wiki
ZFS High-Availability NAS in combination with FreeBSD Cluster with Pacemaker and Corosync

I'm particular intrigued by the ZFS High-Availability NAS, but it seems like dual port SSD/NVMe drives are a specific HP product, please correct me anyone, if I'm wrong here.

Shiunbird · Apr 18, 2024

UUU
I own a Sonnet SAS enclosure that supports dual port disks, but no, I am not going to go this way - expensive.

The first document sounds more like what I am trying to achieve. Perhaps the way will be to just prepare the new server, then connect my SAS enclosure to the new server, import everything, then nuke the old server (the TrueNAS one), prepare it and finish the HAST configuration.

And I have a 3rd spare enclosure that I don't use anymore (noisy fan) and I can create a backup there. I hope I have enough hard drives. =)

I might - I tend to pull them out every 45000 hours regardless of health.