ZFS LOG device on an encrypted pool = single point of failure? Can't unlock pool anymore after loosing SLOG...

Hi,

I am running TrueNAS 12 on my server and had some issues with data corruption recently which caused me to temporarily remove my Intel Optane from my server. After finding out the issues were caused by a broken CPU, I replaced the CPU and wanted to re-include my Intel Optane.

But something went wrong and now I can't unlock my pool anymore... Here's what happened:

I actually followed the same steps that I successfully performed on FreeNAS 11, when initially installing my Intel Optane about 6 months ago.
  • Create a GPT partition table
    • gpart create -s gpt nvd0
  • Create 3 partitions (I didn't do anything yet with the swap)
    • gpart add -i 1 -b 128 -t freebsd-swap -s 16g nvd0
    • gpart add -i 2 -t freebsd-zfs -s 20g nvd0
    • gpart add -i 3 -t freebsd-zfs nvd0
  • Add the 2nd partition as SLOG to my existing password encrypted RAIDZ2 pool
    • zpool add hgstpool log /dev/gptid/cf3dbeb8-c5dc-11ea-8b09-d05099d3fdfe
  • Create a non-redundant test pool from partition 3
    • zpool create -m /mnt/optanepool optanepool /dev/gptid/d23e29f0-c5dc-11ea-8b09-d05099d3fdfe
    • zpool export optanepool
    • Import optanepool in the GUI
Next I wanted to enable lz4 compression on optanepool in the GUI, but I got an error:
Code:
Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 137, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self,
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1195, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/service.py", line 463, in update
    rv = await self.middleware._call(
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1195, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 972, in nf
    args, kwargs = clean_and_validate_args(args, kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 930, in clean_and_validate_args
    value = attr.clean(args[args_index + i])
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 607, in clean
    data[key] = attr.clean(value)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 167, in clean
    value = super(Str, self).clean(value)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 53, in clean
    raise Error(self.name, f'Invalid choice: {value}')
middlewared.schema.Error: [aclmode] Invalid choice: DISCARD
This step did work in FreeNAS 11, but for some reason it doesn't work anymore in TrueNAS 12.

I thought: Ok, no problem. I'll just re-create optanepool and enable compression with the "zpool create" command itself. So I exported the pool the pool in the TrueNAS GUI, but then I couldn't create a new zpool over the old one, as the old one was still there, but not visible, as it was exported.
Then I made the mistake to think that it might be better to destroy optanepool from the GUI, so I re-imported optanepool and then exported it again in the GUI, while selecting the destroy checkbox. At that time my main RAIDZ2 pool was actually still locked after a reboot (I even thought it would be safer to keep it locked).

But, here it comes, when destroying optanepool from the TrueNAS GUI, it didn't just destroy the optanepool partition only. No it destroyed ALL partitions on my Intel Optane, including the partition that I was using as SLOG for my password encrypted RAIDZ2 pool :eek::eek::eek:

When I now try to unlock my RAIDZ2 pool, it fails with the following error:
Code:
Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 91, in main_worker
    res = MIDDLEWARE._run(*call_args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 45, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 977, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/zfs.py", line 371, in import_pool
    self.logger.error(
  File "libzfs.pyx", line 391, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/zfs.py", line 365, in import_pool
    zfs.import_pool(found, new_name or found.name, options, any_host=any_host)
  File "libzfs.pyx", line 1095, in libzfs.ZFS.import_pool
  File "libzfs.pyx", line 1123, in libzfs.ZFS.__import_pool
libzfs.ZFSException: one or more devices is currently unavailable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 361, in run
    await self.future
  File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 397, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 973, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/pool_/encryption_freebsd.py", line 290, in unlock
    raise e
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/pool_/encryption_freebsd.py", line 272, in unlock
    await self.middleware.call('zfs.pool.import_pool', pool['guid'], {
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1238, in call
    return await self._call(
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1203, in _call
    return await self._call_worker(name, *prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1209, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1136, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1110, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
libzfs.ZFSException: ('one or more devices is currently unavailable',)
In the last line, you can clearly read "one or more devices is currently unavailable". In the console, you can also see that it successfully unlocks each HDD, one by one, and after unlocking all 8 HDDs, it then destroys them again:
Code:
Feb  4 10:35:29 data GEOM_ELI: Device gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:29 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:29 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:31 data GEOM_ELI: Device gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:31 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:31 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:32 data GEOM_ELI: Device gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:32 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:32 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:34 data GEOM_ELI: Device gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:34 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:34 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:36 data GEOM_ELI: Device gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:36 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:36 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:38 data GEOM_ELI: Device gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:38 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:38 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:39 data GEOM_ELI: Device gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:39 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:39 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:41 data GEOM_ELI: Device gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:41 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:41 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:42 data GEOM_ELI: Device gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
So apparently my accidently destroyed SLOG device, is causing the pool to become impossible to unlock.o_O
  • I can't add a new / replacement SLOG, because I need an unlocked pool for that
  • I can't "remove" the destroyed SLOG from my pool, because I need an unlocked pool for that
  • I tried physically removing the Intel Optane, so that TrueNAS might realize the SLOG is no longer there, but I still can't unlock my pool
I have some ideas to go forward, but my ZFS knowledge is too limited to know if these are safe / good ideas. I hope someone here can give some feedback...

I was thinking unlock my HDDs one by one using the command line. Perhaps then, even though I can't unlock the destroyed SLOG anymore, I'll be able to "see" and modify my unlocked RAIDZ2 pool and perhaps remove the SLOG, so that TrueNAS no longer expects a SLOG while unlocking. I found following commands to do this:

geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key dev/gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe
geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key dev/gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe

Is this a good idea?
 
Hi SirDice,

I understand that you can't support TrueNAS / FreeNAS specific issues. However doesn't the question and issue itself also apply to FreeBSD? I understand that the reason that I ran into it (the reason that I accidently destroyed my SLOG) is because of TrueNAS / FreeNAS. But the issue itself (and possibly the solution) should apply to FreeBSD as well, no?
 
No, this is not a universal FreeBSD issue. The problem you are having is a result of you partitioning a disk and using it for two separate pools managed by the TrueNAS middleware, which runs counter to its assumptions on how disks will be used (i.e. whole disks are assigned to pools).

If you were to debug the middleware and find some underlying FreeBSD issue, you could report that here, potentially, but you'd need steps to reproduce outside of TrueNAS so that the FreeBSD developers could play with the issue.

There is literally no point in coming to the FreeBSD forum with errors from the TrueNAS GUI; you are much more likely to get support in the TrueNAS forums. I would ask there about how to manually decrypt and mount your raidz2 array.

If you need the middleware out of the way to complete your data recovery, and you end up using FreeBSD to manually decrypt and attempt to import the damaged array, then you could come back here for further troubleshooting tips on how to recover an array with a destroyed SLOG device if you need to.
 
Thanks for the reply upintheclouds.

I understand why I ran into this issue and I also get that posting TrueNAS script errors perhaps wasn't the best to post here ;) Sorry for that...
I did try to post it in the TrueNAS forums as well, but even after bumping my thread twice, I still didn't get a single response.

I was kind off hoping that here in the FreeBSD forum someone could indeed explain me how to manually decrypt and fix my pool, so that I could try to import it again in TrueNAS.

edit:
I've put the TrueNAS script errors in <spoiler> tags. Hopefully that makes my post a bit more "bearable" 😇
 
Do you have any support from IxSystems for it? I thought this was their commercial system. Honestly, if you do I would go to them. Anything you do yourself or from us has the possibility to put yourself in a position where they can no longer help you.

Also, it's difficult to support stuff like this as the middleware tends to have specific ways of doing things that we might screw up by messing with the system directly. (hence the warning post)

Anyway, a few points.

1) I'm not really sure what you mean by unlock/lock, unless just referring to loading/closing the geli devices.
2) If using GELI, this is completely transparent to ZFS. It will act no differently to if it was not encrypted.
3) ZFS will not touch other partitions during a destroy. Either this is a bug in TrueNAS, or you did something else to the disk as well.

Personally I would look at loading the GELI disks manually via the cli, then try to import the pool. The -m option to zpool should allow import of a pool with a missing log device. Again, the fact that the pool is encrypted should be irrelevant. If only a log device is missing, it *should* be importable. Would be interesting to see the output of zpool import when all the disks are unlocked.
 
IxSystems is indeed a commercial company and they sell appliances with TrueNAS pre-configured / installed. But they also offer TrueNAS freely to the community, which is then without "official support". I'm using my own server and the free TrueNAS download, so I don't have "official" support I'm afraid.

I do understand the warning everyone is giving me here. I'm afraid this same warning is even the reason I ran into this issue :confused: But as my offline backup is older than I'd like, I wouldn't mind taking a few "well-calculated" risks to try and update my offline backup before giving up on my pool...

1) I suppose I indeed mean "loading / closing" the geli devices. The TrueNAS middleware scripts are failing to load my geli devices, because the SLOG is missing.
2) Ok. Not entirely sure how to interpret this though
3) Although TrueNAS doesn't really support multi-partitioned devices, it does support importing them through the GUI. So I did report this as a TrueNAS bug as well. Either the import should fail or destroying a "partition-pool" shouldn't accidently render other pools unusable I think ;)

I indeed would like to try to load my geli devices through the command line, to try and make my pool visible (importable / fixable) again.
Are the "geli attach" that I've suggested in my opening post normally "safe" to run?
 
i don’t use geli but I see little issue with running an attach command manually. it will either work or give an error. it *shouldnt* damage anything.

as above, it would be interesting to see what a bare “zpool import” shows once all the disks are loaded.
 
point two is referring to the title. “log on encrypted pool is a single point of failure”

encryption is irrelevant. log is a single point of failure on any pool - but should be recoverable by telling zfs you don’t care about the possibly missing data.
 
Thanks!

I'll give the "geli attach" a shot later this evening then (still at work atm)
I also understand your second point now.

By "loosing" the SLOG, I suppose I risk to loose data that still had to be written to the disks? As I didn't yet load the geli devices after a reboot when I accidentally destroyed the SLOG, I suppose that there was no data to be written, so I may not have lost anything, right? (And even if I did loose a bit, it's still better than loosing 1 month of data for having to use an outdated backup ;))
 
yes, log is only inflight sync writes. given that it sounds like you were just messing with the system and the pool wasn’t in use it is highly unlikely anything was in the log. (in fact should be impossible if the pool wasn’t imported at the time unless the system crashed the last time it was in use). of course zfs has no way of knowing if the log was empty without being able to see it...

as it values data integrity over pretty much everything else, a pool missing a log will not import unless you explicitly tell it to forget about the log.
 
Code:
data# geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe
Enter passphrase:
data# geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe
Enter passphrase:
data# geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe
Enter passphrase:
data# geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe
Enter passphrase:
data# geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe
Enter passphrase:
data# geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe
Enter passphrase:
data# geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe
Enter passphrase:
data# geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe
Enter passphrase:
data# zpool import
   pool: hgstpool
     id: 11186148882051621824
  state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-6X
config:

        hgstpool                                            UNAVAIL  missing device
          raidz2-0                                          ONLINE
            gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE
            gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE
            gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE
            gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE
            gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE
            gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE
            gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE
            gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE
        logs
          gptid/f8725bbb-64c5-11eb-bbcd-d05099d3fdfe        UNAVAIL  cannot open

        Additional devices are known to be part of this pool, though their
        exact configuration cannot be determined.
data#
It seems like this worked!

Now I wonder what I best do next:
1)
* zpool remove hgstpool gptid/f8725bbb-64c5-11eb-bbcd-d05099d3fdfe
* geli detach /dev/gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe
* geli detach /dev/gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe
* geli detach /dev/gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe
* geli detach /dev/gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe
* geli detach /dev/gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe
* geli detach /dev/gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe
* geli detach /dev/gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe
* geli detach /dev/gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe
* then reboot TrueNAS
* And finally try to mount it again using the GUI
* Update my offline backup

2)
* zpool import -m hgstpool or zpool import -m -d /dev/disk/??? hgstpool (or something... dunno really)
* zfs mount hgstpool (not sure if I need to make or specify a mountpoint? The usual mountpoint /mnt/hgstpool currently does not exist)
* Try to get it shared somehow if TrueNAS doesn't automatically share it with my already existing share
* Update my offline backup

3)
Or something else
 
I don’t think a remove command will work on an exported pool.

likely your only option, and i would expect best, is an import -m. this should import and remove the log device at the same time.
 
if it works i would export and then try to import via the gui so truenas can mount it however it wants and do anything else it would normally do during an import.
 
After doing below steps already yesterday:
  • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe
  • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe
  • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe
  • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe
  • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe
  • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe
  • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe
  • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe
    • Which made my pool visible again in CLI
I've just tried following steps (which were also advised by a TrueNAS forum moderator):
  • zpool import -m hgstpool
    • This worked and the pool immediately became fully visible in in GUI as degraded because of the missing SLOG
  • zpool remove hgstpool gptid/f8725bbb-64c5-11eb-bbcd-d05099d3fdfe
    • Also this worked very well. The pool became healthy again in the GUI
  • zpool export hgstpool
    • Since the mod stressed not doing this from the GUI, I did the above in CLI. This however didn't work as it should have. My pool didn't disappear from the GUI (as he told it should). It only became "offline".
Next I rebooted and unlocked the pool from the GUI, which worked fine. My data is available again. So I'll start updating my offline backup first now...

Later I may still need to ensure my TrueNAS setup is still completely healthy (perhaps do a rebuild to make sure), but I'll get guidance for that part in the TrueNAS forum.

Anyway, thanks a lot everyone here for assisting me, even though it was probably a (dark) grey zone if it was correct to post it here ;)
 
Back
Top