ZFS 2x3tb as one raidz member

Jose · Jul 16, 2022

robot468 said:
No, I am ready to wait for your arguments

OK, here's the concern. When your system is at boot stage 2, there is no gmirror anywhere. All that's available at that early stage is the MBR partition table (for BIOS boot) and the information the disk hardware reports. Suppose the MBR is intact, but the primary GPT header is corrupt. Will the loader try to get the partition information from the secondary GPT header? If so, where does it get the disk size information? If it gets it from the MBR, does it report the one-sector-short gmirror information?

I found this article that shows how to dump the partition table from an MBR. Unfortunately, Freebsd's file(1) does not report this information. I'm searching the Web looking for some code or utility that will.

robot468 said:
Every time I touch Linux (out of necessity) I feel pain

I feel you, brother. I get more than enough of that at work.

PMc · Jul 16, 2022

emmex said:
From FreeBSD list
"Also you can't boot from a gconcat volume like you can
from a gmirror volume."

Correct, you cannot. There is afaik no concat driver compiled in the loader binary. You can boot from a mirror, because the mirror is simply ignored.
It's rather simple to understand: when booting, you can only use the rudimentary stuff that is known to a bios, plus what is compiled into the loader binary. Only afterwards the kernel is loaded from the disk partition, and only from when the kernel is in memory, you have full-power/full-system. (Then the devices are re-initialized in unix fashion, and then we get to single-user.)

But I would not do that, in any case. I would not make such a fat pool the boot+root device - at every OS upgrade the whole payload would be in the game, and then imagine something going wrong...
I would bite off 5 GB from every disk with gpart, make two of these bites a separate mirrored zfs pool for the OS, and the other 4 give a nice 20 GB paging. Or something along that line.

robot468 · Jul 16, 2022

Jose said:
OK, here's the concern. When your system is at boot stage 2, there is no gmirror anywhere. All that's available at that early stage is the MBR partition table (for BIOS boot) and the information the disk hardware reports. Suppose the MBR is intact, but the primary GPT header is corrupt. Will the loader try to get the partition information from the secondary GPT header? If so, where does it get the disk size information? If it gets it from the MBR, does it report the one-sector-short gmirror information?

You are right, in this scenario you will have a problem. But if you have a corrupted partition table, the entire disk is probably corrupted as well. This is exactly the situation you made the mirror for. In that case you will have to replace the damaged disk with a new one and boot from a second, good disk. gmirror will rebuild the mirror and restore the first disk.

robot468 · Jul 16, 2022

PMc said:
But I would not do that, in any case. I would not make such a fat pool the boot+root device - at every OS upgrade the whole payload would be in the game, and then imagine something going wrong...

I have already been convinced that it is better to have the system on separate 2xSSDs, I will do so.

The pool that everything works and boots on now was created about 10 years ago, when I was just starting to learn about ZFS. And beyond that, "it works - don't touch it". Now that the requirements have changed, I will of course rebuild the whole system.

Jose · Jul 16, 2022

robot468 said:
You are right, in this scenario you will have a problem. But if you have a corrupted partition table, the entire disk is probably corrupted as well. This is exactly the situation you made the mirror for. In that case you will have to replace the damaged disk with a new one and boot from a second, good disk. gmirror will rebuild the mirror and restore the first disk.

Well, I have had disks where just the boot sector was gone. This typically happens when someone (me) decides to partition the wrong disk. This was trivially fixed with fdisk /mbr command in the bad old DOS days.

Turns out the command you want on Freebsd to examine the partition table on an MBR is fdisk(8). Interestingly, on my system the DOS partition in the MBR is 71 sectors shorter than the gmirror disk size (72 shorter than the physical disk). I'm guessing that yes, you will have problems in the unlikely event that your MBR is healthy, but your primary GPT header is corrupt. The solution is to just switch the primary boot drive to the other member of the mirror in the BIOS, which is exactly the same recovery from a bad MBR.

robot468 said:
I have already been convinced that it is better to have the system on separate 2xSSDs, I will do so.

The pool that everything works and boots on now was created about 10 years ago, when I was just starting to learn about ZFS. And beyond that, "it works - don't touch it". Now that the requirements have changed, I will of course rebuild the whole system.

This is my first ZFS system as well, now a couple of years old. I would've skipped the gmirror on the two SSDs and just done a ZFS root pool with a single mirror vdev if I was doing it now.

One thing I do like about this setup is that I have the ZFS intent log on a partition in my gmirror. This means it too is resilient to a single-disk failure.

Edit: Though with a bad MBR the recovery might happen automatically if you've added the other member of your mirror to the BIOS boot order.

robot468 · Jul 17, 2022

Jose said:
This is my first ZFS system as well, now a couple of years old. I would've skipped the gmirror on the two SSDs and just done a ZFS root pool with a single mirror vdev if I was doing it now.

One thing I do like about this setup is that I have the ZFS intent log on a partition in my gmirror. This means it too is resilient to a single-disk failure.

Edit: Though with a bad MBR the recovery might happen automatically if you've added the other member of your mirror to the BIOS boot order.

Mirroring the whole SSD does not seem like a good solution to me, because I want to allocate all free space for l2_arc. My scheme is like this:
2x256g ssd:
64gb zfs mirror for system
32g swap(on each disk)
2x4gb zil (may be in gmirror but I don't see a big problem if zil suddenly fails. It is rarely used at all and the main disk will stay consistent anyway)
All other space (156Gx2) is L2_ARC

PMc · Jul 17, 2022

robot468 said:
Mirroring the whole SSD does not seem like a good solution to me, because I want to allocate all free space for l2_arc. My scheme is like this:

That's quite what I do on a bit smaller scale. Only I do not yet have the OS in ZFS.

robot468 said:
2x256g ssd:
64gb zfs mirror for system
32g swap(on each disk)
2x4gb zil (may be in gmirror but I don't see a big problem if zil suddenly fails. It is rarely used at all and the main disk will stay consistent anyway)

The zil can get mirrored by ZFS when you have two partitions for it:

Code:

        NAME             STATE     READ WRITE CKSUM
        gr               ONLINE       0     0     0
          raidz1-0       ONLINE       0     0     0
            da1p2.eli    ONLINE       0     0     0
            da2p2.eli    ONLINE       0     0     0
            da0p2.eli    ONLINE       0     0     0
        logs
          mirror-1       ONLINE       0     0     0
            ada1p4.eli   ONLINE       0     0     0
            ada3p13.eli  ONLINE       0     0     0
          mirror-2       ONLINE       0     0     0
            ada3p14.eli  ONLINE       0     0     0
            ada6p4.eli   ONLINE       0     0     0
        cache
          ada3p15.eli    ONLINE       0     0     0

Here I have one big and two small SSD, and try to distribute the zil usage evenly.
The cache (l2arc) can get destroyed at any time. With the log (zil) ZFS is a bit angry when it disappears, but it can be removed - it only contains the most recent changes to the pool, so the loss is normally minimal. Only the "special" device (if used) must never be removed. (The "special" device is for metadata and optionally for small files.)

ZIL usage depends on the application. If You happen to provide NFS, then usually everything goes thru the zil, and it becomes quite an SSD killer.

robot468 · Jul 17, 2022

PMc said:
(The "special" device is for metadata and optionally for small files.)

Does putting special on ssd give any visible performance gain? How to correctly size it?

gpw928 · Jul 17, 2022

Loss of the ZIL can corrupt the underlying pool to the point of total loss. The ZIL must be protected against power loss. There are two common options:

leave the ZIL in the pool where it resides transparently on spinning disks; or
move the ZIL to a separate log (SLOG) on SSDs which have "power loss protection".

My root disk layout is similar to yours, however it's constructed from two enterprise class SSDs (on separate controllers), which have "power loss protection" (on board capacitors). The root and SLOG are constructed from separately mirror'd ZFS partitions. The ARC is a stripe of two ZFS partitions (it's a cache, and thus not sensitive to data loss, so a stripe is OK). The swap is a GEOM mirror (because swap on a ZFS pool can result in deadly embrace).

The Intel enterprise class SSDs I use for the root are expensive per GB, but quite small (about 120 GB used on each drive). They also have excellent latency, which is another desirable characteristic. However, I believe that the conventional advice is that the ZIL and ARC should not be on the same media...

For extra safety, I also have a UPS, managed by sysutils/nut, which will shut down the ZFS server cleanly in the event of a power loss (and reboot it when the power comes back).

robot468 · Jul 17, 2022

gpw928 said:
Loss of the ZIL can corrupt the underlying pool to the point of total loss. The ZIL must be protected against power loss. There are two common options:

Honestly, this is the first time I've heard about it. In ZFS all operations are atomic, ZIL in the worst possible case, contains synchronous write operations that have not yet been written to the pool. How could the loss of ZIL mean the loss of the pool?

gpw928 said:
For extra safety, I also have a UPS, managed by sysutils/nut, which will shut down the ZFS server cleanly in the event of a power loss (and reboot it when the power comes back).

I use apcupsd for this

PMc · Jul 17, 2022

gpw928 said:
Loss of the ZIL can corrupt the underlying pool to the point of total loss.

I usually tend to believe You, but I don't see the technical cause in here: the zil contains the write intentions from the current txg(s). When these disappear, then it should always be possible to recover the pool as of the last fully completed txg.
But then, I do have power loss about every second year, so I am not forced to do investigation in that area.

PMc · Jul 17, 2022

robot468 said:
Does putting special on ssd give any visible performance gain? How to correctly size it?

Short answer: It probably depends.

To size it, take a representative subset of your payload and create a test pool with it.
Full story: I had a build pool (the source and ports trees, with git switching the branches, and creating readymade images for upload to cloud kvms) that resided on SSD mirror of my desktop. I moved it to the server, and there was an icycage with some seagate cheetahs that needed a retirement job (I cannot throw away good old metal), so I put in onto that - and obviousely that was unuseable. Adding a "special" device made it useable - it is certainly not as fast as if fully on SSD, but also not so much slower.
In this usecase an l2arc does not help much, because git switches branches by replacing the content, and the data in l2arc gets invalid every time.

Jose · Jul 17, 2022

PMc said:
I usually tend to believe You, but I don't see the technical cause in here: the zil contains the write intentions from the current txg(s). When these disappear, then it should always be possible to recover the pool as of the last fully completed txg.
But then, I do have power loss about every second year, so I am not forced to do investigation in that area.

So data loss but not corruption? This is still a big deal if you're using the ZIL to speed up database transactions, IMHO.

It could still corrupt an ACID database depending on the database software's implementation details. All the elements of an ACID transaction must either all succeed or all fail. If writing one to disk is spread across more than one IOP, you could get into a situation where some IOPS were in a ZIL batch that got flushed to disk, and some weren't. The database software is in the dark about these details and believes all writes succeeded (that's the point of fsync.) Should the ZIL fail, the database will be corrupted.

PMc said:
ZIL usage depends on the application. If You happen to provide NFS, then usually everything goes thru the zil, and it becomes quite an SSD killer.

I guess -o async on the NFS client mount command would help here. Then again, one does not always control the client.

mer · Jul 17, 2022

ZIL is for writes, not reads, I'm not sure if there it's persistent (replay anything in in ZIL on startup), but if it is or planned to be in the future, you could still lose data, but less of it.
That's the problem with write cache which ZIL is kind of like; yank power and you have a probability of losing data not yet written to the device. Lots of things to do that will minimize the amount of loss, but UPS and graceful shutdown are about the only way to guarantee you don't lose anything.

PMc · Jul 17, 2022

mer said:
ZIL is for writes, not reads, I'm not sure if there it's persistent (replay anything in in ZIL on startup)

That's the only thing zil does: replay after reboot after unexpected stop. (It is not a cache.)
And with that ZFS can commit sync writes (and only sync writes are concerned here) immediately, without waiting for them to be persisted to disk.
This happens always; but normally (without log devices) the zil lives alongside the other data on the pool devices, creating additional seeks etc.

PMc · Jul 17, 2022

Jose said:
So data loss but not corruption? This is still a big deal if you're using the ZIL to speed up database transactions, IMHO.

It could still corrupt an ACID database depending on the database software's implementation details. All the elements of an ACID transaction must either all succeed or all fail.

Yes. This is the pain between filesystem driver and database implementation. Unjournaled ufs, for example, doesn't care at all about these issues.

Take postgres, for example: there is a switch full_page_writes,and this can be switched off with ZFS, because it is guaranteed that a database page (8kB) is written either completely or not at all. On ufs one must not switch that off: the database must take extra precautions to detect half-written database blocks at restart, otherwise they may create silent data corruption.

Then, the next interesting question is, what does the database application do when it gets a commit, and later on that transaction doesn't exist in the database, because it was only in the ZIL and that got lost? This is certainly a killer when you do things like credit-card payment approval. In such a case all that gpw928 said about power loss protection is very relevant.

So there are at least three levels involved:

is the filesystem consistent in itself, i.e. is the data still accessible?
is the contained data consistent in itself, i.e. does the database content still make sense?
is the data consistent with the state of affairs other parties may have?

Jose said:
I guess -o async on the NFS client mount command would help here. Then again, one does not always control the client.

Yepp, that works. But there is a reason why these activities are sync by default, so it depends if you can afford to make them async.

gpw928 · Jul 17, 2022

robot468 said:
Honestly, this is the first time I've heard about it. In ZFS all operations are atomic, ZIL in the worst possible case, contains synchronous write operations that have not yet been written to the pool. How could the loss of ZIL mean the loss of the pool?

To quote Klara Systems:

What happens if a failure, such as a power loss causing a system reboot, occurs? Everything in RAM, including all pending transactions and asynchronous write requests, is gone. If there was an interrupted transaction group performing a write, that transaction group is incomplete and the data on disk is now out-of-date by 5 seconds, which can be a big deal on a busy server.

However, keep in mind that pending synchronous writes are still on disk in the ZIL. On system startup, ZFS reads the ZIL and replays any pending transactions in order to commit those writes to disk. That sounds like a pretty good system: pending synchronous writes still get written and no more than 5 seconds worth of asynchronous writes are lost.

When purchasing a SLOG device, you want something that is non-volatile and battery-backed. It does not need to be large as the ZIL only needs to have enough capacity to contain all the synchronous writes that have not yet been flushed from RAM to disk.

SSDs are different to spinning disks in that cheap SSDs have write caches that are volatile. The data in these volatile caches get acknowledged to the O/S as being "on disk" when they are not. And these cached data can be lost when the power fails. Expensive "enterprise class" SSDs have capacitors that hold enough reserve power to write volatile cache contents to permanent storage when the power fails.

If an SSD containing a separate ZIL does not have "power loss protection", the synchronous writes stored in the ZIL can be completely lost when the power fails -- and any database (which relies on synchronous writes for integrity) just went down the toilet...

To be completely fair, the ZIL only matters for synchronous writes, and if you are completely sure that no application is using them, then your can be sure that the ZIL won't be used (so you have no need for a separate ZIL in the first place). Power loss may still result in lost asynchronous file system transactions, but file system transaction logs generally allow maintenance of structural integrity of the file system (you can still lose data).

gpw928 · Jul 17, 2022

PMc said:
I usually tend to believe You, but I don't see the technical cause in here: the zil contains the write intentions from the current txg(s). When these disappear, then it should always be possible to recover the pool as of the last fully completed txg.
But then, I do have power loss about every second year, so I am not forced to do investigation in that area.

You are right. I tend to think about a pool using a separate ZIL as relying on synchronous writes (e.g. databases and NFS).
That's because you won't want or need a separate ZIL otherwise.
But it's the applications that will suffer data corruption on unprotected power loss, not the pool.

PMc · Jul 18, 2022

gpw928 said:
If an SSD containing a separate ZIL does not have "power loss protection", the synchronous writes stored in the ZIL can be completely lost when the power fails -- and any database (which relies on synchronous writes for integrity) just went down the toilet...

I prefer to disagree. In theory You are right, but in practice the database just cannot know that it went down the toilet.

When a database starts recovery, it can only know the data that is on disk, it has no way to figure out what was before. So if that data is five seconds back, because the ZIL got lost and was not replayed, then the database will recover to a point-in-time five seconds earlier. (Point-in-time restore and log replay in a database is the same routine.)

This is a big issue if some other party relies on the completed transactions (as I said before, credit card approvals and such). It should be no issue when the database itself is the authoritative source for the data - e.g. the database storing this forum: you post a message, suddenly the website becomes unresponsive. After a while you can connect again, but the message isn't there. Can happen.

Partial data should not be a problem, because the database does it's own recovery from the redo-log. And the redo log is a sequentially written file, it cannot have partial data - it goes to a certain point, and there it ends. And up to that point will be recovered.

Jose · Jul 18, 2022

PMc said:
I prefer to disagree. In theory You are right, but in practice the database just cannot know that it went down the toilet.
When a database starts recovery, it can only know the data that is on disk, it has no way to figure out what was before. So if that data is five seconds back, because the ZIL got lost and was not replayed, then the database will recover to a point-in-time five seconds earlier. (Point-in-time restore and log replay in a database is the same routine.)

Yeah, but the two logs (database write-ahead log and ZFS ZIL) are not guaranteed to be in synch. They could be in different devices altogether. Suppose the database WAL has a checkpoint at transaction X. The database on restart will rely on all transactions up to and including X having been committed to the database data files. However, if transaction X was actually only partially applied, there could be data consistency problems like missing foreign keys.

These problems might not even manifest right away. It could be that nothing bad will happen until someone actually tries to access the table with the missing foreign key, at which point the database system is likely to crash with some assertion.

Recovery from that could be very, very difficult. You'll have to hope that you can identify transaction X somehow, and that you've archived enough WAL logs to be able to re-apply X and all the transactions that followed. However, some later transactions may depend on data that did not exist because X was applied inconsistently. A more likely scenario is you'll restore your DB from backup losing however much data was written after the X transaction failed.

PMc said:
Partial data should not be a problem, because the database does it's own recovery from the redo-log. And the redo log is a sequentially written file, it cannot have partial data - it goes to a certain point, and there it ends. And up to that point will be recovered.

Redo log is only replayed since the last checkpoint. Checkpoints typically happen after a transaction has been committed. Same problem if the checkpoint transaction got corrupted by the ZIL failure.

PMc · Jul 18, 2022

Jose said:
Yeah, but the two logs (database write-ahead log and ZFS ZIL) are not guaranteed to be in synch.

No, they are completely independent. One builds upon another.

Jose said:
They could be in different devices altogether. Suppose the database WAL has a checkpoint at transaction X. The database on restart will rely on all transactions up to and including X having been committed to the database data files. However, if transaction X was actually only partially applied, there could be data consistency problems like missing foreign keys.

Hm. One would need to look into that and understand how the checkpoints are actually done and how it is decided which were done completely.

Jose said:
These problems might not even manifest right away. It could be that nothing bad will happen until someone actually tries to access the table with the missing foreign key, at which point the database system is likely to crash with some assertion.

Recovery from that could be very, very difficult. You'll have to hope that you can identify transaction X somehow, and that you've archived enough WAL logs to be able to re-apply X and all the transactions that followed.

Silent corruption is an ugly thing. Otherwise I am very relaxed: if a database does not recover, I'll restore from backup and then apply all the logs that are there - and then I will see where and why that fails, and create another bug report.

This is indeed another gotcha in the whole logic: it expects that all the stuff works bugfree.

And there is yet another: usual SSD are three (or more) level cells, so if a cell gets erased to store another bit, there may be two more bits of old data that need to be preserved. What if a power failure comes in-between? We rely on the embedded SSD controller to employ some algorithm so that this can never happen. And we don't know if that is bugfree.

Jose said:
Redo log is only replayed since the last checkpoint. Checkpoints typically happen after a transaction has been committed. Same problem if the checkpoint transaction got corrupted by the ZIL failure.

That is interesting. Checkpoint happens, and writes lots and lots of data. At some point it is completed, and then there must be a flush, before the WAL can be dumped. What if that flush gets lost... I would assums, if the flush is lost, then the WAL dump is also lost, and how would the database at restart know that this checkpoint was ever completed?

That would still need some logical verification. But then, before I start to buy expensive enterprise SSD, I would rather get a diesel. I cannot know what's inside the SSD logic, but I can make sure that a diesel works - and I love heavy metal...

Jose · Jul 18, 2022

PMc said:
That is interesting. Checkpoint happens, and writes lots and lots of data. At some point it is completed, and then there must be a flush, before the WAL can be dumped. What if that flush gets lost... I would assums, if the flush is lost, then the WAL dump is also lost, and how would the database at restart know that this checkpoint was ever completed?

That's an implementation detail, but I think the way it's usually implemented is as a high-water mark. You have some sequential log in which every record has a monotonically increasing numerical id. Writing a new checkpoint usually means just updating some field to the new record number. So if the old checkpoint was X, you update the checkpoint record to X+n where n is some positive whole number.

PMc said:
That would still need some logical verification. But then, before I start to buy expensive enterprise SSD, I would rather get a diesel. I cannot know what's inside the SSD logic, but I can make sure that a diesel works - and I love heavy metal...

Well, we've got that in common, at least.

gpw928 · Jul 18, 2022

PMc said:
This is a big issue if some other party relies on the completed transactions (as I said before, credit card approvals and such). It should be no issue when the database itself is the authoritative source for the data - e.g. the database storing this forum: you post a message, suddenly the website becomes unresponsive. After a while you can connect again, but the message isn't there. Can happen.

It seems to me that you are asserting that databases are like file systems -- and that it's OK to lose committed transactions, provided that the thing can be made to continue working.

I have spent half a life time working around Oracle and DB2 installations where losing a committed transaction would be cause for an inquisition that would make the Spanish one look like a picnic -- well it would at least result a serious non-compliance in the QA system...

SSDs that lose data which have been successfully written with the O_SYNC flag set (or where an fsync(2) system call has been issued) are simply not reliable. If they could be made reliable, there would be no need for O_SYNC. The down-sides may be mitigated to some extent by informed and clever design, but may never be fully eliminated.

We have not touched on synchronous NFS, where the applications are likely to be even more vulnerable...

PMc · Jul 18, 2022

gpw928 said:
It seems to me that you are asserting that databases are like file systems -- and that it's OK to lose committed transactions, provided that the thing can be made to continue working.

It's a money question. There is a price difference of almost 1:10, so when the question is, do I need that stuff, the imo correct answer is: you need it if there is somebody who requires that quality-of-service - and who pays for it.

There is a thing called "risk-assessment", and that means, figure out which risks you can afford to take, and which risks need remedy - and that is individual to the use-case.

gpw928 said:
I have spent half a life time working around Oracle and DB2 installations where losing a committed transaction would be cause for an inquisition that would make the Spanish one look like a picnic -- well it would at least result a serious non-compliance in the QA system...

Yeah, that's fine, but it is not a technical issue. (I always imagine, what would those people do when they were confronted with real problems - like being stuck in the brazilian jungle without food or water? That would be funny to watch.)

mefizto · Jul 18, 2022

Hi

gpw928 said:
The swap is a GEOM mirror (because swap on a ZFS pool can result in deadly embrace).

Could you please explain the term "deadly embrace"? Is it a condition when the OS can deadlock when moving a page to the swap?

Kindest regards,

M