Solved Question about ZIL/SLOG

I have been reading about ZFS-ZIL/SLOG, I understand how it works and the benefits it provides, but I am confused when ZIL is used.
When an application needs a sync write, data is written to ZIL which if I have understood correctly is located in pool disks, so why don't use the pool directly if it's the same write speed as ZIL?
If ZIL and pool uses the same disks, write speed should be the same, why write data twice?

Best regards.
 
 
ZFS writes data in transaction groups, and composing a transaction group may take many seconds.

The ZIL behaves like the logs used in a variety of other file systems. It logs the intended transaction quickly to disk, so a synchronous write can be acknowledged as "on disk" quickly -- while the transaction group is still being composed.

If the system crashes after the write is acknowledged, and before the transaction group gets to permanent storage on disk, the log can be replayed after a reboot to honour any previously acknowledged write operations (which is required behaviour for synchronous write operations).

If you move your ZIL to separate, faster media, the logging happens more quickly, and thus synchronous writing proceeds more quickly.
 
Thnak you for the answer Alain De Vos but I have read the forum post, it doesnt answer any of my inital questions.
In the forum link i can read that in ZIL is written "metadata" information no the data to be written to pool, is that correct?
In any way, if ZIL is written in pool disks why dont write data directly to pool disks?

Best regards.
 
Hello gpw928, you say "logs the intended transaction quickly to disk" its not quiclky, it is at the same speed as writting to pool directly as ZIL is a part of the pool disks if you dont use an external dedicated disk for SLOG.
I think that i am not understanding something obvious for you.

Best regards.
 
Consider ZIL as a write cache which is faster then the pool. So data goes to the ZIL and when there is time it goes from the ZIL to the pool. This two step process is faster when the ZIL-device is a faster device. Cfr SSD vs HardDisk
 
Hello gpw928, you say "logs the intended transaction quickly to disk" its not quiclky, it is at the same speed as writting to pool directly as ZIL is a part of the pool disks if you dont use an external dedicated disk for SLOG.
I think that i am not understanding something obvious for you.
Please read the first line of my post again. Transaction groups take up to 30 seconds to compose. Writing a log entry to the ZIL (on the same disks as the rest of the pool) takes milliseconds.
 
Consider ZIL as a write cache which is faster then the pool. So data goes to the ZIL and when there is time it goes from the ZIL to the pool. This two step process is faster when the ZIL-device is a faster device. Cfr SSD vs HardDisk
Why ZIL is faster thant the pool if ZIL is written in the pool?
If you move your ZIL to separate, faster media, i understand it will be faster, but if you use ZIL en the same pool, it will be the same access time as the pool, so why dont write data directly to pool if is has the same timecost writting to ZIL and pool.
 
The ZIL exists in RAM (volatile) and on the disks of a pool (non-volatile). Instead of using the disks of a pool, a separate storage (also non-volatile) outside the pool can be used: a SLOG device. Only data in non-volatile storage is safe in case of a system crash or power loss. You ask valid questions but you'll have to dive deeper into the essence of ZIL and the details of how it works.

At its core, ZIL is a mechanism used to guarantee that data is written relatively quickly to persistent storage in a safe manner, not in danger of any mishap such as an unexpected power loss or crash. In the view of ZFS, that persistent place is somewhere on the actual disk platters or on an SSD. On an SSD that should be the either the non-volatile memory cells of that device or the volatile memory cells that are covered by PLP (power loss protection) of that device. ZFS tries very hard to safely assess that the acknowledgement of a device that the data has actually been written is accurate.

Relating to your question "why write data twice?": yes, data is written twice but, there is an essential difference between ZIL data on disk and non-ZIL data on disk. ZIL data is very small and is of a transient nature; it has a temporary data storage structure: it has not reached its final state. ZIL data (in the pool, on disk) is not part of the normal persistent ZFS data structures of the pool. However, when data has been written to a pool in its final state, it is part of in the normal persistent ZFS data structures of that pool. A SLOG is a device used exactly to not have to "write data twice" to the pool and decouple ZIL I/O of a pool from non-ZIL I/O of that pool. Because a SLOG takes the ZIL I/O away from a pool, data is written to that pool only once, decreasing its I/O. It increases the I/O bandwidth of the pool for all data because that data does not have to compete with ZIL data being written: there is no ZIL data being written to that pool anymore. There are also no head movements (of the read/write disk heads of that pool) because of ZIL data anymore, again: because there is no ZIL data being written to that pool anymore. Both result in a speed increase, especially when I/O traffic of that pool is very high. More importantly, a SLOG usually has a much higher I/O speed* and lower latency than the pool that it is coupled with; this represents the main speed gain.

"[...] so why don't use the pool directly if it's the same write speed as ZIL?" Not using the ZIL mechanism means not using a transient but persistent mechanism for data storage. Not using that transient mechanism means you'd have to resort to writing data from RAM and incorporating it as part of the ZFS data structures of a pool in one go before ZFS could assume that the data that have been written to disk are safe and in a persistent state. Updating the ZFS data structures on the disks of a pool includes adding new subtrees of data for every new or changed ZFS block unit (ZFS is a COW filesystem): that takes a lot of time. It takes a lot less time to write the transient data in the form of ZIL blocks (LWBs), directly from the ZIL in memory (=RAM) in a very straightforward way to the pool.

___
Edit: footnote changed
* note that only the write speed and latency of a SLOG are really important because these are time critical for writes that are caused by synchronous events. These synchronous events generate blocking writes and return to the caller only after ZIL blocks (LWBs) haven been written safely to disk. In case of a power loss or a system crash that deletes or invalidates data in volatile memory (=RAM), ZIL data on disk (be it on disks of a pool or on that pool's companion SLOG) is read from and its log records of intended transactions are replayed.
 
Last edited:
Thnak you for the answer Alain De Vos but I have read the forum post, it doesnt answer any of my inital questions.
In the forum link i can read that in ZIL is written "metadata" information no the data to be written to pool, is that correct?
In any way, if ZIL is written in pool disks why dont write data directly to pool disks?
Because then we would get many small writes and probably increase fragmentation.

For synchronous writes, speed is everything: we must get the data to disk immediately, so that the application can continue. But if we would write every little chunk right to its final destination, we would 1) need to keep track of all these destinations all the time, 2) often write chunks much smaller than the blocksize, 3) produce lots of disk seeks.

So instead we write some sufficient data to the ZIL immediately. This is a sequential file object, there is no allocation issue, no file-tree hierarchy and not so much additional seek. Then we collect the data for a while (specified by txg.timeout) to build a transaction group, and then finally push out that whole txg in one go. Some data may have been overwritten in the meantime, and now we don't need to write it multiple times. Some data may have accumulated to bigger chunks, and we can allocate these in one piece. And for mechanical disks we can line up the seeks in an optimal way.

"logs the intended transaction quickly to disk" its not quiclky, it is at the same speed as writting to pool directly as ZIL is a part of the pool disks if you dont use an external dedicated disk for SLOG.
Not really. When a file gets changed, the directory entry must also be rewritten (update the mtime and size). Since we have copy-on-write, we must in fact write that directory to a new location. That means, the index pointer to that location must also be rewritten, and so on up to the uberblock. Writing to a file in ZFS involves a bit more than just writing to the file.
The ZIL, in contrast, is just a sequential journal. Append the stuff to the end of it, and that's it.

You have a valid point nevertheless: writing to a ZIL within the pool adds traffic to the pool, and it is usually seek activity between the current read position in the regular data and the write position at the end of the ZIL. This makes things slower when we have significant synchronous write activity. But if we would write to the final destination immediately, it would be worse.
 
Ok, thank you all of you guys, i think that now i have a more accurate view of how ZFS synchronous writes works, the main difference between ZIL and pool wirtting process is that ZIL is faster because its in a fixed disk position and ZFS doesnt have to seek for free space nor any other kind of information to allocate the new data, it only writes in a fixed disk area without worrying about position, size or any other previous data.
 
Back
Top