ZFS Implementing HSM for ZFS.

zerophase · Feb 13, 2025

I'm thinking about building a server that uses an vdev SSD cache to dump writes, before moving them to a much larger HDD vdev for long term storage.

For implementing the HSM wouldn't it basically be take a ZFS snapshot of the SSD vdev, move the data to the HDD vdev, take a snapshot of the HDD vdev, and run ZFS diff. If they don't match move data from the SSDs to the HDDs again for the parts that do not match?

Years ago, I was reading Mike Acton and these C++ devs told me to use managed pointers. But, Mike Acton gave a lecture about C++ devs writing slow code. "You just need a char*. Throw the stl library out", he said. So, I spent three months debugging my char* class till it could ingest an entire XML file, and I could manipulate it with other code without bugs. It was basically call malloc once at the start of the program, read the entire file in, and then use memmove with start_block + (end_block - start_block) +/- 1, and then deallocate before the program closes.

Shouldn't all of the data transfers be sequential reads and writes, and wouldn't my already existing string class work for correcting differences between snapshots potentially? Maybe, I have to add some stuff to make it work with ZFS, but I believe I should already have most of the pointer math and bit twiddling needed to correct errors between snapshots. If there's data corruption from sequential writes won't it be { good_block } { bad_block } { good_block }, and I just do some pointer math between both good_blocks to overwrite the bad_block?

mer · Feb 13, 2025

zerophase said:
I'm thinking about building a server that uses an vdev SSD cache to dump writes, before moving them to a much larger HDD vdev for long term storage.

I'm not sure what you mean by HSM, but I believe what you talk about here is basically a zpool on the HDD vdev using an SSD as a SLOG device for that zpool.

zerophase · Feb 13, 2025

mer said:
I'm not sure what you mean by HSM, but I believe what you talk about here is basically a zpool on the HDD vdev using an SSD as a SLOG device for that zpool.

I mean Hierarchical Storage Management. I mean moving data from a much smaller but extremely fast SSD zpool to a much larger but substantially slower HDD zpool. I don't think the SSD zpool will need any slogs or any of that stuff. But, maybe. But, it should definitely be written to faster than a HDD zpool with all of the special disks and such.

I'm trying to see if on the server if I have a 10 TB SSD write cache zpool if I can write from my desktop's SSD to my server as fast as I can read from it. Minus a bit for the overhead. Then below it I have a 400 TB zpool of HDDs for long term storage.

ralphbsz · Feb 13, 2025

Several questions. First, the most important one: What are you trying to accomplish? What is your goal, what are your requirements. You are asking an XY question: I want to do XY, how do I do it. But we need to know why you want XY, because there are many ways to do it, with different costs and benefits.

Are you interested in write speed? In that case, if all the data written to the SSD tier also eventually migrates to HDD (meaning you have long-lived data), all the SSDs are doing is acting as a shock absorber or write cache, allowing the system to quickly ride out spikes of write traffic, and then smoothly spool that to the HDD tier. On the other hand, if your data is short-lived, you may be able to write it to SSD only, and most of it is deleted before the migration to HDD happens. An extremely example of such a system is the constant writing of checkpoints in HPC: they get written continuously, usually never read, and deleted after a few minutes; they just act to speed up restarts of crashed processes. But note that using the SSD tier as a write buffer for data that is usually not read back may cause problems of flash wear-out.

On the other hand, you might be interested in read speed, and that is an application where SSDs shine. This is particularly true if you have an expectation that recently written data is also frequently read, while older data may be archival; the extreme example is compliance data that has to be kept online for many years, but the expectation is that it is never accessed.

If your issue is neither read speed nor write speed, why are you using SSDs at all? It's obvious that HDDs are much cheaper (and tape or MAID cheaper still, but neither is really viable for individual small business or home users today). The thing that is less obvious is that HDD is also cheaper when measuring bandwidth ($ per GB/s); the only reason SSDs make sense is for random access (they do actually have good $ per IOps). And this is where access patterns come in: If all your files or objects are large, and read and written sequentially with large IOs (or good read-ahead caching), then SSDs make no sense at all. At that point, the whole concept of HSM becomes questionable.

zerophase · Feb 18, 2025

I'm trying to write as fast as my network will allow, and then push the data down towards long term storage. I also want a read cache from arc / l2arc for specific datasets in long term storage. I'm going to be benchmarking while building this system. I'm using overprovisioned enterprise drives for this. I'm trying to backup multiple computers and I want the reads and writes of my backup system to be fast. I'll probably be playing around with arc and l2arc per dataset.

The one issue I might run into is the write cache ssds are going to be smaller than the long term storage. I'll be using send / receive from the SSD pool to the HDD pool. I was told both the SSD and HDD pools need to be the same size, and the HDD pool will need to be mounted as read only. But, from looking through the documentation that doesn't sound exactly true. So, are there switchs I can flip for changing how zfs send / receive behaves? I'm just interested in being able to have like a 10 TB write cache, while having 400 TB have long term storage to send my data to. I'll probably also be using the Borg 2.0 transfer command, and / or other means of sending data from the cache to long term storage depending on the program. But, send / receive sounds ideal depending on the work load.

I'm probably going to add a tape drive as well to keep copies of data while benching different disk setups for my workload. I believe that's cheaper than using online storage to hold data, while I'm re-configuring my pools. My entire plan is to start with just a zraid3 or draid3, and add other disks after benching borg and other programs writing to this system. Probably, going to let it run for a day to week to see if I need anything. The SSD 4 way mirror write cache I'll add in last. I understand that might change how I need to configure the zraid3 / draid3.

I'm trying to see if I can build a high end HSM for less than current proprietary solutions. Most of them don't even have a price listed and only take custom contracts. So, I believe they're fairly expensive.

ZFS Implementing HSM for ZFS.

zerophase

mer

zerophase

ralphbsz

zerophase