ZFS: using the same SSD for log and cache

AndyUKG · Jan 11, 2011

Hi,

its come up quite a few times on the forum the question of using a pair of SSDs to provide both log and cache functions in a ZFS pool. I was wondering if anyone has actually experience of doing this and can comment if its worked ok? I can't see any issues with doing it, but always good to hear from those with experience!

thanks Andy.

piggy · Jan 13, 2011

AndyUKG said:
Hi,

its come up quite a few times on the forum the question of using a pair of SSDs to provide both log and cache functions in a ZFS pool. I was wondering if anyone has actually experience of doing this and can comment if its worked ok? I can't see any issues with doing it, but always good to hear from those with experience!

And then in like six months u can put your ssds in the trashcan. SSD are not made for that.

graudeejs · Jan 13, 2011

AFAIK, right now the problem is detaching SSD from ZFS pool if SSH is used as cache or log device (correct me if I'm wrong)
AFAIK, this might (and probably will) change in FreeBSD-9

phoenix · Jan 13, 2011

ZFSv18 and older cannot import a pool with a failed log vdev. Thus, all log vdevs must be created as mirrors # zpool add log mirror disk1 disk2

ZFSv19 and newer can import pools where the log vdev has failed, and can remove log devices from the pool. Some data may be lost if the data in the log has not yet been written out to the pool.

FreeBSD 7.3 and 8.1 include ZFSv14.

FreeBSD 8.2 will include ZFSv15.

There are patches available for 8-STABLE and 9-CURRENT for ZFSv28.

Thus, if you aren't running 8-STABLE or 9-CURRENT with the ZFSv28 patches, you really should not add a single log device to your pool.

phoenix · Jan 13, 2011

To the OP: there's nothing inherently wrong with using the same SSD for both log and cache devices. log devices rarely need to be larger than 8 GB, and it's near impossible to find an SSD that small these days.

However, the usage patterns for cache and log devices are very different, and using the wrong kind of SSD for a log device can have adverse effects. Ideally, you'd use an SLC-based SSD (Intel X25-E, for example) as those are optimised for writes, and all log I/O is writes.

Unless you are exporting lots of filesystems via NFS, you probably don't even need a separate log device.

cache devices need to be optimised for high read speeds, as the pool throttles writes to 7 MBps to cache devices. However, if you are doing lots of reads from the cache, you may impact the write performance of the log side of things, if using the same SSD for both.

It all depends on your workload.

Try it without a separate log and see if you are write-limited. Try it with a single cache device. Try it with a cache and a log.

AndyUKG · Jan 14, 2011

Thanks for the comments. I don't have any SSD devices currently so can't do any testing.
Re Phoenix, interesting comment re the Intel X25-E, are you suggesting there are some fundamental difference between the Intel X25-E and Intel X25-M devices, is it not valid to simply compare the tech specs of the devices and choose whichever is appropriate for your applications (based on IO and bandwidth specs)?

ta Andy.

PS Original question was, has anyone actually tried it?

sub_mesa · Jan 14, 2011

What advantage does the X25-E have over MLC X25-M? Yes i know its 50nm SLC; is that good?

We can see less writes per cell as NAND gets smaller (25nm) - but strangely the total write endurance has increased instead! This is because with less writes per cell, if you have alot more cells this can compensate for the lower write cycles per cell. Also, the write amplification of newer SSDs is lower, which also impacts write endurance.

In short, you should look at an SSD and say "this SSD has 20TB write endurance", so after you wrote 20TB to it it'll fail gracefully. This value usually includes write amplification, and is determined using 4K random writes; not sequential writes. In other words, i don't really understand the urge for people to prefer SLC over MLC. The key to NAND performance is an intelligent controller; and write endurance is something you can control easily; simply add more SSDs for more write cycles. Doubling the number of SSDs, would also double the write endurance, assuming you can spread the writes decently across all SSDs.

For a SLOG/ZIL device you would want:
- pool version 19, as discussed above
- an SSD with supercapacitor (safe writes: Intel G3, Sandforce SF2000, Marvell C400)
- an SSD with good sequential write abilities

Neither the X25-E and X25-M have a supercapacitor; but the upcoming SSDs should have: Intel G3, Sandforce SF2000 and Marvell C400 should each have a supercap, meaning they won't corrupt data on sudden power loss, like all current SSDs and NAND-based products do and why they are not inherently reliable. The freaky thing is: SSDs can kill old data as well, while HDD a power loss could only wipe the writes in the write buffer; anything already on the disk won't be harmed and will stay intact. SSDs are different, and continue writing when they lost power (voltage drops below normal level), even at places where they shouldn't be writing at all. If this happens in a reserved spot where the SSD saves its HPA mapping table (difference between logical LBA and physical LBA) then you could kill/brick your entire SSD.

L2ARC, instead, is completely safe. Even if your SSD corrupts, this would never put your data at risk. The L2ARC works with checksums and thus corruption on your SSD would be detected and ZFS would query the real storage instead of the L2ARC. So L2ARC is safe and can be used at any time.

As far as i understand, the performance specs are:
- L2ARC: multiqueue random reads (very much benefits from NCQ/AHCI; up to 10 times lower performance in IDE mode without NCQ)
- SLOG: 100% sequential write; not sure if compression will help so i'm assuming Sandforce SSDs might have a disadvantage here.

And yes i tried what you want; only for testing though i don't use SLOG at the moment. But you can just partition your SSD and feed it to ZFS for different functions. The SSD won't mind that it has two different duties; unlike HDDs, it has no seek penalty for having to swap between different I/O; that's why you got an SSD!

danbi · Jan 17, 2011

sub_mesa said:
In other words, i don't really understand the urge for people to prefer SLC over MLC.

Perhaps, because SLC has over 10 times the write endurance of MLC? Is typically faster etc.
SLC memory also does not have the intrinsic MLC write failure when power is lost.

The key to NAND performance is an intelligent controller; and write endurance is something you can control easily; simply add more SSDs for more write cycles.

In fact, with current controller technology, you are safe if you never use the entire flash drive. This leaves space for the internal controller to remap flash blocks.

The freaky thing is: SSDs can kill old data as well, while HDD a power loss could only wipe the writes in the write buffer; anything already on the disk won't be harmed and will stay intact.

Unless, hard disk technology has suddenly changed overnight, this has never been true! Hard disks, being mechanical devices assume high risk of writing wherever and whatever over the platters, when power fails. Sure, many enterprise class drives spend great of effort (and cost) to reduce the chance of this happening -- usually using built-in capacitors and additional electronics/mechanics. I have observed many cases of HDDs crash their heads at power loss, sometimes physically scratching the platter surfaces, etc.

SSDs are different, and continue writing when they lost power (voltage drops below normal level), even at places where they shouldn't be writing at all.

This is true for all poorly designed, 'low cost' garbage. Nothing to do particularly with SSDs, except maybe there are lots of junk flash products on the market -- but they are also designed for junk uses anyway.
It is also more of a problem with MLC, than with SLC. But then.. there aren't many junk SLC drives anyway

If this happens in a reserved spot where the SSD saves its HPA mapping table (difference between logical LBA and physical LBA) then you could kill/brick your entire SSD.

This is where it typically happens. Being a non-mechanical device, it's very hard for an flash controller to decide to write to a different place. But, it's entirely possible, rewrites of the mapping tables to be not sufficiently good, if the device has lost power and could not make sure it can read these back.

In any case, such 'damage' should be easily undone, because all that happened is the mapping table got corrupted. Simply clearing the table is enough to resurrect that device -- of course, most junk flash devices do not provide you with such tools, because you are expected to just go and buy the next piece of electronic garbage.

Pity the price difference between the junk devices and quality devices is so great. Otherwise, we would not be discussing this at all. When a HDD bricks itself in a month use, we know it was bad (model, design, sample, manufacturer). When the same happens with an SSD, people say it's the technology...

Otherwise I do agree that using whatever for L2ARC is safe, as long as it has sufficient random read capacity.

PS: You might find this article http://www.storagesearch.com/ssd-slc-mlc-notes.html interesting to read.

sub_mesa · Jan 17, 2011

danbi said:
Perhaps, because SLC has over 10 times the write endurance of MLC? Is typically faster etc.

Is that really true? We are told to believe that, and it might have been true before advanced NAND controllers came out...

But times have changed. Today the write amplification is key to write endurance, and thus controller is much more important than the actual write cycles per cell. The write endurance could be expressed as this formula:

Code:

Write Endurance = (total NAND cells * write cycles per cell) / write amplification factor

So let's compare SLC versus MLC:

Intel X25-E 50nm SLC
write cycles per cell: 100.000 (100k)
4K random write endurance: 1.0 - 2.0 PiB
Write endurance per dollar: 32GB = 1.0PiB per ~450 dollar = 2.275GiB per dollar
Capacity per dollar: 0,071 GiB per dollar

Intel X25-E 25nm e-MLC
write cycles per cell: 25.000 (25k)
4K random write endurance: 1.0 - 2.0 PiB
Write endurance per dollar: 100GB = 1.0PiB per ~350 dollar = 2.926GiB per dollar
Capacity per dollar: 0,286 GiB per dollar (up to 1.0GiB per dollar expected for 25nm gen MLC)

Ouch! Huge improvement and no real reason to prefer the older 50nm SLC version of Intel. Now of course, there is 34nm and perhaps even 25nm SLC available as well; both Micron C400 and Sandforce SF2000 are still compatible with SLC memory. But we can also see MLC taking over the traditional role for SLC for high write endurance; the key in here does not lie in simple write cycles per cell but in reducing write amplification, wear leveling factor and total write cells which combined give you the write endurance in (peta) bytes.

SLC memory also does not have the intrinsic MLC write failure when power is lost.

Per X25-E datasheet the hold up time is 0.01 seconds for serious voltage drops; after that you can expect corruption to occur.

In fact, with current controller technology, you are safe if you never use the entire flash drive. This leaves space for the internal controller to remap flash blocks.

You are confusing spare space with power loss and inconsistent HPA mapping table; those two have nothing to do with each other. Spare space is used for performance reasons, not to prevent corruption. In fact, the reason NAND controllers do remapping only amplifies the whole 'unsafe writes' problem in the first place, since now if power lost we may also be left with an outdated and inconsistent HPA mapping table.

Unless, hard disk technology has suddenly changed overnight, this has never been true! Hard disks, being mechanical devices assume high risk of writing wherever and whatever over the platters, when power fails.

Do you happen to have a link of that? When power fails HDD suddenly starts to write, even if no write requests were issued to that spot? This would be new information to me and i think alot of other people as well.

I think you may be confusing failure (head crash) with data inconsistency after power loss.

This is true for all poorly designed, 'low cost' garbage.

I believe that every SSD that doesn't have a supercapacitor would be at risk of corruption.

In any case, such 'damage' should be easily undone, because all that happened is the mapping table got corrupted. Simply clearing the table is enough to resurrect that device

If you are still able to, and either way that means the total or at least partial destruction of all data on the SSD; hence my argument that all current generation SSDs show corruption and hence are not suitable as SLOG/ZIL drive.

phoenix · Jan 17, 2011

danbi said:
Perhaps, because SLC has over 10 times the write endurance of MLC? Is typically faster etc.
SLC memory also does not have the intrinsic MLC write failure when power is lost.

SLC is optimised for write speed. It's something like 10x (or higher) faster to write a single SLC cell than it is to write an MLC cell. For write-heavy workloads, like a ZFS log vdev, this makes a *HUGE* difference. Read speeds for SLC aren't that great, though, in comparison to the write speed.

MLC is optimised for read speed. For read-heavy workloads, like a ZFS cache vdev, this makes a huge difference.

IOW, each is optimised for a different workload, and you should use the one that suits your workload.

You need to step out of the "desktop" and "SOHO server" mindset, and into the "enterprise" mindset to really grasp this.

For a system with 5-10 drives in it, servicing 10-20 computers on the network, you can use pretty much any harddrive, any SSD, any CPU, any RAM, and build a ZFS system that works just fine, and is plenty fast enough.

However, move up to the massive storage boxes with 24+ drives, maybe with external drive boxes, servicing 100s of network clients via NFS and iSCSI, and you need to pick your part carefully. Atom, Celeron, and Pentium-D CPUs aren't going to cut it. 4 GB of RAM ain't enough. MLC SSDs won't work as log vdevs, and you definitely need a log vdev in this setup. 5900 RPM drives won't cut it.

sub_mesa · Jan 18, 2011

A lot of potential performance that MLC/SLC provides is lost in the ONFI-interface. So I think we should look at performance from the point of the host system instead, rather than internal local performance.

But this document does describe basic comparison in performance of SLC versus MLC, and it rates the write at 7MB/s per channel for MLC and 17MB/s per channel for SLC. If you translate that to 10 channels for Intel SSD, these numbers are quite accurate.

But I would note that upcoming Intel G3 (with supercapacitor so safe for SLOG) has a rated 170MB/s sequential write speed as well. The X25-E G3 25nm using e-MLC is just a tad faster at 200MB/s. So just use conventional Intel G3's is my advice; striping them if you want to. That should get you the most sequential write MB/s per dollar and have safe writes so no more corruption and sudden death. I expect these to be popular for the ZFS SLOG function.

danbi · Jan 18, 2011

To what phoenix mentioned, I would like to add that you can build an "write optimized SSD" with MLC and "read optimized SSD" with SLC -- it depends on how you organize the I/O paths and priorities internally. It is not the read/write performance why you are unlikely to see SLC in commodity USB token and MLC in enterprise products.
(although, greed is what moves this civilization)

There are flash devices, that contain twice or more times flash storage than the advertised capacity. This allows for the device to write at 'full speed', because another thread in the controller erases blocks that you just overwrote with data, ready for the next writes.

There are all sorts of 'magic' that can be done to semiconductor things, but one is sure, we cannot (yet) override the laws of physics which basically say, that an SLC cells stores one voltage value (1 bit), while an MLC cell stores more states (two or three bits). In fact, the state in MLC flash is complex and lots more unstable (in physical or rather electrical sense). For the same technological base, or technology, SLC flash will always be better than MLC flash in all other aspects, but storage density.

Back to the topic.

The SLOG is "write always, read only on recovery". From performance perspective, this means you need write optimized storage. As mentioned many times before, it does not have to be SSD. Writes are pretty much sequential, so this does not rule out HDDs. The volume of data that is written to SLOG is insignificant. ZFS v23 reduces the ZIL even further. The SLOG does not need to be larger than few gigabytes usually. Only synchronous writes go trough the SLOG -- it does not mean you need 900MB/s write to SLOG if you want to write 900MB/s to the pool. It only has to be separate from the main storage pool. This is in order to not waste IOPs in the main pool, to not move heads and to not have to free the in-pool ZIL records (that also happen to be variable size, leading to fragmentation). The only requirement for the SLOG is to survive server crash while preserving what was supposedly wrote there. Most such crashes happen at power failure -- so it must be able to survive power failure, not matter what.

The best candidate for an SLOG is battery backed RAM. Without any doubt.

Next best candidates, in no particular order are small enterprise grade, low-latency disk drive, SLC flash or MLC flash. Of course, FLASH should have capacitor or battery backup. Latency is significant factor for the SLOG (and for the L2ARC by the way, after read performance). Write latency is not something that any FLASH device is proud with (but, see above there are exceptions, for purpose built devices).

To the original question: is it wise to mix SLOG and L2ARC on the same device?
If the device is of sufficiently recent generation, with sufficiently many write/read IOPs perhaps yes. It all depends on the application. The write operations will 'choke' most cheap flash drives, leaving nothing for the read portion, so the L2ARC will suffer. Or the large writes the L2ARC does, will impact the SLOG response time. This will only happen after certain threshold is reached, that is, when you reach sufficient number of sync operations, from database activity or file/directory creation etc. Unfortunately, NFS is one application where everything is sync operation when you write from the client. In most installations this may never happen. It's best to experiment -- observe the drive load with gstat etc.

PS: sub_mesa, I do not have pointer handy about HDDs writing garbage when power is lost. Speaking of experience and an shelf full of dead drives

But Google is our friend

Enterprise drives take special measures against this -- by providing large capacitors and special mechanics with the sole purpose to lift the heads away from the surface should power to the drive be lost or there is power fluctuation -- to prevent the heads from emitting random garbage to the platters. This is typically not the case with cheap desktop drives, although most do take some measures even if not that aggressive. There is lots of stories in drive handling, such as low level (without today's quotes) formatting the drive to create new sector marks etc .. after the old marks being lost.. somehow

But luckily, this is more or less history.

atwinix · Feb 9, 2011

Short answer, Yes. I have a Kingston SSDNow V+ 100 64GB, which has been partitioned to contain the SLOG, L2ARC and /var/log to allow my mechanical drives to spin down when idle for long. You can find my thread somewhere around here - I did the same research before I implemented mine and its been almost 3 months now and I haven't had any problems.

Cheers.

stassik · Sep 8, 2011

phoenix said:
ZFSv18 and older cannot import a pool with a failed log vdev. Thus, all log vdevs must be created as mirrors # zpool add log mirror disk1 disk2

ZFSv19 and newer can import pools where the log vdev has failed, and can remove log devices from the pool. Some data may be lost if the data in the log has not yet been written out to the pool.

This means that if I use mfsbsd-se-8.2-zfsv28-amd64.iso for example, I can keep ZiL in RAMDISK? and not worry about losing zfs pool?

phoenix · Sep 8, 2011

Correct. You run the risk of losing any data in the ZIL if your system crashes or loses power, though.

Goose997 · Sep 8, 2011

phoenix said:
Unless you are exporting lots of filesystems via NFS, you probably don't even need a separate log device.

It all depends on your workload.

Try it without a separate log and see if you are write-limited. Try it with a single cache device. Try it with a cache and a log.

Can someone explain to me the difference between having file systems exports on Samba and NFS? I have read somewhere that having a log with Samba shares does not make a difference or am I wrong?

Thanks
Malan

peetaur · Sep 13, 2011

I tried this. I believe I need a ZIL because I am exporting a large file system via NFS. Additionally, I didn't want my root mirror to use up 2 more hard disk bays, so I got SSDs big enough for all 3 (root, cache, and log).

I ran into a serious problem, where the zfs boot loader would try to load the wrong pool. I would like to help you avoid such a situation, so here is my warning. To prevent that, the root slice needs to be before any other zfs slices on the disk. Here is the bug report I posted: http://www.freebsd.org/cgi/query-pr.cgi?pr=160706

And of course I realise there is a performance penalty to also adding root on there. But it should be very small if the OS is caching properly.

And Malan, I'm not sure about all the differences, but the problem with NFS is the synchronous writes. If you can't prevent/mitigate the impact of that, and if CIFS (Samba) doesn't do the same thing, then it may be faster, but it also might mean you are more likely to lose data when there is a failure (such as a power outage).

This kernel parameter change is supposed to lower the performance problem with synchronous writes, but I don't know anything about it (such as side effects, or whether it improves performance at all). I've also seen posts with NFS driver source code patches, perhaps saying/implying that this setting below doesn't work.

Code:

cat >> /etc/sysctl.conf << EOF
vfs.nfsrv.async=1
EOF

# sysctl -w vfs.nfsrv.async=1