Solved New to ZFS and have questions for my NAS

Trigger · Jan 17, 2015

Hello,
First off I would like to say thanks to all of you being here, I have read a lot of great answers and threads on this forums.

I decided to use FreeBSD + ZFS for my NAS storage at home.
The setup consists of a server with Supermicro MB, Intel CPU supporting AES-NI and 16 GB ECC RAM with4 x 3TB disks and 2 x 120GB SSD.

I installed it using the latest 10.1 RELEASE and installed ZFS using the memstick image.
So the zroot pool is a raid-z1 setup on the 4 x 3TB disks like this:

Code:

root@nas:~ # zpool list
NAME  SIZE  ALLOC  FREE  FRAG  EXPANDSZ  CAP  DEDUP  HEALTH  ALTROOT
zroot  10.9T  1.83G  10.9T  0%  -  0%  1.00x  ONLINE  -

Code:

root@nas:~ # zpool status
  pool: zroot
state: ONLINE
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  zroot  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  gpt/zfs0  ONLINE  0  0  0
  gpt/zfs1  ONLINE  0  0  0
  gpt/zfs2  ONLINE  0  0  0
  gpt/zfs3  ONLINE  0  0  0

errors: No known data errors

Code:

root@nas:~ # zfs list
NAME  USED  AVAIL  REFER  MOUNTPOINT
zroot  1.33G  7.65T  140K  none
zroot/ROOT  486M  7.65T  140K  none
zroot/ROOT/default  486M  7.65T  486M  /
zroot/tmp  140K  7.65T  140K  /tmp
zroot/usr  875M  7.65T  140K  /usr
zroot/usr/home  140K  7.65T  140K  /usr/home
zroot/usr/ports  874M  7.65T  874M  /usr/ports
zroot/usr/src  140K  7.65T  140K  /usr/src
zroot/var  761K  7.65T  140K  /var
zroot/var/crash  140K  7.65T  140K  /var/crash
zroot/var/log  203K  7.65T  203K  /var/log
zroot/var/mail  140K  7.65T  140K  /var/mail
zroot/var/tmp  140K  7.65T  140K  /var/tmp

Why does it say 7.65T free space?
I would think like 12T (4 x 3) - 3TB (RAID-z1) - 2 GB swap file per disk - 46 GB per disk (zfs using 1/64th of the disk?) - FreeBSD of like 2-10G maximum = around 8.6-8.7T space, why is it so low?

Also here is some more questions.

Is it worth installing FreeBSD on ZFS or only use ZFS for the storage? (Example: Installing FreeBSD on a USB stick, SD card, standalone SDD etc)
My SSD is on SATA ports 0 and 1 and the other disks after that, would that mean that ada0 and ada1 is my SSDs, I can't really confirm since the "zpool status" says the disks is named zfs0, zfs1, zfs2 and zfs3? (See the command output below, please also note that I'm new to FreeBSD so I don't really know how to check this properly)

Code:

root@nas:~ # egrep 'ad[0-9]|cd[0-9]' /var/run/dmesg.boot
ada0: Previously was known as ad4
ada1: Previously was known as ad6
ada2: Previously was known as ad8
ada3: Previously was known as ad10
ada4: Previously was known as ad12
ada5: Previously was known as ad14

Also I would really like to have your input in adding the 2 x SSDs as L2ARC caching and how it would improve my performance?
I have been reading a lot and decided to go with 2 SSDs when reading that some data can become corrupt if only running one SSD and it crashes, is this correct and how would I add them so they are redundant? Would it be the same was running "zpool add zroot cache ada0" if that the correct syntax adding the 1st SSD?

Thanks for your time
Best regards

xy16644 · Jan 17, 2015

With 3TB disks you only get about 2.72TB of usable formatted space. Therefore, for your setup:

4 x 2.72TB = 10.88TB

Then take away the RAID Z1 space:

10.88TB - 2.72TB = 8.16TB of usable space

Not sure where the rest went but it must be OS and swap files? Maybe snapshots?

Is it worth installing FreeBSD on ZFS or only use ZFS for the storage? (Example: Installing FreeBSD on a USB stick, SD card, standalone SDD etc)

I use ZFS for everything. I have a bootable USB drive with ZFS on it that I use to boot my server and my root mirrored SSD drives have ZFS on it. My backup SATA drive also has ZFS on it. All my drives are encrypted using GELI except the bootable USB drive and it all works great! Never run L2ARC so I can't comment I'm afraid.

What Supermicro board do you have? I have the Supermicro X10SL7-F and it is SUPERB with FreeBSD. So stable and silly fast with SSD drives! Loving the IPMI too.

Trigger · Jan 17, 2015

Hello xy16644,

Thanks for your reply!
That's cool, I'm also running the Supermicro X10SL7-F motherboard and I'm pretty happy with it but haven't used it for long yet.

So you would recommend me having FreeBSD installed on ZFS and then just create the storage on there, I also think it's the best solution since then the OS also gets some protection from ZFS.
I'm however concerned where all the storage went, I mean, FreeBSD can't just take away 500 GB just like that. I was also very thorough when specifying only 2 GB Swap files per drive. I can't see any snapshots using "zfs list -t snapshot".

Anybody has a clue on why?

Code:

root@nas:~ # zfs list
NAME  USED  AVAIL  REFER  MOUNTPOINT
zroot  1.33G  7.65T  140K  none
zroot/ROOT  487M  7.65T  140K  none
zroot/ROOT/default  487M  7.65T  487M  /
zroot/tmp  140K  7.65T  140K  /tmp
zroot/usr  875M  7.65T  140K  /usr
zroot/usr/home  140K  7.65T  140K  /usr/home
zroot/usr/ports  874M  7.65T  874M  /usr/ports
zroot/usr/src  140K  7.65T  140K  /usr/src
zroot/var  761K  7.65T  140K  /var
zroot/var/crash  140K  7.65T  140K  /var/crash
zroot/var/log  203K  7.65T  203K  /var/log
zroot/var/mail  140K  7.65T  140K  /var/mail
zroot/var/tmp  140K  7.65T  140K  /var/tmp

Code:

root@nas:~ # zfs list -t snapshot
no datasets available

If anybody has some input on L2ARC I would be very thankful, my plan is to use the 2 SSDs as L2ARC caching.
Best regards

xy16644 · Jan 17, 2015

Trigger said:
So you would recommend me having FreeBSD installed on ZFS and then just create the storage on there, I also think it's the best solution since then the OS also gets some protection from ZFS.
I'm however concerned where all the storage went, I mean, FreeBSD can't just take away 500 GB just like that. I was also very thorough when specifying only 2 GB Swap files per drive. I can't see any snapshots using "zfs list -t snapshot".

I'd definitely want to have FreeBSD installed on a RAID setup so you can withstand a failure of a disk (hence me mirroring my root drive). You have the RAM so if it was up to me I would use ZFS for everything. I personally installed FreeBSD and have all my data on my mirrored SSD drives...the speed is awesome! Installing ports is silly fast and upgrading FreeBSD is so quick. An upgrade from 9.x to 10.x takes about 30min from source (I run STABLE)! Recompiling about 250 ports takes about 40min. On my old machine those tasks took about 2 days...

If it was my setup I would mirror the drives for the zroot and then use the 4 x 3TB drives for your data storage. From what I remember RAID10 (striping across mirrors) gives you the best speed with 4 drives but then you lose half the space. Some people have said using SSD drives for FreeBSD and the root is a waste but I absolutely love the speed and responsiveness of it. I've also had users say they notice the differnce in speed compared to when I had their mail on an old school spinning SATA disk.

I have no idea where the 500GB went. Maybe someone else here has an idea but my entire server (excluding backup data drive) uses 17GB and that includes about 11GB of mail. Oh I also have LZ4 compression turned on since day one of the build of the server.

t1066 · Jan 18, 2015

ZFS will take away a few percentage of drive space for maintenance. The following shows the usable space on a 100GB pool.

Code:

zroot  18.8G  77.6G  144K  none

xy16644 · Jan 18, 2015

If I run gpart show:

Code:

=>  34  250069613  da0  GPT  (119G)
  34  2014  - free -  (1.0M)
  2048  232783872  1  freebsd-zfs  (111G)
  232785920  17283720  2  freebsd-zfs  (8.2G)
  250069640  7  - free -  (3.5K)

My zroot has a total of 111GB but if I run zfs list zroot:

Code:

NAME  USED  AVAIL  REFER  MOUNTPOINT
zroot  17.1G  89.5G  40.2M  /

I have a total of 106.6GB (17.1 + 89.5)

So it looks like you lose about 4% of the drives space with ZFS: 111 - 89.5 - 17.1 = 4.4 And 4.4/111 = 3.96%

In your example you have: 7.65TB/8.16TB = 9.375% of lost space so maybe you lose more space as the disks get bigger? I'm not sure, only guessing here. I only use 128GB SSD drives!

usdmatt · Jan 18, 2015

As seen in your list output, the entire FreeBSD system is currently only using 1.33GB. Your disks are actually only around 2.7TB, not 3, as already mentioned. If you put that into your original calculation you lose about another 300GB per disk, or just under 1TB overall. that makes up for the majority of your discrepancy.

I don't know where you got that info about corruption on L2ARC, and having two devices. When ZFS reads data it will use the cache if the data is there. It will validate the checksum and if it's corrupt, will read from the pool instead. If the L2ARC fails completely, ZFS will just ignore the device and go back to using the pool. The only benefit the second device will give you is increasing the available cache space. It's also never been possible to mirror cache devices (fairly sure about that but haven't confirmed).

Maybe you're thinking of ZIL / LOG devices, which used to be a critical component* and it was advised to mirror them. These days you can get away with one ZIL device and the pool can just be reverted to the last transaction if it fails during a crash. Probably still advised to mirror ZIL for enterprise systems though as it means the pool can automatically recover from a crash with one failed ZIL device.

You can add both devices as L2ARC which will give you all the SSD space as cache. You could also use one device as a ZIL, although that only speeds up sync writes which may not have much effect on your system unless you're doing a lot of stuff like NFS or databases.

The alternative is to mirror the SSDs for the OS, and use the SATA disks just for storage. This completely separates your data from the OS, although you no longer have the SSDs available for cache or log. The would probably be a good idea if FreeBSD nailed the OS disks as much as Windows does, but probably isn't worth it on FreeBSD. You may as well stay as you are and use the SSDs for both cache or cache and ZIL.

ada0 should be the first SATA port, so should be your first SSD. If you look at output from dmesg you should see all the disks listed with their name and details. You can also use the gpart command below to view the labels on each disk, which should show the ZFS label on the data disks. Also you can use diskinfo to view details about a disk, including the size and serial which can help identify disks.

Code:

# view boot output which will include all hardware and disk devices
dmesg | grep 'ada'
Or
grep 'ada' /var/run/dmesg.boot
# view GPT labels on a disk, should identify the ZFS disks
gpart show -l adaX
# get information about a disk
diskinfo -v adaX

Adding cache / log devices is as follows:

Code:

zpool add zroot cache ada0

zpool add zroot log ada0
Or
zpool add zroot log mirror ada0 ada1

*Well it still is a critical component. However a pool can usually be restored fairly easily after a failed ZIL these days. A few years ago it used to fault the pool.

Trigger · Jan 18, 2015

Thanks for your help people,

Now that you mention it, it all makes sense with the size, you never get 3TB clean out of a 3TB disk, silly me not remembering that.
I have done it like this with my two SSDs and I think it'll be OK.

Code:

gpart create -s gpt ada0
gpart create -s gpt ada1
gpart add -a 4k -t freebsd-zfs -s 60G ada0
gpart add -a 4k -t freebsd-zfs ada0
gpart add -a 4k -t freebsd-zfs -s 60G ada1
gpart add -a 4k -t freebsd-zfs ada1
zpool add zroot cache ada0p1
zpool add zroot cache ada1p1
zpool add zroot log ada0p2
zpool add zroot log ada1p2

This will give me two SSDs with L2ARC caching and two ZIL log devices.

Code:

root@nas:~ # zpool status
  pool: zroot
 state: ONLINE
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  zroot  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  gpt/zfs0  ONLINE  0  0  0
  gpt/zfs1  ONLINE  0  0  0
  gpt/zfs2  ONLINE  0  0  0
  gpt/zfs3  ONLINE  0  0  0
  logs
  ada0p2  ONLINE  0  0  0
  ada1p2  ONLINE  0  0  0
  cache
  ada0p1  ONLINE  0  0  0
  ada1p1  ONLINE  0  0  0

errors: No known data errors

Would you recommend me instead adding the log devices as two mirrored instead of two?
As the ZIL log is more fragile than L2ARC, as stated in this thread, example:

Code:

zpool add zroot log mirror ada0p2 ada1p2

I'm only messing around with the stuff right now and will completely reinstall when I'm satisfied and my small labs is done.
I have also enabled deduplication and compression on the zroot ZFS.

Code:

zfs set dedup=on zroot
zfs set compression=gzip zroot

However I'm considering adding more compression the below since I have a pretty powerful CPU so it won't really affect me noticeable.

Code:

zfs set compression=gzip-9 zroot

Thanks
Best regards

xy16644 · Jan 18, 2015

Trigger said:
I'm only messing around with the stuff right now and will completely reinstall when I'm satisfied and my small labs is done.
I have also enabled deduplication and compression on the zroot ZFS.

Code:

zfs set dedup=on zroot zfs set compression=gzip zroot

However I'm considering adding more compression the below since I have a pretty powerful CPU so it won't really affect me noticeable.

Code:

zfs set compression=gzip-9 zroot

Thanks
Best regards

I'd avoid dedup, it needs LOTS of memory to work correctly. I was dying to enable dedup on my server but after reading many posts about dedup I decided against it. To dedup the storage you have would take LOTS of RAM (I forget the exact amounts needed for dedup but it was a few GB of RAM per 1TB of space).

You should check out LZ4 compression, it gives great performance and if it can't compress something more than (I think) 12.5% it stops with the compression.

Edit: This may help you re dedup:

http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe

Spolier:

RAM Rules of Thumb

For every TB of pool data, you should expect 5 GB of dedup table data, assuming an average block size of 64K.
This means you should plan for at least 20GB of system RAM per TB of pool data, if you want to keep the dedup table in RAM, plus any extra memory for other metadata, plus an extra GB for the OS.

usdmatt · Jan 18, 2015

Using two partitions on the SSDs seems a decent idea. I probably would mirror the ZIL. Primarily because I don't think you'll get much benefit from 2 ZIL devices, but will get a slight benefit from mirroring it. (If one fails, and your machine crashes, you won't have the worry of having to manually recover the pool)

As mentioned it really isn't worth messing with dedupe. It's far more hassle than the benefit you get from it. Some people run it with success on backup servers where performance is not at all a requirement, but it can create additional issues, such as requiring a lot of RAM and time to delete data. Some of these problems may have been improved over the years but I still think it's too resource hungry to be advisable in most live systems.

As for compression, I would use lz4, again as already mentioned.

Trigger · Jan 19, 2015

Thanks everybody,

I have decided to skip deduplication and disable it completely and instead compare the different levels of compression.
However I have one final question, what is the benefit from having ZIL on a part of the SSD(s) instead of having both SSDs for L2ARC only? What do I gain from it? Performance-, security-, stability- wise?

Best regards

usdmatt · Jan 19, 2015

The ZFS Intent Log (ZIL) is a small area of the pool used to backup sync writes. Adding a separate ZIL device purely moves the ZIL from the pool disks to an independent device. When writing sync data, ZFS adds it to the current transaction in RAM, but also drops a copy in the ZIL, just in case the server crashes before the in-memory transaction is completed.

So, stability wise you're not really any better off. You've just moved the ZIL from one disk to another. If the SSD has a supercap then you may be slightly more robust in the event of a crash, as most disks will lose what's in their write buffer. (It actually gets a bit more complicated than that. ZFS does issue flush commands to make sure disk caches are empty before completing sync writes, but not all disks actually do it - especially the cheaper ones most people, including me, tend to use)

Can't really see any difference in security.

Performance wise, as long as the SSD has better write performance (more importantly IOPS/latency) than the pool disks, you will see a benefit for sync writes. Of course, whether it's of much benefit to your use case depends on whether anything you plan to use the storage for does sync writes. As above, the primary candidates for that are NFS & databases. If it's all bulk file storage you may as well use both for cache.

Another alternative is to use one partition on the SSDs for the OS (you'd need to re-partition with boot code), the whole disks for storage pool, and a second SSD partition for storage cache. Not sure that's giving you much benefit though really. Would possibly makes things easier if you planned to encrypt the data pool.

phoenix · Jan 19, 2015

xy16644 said:
RAM Rules of Thumb

For every TB of pool data, you should expect 5 GB of dedup table data, assuming an average block size of 64K.

This means you should plan for at least 20GB of system RAM per TB of pool data, if you want to keep the dedup table in RAM, plus any extra memory for other metadata, plus an extra GB for the OS.

Our experience with dedupe and large ZFS pools comes to a different conclusion. The rule of thumb we've used is that you need:

approx 1 GB of ARC space (RAM plus cache) per TB of unique data in the pool.

And, you only need enough RAM to support your OS, your apps, and the ARC space for the pool (which includes only the parts of the DDT that are in use).

Our largest deduped pool has 90x 2 TB harddrives, for 163 TB of space in the pool:

Code:

$ zpool list storage
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
storage   163T   122T  41.5T    74%  1.78x  ONLINE  -

$ zpool list storage
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
storage   163T   122T  41.5T    74%  1.78x  ONLINE  -

Going by your Rule of Thumb, we'd need 815 GB of RAM to hold the dedupe table (which may be true if we loaded the entire thing into ARC). However, you never need to load the entire table. According to top(1), ARC usage is under 30 GB (that's our steady-state; system has been up for 82 days now):

Code:

last pid: 58206;  load averages:  1.43,  1.18,  0.97                                        up 82+01:00:13  08:38:27
52 processes:  1 running, 51 sleeping
CPU:  0.0% user,  0.0% nice, 15.2% system,  0.8% interrupt, 84.0% idle
Mem: 31M Active, 54M Inact, 69G Wired, 396K Cache, 55G Free
ARC: 17G Total, 6737M MFU, 6924M MRU, 44M Anon, 2178M Header, 1734M Other
Swap: 8192M Total, 16M Used, 8176M Free

There's "only" 128 GB of RAM in the box, along with 128 GB of L2ARC space. Again, according to your rule of thumb, we'd only be able to support a pool up to 50 TB in size.

I don't have current numbers on the total size of the DDT for this pool ( # zdb -DD storage is running, but it takes awhile to complete), but it's not the total size that matters. It's the current operating size of the DDT in ARC that matters. The DDT is metadata; metadata is loaded on demand; so you only need enough ARC space to hold the DDT for the data that you are currently working on.

Granted, all that said, I would caution people from turning on dedupe. It's really not worth the performance penalty, especially when deleting snapshots, resilvering devices, or scrubbing the entire pool for the little bit of space savings. When we first started using dedupe, harddrives were expensive ($200+ CDN for a 1 TB harddrive so we started with 0.5 TB drives) while CPU/RAM were relatively inexpensive. Now, we can get 2 TB harddrives for $80 CDN (in bulk), but CPU/RAM hasn't really changed in price. It's less expensive to just add harddrives to a pool than to fight with dedupe.

Trigger · Jan 19, 2015

Thank you both for your inputs, I value them a lot in my decisions.

I have grown to release that my setup is pretty overkill for the use that I'm really gonna use it for.
16 GB RAM and 120 GB L2ARC cache (approx 60 GB on each SSD) is something that I will never fill up, same goes for the ZIL being like 120 aswell. I mean even filling up 80% of the RAM (~12 GB) will be a huge challenge so I'm basicly turning on how would use the two SSDs most efficient.

On the bright side, being into technology and all it's ups and downs, it's worth the money to have some fun things to play with.
I would like to thank you all for sharing your inputs, experience and answering my questions.

Best regards

tryingagain · Jan 20, 2015

xy16644 said:
Recompiling about 250 ports takes about 40min. On my old machine those tasks took about 2 days...

Wow

beer

bhughes · Jan 23, 2015

Trigger said:
Why does it say 7.65T free space?
I would think like 12T (4 x 3) - 3TB (RAID-z1) - 2 GB swap file per disk - 46 GB per disk (zfs using 1/64th of the disk?) - FreeBSD of like 2-10G maximum = around 8.6-8.7T space, why is it so low?

As already pointed out, your 3TB drives are really 2.72TB due to how space is reported by the vendor (using base10 TB units) and FreeBSD/ZFS (using base2 TB units).

xy16644 said:
So it looks like you lose about 4% of the drives space with ZFS: 111 - 89.5 - 17.1 = 4.4 And 4.4/111 = 3.96%

If you look at this commit to base, https://lists.freebsd.org/pipermail/svn-src-head/2014-July/060276.html, you will see that 1/32 or 3.125% of the pool is reserved as "slop" space. The description in the commit does a good job of explaining what the slop space is for, so I won't duplicate the info here.

More number porn below:

Code:

$ bc
scale=1
3*(1000^4)/(1024^4)*4
10.8
3*(1000^4)/(1024^4)*(31/32)*3
7.8

So, if your ZFS partitions were 3000000000000 bytes, you should have a pool around 10.88TB (you do). The available space, after slop and redundancy removed, would be around 7.8TB (not too far from your 7.65TB).

tip: You can check the exact size of your ZFS partitions using gpart show and use those instead of the 3*(1000^4) above for a more realistic view of the numbers.