ZFS using 'advanced format drives' with FreeBSD (8.2-RC3)

Bucky · Feb 8, 2011

Goal: Add a large amount of storage to my existing home server, using zfs and a bunch of big, cheap HDDs. Not interested in booting into zfs. My home server already runs 24/7 as it has a mail server and DHCP server running, too.

My H/W (you want pretty fast and mostly LOTS OF RAM)

Code:

	mobo = Asus P5WDH (only 3 'normal', on-board SATA ports)
	RAM = 6 gig ECC DDR2 (not in dual channel mode)
	CPU = C2Q6600 or C2Q9400 CPU (two otherwise identical servers)
	SATA = Supermicro AOC-USAS-L8i PCIe 8 port SATA card
	HDDs = 2 identical Brand X HDDs for mobo connectors port 0 & 1
	HDDs = 8 identical Samsung F4 HD204UI 2 TB HDDs (these are 'advanced format drives') connected to the Supermicro card.

Running FreeBSD 8.x AMD64 (64 bit version).

Kudos to sub.mesa and the many, many other folks who know so much more than I and especially to Pawel for porting zfs to FreeBSD. Sub.mesa's website has a very good how-to install FreeBSD and other stuff.

This isn't *the* way to do it, just *a* way to do it.

Install FreeBSD on one of the Brand X HDDs, then gmirror(8) that drive to the other identical drive. Edit rc.conf to enable zfs and reboot.

Identify the 8 HDDs for the zfs array watching the bootup process, doing a dmesg and/or looking in /dev/ (eg, /dev/da0-7). Make sure the drives are free of any partioning info:
# dd if=/dev/zero of=/dev/da0 bs=1m count=1

Repeat for each drive.

Force the drives to ignore the 'advanced format drive' firmware (mis)information, thus:
# gnop create -S 4096 /dev/da0

Now build your zfs array, thus:
# zpool create media raidz2 da0.nop da1 da2 da3 da4 da5 da6 da7

Save it
# zpool export media

Trash the gnop
# gnop destroy /dev/da0.nop

Rebuild it
# zpool import media

REBOOT

Shouldn't show anything like da0.nop
# ls /dev/

Now test...

Code:

# ls /media	; should be there
# zdb		; look for ashift = 12 somewhere

Run write/read speed test like this:

Code:

write:	dd if=/dev/zero of=/media/zerofile.000 bs=1m count=20000
read:	dd if=/media/zerofile.000 of=/dev/null bs=1m

My write speed showed this:

Code:

20971520000 bytes transferred in 52.457735 secs (399779365 bytes/sec)

My read speed showed this:

Code:

20971520000 bytes transferred in 46.018040 secs (455723884 bytes/sec)

I'm happy. Similar results should be had with the Western Digital advanced format drives.

Eventually, I'll put up a web site with a much more detailed How-To along with the theoretic basis as I understand it (probably wrong)

Hope this helps someone.

ian-nai · Feb 8, 2011

Interesting! That's the first time I've heard of the gnop utility. I followed all of the steps you took but that one. What was achieved by creating a 4096 byte (?) gnop provider?

AndyUKG · Feb 8, 2011

The gnop device reports its sector size to zpool as 4096 bytes, based on this the pool is created with an appropriate ashift value for 4k disks. Tnis normally wouldn't happen as the 4k disks typically emulate 512 byte sectors, so zpool pool creates a pool optimised for 512 byte sectors of course!
The value of ashift is a per pool setting that should reflect the physical sector size of the disks in order to achieve normal/optimal performance,

ta Andy.

PS nice solution Bucky!

nakal · Feb 8, 2011

I would not test with /dev/zero. ZFS seems to optimize something with such streams (I got over 500 MB/s write speed on my mirrored drive). Also ZFS has a large cache, so reading a 2 GB file twice gives me 2 different results: first about 160 MB/s, then the second time 1400 MB/s.

Just for information, this is not how you would do benchmarks on ZFS.

Bucky · Feb 8, 2011

Benchmarking..

Thank you, nakal. Shows how little I really know.

Since 2000hrs last night, my server has been copying my media collection from the [soon-to-be] old server, to the new server. Using rsync over my gigabit home network about 90% done transferring 5 TB of files (about 6000 files)from the old zfs pool to the new zfs pool. Average transfer "speed" showing is 70-80 MB/s. Been running 569 minutes as of now.

Just for reference sake... Old server has 8 WD 1TB Caviar Black drives in zfs raidz2. I'm running out of room over there. New server has 8 Samsung 2TB F4 drives in zfs raidz2.

phoenix · Feb 8, 2011

Why are you transferring the data and not simple replacing the drives in the original raidz2? Offline 1 drive, remove it from server, insert new disk, and then just # zpool replace <poolname> <olddisk> <newdisk>

Wait for that to finish. Then repeat with the rest of the disks. Once all 8 are replaced ... you have 50% free space (may need to reboot or export/import the pool).

Hrm, thinking about it, though, going from 512B disks to 4096B disks would lead to performance issues, as you need to create the pool with a 4K record size (ashift=12).

aragon · Feb 9, 2011

Bucky, it'd be nice if you could do more benchmarks.

phoenix · Feb 9, 2011

Installing security/openssh-portable with the HPN patches, and configuring /usr/local/etc/ssh/sshd_config to enable HPN with a buffersize of 8192, and using the None cipher, will increase your transfer speeds, even with rsync, a lot.

# rsync --archive --hard-links --delete-during --delete-excluded --partial --inplace --numeric-ids --rsh="/usr/local/bin/ssh -oNoneEnabled=yes -oNoneSwitch=yes -oHPNBufferSize=8192" --rsync-path="sudo rsync" [email]username@remote.host[/email]:/path/to/start/from/ /path/to/copy/to/

Be sure to disable the included SSH server (ssd_enable="NO") and enable the ports version (openssh_enable="YES") in /etc/rc.conf

Bucky · Feb 9, 2011

Ummm, now you will see just how little I know about all this.

I didn't think to do the disc replacement with a bigger disc in the existing array, but figured nothing short of the way I did it would bypass the 512b vs 4096b sector size issue. Somewhere I read that zfs uses the physical sector size reported by the *first* disc added to an array as the size for all the discs in the array, even those added or replaced later. Can't mix/match drives too much else stability becomes an issue.

I'd be happy to run proper benchmarks if someone can tell me what program(s) to use and how to run them to your specification.

I use the base ssh as it works perfectly adequately for me. I used to install the one from the ports, but it doesn't get me anything else that I need.

And Phoenix, thank you for formatting my initial message - I didn't know how to do that and it didn't occur to me either - first posting ever on this forum. It looks very pretty the way you've done it and is imminently more readable.

For now, my media collection has been transferred to the new 2TB Samsung drives and I've shut down that machine until the RELEASE version of 8.2 is out. Then I'll do an 'export' and pull out the SATA card to those drives while I build up a new server on the gmirror drives, then do an 'import' and I'll be ready to go again.

danbi · Feb 9, 2011

The ashift value is on per-vdev base, not on per-zpool base. Therefore, in an zpool you may have one vdev with 4k drives, another with 512b drives etc. Just make sure you do not mix both types, unless you use the larger sector size found in the vdev when creating it.

phoenix · Feb 9, 2011

Are you sure about that? From what I can see on the zfs-discuss mailing list, it's a per-pool setting. But, I haven't looked at the code to confirm that.

danbi · Feb 10, 2011

Look at the output of zdb -- the ashift values are listed as property of the vdev.

One could easily test this with multi-vdev zpool..

AndyUKG · Feb 10, 2011

danbi said:
One could easily test this with multi-vdev zpool..

Yeah, Id seen those comments on zfs discuss too. I've just tested it, it is indeed a vdev specific setting, which makes sense.

phoenix · Feb 10, 2011

Hrm, interesting.

Wonder how well it copes with the situation where vdev A has ashift=9 and vdev B has ashift=12. Wonder if it would impact performance at all, where you go to write 12 KB of data across the two vdevs (multiples of 4K written to vdev B and multiple of 0.5K written to vdev A).

Anyone done any benchmarking of creating a vdev using 4K gnop devices, with non-4K Advanced Format (aka 512B sectors) drives? Just wondering if for our next storage box, we should create the vdevs using gnop to set ashift=12, to allow us to migrate down the road to 4K drives without issues.

danbi · Feb 11, 2011

phoenix, indeed, this makes perfect sense. Remember that ZFS is designed touse any block device for storage, therefore it is very likely that blocksize will be different for devices and therefore ashift. I wonder if it is indeed true, that zpool create would consider only the blocksize of the first device, or, what is more reasonable to expect, will consider the largest blocksize of all devices participating in the vdev.

Funny, creating vdev out of 4k drives let's you replace these later with 512b drives without loss of performance, but the opposte is not quite true. I think, with the current multi-terrabyte drive capacities using 512b as the minimum data unit does not make much sense.

Even better would be to have the ability to drop a vdev off the zpool......

boolean · Mar 30, 2011

Thanks for this tip!

tonyalbers · Apr 7, 2011

Slightly O/T here, but wouldn't it make more sense to copy the data using NFS instead of rsync, since ZFS is pretty much born with NFS support?

/tony

dennylin93 · Apr 7, 2011

In some cases, rsync is much faster than NFS.

phoenix · Apr 7, 2011

tonyalbers said:
Slightly O/T here, but wouldn't it make more sense to copy the data using NFS instead of rsync, since ZFS is pretty much born with NFS support?

ZFS on Solaris has built-in, in-kernel servers for NFS and CIFS. However, ZFS on FreeBSD just uses the normal nfs daemons and samba daemons.

The only difference between exporting a UFS filesystem via NFS, and exporting a ZFS fileystem via NFS is that you have two different ways of configuring the NFS export line for ZFS: via the sharenfs property, or via the normal /etc/exports file. And all the sharenfs property does on FreeBSD is write a line to /etc/zfs/exports.

There's nothing magical or even new about NFS exporting a ZFS filesystem on FreeBSD.

tonyalbers · Apr 8, 2011

Thanks phoenix, I didn't know that zfs nfs shares were dealt with that way in FreeBSD.

/tony

serverhamster · May 17, 2011

How can you see if a drive has 4k sectors?

Code:

da0 at mps0 bus 0 scbus0 target 0 lun 0
da0: <ATA WDC WD30EZRS-00J 0A80> Fixed Direct Access SCSI-5 device 
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)

Can we believe the value of 512 byte sectors? These drives probably have 4k, but I want to make sure. There is a lot of talk about the EADR drives, but this one is EZRS.

aragon · May 18, 2011

You can't believe the advertised sector size. Check your drive specifications - if it is said to be "advanced format" then it is a 4k drive. You can also test by writing to it on and off 4k boundaries - it'll be reproducibly slower unaligned.

boolean · May 18, 2011

Trying to figure out why my post above was edited.

serverhamster · May 19, 2011

Thanks. It is "advanced format". I recreated the zpool.

danpritts · Jul 17, 2011

any downside to using the ashift=12 with 512b drives?

I'm about to rebuild my system, as it turns out with 512-byte sector drives.

Is there any downside, other than a slight loss of capacity with small files, in building an array assuming the 4k sector size? I'm imagining replacing a disk later might be easier this way.