Solved Proper UFS formatting guidance

listentoreason · Oct 27, 2013

EDIT: My concerns over drive speed were misplaced; the drive was performing at maximum performance. My confusion was due to impressive speed gains when using an 11-disk zpool of the same drives. @wblock@ provides nice guidance for formatting.

I would like to verify that I am partitioning a basic UFS file system correctly. I have configured a solitary drive as:

Code:

gpart destroy -F da10
gpart create -s GPT da10
gpart add -t freebsd-ufs da10
newfs -S 4096 -b 32768 -f 4096 -O 2 -U -m 8 -o space -L ufs4kb /dev/da10p1

I was underwhelmed by the resulting IO performance, though:
bonnie++ [size=-2]v1.96[/size] UFS space optimized 80G N=3
Read=130 Write=120[size=-2]Â±3[/size] Rewrite=43[size=-2]Â±1[/size] [size=-2](MB/sec)[/size] Latency: 610,1200,8900 [size=-2](ms)[/size]

The drive is a 3 TB SATA 3.0 drive, which is connected to an HBA with 11 other drives in a RAID-Z3 pool. That pool gets significantly (3-4 fold) better performance (more details):
bonnie++ [size=-2]v1.96[/size] RAID-Z3 11 drives 100G N=6
Read=670[size=-2]Â±77[/size] Write=330 Rewrite=230[size=-2]Â±14[/size] [size=-2](MB/sec)[/size] Latency: 260,780,2100 [size=-2](ms)[/size]

I expected the drive to perform similarly to its comrades in the pool, perhaps better. The drives are all "fake 512 byte sector" WD drives, and I have correctly configured the pool (ashift=12). Have I failed to do so for the UFS device? Or missed a configuration step? Or does a large ZFS pool make such good use of parallel IO that it's just that much more efficient than an isolated drive?

I tried to follow guidance using bsdlabel, but learned that it does not support drives over 2 TB, and was pointed back to gpart. I tried just using newfs:

Code:

gpart destroy -F da10
dd if=/dev/zero of=/dev/da10 bs=4096 count=1
newfs -U -f 4096 /dev/da10

... but ended up with a provider I could not mount (mount: /dev/da10s1: Invalid argument).

Drive characteristics:

Code:

[CMD]diskinfo -v da10[/CMD]
da10
	512         	# sectorsize
	3000592982016	# mediasize in bytes (2.7T)
	5860533168  	# mediasize in sectors
	4096        	# stripesize
	0           	# stripeoffset
	364801      	# Cylinders according to firmware.
	255         	# Heads according to firmware.
	63          	# Sectors according to firmware.
	     WD-WCC1T0860025	# Disk ident.

wblock@ · Oct 27, 2013

listentoreason said:
I would like to verify that I am partitioning a basic UFS file system correctly. I have configured a solitary drive as:

Code:

gpart destroy -F da10 gpart create -s GPT da10 gpart add -t freebsd-ufs da10 newfs -S 4096 -b 32768 -f 4096 -O 2 -U -m 8 -o space -L ufs4kb /dev/da10p1

The partition is almost certainly not aligned with the 4K blocks on the drive. Please show the output of gpart show da10.

To align, back up the data and delete the partition. Then add it again, but make sure it starts at an aligned block and is an even block multiple in size:
gpart add -t freebsd-ufs -b 1m -a4k da10

As far as the newfs, how sure are you about overriding the default values? And the defaults have changed. What version of FreeBSD are you using?

listentoreason · Oct 27, 2013

wblock@ said:
The partition is almost certainly not aligned with the 4K blocks on the drive. Please show the output of gpart show da10.

Code:

[CMD]gpart show da10[/CMD]
=>        34  5860533101  da10  GPT  (2.7T)
          34           6        - free -  (3.0k)
          40  5860533088     1  freebsd-ufs  (2.7T)
  5860533128           7        - free -  (3.5k)

wblock@ said:
To align, back up the data and delete the partition. Then add it again, but make sure it starts at an aligned block and is an even block multiple in size:
gpart add -t freebsd-ufs -b 1m -a4k da10

Ok, done. I am not sure I need it, but I have been using gpart destroy -F, in the hope that it results in a cleaner process (I'm concerned about having a prior experiment taint later attempts). I've run:

Code:

[CMD]gpart destroy -F da10[/CMD]
[CMD]gpart create -s GPT da10[/CMD]
[CMD]gpart add -t freebsd-ufs -b 1m -a4k da10[/CMD]
[CMD]gpart show da10[/CMD]
=>        34  5860533101  da10  GPT  (2.7T)
          34        2014        - free -  (1M)
        2048  5860531080     1  freebsd-ufs  (2.7T)
  5860533128           7        - free -  (3.5k)
[CMD]newfs -S 4096 -b 32768 -f 4096 -O 2 -U -m 8 -o space -L ufs4kb /dev/da10p1[/CMD]
[CMD]gpart list da10[/CMD]
Geom name: da10
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 5860533134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: da10p1
   Mediasize: 3000591912960 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   rawuuid: 1364e472-3f02-11e3-b256-002590c52018
   rawtype: 516e7cb6-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 3000591912960
   offset: 1048576
   type: freebsd-ufs
   index: 1
   end: 5860533127
   start: 2048
Consumers:
1. Name: da10
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2

Unfortunately, the benchmarks are still lackluster. gpart list reports the sector size as 512 bytes, but I don't know if that is what the file system is actually using, or if it is just a re-parrot of the fake value being reported by the drive. Was I supposed to include a space after "-a" in " -a4k" (I'm running benchmarks on that possibility now, but it's still reporting "Sectorsize: 512" for da10p1)?

bonnie++ [size=-2]v1.96[/size] UFS-wblockHelp 80G N=3
Read=140 Write=130[size=-2]Â±4[/size] Rewrite=46[size=-2]Â±1[/size] [size=-2](MB/sec)[/size] Latency: 710,640,8100 [size=-2](ms)[/size]

wblock@ said:
As far as the newfs, how sure are you about overriding the default values? And the defaults have changed. What version of FreeBSD are you using?

Not sure at all! I'm staggering through a couple dozen similar-but-not-quite-identical web sites and forum points that are trying to provide wisdom on the 4k drive issue. As far as I can tell, the command I am using is preserving most of the defaults already:

Code:

[CMD]uname -a[/CMD]
FreeBSD citadel 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012     [email]root@farrell.cse.buffalo.edu[/email]:/usr/obj/usr/src/sys/GENERIC  amd64
[I]newfs defaults, via [CMD]man[/CMD]:[/I]
  -O 2 -a 16 -b 32768 -f 4096 -m 8 -r 0

Many of the other defaults are calculated dynamically based on the disk. The major deviation from the defaults is the use of -S 4096, which comes with the uncomfortable man warning of:

Changing these defaults is useful only when using newfs to build a file system whose raw image will eventually be used on a different type of disk than the one on which it is initially created (for example on a write-once disk). Note that changing any of these values from their defaults will make it impossible for fsck(8) to find the alternate superblocks if the standard superblock is lost.

That makes me nervous, but I figured the whole point of the exercise was to force the sector size to 4096...

wblock@ · Oct 27, 2013

listentoreason said:

Code:

[CMD]gpart show da10[/CMD]
=>        34  5860533101  da10  GPT  (2.7T)
          34           6        - free -  (3.0k)
          40  5860533088     1  freebsd-ufs  (2.7T)
  5860533128           7        - free -  (3.5k)

That UFS partition is aligned.

Ok, done. I am not sure I need it, but I have been using gpart destroy -F, in the hope that it results in a cleaner process (I'm concerned about having a prior experiment taint later attempts).

-F just allows destroying without having to delete all the partitions first.

Code:

1. Name: da10p1
   Mediasize: 3000591912960 (2.7T)
   Sectorsize: 512
   [color="Red"]Stripesize: 4096[/color]
   Stripeoffset: 0
   Mode: r1w1e1
   rawuuid: 1364e472-3f02-11e3-b256-002590c52018
   rawtype: 516e7cb6-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 3000591912960
   offset: 1048576
   type: freebsd-ufs
   index: 1
   end: 5860533127
   start: 2048
Consumers:
1. Name: da10
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   [color="Red"]Stripesize: 4096[/color]
   Stripeoffset: 0
   Mode: r1w1e2

The "stripe size" is the physical sector size, so that's correct.

Unfortunately, the benchmarks are still lackluster.

It may just be as fast as that drive can go.

Was I supposed to include a space after "-a" in " -a4k"

It's optional. Personally, I find it easier to read with spaces only between the different options.

As far as I can tell, the command I am using is preserving most of the defaults already:

Code:

[CMD]uname -a[/CMD]
FreeBSD citadel 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012     [email]root@farrell.cse.buffalo.edu[/email]:/usr/obj/usr/src/sys/GENERIC  amd64
[I]newfs defaults, via [CMD]man[/CMD]:[/I]
  -O 2 -a 16 -b 32768 -f 4096 -m 8 -r 0

This goes in the category of premature optimization, or kind of a variation of Occam's razor: don't change defaults without a reason.

Many of the other defaults are calculated dynamically based on the disk. The major deviation from the defaults is the use of -S 4096, which comes with the uncomfortable man warning of:

That makes me nervous, but I figured the whole point of the exercise was to force the sector size to 4096...

I have never changed that, and I'm not certain that what UFS means by "sector" is the physical block size. This article talks about 4K drives, but only mentions changing cluster size:
http://ivoras.sharanet.org/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html

usdmatt · Oct 27, 2013

Am I the only one who thinks 130 MBps read and 120 MBps write are reasonable for a single disk?

You appear to be comparing a single disk against an 11 drive array. Yes, the RAID-Z3 will perform faster, especially for non-random reads/writes and there's no way you'll get anywhere near 670/330 with a single SATA disk. You'd be lucky to get near those RAID-Z3 speeds with a SATA SSD, let alone a traditional disk.

... but ended up with a provider I could not mount (mount: /dev/da10s1: Invalid argument).

If you run news directly on /dev/da10, I would expect you to need to use /dev/da10 in the mount command.

wblock@ · Oct 27, 2013

Agreed, it is pretty fast for a single spinning disk.

listentoreason · Oct 27, 2013

usdmatt said:
Am I the only one who thinks 130 MBps read and 120 MBps write are reasonable for a single disk?

It's probably just me that's worked up about it

. My concern came from the massive performance difference between my zpool and the single disk. I honestly was not sure what to expect from setting up a pool. I had seen marked performance improvement while increasing pool size, but had been assuming that I was "paying less of a penalty" somewhere rather than actually gaining performance over a single drive.

While waiting for yet another bonnie++ run to finish, I pulled up the Western Digital Red drive Specification Sheet. They do in fact list the performance as "147 MB/s Host to/from drive (sustained)". So my results are almost exactly what I should expect, and if we assume that WD is using "drive manufacturer math" and "MB" is not "MiB", then I'm spot-on with 140 "MiB".

Today's Lesson Learned: ZFS not only aggressively preserves data integrity, but can provide a massive 4-6 fold IO improvement over isolated drives.

This still leaves me confused why GELI-on-top-of-ZFS is so slow. If I am building a virtual device on top of my extra-speedy zpool, I'd think I would gain the performance benefit of the pool.

Thanks @wblock@, your posts were extremely helpful. I'm going to strip out the partition, and reformat per your instructions (including not fixin' what ain't broke). This will let me gather geli benchmarks off of a "raw" drive, which will in turn help me determine if I have storage or processing bottlenecks in my encryption plans.

kpa · Oct 27, 2013

How is your RAID-Z3 pool divided to VDEVs? ZFS will provide big improvements in performance over single disks if you have multiple VDEVs in a pool. A single VDEV ZFS pool may not provide that much performance increase over a single disk, in some cases the performance can be even slower than a single disk.

usdmatt · Oct 27, 2013

I was under the impression that a single RAID-Z vdev could provide much more sequential throughput than a single disk, but similar IOPS. Most information I've seen (including this and @listentoreason's other recent thread) suggest that RAID-Z can easily outperform a single disk.

Granted, neither of these threads display the actual pool layout, but I'm under the impression it's a single vdev.

My first ZFS NAS from several years ago, with 4 disks in RAID-Z1 (not optimal), could definitely beat the performance of a single disk.

listentoreason · Nov 10, 2013

usdmatt said:
Granted, neither of these threads display the actual pool layout, but I'm under the impression it's a single vdev.

That is correct. I have twelve 3 TB SATA drives connected to an LSI SAS HBA. Eleven of those drives are in a RAID-Z3 pool, and the "left over" drive is what I'm doing single drive benchmark tests on.

I'll mark this solved, thanks @wblock@ and @usdmatt for the advice and input.