Solved zpool on megaraid jbod and 4k AF drives

Hello everyone. And sorry for my poor english - it's not my native language =)
I'm in process of installing and testing a new server with FreeBSD11 and ZFS. I try to put up complicated(not so much?) config of 3 pools and ask you fellow $username advice.
Chassis contains:
Code:
- LSI(Avago) 9271-8i + 16xSAS HDDs (HGST HUS726040AL5214) in JBOD mode;
- Supermicro X10DRi + 2xIntel DC SSDs. (3 partitions on each - for ARC, ZIL and data)
- 128 GB DDR4 ECC RAM
ZFS pools:
Code:
- 1st: FreeBSD OS root pool @ internal 16GB USB3.0 flash drive;
- 2nd: pool for "data" @ LSI 9271 of [CMD]14 HDDs for raidz3 and 2 HDDs for spare[/CMD] + [CMD]2x ssdXp0 arc stripe[/CMD] + [CMD]2x ssdXp1 zil mirror[/CMD];
- 3rd: pool for fast data zfs mirror @ ssd partitions (2x adaXp3)
The script i use to install (questions are at the end of post, you can skip reading script for now, synopsis is: crate boot partition and ROOT pool on usb stick \\ create DATA pool and add SSD partitions as ARC+ZIL with 2x HDDs as spare to DATA pool \\ create SSD fast pool, install system.):
Code:
#!/bin/sh
dev0="da0"  # USB3.0 flash stick @ internal X10DRi's USB3.0 port
ssd0="ada0" # 2x SSDs @ X10DRi's Intel AHCI SATA controller
ssd1="ada1"
hdd0="mfisyspd0" # 16x HDDs @ LSI 9271
<...>
hdd15="mfisyspd15"
#
gpart create -s gpt $dev0
gpart create -s gpt $ssd0
gpart create -s gpt $ssd1
gpart create -s gpt $hdd0
<...>
gpart create -s gpt $hdd15
#
gpart add -b 2048 -a 4k -s 478 -t freebsd-boot $dev0
gpart add -s 12G -a 4k -t freebsd-zfs -l zroot0 $dev0
gpart add -s 32G -a 4k -t freebsd-zfs -l l2arc0 $ssd0
gpart add -s 16G -a 4k -t freebsd-zfs -l zil0 $ssd0
gpart add -s 32G -a 4k -t freebsd-zfs -l l2arc1 $ssd1
gpart add -s 16G -a 4k -t freebsd-zfs -l zil1 $ssd1
gpart add -s 136G -a 4k -t freebsd-zfs -l zssd0 $ssd0
gpart add -s 136G -a 4k -t freebsd-zfs -l zssd1 $ssd1
gpart add -a 4k -t freebsd-zfs -l hdd0 $hdd0
<...>
gpart add -a 4k -t freebsd-zfs -l hdd15 $hdd15
#
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 $dev0
#
gnop create -S 4096 /dev/gpt/zroot0
gnop create -S 4096 /dev/gpt/l2arc0
gnop create -S 4096 /dev/gpt/l2arc1
gnop create -S 4096 /dev/gpt/zil0
gnop create -S 4096 /dev/gpt/zil1
gnop create -S 4096 /dev/gpt/zssd0
gnop create -S 4096 /dev/gpt/zssd1
gnop create -S 4096 /dev/gpt/hdd0
<...>
gnop create -S 4096 /dev/gpt/hdd15
#
zpool create -f -o cachefile=/var/tmp/zpool.cache -R /mnt -m / zroot /dev/gpt/zroot0.nop
zpool create -f -o cachefile=/var/tmp/zpool.cache -R /mnt/zdata -m /zdata zdata raidz3 /dev/gpt/hdd0.nop <...> /dev/gpt/hdd13.nop \
cache /dev/gpt/l2arc0.nop /dev/gpt/l2arc1.nop \
log mirror /dev/gpt/zil0.nop /dev/gpt/zil1.nop
zpool add zdata spare /dev/gpt/hdd14.nop
zpool add zdata spare /dev/gpt/hdd15.nop
zpool create -f -o cachefile=/var/tmp/zpool.cache -R /mnt/zssd -m /zssd zssd mirror /dev/gpt/zssd0.nop /dev/gpt/zssd1.nop
#
zpool export zssd
zpool export zdata
zpool export zroot
#
gnop destroy /dev/gpt/zroot0.nop
gnop destroy /dev/gpt/l2arc0.nop
gnop destroy /dev/gpt/l2arc1.nop
gnop destroy /dev/gpt/zil0.nop
gnop destroy /dev/gpt/zil1.nop
gnop destroy /dev/gpt/zssd0.nop
gnop destroy /dev/gpt/zssd1.nop
gnop destroy /dev/gpt/hdd0.nop
<...>
gnop destroy /dev/gpt/hdd15.nop
#
zpool import -d /dev/gpt -o cachefile=/var/tmp/zpool.cache -R /mnt zroot
zpool import -d /dev/gpt -o cachefile=/var/tmp/zpool.cache -R /mnt/zdata zdata
zpool import -d /dev/gpt -o cachefile=/var/tmp/zpool.cache -R /mnt/zssd zssd
#
zpool set bootfs=zroot zroot
zpool set feature@lz4_compress=enabled zroot
zpool set feature@lz4_compress=enabled zdata
zpool set feature@lz4_compress=enabled zssd
zfs set checksum=fletcher4 zroot
zfs set checksum=fletcher4 zdata
zfs set checksum=fletcher4 zssd
zpool set autoexpand=on zroot
zpool set autoexpand=on zssd
zpool set autoexpand=on zdata
zfs set atime=off zroot
zfs set atime=off zssd
#
zfs create zroot/usr
zfs create zroot/var
zfs create -o mountpoint=/usr/home -o compression=lz4 zdata/home
zfs create -o mountpoint=/usr/src -o compression=lz4 -o exec=off -o setuid=off zdata/src
zfs create -o mountpoint=/usr/ports -o compression=lz4 -o setuid=off zdata/ports
zfs create -o mountpoint=/usr/ports/distfiles -o compression=off -o exec=off -o setuid=off zdata/ports/distfiles
zfs create -o mountpoint=/usr/ports/packages -o compression=off -o exec=off -o setuid=off zdata/ports/packages
zfs create -o mountpoint=/var/crash -o compression=lz4 -o exec=off -o setuid=off zssd/crash
zfs create -o mountpoint=/var/db -o exec=off -o setuid=off zssd/db
zfs create -o mountpoint=/var/db/pkg -o compression=lz4 -o exec=on -o setuid=off zdata/pkgdb
zfs create -o mountpoint=/var/db/ports -o compression=lz4 -o setuid=off zdata/portsdb
zfs create -o mountpoint=/var/db/portsnap -o compression=lz4 zdata/portsnapdb
zfs create -o exec=off -o setuid=off zroot/var/empty
zfs create -o compression=lz4 -o exec=off -o setuid=off zdata/log
zfs create -o mountpoint=/var/mail -o compression=lz4 -o utf8only=on -o exec=off -o setuid=off zdata/mail
zfs create -o mountpoint=/var/run -o compression=lz4 -o exec=off -o setuid=off zssd/varrun
zfs create -o mountpoint=/var/tmp -o compression=lz4 -o exec=on -o setuid=off zssd/vartmp
zfs create -V 8G zdata/swap
zfs set org.freebsd:swap=on zdata/swap
zfs set checksum=off zdata/swap
#
cd /usr/freebsd-dist
export DESTDIR=/mnt
for file in base.txz lib32.txz kernel.txz ; do (cat $file | tar --unlink -xpJf - -C ${DESTDIR:-/}) ; done
cp /var/tmp/zpool.cache /mnt/boot/zfs/zpool.cache
cat << EOF > /mnt/etc/rc.conf
zfs_enable="YES"
EOF
echo 'zfs_load="YES"' > /mnt/boot/loader.conf
touch /mnt/etc/fstab
echo "md /tmp mfs rw,-s4096m 2 0" >> /mnt/etc/fstab
This gives me following: # zpool status
Code:
  pool: zdata
 state: ONLINE
  scan: none requested
config:

    NAME            STATE     READ WRITE CKSUM
    zdata           ONLINE       0     0     0
      raidz3-0      ONLINE       0     0     0
        gpt/hdd0    ONLINE       0     0     0
        gpt/hdd1    ONLINE       0     0     0
        gpt/hdd2    ONLINE       0     0     0
        gpt/hdd3    ONLINE       0     0     0
        gpt/hdd4    ONLINE       0     0     0
        gpt/hdd5    ONLINE       0     0     0
        gpt/hdd6    ONLINE       0     0     0
        gpt/hdd7    ONLINE       0     0     0
        gpt/hdd8    ONLINE       0     0     0
        gpt/hdd9    ONLINE       0     0     0
        gpt/hdd10   ONLINE       0     0     0
        gpt/hdd11   ONLINE       0     0     0
        gpt/hdd12   ONLINE       0     0     0
        gpt/hdd13   ONLINE       0     0     0
    logs
      mirror-1      ONLINE       0     0     0
        gpt/zil0    ONLINE       0     0     0
        gpt/zil1    ONLINE       0     0     0
    cache
      ada0p1        ONLINE       0     0     0
      ada1p1        ONLINE       0     0     0
    spares
      mfisyspd14p1  AVAIL  
      mfisyspd15p1  AVAIL  

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: none requested
config:

    NAME          STATE     READ WRITE CKSUM
    zroot         ONLINE       0     0     0
      gpt/zroot0  ONLINE       0     0     0

errors: No known data errors

  pool: zssd
 state: ONLINE
  scan: none requested
config:

    NAME           STATE     READ WRITE CKSUM
    zssd           ONLINE       0     0     0
      mirror-0     ONLINE       0     0     0
        gpt/zssd0  ONLINE       0     0     0
        gpt/zssd1  ONLINE       0     0     0

errors: No known data errors
Seems OK and works also without noticeable troubles. Although it's just the beginning of tests =)

Few questions:
  1. 4K AF HDDs: IMO, all my HDDs are 4k sector size, but MegaCli util show HDDs @ LSI 9271 controller as:
    Code:
    <...>
    Sector Size:  512
    Logical Sector Size:  512
    Physical Sector Size:  4096
    Firmware state: JBOD
    <...>
    Does someone know if there is need (and possible at all) to configure the LSI 9271 controller to use 4k HDDs in native 4k sector size? As far as i understand - at this point controller just "translates" 4k to 512 for compatibility. Or do i miss something important here?
  2. Script: installation goes OK, but i'm in doubt about "export/import pools" part of script - does sequence (in my script it's 1-zssd, 2-zdata, 3-zroot) matter? Do i need to use different zpool.cache on each pool?
  3. Config: overall layout of this HW/SW config - what do you think of it (zfs on HDDs in JBOD+SSD partitioning for caches)? Does this config makes sense? In your opinion, does it contains "bugs" or caveats/bottlenecks? Total goal in performance side is to use Intel SSDs to speed up array of slow HDDs (CasheCade is not an option in my case). At this point i plan to test and use server as bhyve(8) and jail(8) host.
  4. GPT labels: What is wrong with ARC and SPARE gpt labels on DATA pool in zpool status listing? Install script use them but at the end they just ignored in pool config? ZIL mirror is with appropriate gpt labels... Confusing.
 
I think that you might have used my guide for ZFS on root. It is kind of obsolete now, it was created for FreeBSD 9.

First observation, you don't need to use gnop any more. Instead you can use sysctl vfs.zfs.min_auto_ashift=12 which will align the drives to 4K.
Second observation, zpool.cache is not needed anymore.
Third observation, proper partitioning for 4K is: gpart add -s 222 -a 4k -t freebsd-boot /dev/disk

You can use this guide as a reference. Just don't bother with the gnop trick anymore.

I also place ZIL & CACHE on the same SSDs without any problems.
 
I think that you might have used my guide for ZFS on root.
Yes I have, and I fail words to express my thanks, it helped a lot. Even in 11R it's not ideal but still usable =)
you don't need to use gnop any more. Instead you can use sysctl vfs.zfs.min_auto_ashift=12 which will align the drives to 4K.
How can i check if drives/partitions are actualy 4k aligned?
gpart info after new install:
Code:
# SSDs
=>       40  390721888  ada0  GPT  (186G)
         40   67108864     1  freebsd-zfs  (32G)
   67108904   33554432     2  freebsd-zfs  (16G)
  100663336  285212672     3  freebsd-zfs  (136G)
  385876008    4845920        - free -  (2.3G)

# HDDs
=>        40  7814037088  mfisyspd0  GPT  (3.6T)
          40  7814037080          1  freebsd-zfs  (3.6T)
  7814037120           8             - free -  (4.0K)

# USB stick
=>      40  30343088  da0  GPT  (14G)
        40       472    1  freebsd-boot  (236K)
       512  25165824    2  freebsd-zfs  (12G)
  25166336   5176792       - free -  (2.5G)
Second observation, zpool.cache is not needed anymore.
Removed this and anything related gnop as well as import/export pool parts of script - seems like this solved labeling problem but pools on HDDs and SSDs are not mounted on boot. Will try to test variants
Third observation, proper partitioning for 4K is: gpart add -s 222 -a 4k -t freebsd-boot /dev/disk
"No need to set beginning of part, -a 4k aligns automatically"... Got it. From upper gpart listing, does this looks ok?
You can use this guide as a reference. Just don't bother with the gnop trick anymore.
I'll spend my time and read this guide
I also place ZIL & CACHE on the same SSDs without any problems.
Yes, but based on some reading forums it seems like it's still better to use mirror for ZIL and not so much waste of space (IMO) for spare ARC.
So far your answers solved described problems and shortened my install script - thanks a lot!
 
Does someone know if there is need (and possible at all) to configure the LSI 9271 controller to use 4k HDDs in native 4k sector size? As far as i understand - at this point controller just "translates" 4k to 512 for compatibility. Or do i miss something important here?
Unless your drives are marked "Advanced Format 4Kn" (like in this image), then all operations outside the drive platters are happening in 512-byte chunks and the drive is doing the work to map the 512-byte chunks into the physical 4096-byte sectors. Some drives tell the operating system that they're doing this, some do not.

To use a 4Kn drive, you need to have a controller that is capable of handling them (not all are), and you usually need a UEFI BIOS if you want to boot from a 4Kn drive.

512e drives "lie" to the controller / operating system so everything Just Works, but performance can be very poor if accesses are not aligned. With 4Kn, it doesn't matter as everything happens in 4096-byte chunks - there is nothing to "align".

It is normally not possible to switch a drive between 512e and 4Kn modes (despite what the label in the above image implies).
 
Unless your drives are marked "Advanced Format 4Kn"
Ah, so if it's "Advanced Format AF" and not "Advanced Format AFn" - it is designed to made the "trick" with 4k > 512 mapping - despite of RAID/HBA controller abilities and configuration.
To use a 4Kn drive, you need to have a controller that is capable of handling them (not all are)
Well, 9271-8i (AFAIK) have support. But i was searching this and it's not an issue in my case.
and you usually need a UEFI BIOS if you want to boot from a 4Kn drive.
No, it's USB3.0 boot drive, but good to know.
It is normally not possible to switch a drive between 512e and 4Kn modes (despite what the label in the above image implies).
As i thought of.
Thank you!
 
Back
Top