Solved ZFS and concats

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

To reorganise my zfs server, I need to vacate the tank, temporarily. It's currently configured with 5 x 3 TB disks in RAID-Z1 format. These disks are nearly 10 years old. Two of the originals have failed already, and I want to move to RAID-Z2 as soon as possible. I already have two spare disks spinning in the case, and two in anti-static bags. Most of my new and spare drives are 4 TB, but I (mostly) only expect to use 3 TB (original size), at least for the time being.

I need to safely replicate the tank for long enough to reorganise it from a 5 x spindle RAID-Z1 to a 7 x spindle RAID-Z2.

The root is on separate SSD mirror and won't be impacted.

I'm looking at using external disks connected to USB ports to hold the tank while the internal disks get re-organised. Efficiency and speed are not high priorities. Nor is redundancy (I'll scrub it before proceeding). But it needs to be a sensibly managed risk.

I want at least 10 TB external USB disk capacity. I already have two 4 TB spare SATA disks, with USB adaptors. So I need to buy more disk(s). For ongoing operational reasons (offsite backups) I favour purchasing two 2 TB external USB disks, but one 4 TB SATA disk with a SATA/USB converter would also work.

So the external USB disk options for storing the tank are:
  • 3 x 4 TB; or
  • 2 x 4 TB and 2 x 2 TB (best in the long term).
For odd sized disks, it seems that gconcat(8) is a good option to create a single disk device, /dev/concat/cc0, composed of a 12 TB concat from 2 x 4 TB disks and 2 x 2 TB disks. I expect that a 12 TB UFS file system would safely sit on top of that.

I wondered if I could configure ZFS to create a simple, non-redundant, pool using /dev/concat/cc0.

I'm looking to use ZFS and not UFS because zfs-send(8) has great appeal to vacate, verify, and restore the tank.

I'd appreciate your thoughts. And I'd especially like to know if anyone has ever used a concat under ZFS, or has any other solution to the problem of odd disk sizes with ZFS.
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

There are 16 internal SATA ports (8 on the motherboard, and 8 on an LSI SAS2008 card). The enclosure has 7 x 3.5" spinning disks, and 2 x SSDs. Its not realistic to add any more SATA disks internally.

To get more disks connected, I have to use the external USB ports.

Externally, there are 4 x USB 3 Gen 2 (red) ports (3 x Type A, and 1 x Type C). These are 10 Gbit/sec.

I have two StarTech USB 3.1 Gen 2 Type A to SATA adapters. These are 6 Gbit/sec. I use these for creating off-site backups on naked 3.5" SATA disks connected to the red USB ports, and they work well. I will buy another, if I buy a third spare 4 TB disk (the 3 x 4 TB option to copy the tank).

Externally, there are also 4 x USB 3.2 Gen 1 (blue) Type A ports, 5 Gbit/sec, which I don't intend to use.
 

VladiBG

Daemon

Reaction score: 585
Messages: 1,261

What is your current backup strategy? Where are you backing up the information?
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

The data I really care about are kept in separate datasets, and get off site backup with zfs send to USB disks. However, the backup strategy is not directly relevant to the problem of temporarily vacating the tank.
 

covacat

Aspiring Daemon

Reaction score: 329
Messages: 674

afaik you can create the pool without the gconcat device just specify the disks
depending on the sizes of your datasets (no one larger than 8tb) you can create 2 intermediary pools, one of 2 disks and one of 1 disk
and you won't need a 3rd external usb drive enclosure
 

VladiBG

Daemon

Reaction score: 585
Messages: 1,261

I would suggest to add 2x12TB WD Gold disks in Raid1 (mirror) and use them as local backup.
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

afaik you can create the pool without the gconcat device just specify the disks
I'm certainly not expert on the subject, but my understanding of ZFS vdevs is that they may be single-device, RAID-Z1, RAID-Z2, RAID-Z3, or mirror.

I do not believe that there is an option to use whole disks of different sizes in the same ZFS pool (well, you can, but you must either waste a lot of space or use partitions in crazy ways).

The idea of the concat is that it acts as a single-device, without any redundancy. I know that this is somewhat risky, but I have considered that (risk duration is small, disks don't fail often and I will scrub before proceeding). The other options won't provide the capacity I need.
depending on the sizes of your datasets (no one larger than 8tb) you can create 2 intermediary pools, one of 2 disks and one of 1 disk
and you won't need a 3rd external usb drive enclosure
That's good thinking, except there are no internal disks not committed to the new RAID-Z2 pool. Since I have more than 8 TiB in the tank I need more than the existing 2 x 4 TB external disks to copy it (i.e. minimum three USB disks). The distinction here between TiB and TB is relevant. No matter how I sub-divide the pool, I need a third USB disk. My options are to buy two 2 TB "USB backup disks", or a conventional naked 4 TB SATA disk with a USB converter.
 

covacat

Aspiring Daemon

Reaction score: 329
Messages: 674

yes but if you use zfs send / receive for backup you do it per dataset
so if you have
tank/A 6TB
tank/B 2TB
tank/C 3TB
you create a pool with 2 usb disks and create btank/A and btank/B on it
remove the disks, create ctank/C on the third backup disk
then install the 7 disks in the sata enclosures and restore in 2 steps
the idea is you dont have to backup all datasets in the same pull for transferirng them
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

I would suggest to add 2x12TB WD Gold disks in Raid1 (mirror) and use them as local backup.
Thank you. That's outside my current ambit of considerations -- but thoughtful.

The ZFS server IS the backup server. It's data sets are already partitioned into three categories, namely:
  1. stuff I don't care about, and don't backup (e.g. MythTV recordings);
  2. stuff I do care about, which exists on multiple hosts, and which can be recovered from the Internet (e.g. Unix source code for many variants -- Research, USG, and Berkeley); and
  3. stuff I would never want to lose (e.g. email archive, photos, software I wrote myself, the Calibre library, legal and financial documents).
The stuff in 3 above gets replicated off site, and there's never less than one copy off-site (also kept on my notebook, and at least one other machine). It's well under 2 TB.

But there's certainly a case to back up the whole backup server on a semi-regular basis. There would be a lot of "junk" backed up, but no scope to lose anything...
 

covacat

Aspiring Daemon

Reaction score: 329
Messages: 674

also im pretty sure zfs supports concats, you can quickly test it with 2 md devices
create 2 files 1gb and 2gb mdconfig them and create a pool out of md0 and md1 or whatever names they get
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

yes but if you use zfs send / receive for backup you do it per dataset
Yes, the dataset sizes would allow that, and would save me buying a third USB/SATA converter. Thank you.
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

also im pretty sure zfs supports concats, you can quickly test it with 2 md devices
I suppose it serves me right for reading the documentation that suggested concats weren't supported under ZFS, rather than actually testing it, because they clearly do work:
Code:
[strand.618] $ sudo mdconfig -s 400m 
md0

[strand.619] $ sudo mdconfig -s 800m
md1

[strand.620] $ sudo zpool create example /dev/md0 /dev/md1

[strand.621] $ sudo zfs set compression=lz4 example

[strand.622] $ df -h /example
Filesystem    Size    Used   Avail Capacity  Mounted on
example       1.0G     96K    1.0G     0%    /example

[strand.623] $ zpool status example
  pool: example
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    example     ONLINE       0     0     0
      md0       ONLINE       0     0     0
      md1       ONLINE       0     0     0

errors: No known data errors

[strand.624] $  zpool list -v example
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
example    1.12G   396K  1.12G        -         -     0%     0%  1.00x    ONLINE  -
  md0       384M   196K   384M        -         -     0%  0.04%      -  ONLINE  
  md1       768M   200K   768M        -         -     0%  0.02%      -  ONLINE 

[strand.625] $ cd /usr

[strand.626] $ du -h -s ports
805M    ports

[strand.627] $ sudo find ports -depth -print | sudo cpio -pdmu /example 
806652 blocks

[strand.628] $ zpool list -v example
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
example    1.12G   773M   379M        -         -    14%    67%  1.00x    ONLINE  -
  md0       384M   322M  61.7M        -         -    12%  83.9%      -  ONLINE  
  md1       768M   451M   317M        -         -    16%  58.7%      -  ONLINE

Thank you to all. Problem solved. I have multiple plausible options.

I also have a support request in with StarTech to see if my USB312SAT3 adapters will work with 12 TB disks (the docs say 6 TB max tested).

I 'm still curious to know if a GEOM concat could be presented to ZFS as a "disk".
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

After considering the options, I decided to go with two 12 TB WD Gold (WD121KRYZ) drives. This cost a fair bit more than I was planning to spend. I did it because:
  • I got a really fast response from StarTech support (they have not tested the USB312SAT3 adapters beyond 6 TB, but "have noted customers making use of this adapter with other 12TB hard drives with no issues reported");
  • the WD121KRYZ is an enterprise class drive on run-out, and I got two at the bargain basement price of $US360 each;
  • I can save the tank with full redundancy of a 12 TB mirror on USB connected disks;
  • after the tank reorganisation is complete, I will be able to rotate the 12 TB disks off-site providing 100% backups of the ZFS server without ever having to worry if I have missed something important;
  • the existing (old but barely used) 1 TB SATA disks I use for off-site backups can now be relegated to archive duty to secure off-site the stuff I "never want to lose"; and
  • I will be well positioned if I ever need to vacate the tank in the future.
Thank you VladiBG for making me think about it some more...
 

VladiBG

Daemon

Reaction score: 585
Messages: 1,261

If possible connect them directly to the SAS/SATA port. Don't use USB adapter.
If you have free SAS ports check the WD Ultrastar DC series. Some of them can be found at the same price as WD gold.
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

The trouble with using SATA (or SAS) is that I have to open the case to get the power and data cables out.
Modern power cables are wildly insecure in their attachment, and SATA cables aren't much better.
I risk catastrophic failure of the RAID-1Z set if more than one of the internal cables is disturbed and detaches accidentally.
Then there's the compromised cooling caused by the case opening -- with 7 spindles, the (4-pin Noctua) case fans blow hard most of the time (it's summer in Australia).
I have used the USB/SATA converters many times, and they seem reliable. Speed is not of the essence.
 
OP
gpw928

gpw928

Aspiring Daemon

Reaction score: 273
Messages: 598

Here is a proof of concept for using a three x 150 MB spindle concat as a component "disk" in a three x 450 MB "disk" RAID-Z1 set:
Code:
[f13.129] # mdconfig -s 150m        # md0
md0
[f13.130] # mdconfig -s 150m        # md1
md1
[f13.131] # mdconfig -s 150m        # md2
md2
[f13.132] # mdconfig -s 450m        # md3
md3
[f13.133] # mdconfig -s 450m        # md4
md4
[f13.134] # gconcat label -v cc0 /dev/md0 /dev/md1 /dev/md2
Metadata value stored on /dev/md0.
Metadata value stored on /dev/md1.
Metadata value stored on /dev/md2.
Done.
[f13.135] # zpool create example raidz /dev/concat/cc0 /dev/md3 /dev/md4
[f13.136] # zfs set compression=lz4 example
[f13.137] # zpool status example
  pool: example
 state: ONLINE
config:

    NAME            STATE     READ WRITE CKSUM
    example         ONLINE       0     0     0
      raidz1-0      ONLINE       0     0     0
        concat/cc0  ONLINE       0     0     0
        md3         ONLINE       0     0     0
        md4         ONLINE       0     0     0

errors: No known data errors
[f13.138] # zpool list -v example
NAME             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
example         1.25G   888K  1.25G        -         -     0%     0%  1.00x    ONLINE  -
  raidz1        1.25G   888K  1.25G        -         -     0%  0.06%      -  ONLINE  
    concat/cc0      -      -      -        -         -      -      -      -  ONLINE  
    md3             -      -      -        -         -      -      -      -  ONLINE  
    md4             -      -      -        -         -      -      -      -  ONLINE  
[f13.139] # cd /usr/local
[f13.140] # du -h -s llvm12
311M    llvm12
[f13.141] # find llvm12 -depth -print | cpio -pdmu /example
1472202 blocks
[f13.142] # zpool list -v example
NAME             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
example         1.25G   488M   792M        -         -     1%    38%  1.00x    ONLINE  -
  raidz1        1.25G   488M   792M        -         -     1%  38.1%      -  ONLINE  
    concat/cc0      -      -      -        -         -      -      -      -  ONLINE  
    md3             -      -      -        -         -      -      -      -  ONLINE  
    md4             -      -      -        -         -      -      -      -  ONLINE  
[f13.143] # zpool status -x example
pool 'example' is healthy
[f13.144] # zpool scrub example
[f13.145] # echo $?
0
This would have solved my desire to get a 3 x 4 TB RAID set from four disks ((2G + 2G) + 4G + 4G).

It would be good to hear the opinions of any opponents to the plan...
 
Top