ZFS VMWare+bareos == ZFS problems after restore

My question is going to have some prelude because any step of those leading to the final situation could be important.

We're using bareos with vmware plugin to backup our VMs from ESXi hosts. Not so long ago we totally lost a data center and I got a weird backup at my hands. Because the vCenter it was fetched from is gone now the only remaining way to retrieve the backup was to pull VMDK image out of it which we succeed with only to realize that:
  • The resulting file is neither flat VMDK, nor monolithic one. I.e. it has some kind of descriptor data included, but it's in a binary form instead of text. And the file doesn't start with KDMV signature.
  • The file size is less than it should be. The disk geometry 26108 cyl, 255 heads, 63 sec suggests at least 214745610240 bytes or 419425020 sectors; but in fact bareos came up with 213678818262 bytes which makes 417341440 sectors without the header.
For the restored file I managed to locate and extract the actual flat VMDK data. With some help from vmware folks I ended up with padding the extracted with zeroes to make it the expected size and recovered descriptor which allowed me to attach the disk to a temporary FreeBSD VM. This is where the problem with ZFS pops up. The temporary FreeBSD can now see all three GPT partitions from the original server. But ZFS pool is broken. zdb -l results in this kind of report:

Code:
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'zroot'
    state: 0
    txg: 20194798
    pool_guid: 5321062894320895639
    hostid: 2270722292
    hostname: ''
    top_guid: 2824987660710126592
    guid: 2824987660710126592
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 2824987660710126592
        path: '/dev/da0p3'
        whole_disk: 1
        metaslab_array: 37
        metaslab_shift: 30
        ashift: 12
        asize: 201856647168
        is_log: 0
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
------------------------------------
LABEL 1
------------------------------------
    version: 5000
    name: 'zroot'
    state: 0
    txg: 20194798
    pool_guid: 5321062894320895639
    hostid: 2270722292
    hostname: ''
    top_guid: 2824987660710126592
    guid: 2824987660710126592
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 2824987660710126592
        path: '/dev/da0p3'
        whole_disk: 1
        metaslab_array: 37
        metaslab_shift: 30
        ashift: 12
        asize: 201856647168
        is_log: 0
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
------------------------------------
LABEL 2
------------------------------------
failed to unpack label 2
------------------------------------
LABEL 3
------------------------------------
failed to unpack label 3

By introspecting the flat VMDK I managed to find all 4 ZFS labels. All 4 are seemingly intact. But apparently the two located at the end of the disk cannot be found.

At this point I'm currently stuck. What is almost clear to me is that somehow bareos skipped (419425020-417341440)=2083580 sectors. When I padded the flat data with zeroes it did help the system to handle the GPT tables but certainly didn't help in finding the last two ZFS labels.

I'm looking for any, even craziest, ideas as to how I can help the system find and mount the ZFS partition. I have good grounds to expect most if not all the data be there just waiting to be accessed.

One thing I'm thinking about is relocating the second two labels to where ZFS expects to find them. If I'm lucky then the sectors skipped by bareos are located somewhere between the actual files on the FS and the labels. Unfortunately, I don't know how to calculate the correct offsets. Aside of that, maybe something could be done to bareos itself to make it restore the full disk? Or something else I just can't think about at the moment.
 
That is quite interesting. Sadly, I have never worked with vmware (and didn't even know bareos has a plugin for these).
As mentioned elsewhere, I had identified and workarounded/fixed about 28 bugs until I got the bareos thing into a shape I can use.

One thing that comes to mind reading your story, is my bug #3822; failure to properly handle the "sparse" option. It says that the sparse=yes option produces wrong results, at least in verification runs, and therefore must not be used (effective since about bacula 2.2.7).

Not sure if that might apply to your situation, and anyway it won't help now after the fact. :(
What I would do in Your situation - but I fear You don't wanna hear that either ;) - is, recreate the failed installation, or some lookalike; rerun the backup; and then look what kind of crap actually happens there. That probably helps in understanding how the restored crap was mangled, and to figure how to correctly de-mangle it. It also might help in understanding why that happens (and then probably identify and fix the next bug). But, I admit, that is work.

Besides that, a ZFS pool image is nothing else than a kind of file (or a couple of them). But I might assume it does not work like a msdos filesystem that is filled from beginning to end; I might rather assume some allocation scheme to evenly fill allocation groups. So, it would be highly desireable to get that whole image intact, otherwise one would need to climb into the internals of the zfs allocation - and I might assume the bareos sourcecode is the easier part.
 
Search in your file for the first and last occurences of "EFI PART", and note byte offsets into the file.

Then, post two byte offsets and two dumps of the 512 bytes after the "EFI PART".

This would give us GPT table headers. These are self-referencing, meaning each one contains its own address (and also address for the other one) inside itself. We can then try to relate these self-references to offsets in file, and try to figure out some padding.

Also while you are at it, the MBR is just before the first GPT. It ends with 55AA hex. In hex editor, open first GPT position, track 4096 bytes back (to the lower offsets), and post dump at this position as well. This is needed to determine if the "disk" was using 512 or 4096 bytes per sector.

One more question is if it was a single-disk pool, or was RAIDZ involved.
 
I'd like to say "thank you" for both your replies! Unfortunately, PMc is closer to the answer: most certainly I hit a bareos bug. Even though we don't use sparse explicitly (though I can't rule out implicit use of the option) something has definitely went wrong. The reason for me to think so is that a script I wrote to find the second pair of ZFS labels came up with byte offsets which are 64 bytes past sector boundary which is absolutely impossible for any valid disk image.

So far, reinstall from the scratch with some help of half a year old file-level backup is the only option left.
 
Back
Top