dump to Ultrium LTO5 tape

noodlefling · Mar 5, 2015

We have a brand new Dell server with an Ultrium 5 tape drive and we're having trouble using the standard dump command which has worked fine for us since the beginning of time.

Typically /sbin/dump 0Lauf /dev/nsa0 /dev/<thefilesystem> has worked fine.

With the new system, it causes the tape drive to make horrible stopping and starting sounds that ultimately make you wish the server had a face so you could punch it.

I have seen recommendations for -C or -b options, but those seem to be for optimizing performance. I couldn't find any reasonable method of determining whether you have chosen good numbers or not, so I have not been messing with those.

The problem with testing is that each time we try something, the drive will do alternately short high- and low-pitched whines, then sometimes a very long one. It sounds a bit like the tape is winding forwards and backwards, looking for something it can't find. Eventually it hits the end of the tape, rewinds to the beginning, and tries again. I have no evidence of this, it's just what it sounds like.

After a long time of making this racket, and without the usual indicator explaining how much progress is being made, we attempt to kill the processes, and it never works. I've done enough dumping in my day to know that it should be in the process of dumping data normally to tape by this time, so I'm not just being impatient. I've left it for an hour to dump a few hundred M and it never finishes.

Rebooting is then necessary.

Code:

# /sbin/dump 0Lauf /dev/nsa0 /dev/mfid0p2
  DUMP: Date of this level 0 dump: Thu Mar  5 09:40:42 2015
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping snapshot of /dev/mfid0p2 (/) to /dev/nsa0
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 964590 tape blocks.

At this point it hangs, makes a bunch of noise, and never progresses.

These processes are visible:

Code:

root  838  0.0  0.0 12440 2136  0  I+  9:40AM  0:00.00 /sbin/dump 0Lauf /dev/nsa0 /dev/mfid0p2 (dump)
root  840  0.0  0.0 12440 2132  0  D+  9:40AM  0:00.00 /sbin/dump 0Lauf /dev/nsa0 /dev/mfid0p2 (dump)

The first process can be killed with a kill -9, but the second is invincible.

Any suggestions for things to experiment with? Previously, we've gotten new servers that required an extra parameter (or a lack of one we used to use) to make things magically work. Testing is taking more work than usual, since every time we try a new thing, we have to wait a while, then reboot.

We only have to do this about every 5 years, when we get a new server, so we are not dump experts, except to know that, when it's working, it's very safe and effective.

Any help and/or wild speculation would be appreciated. Thanks!

noodlefling · Mar 5, 2015

Oh, and I should say that the system is running 10.1-RELEASE-p6 and is using ufs, not zfs.

Terry_Kennedy · Mar 8, 2015

noodlefling said:
The problem with testing is that each time we try something, the drive will do alternately short high- and low-pitched whines, then sometimes a very long one. It sounds a bit like the tape is winding forwards and backwards, looking for something it can't find. Eventually it hits the end of the tape, rewinds to the beginning, and tries again. I have no evidence of this, it's just what it sounds like.

That sounds like "shoeshining"

After a long time of making this racket, and without the usual indicator explaining how much progress is being made, we attempt to kill the processes, and it never works. I've done enough dumping in my day to know that it should be in the process of dumping data normally to tape by this time, so I'm not just being impatient. I've left it for an hour to dump a few hundred M and it never finishes.

That doesn't sound normal. An un-killable process indicates an I/O that never completes.

Code:
Code:

# /sbin/dump 0Lauf /dev/nsa0 /dev/mfid0p2

Try adding -C 32 -b 32 to your dump(8) command line. Be careful with the placement, as the f argument expects to be the last one in the group, with the filename immediately following.

Another thing to do would be to make sure your tape firmware is up-to-date. Normally you can use the stand-alone Server Update Utility (SUU) to do this, but to get recent Dell SUU images to fit on a DVD, they need to be de-bloated. I use an obsolete CDU image to boot the system (the SUU is not bootable on its own).

Oh, and I should say that the system is running 10.1-RELEASE-p6 and is using ufs, not zfs.

Yup, # dump doesn't work on zfs.

wblock@ · Mar 8, 2015

-b 64 will increase speed, but don't go any higher. I'd cite where I found that (one of the mailing lists), but don't have the reference.

noodlefling · Mar 14, 2015

I've done a bit more research and still have no explanation for what's going on. I have one tape that seems to work better than the others. This one I managed to write some data to when experimenting with mt but I can't remember what parameter I sent. rewind works, but pretty much any command other than status starts the noisy back-and-forth, so the issue is not dump-specific.

For that reason, I'm going to lay off messing with the parameters for now.

I ordered new tapes from other manufacturers, to see if that might be the problem, but that seems to make no difference.

The one tape where I did get some magic to happen seems to have been "blessed", in that now it will function normally most of the time. In fact, if I only had that tape, I could have done a full battery of tests and never noticed there was a significant problem.

I will next try the firmware upgrades you recommend. If this is the problem, it would be fantastic, because presumably all the other errors would be cleaned up, all previous problems would make sense, and the tape drive would at least no longer be a force of instability on the server.

Besides the super-annoying nails-on-the-chalkboard noisiness, and the irony of the tape drive being the least stable part of the system, the next most irritating issue is the utter inconsistency of the errors. I don't know what voodoo has been applied to the one working tape, but I wouldn't want to count on being able to duplicate that trick to make a production server safe.

Thanks for your insights. I'll report back when the firmware has been updated...hoping/assuming it needs updating!

Terry_Kennedy · Mar 14, 2015

noodlefling said:
Thanks for your insights. I'll report back when the firmware has been updated...hoping/assuming it needs updating!

If the tape drive does this with just about any command, I think it is broken. Firmware is unlikely to fix it, as LTO drives generally work decently with whatever firmware they were built with (unlike, for example, DLT).

Are all of your cartridges LTO5? LTO drives can generally write on the immediate prior generation of tape and read from two prior generations.

Did your system come with a cleaning tape? You might want to try it. The drive will normally signal when it needs to be cleaned, but you never know.

You could just have a bad tape drive. If this is a new system, you could ask Dell to swap it and see if the replacement works better. Note that a failing tape drive can sometimes damage tapes inserted in it, so if you get a new drive, test it with a new tape as well.

noodlefling · Mar 16, 2015

I suppose I'll have to do the firmware thing to satisfy the Dell rep that there is actually something wrong with the drive before they'll ship out a new one. I ran the BIOS diagnostic and it didn't detect any problems, but since Dell has no idea what FreeBSD is, they'll probably get frustrated by attempts to do further tests.

If I could get a clean dump, maybe to a USB drive, I could just install Windows or whatever they want on the thing to do the testing their way.

All the tapes are LTO5. I got the drive to successfully read an LTO3, but it couldn't write to it, which, as you say, is expected.

I tried a couple different cleaning tapes and it didn't help.

All the tapes are new, so at least there won't be any data loss, but we'd like to put this server into production soon.

Thanks again for your help.

wblock@ · Mar 16, 2015

There were experimental patches to sa(4) posted on one of the mailing lists recently (maybe -current). Have you seen those?

noodlefling · Mar 17, 2015

Thanks for the suggestion. I contacted the author of the changes and he says they most likely will not have any effect on my particular issue, but that I need to at least upgrade the firmware before condemning the drive.

He also suggested a couple of other diagnostics to try. If I uncover anything interesting, I'll report back here so there will be a Googlable solution out there for others to find.

noodlefling · Mar 18, 2015

So, we haven't really made any progress.

New firmware was installed on the tape drive last night. It didn't change the behavior.

The drive was replaced by a new one today, and it also produces the same behavior.

We did try a new tape and it worked fine.

So...previous tapes which did not work properly continue to not work properly. The old tape that worked properly does so with errors and the new one seems work cleanly. I have no idea what this means.

We'll continue experimenting with new tapes, but it all seems to black magic at this point.

I'm still discussing the issue with the sa coder, so perhaps something will come of this. As of now, the plan seems to be to buy twice as many tapes as are necessary and just hope that half of them work. I can't help but think there's a better solution.

Terry_Kennedy · Mar 18, 2015

noodlefling said:
The drive was replaced by a new one today, and it also produces the same behavior.

We did try a new tape and it worked fine.

That would seem to be consistent with my post:

Terry_Kennedy said:
Note that a failing tape drive can sometimes damage tapes inserted in it, so if you get a new drive, test it with a new tape as well.

So, I'd suggest getting a few more new tapes and seeing if they all work fine in the new drive.

diizzy · Mar 18, 2015

Have you verified that the drive works at all in any OS apart from the diagnostics? Might be worth booting Ubuntu or whatever you prefer and try to make it do something just to verify that it's not a FreeBSD related issue?
//Danne

noodlefling · Mar 18, 2015

diizzy said:
Have you verified that the drive works at all in any OS apart from the diagnostics?

That's exactly what we decided to try next. If the bad tapes fail with another OS, then we're willing to accept that we simply have terrible luck with new tapes (50% failure rate, across 4 different brands), and all this has been a wild goose chase.

Or Terry could be right in that maybe the old drive was bad and killed some fresh tapes dead, and most future new tapes will work fine.

The disappointing thing is how thoroughly a bad tape can lock up a drive, requiring a system reboot to normalize things, particularly if bad tapes are this common. These things are fairly expensive. And if they can be so bad out of the box, how long before important tapes with data on them go bad? We have LTO3 tapes from 7 years ago and LTO1 tapes from 10 years ago that still work fine.

Anyway, I will report back after trying to back up with a different OS. And when we get new tapes. Again.

Thanks, guys.

Terry_Kennedy · Mar 18, 2015

noodlefling said:
The disappointing thing is how thoroughly a bad tape can lock up a drive, requiring a system reboot to normalize things, particularly if bad tapes are this common.

The lockups (actually, a non-completing I/O in device wait state) is due to the drive operating under the assumption that recovering the data is vital, so it retries essentially "forever". On disk drives, this is the difference between the "RAID edition" drives with TLER (Time Limited Error Recovery) vs. normal drives. A normal drive will retry for a lot longer than a RAID drive, since the RAID drive assumes that the controller can recover the data from another member of the RAID set. The RAID controller will also mark a drive offline if the drive spends too much time doing data recovery.

If your tape drive is in an external enclosure, power cycling it will probably clear the hung I/O since the drive will generate an asynchronous attention response when it powers back up. That should satisfy the FreeBSD driver. Note that this is a last resort, as power cycling a drive with a tape loaded can damage the tape or get the tape stuck in the drive.

These things are fairly expensive. And if they can be so bad out of the box, how long before important tapes with data on them go bad? We have LTO3 tapes from 7 years ago and LTO1 tapes from 10 years ago that still work fine.

I have several hundred LTO4 tapes and perhaps 50 LTO6 tapes, and I've never had a bad one out-of-the-box. That isn't to say it can't happen, but seems to be very uncommon. This is much better than SDLT600 tapes, where a 33% DOA rate would be excellent. And Quantum repeatedly refused to honor their "lifetime media warranty", claiming the tapes were "too old". I told them to look at the number of load cycles on each tape (1) when they got them back, and it wasn't my problem if their authorized distributors had stale inventory.

LTO tapes have a bunch of factory-written data on them which cannot be recovered / replaced if lost. That's why the boxes say "do not degauss" (bulk erase) as it will render the tapes unusable. If your original drive was damaging this data, then it could render unusable any tape it loads.

diizzy · Mar 18, 2015

I do not intend to go off topic but if it's compression friendly data perhaps BDXL along with LZMA compression could be an option?
//Danne

Terry_Kennedy · Mar 18, 2015

diizzy said:
I do not intend to go off topic but if it's compression friendly data perhaps BDXL along with LZMA compression could be an option?

Just for reference and anyone else that finds this topic later on...

An LTO5 cartridge holds 15x the capacity of a BDXL disc. Currently, a LTO5 cartridge is $20 or so, while a write-once BDXL disc is approximately $25. Rewritable BDXL-RE media is pretty rare and costs about $40. [Prices based on best price for single-quantity new media from US sellers, excluding eBay.]

diizzy · Mar 18, 2015

Hence why I suggested LZMA/LZMA2 if its applicable, not to mention that the drives are much cheaper. I'm not sure exactly how large a FreeBSD base install is but my images on MIPS ends up at about 155Mbyte off a fully populated 16Gbyte USB stick using LZMA2.
//Danne

noodlefling · Mar 19, 2015

This is an internal tape drive, so the only way to regain control once the drive goes insane is to press and hold the eject button. This resets the drive, and a message on the console shows up saying the drive has become "detached". Is there a way to re-attach it other than rebooting the system?

If we go with the theory that the first drive was bad and that it permanently damaged those tapes, then maybe that is indeed the root cause, and everything will be fine with this new drive. There are new tapes on the way, so if we go 5 for 5 with those, perhaps we can be assured that the storm has passed.

diizzy said:
I do not intend to go off topic but if it's compression friendly data perhaps BDXL along with LZMA compression could be an option?

That is a lot of consonants! And a vowel...

Anyway, we need bulk storage, so I don't think the Blu-Ray would cut it. It doesn't need to be fast, but it does need to be safe, which is why these bad tapes are giving us the willies.

We also back up to portable USB drives (essentially laptop drives in cases that require no external power source). They are unreliable, but painless enough that it's a no-brainer to give us a little extra comfort.

Backing up to the "cloud" is not an option, for a number of reasons.

By the way, Terry, when I was talking to the Dell rep about installing the new firmware, he had a convoluted plan similar to your convoluted plan, but different. A little way in, he got confused, and then I told him, "I don't what any of these terms mean, but..." and then I read him your blog post about how to update firmware and he thanked me and said it was great idea, so we did that.

So it looks like you'll always have the back-up plan of a job as a Dell tech support rep, should it come to that.

Also does anyone have a link to a primer explaining how to choose appropriate blocksize and cache values in a dump? I get what they mean in general terms, but it all seems like voodoo when you try to nail down values that make sense, and the only rule of thumb seems to be a cache value 64 or less.

For instance, I get an error on the console if I choose any blocksize larger than 8, although superficially everything seems to be working. Even the default is 10, so that seems a bit small, although I admit that I have no sense for what those values mean in this real world application.

Terry_Kennedy · Mar 19, 2015

noodlefling said:
This is an internal tape drive, so the only way to regain control once the drive goes insane is to press and hold the eject button. This resets the drive, and a message on the console shows up saying the drive has become "detached". Is there a way to re-attach it other than rebooting the system?

Try # camcontrol rescan all

If we go with the theory that the first drive was bad and that it permanently damaged those tapes, then maybe that is indeed the root cause, and everything will be fine with this new drive. There are new tapes on the way, so if we go 5 for 5 with those, perhaps we can be assured that the storm has passed.

Keep us posted.

By the way, Terry, when I was talking to the Dell rep about installing the new firmware, he had a convoluted plan similar to your convoluted plan, but different. A little way in, he got confused, and then I told him, "I don't what any of these terms mean, but..." and then I read him your blog post about how to update firmware and he thanked me and said it was great idea, so we did that.

So it looks like you'll always have the back-up plan of a job as a Dell tech support rep, should it come to that.

Oh nooo!!!

Also does anyone have a link to a primer explaining how to choose appropriate blocksize and cache values in a dump mean? I get what they mean in general terms, but it all seems like voodoo when you try to nail down values that make sense, and the only rule of thumb seems to be a cache value 64 or less.

Tapes are weird. They come from the dawn of computing, when it was decided that tape records could contain an arbitrary number of bytes, as long as the size was between some minimum and maximum values (which were pretty far apart, 20 and 65536, IIRC).

Other tape formats such as QIC and CompacTape came along with their own rules, often fixed block sizes. But since they didn't interchange with other systems the way 1/2" 9-track tape did, this didn't matter as much. They were often used only with the backup software that came bundled with the drive and controller.

Then came tapes based on consumer audio/video mechanisms - DDS and Data8. These initially used a fixed block size (normally 512) and the driver or controller had to deal with mapping between what the user application asked for and what the drive could provide.

Eventually most formats evolved to allow variable block sizes (on the interface side of the drive; what the drive did internally could be completely different). It took another 10 years or so for the last few nits to get ironed out (odd-length record support being one of the last holdouts).

The internal block size on LTO tapes varies and isn't relevant anyway. It ranges from around 400 KB to around 1.6 MB, depending on the drive model. The drive will buffer data until it has enough to fill one of those blocks (or the file is closed).

In general, you want to use blocks (records) and buffers (a collection of blocks) that are big enough to keep the tape moving most of the time. If the tape has to stop and back up, it wastes a lot of time and sometimes tape space. You can listen to the drive "whooshing" to get a feel for how often it does this - something you seem to have a lot of experience with, from your older drive.

For instance, I get an error on the console if I choose any blocksize larger than 8, although superficially everything seems to be working. Even the default is 10, so that seems a bit small, although I admit that I have no sense for what those values mean in this real world application.

Can you post the error?

noodlefling · Mar 21, 2015

Terry_Kennedy said:
Try # camcontrol rescan all

No joy...

Code:

# camcontrol rescan all
Re-scan of bus 0 was successful
Re-scan of bus 1 was successful
Re-scan of bus 2 was successful
Re-scan of bus 3 was successful
Re-scan of bus 4 was successful
Re-scan of bus 5 was successful
Re-scan of bus 6 was successful
Re-scan of bus 7 was successful
# mt errstat
mt: /dev/nsa0: Device not configured

A reboot cures all.

The internal block size on LTO tapes varies and isn't relevant anyway.

Any suggestions for settings LTO5 tapes/drives in general, or does it vary by manufacturer? Is the best way to test just listening to make sure things sound reasonable? I never got error messages when using default settings for LTO1 and LTO3 drives. And my drives have lasted for many years.

Can you post the error?

When using the default 10 blocksize, it says...

Code:

(sa0:mps0:0:4:0): 10240-byte tape record bigger than supplied buffer

Changing the blocksize to numbers larger than 8 will give an appropriately-scaled error (n x 1024). Choose 8 or lower, and there is no error.

Despite the error message, storing and restoring seems to work fine, but we haven't rigorously stress-tested this assumption.

There was a hold-up with the delivery of fresh tapes, so there's no new data on that front yet.

Terry_Kennedy · Mar 21, 2015

noodlefling said:
No joy...

Code:

# mt errstat mt: /dev/nsa0: Device not configured

A reboot cures all.

When the drive is working do a # camcontrol devlist to find out where it normally shows up. Then you could try # camcontrol reset B:T:L where B, T, and L are the bus, target and LUN that you got from the previous devlist. Try this before trying to reset the tape drive via the front panel.

Any suggestions for settings LTO5 tapes/drives in general, or does it vary by manufacturer? Is the best way to test just listening to make sure things sound reasonable? I never got error messages when using default settings for LTO1 and LTO3 drives. And my drives have lasted for many years.

It depends on the drive's physical speed (half-height drives are usually slower than full-height drives of the same generation) and the generation of tape being used. An LTO4 tape in an LTO5 drive needs less buffering because it doesn't contain as much data per unit of length as an LTO5 drive.

Code:
When using the default 10 blocksize, it says...

Code:

(sa0:mps0:0:4:0): 10240-byte tape record bigger than supplied buffer

Changing the blocksize to numbers larger than 8 will give an appropriately-scaled error (n x 1024). Choose 8 or lower, and there is no error.

That seems odd. Here's a dump(8) done with a 32 KB block size, done to an IBM LTO4 drive:

Code:

(0:1) rz1:/sysprog/terry# camcontrol inquiry /dev/sa0
pass4: <IBM ULT3580-HH4 C7QJ> Removable Sequential Access SCSI-3 device
pass4: Serial Number 1K1004xxxx
pass4: 300.000MB/s transfers, Command Queueing Enabled
(0:2) rz1:/sysprog/terry# dump -0uaL -C 32 -b 32 -f /dev/nsa0 /
  DUMP: Date of this level 0 dump: Sat Mar 21 13:50:10 2015
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping snapshot of /dev/mirror/gm0s1a (/) to /dev/nsa0
  DUMP: mapping (Pass I) [regular files]
  DUMP: Cache 32 MB, blocksize = 65536
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 553328 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
  DUMP: DUMP: 553230 tape blocks on 1 volume
  DUMP: finished in 17 seconds, throughput 32542 KBytes/sec
  DUMP: level 0 dump on Sat Mar 21 13:50:10 2015
  DUMP: Closing /dev/nsa0
  DUMP: DUMP IS DONE

And no console error messages.

What sort of controller is this attached to? A dmesg(8) of at least the controller and tape drive might be informative.

noodlefling · Mar 24, 2015

Terry_Kennedy said:
When the drive is working do a # camcontrol devlist to find out where it normally shows up.

It reports...

Code:

<IBM ULTRIUM-HH5 E6Q3>  at scbus0 target 4 lun 0 (sa0,pass0)

But in order to make the drive go nuts, there needs to be a "bad" tape in there, and in case those tapes have magnetic dandruff, I'd rather not put them in there just to test at this point. If the new tapes break things, though, I will surely try this fix.

What sort of controller is this attached to?

Code:

sa0 at mps0 bus 0 scbus0 target 4 lun 0
sa0: <IBM ULTRIUM-HH5 E6Q3> Removable Sequential Access SCSI-6 device
sa0: Serial Number 9068800821
sa0: 600.000MB/s transfers

Not sure what I'm looking for when it comes to the controller.

Terry_Kennedy · Mar 24, 2015

noodlefling said:
It reports...

Code:

<IBM ULTRIUM-HH5 E6Q3> at scbus0 target 4 lun 0 (sa0,pass0)

But in order to make the drive go nuts, there needs to be a "bad" tape in there, and in case those tapes have magnetic dandruff, I'd rather not put them in there just to test at this point. If the new tapes break things, though, I will surely try this fix.

Ok. That would be # camcontrol reset 0:4:0.

Not sure what I'm looking for when it comes to the controller.

You have a mps(4) controller, which could be any one of a number of things. Since this is a Dell system, I expect it is something like a SAS 6/e card. Things will get funky if it is a PERC (RAID) card, as command passthrough is somewhat iffy on those. Try # grep mps0 /var/run/dmesg.boot

noodlefling · Mar 25, 2015

Terry_Kennedy said:
Things will get funky if it is a PERC (RAID) card, as command passthrough is somewhat iffy on those. Try # grep mps0 /var/run/dmesg.boot

Code:

# grep mps0 /var/run/dmesg.boot
mps0: <LSI SAS2008> port 0xfc00-0xfcff mem 0xda7f0000-0xda7fffff,0xda780000-0xda7bffff irq 18 at device 0.0 on pci2
mps0: Firmware: 07.15.08.00, Driver: 19.00.00.00-fbsd
mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
sa0 at mps0 bus 0 scbus0 target 4 lun 0

We do have a RAID configuration for the hard drives. Not sure if that impacts the tape drive.

Code:

mfi0: <Dell PERC H710P Adapter> port 0xec00-0xecff mem 0xd8ffc000-0xd8ffffff,0xd8f80000-0xd8fbffff irq 34 at device 0.0 on pci8

Terry_Kennedy · Mar 26, 2015

noodlefling said:
We do have a RAID configuration for the hard drives. Not sure if that impacts the tape drive.

No, those are completely independent controllers.