Bacula console freezing when checking slots in tape changer

Hello,

First of all, here's the setup for my Bacula server:

  • Dell PowerEdge R320 server
  • Dell PowerVault 124T LTO5 tape changer
  • FreeBSD 9.1 OS
  • bacula-server-5.2.12

Whenever I run [CMD=]update slots[/CMD] in the Bacula command-line console, bconsole, after exchanging tapes in the tape changer, the console freezes. I have to exit the console using Ctrl+C. If I immediately go back into the Bacula console and run [CMD=]update slots[/CMD] again, the console either freezes again or I get an error message. The message (which I wish I would have written down :\) indicates that another process is currently using the tape changer. Here's the process I'm seeing:

Code:
root@aus-nix-backup:/root # ps waux | grep mtx
bacula   91498   0.0  0.1 14172  1408 ??  DL   12:01PM     0:02.35 /usr/local/sbin/mtx -f /dev/pass1 status

The process cannot be killed.

Last week when this happened, I rebooted the server to resolve this issue. I was able to label the new tapes, mount one into the drive, and run successful backups all week long. When this happened again today, I didn't want to have to reboot again; that's a workaround, not a permanent solution. I tried restarting the storage daemon, and I was able to run [CMD=]update slots[/CMD] in the console to have Bacula process the new tapes; however, the mtx process is still alive. Also, I see the following:

Code:
root@aus-nix-backup:/root # ps waux | grep bacula
bacula   91498   0.0  0.1 14172  1408 ??  DL   12:01PM     0:02.35 /usr/local/sbin/mtx -f /dev/pass1 status
bacula   91499   0.0  0.1 16280  1640 ??  I    12:01PM     0:00.00 grep  *Storage Changer
bacula   91500   0.0  0.1 12104  1536 ??  I    12:01PM     0:00.00 awk {print $5}
bacula   92252   0.0  0.3 71612  6188 ??  Ss    1:07PM     0:00.07 /usr/local/sbin/bacula-dir -u bacula -g bacula -v -c /usr/local/etc/bacula-dir.conf
bacula   92266   0.0  0.2 59184  4868 ??  Is    1:07PM     0:00.35 /usr/local/sbin/bacula-sd -u bacula -g bacula -v -c /usr/local/etc/bacula-sd.conf
root     92301   0.0  0.2 44132  3192 ??  Is    1:08PM     0:00.04 /usr/local/sbin/bacula-fd -u root -g wheel -v -c /usr/local/etc/bacula-fd.conf
root     94809   0.0  0.1 16280  1628  1  R+    4:53PM     0:00.00 grep bacula

The two processes, 91499 and 91500, below the mtx process concern me even more.

The aforementioned issues are not showstoppers, at least not for the time being. If I restart the storage daemon, as mentioned above, backups seem to be running well. I'll also note that I am unmounting the tape drive using the Bacula console before removing and inserting tapes. I'd still like to know what causes the console freeze and lingering mtx process and if those issues can be avoided.

I appreciate any advice offered.
 
kkaos said:
The aforementioned issues are not showstoppers, at least not for the time being. If I restart the storage daemon, as mentioned above, backups seem to be running well. I'll also note that I am unmounting the tape drive using the Bacula console before removing and inserting tapes. I'd still like to know what causes the console freeze and lingering mtx process and if those issues can be avoided.

I appreciate any advice offered.
The D state you see for the mtx process indicates a device wait. Those generally aren't killable - the system will wait (possibly forever) for the device driver to either successfully complete the operation or return an error condition.

The PowerVault 124T is a re-badged Quantum Superloader 3, which I've referred to as "a triumph of engineering over common sense" in the past. I've used a number of these (the SCSI versions) on FreeBSD.

The Superloader 3's changer interface actually comes from inside the tape drive mechanism itself (called the "library port"), so I wouldn't expect changer commands to fail if you can successfully read / write tapes. But strange things can happen.

The latest (generic) firmware versions for the Superloader 3 / LTO-5 is V90 (quite recent) for the library and either V31807011 or V3180A011 for the LTO-5 drive. You should check to see what versions your 124T is running, and check the Dell support site for updates.

What brand / model SAS controller are you using in the FreeBSD system?

I assume the console doesn't log any error messages while the mtx process is hung? Can you post the relevent portions of your kernel dmesg (from /var/run/dmesg.boot)? The interesting parts will be the sa0 and ch0 devices, as well as their parent entries. Also, the output from [cmd=""]camcontrol devlist[/cmd] might be helpful.
 
Terry_Kennedy said:
The latest (generic) firmware versions for the Superloader 3 / LTO-5 is V90 (quite recent) for the library and either V31807011 or V3180A011 for the LTO-5 drive. You should check to see what versions your 124T is running, and check the Dell support site for updates.

I recently purchased the 124T from Dell a little over a month ago so its firmware should be up-to-date, but I'll definitely check.

Terry_Kennedy said:
What brand / model SAS controller are you using in the FreeBSD system?

LSI SAS2008

Terry_Kennedy said:
I assume the console doesn't log any error messages while the mtx process is hung? Can you post the relevent portions of your kernel dmesg (from /var/run/dmesg.boot)? The interesting parts will be the sa0 and ch0 devices, as well as their parent entries. Also, the output from [cmd=""]camcontrol devlist[/cmd] might be helpful.

I'm seeing this message repeating incessantly in /var/log/messages:

Code:
Apr 15 08:00:01 aus-nix-backup kernel: (pass1:mps0:0:3:1): REQUEST SENSE. CDB: 3 20 0 0 14 0 length 20 SMID 678 terminated ioc 804b scsi 0 state c xfer 0

Looks like a communication issue between the 124T and the SAS controller, and I believe that's the mtx process which is stuck in the DL state.

From /var/run/dmesg.boot:

Code:
mps0: <LSI SAS2008> port 0xec00-0xecff mem 0xd9ff0000-0xd9ffffff,0xd9f80000-0xd9
fbffff irq 48 at device 0.0 on pci8
mps0: Firmware: 07.15.08.00, Driver: 14.00.00.01-fbsd
mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
...
sa0 at mps0 bus 0 scbus0 target 3 lun 0
sa0: <QUANTUM ULTRIUM 5 3180> Removable Sequential Access SCSI-6 device
sa0: 600.000MB/s transfers
sa0: Command Queueing enabled
ch0 at mps0 bus 0 scbus0 target 3 lun 1
ch0: <DELL PV-124T 0090> Removable Changer SCSI-2 device
ch0: 600.000MB/s transfers
ch0: Command Queueing enabled
ch0: 8 slots, 1 drive, 1 picker, 0 portals

And here's the output from [cmd=""]camcontrol devlist[/cmd]:

Code:
root@aus-nix-backup:/root # camcontrol devlist
<QUANTUM ULTRIUM 5 3180>           at scbus0 target 3 lun 0 (sa0,pass0)
<DELL PV-124T 0090>                at scbus0 target 3 lun 1 (pass1,ch0)
<PLDS DVD-ROM DS-8D9SH JD51>       at scbus5 target 0 lun 0 (pass2,cd0)

I'll also note that backups ran successfully over the weekend.
 
kkaos said:
I recently purchased the 124T from Dell a little over a month ago so its firmware should be up-to-date, but I'll definitely check.
I've received new-in-the-box stuff from Dell and have found out-of-date firmware on more than one occasion. However, your versions look like the latest.

I'm seeing this message repeating incessantly in /var/log/messages:

Code:
Apr 15 08:00:01 aus-nix-backup kernel: (pass1:mps0:0:3:1): REQUEST SENSE. CDB: 3 20 0 0 14 0 length 20 SMID 678 terminated ioc 804b scsi 0 state c xfer 0

Yup, that's your changer. Your [cmd=""]dmesg[/cmd] shows:

Code:
<DELL PV-124T 0090>                at scbus0 target 3 lun 1 (pass1,ch0)

I don't have a lot more to offer, unfortunately. The one thing you might try is using the LSI Logic driver, which is a little newer (Phase 15) than the one you've got:
Code:
mps0: Firmware: 07.15.08.00, Driver: 14.00.00.01-fbsd
Download the P15 FreeBSD driver from here (currently the top result).

Normally the mps(4) driver is built into the FreeBSD kernel, so you'd need to build a kernel without:
Code:
device          mps             # LSI-Logic MPT-Fusion 2
in the config file. Then you can follow the instructions in the LSI README (2nd file in the search results) to install the LSI driver. If this does fix the problem, I'm sure the FreeBSD maintainers would like to hear about it.
 
I built a custom kernel without the mps driver, and I configured /boot/loader.conf to use the alternate LSI P15 driver. After rebooting, though it appears the driver is loaded and being used with the changer, I see 100+ instances of these error messages that were logged in /var/log/messages during boot:

Code:
Apr 17 17:32:18 aus-nix-backup kernel: (probe7:mpslsi0:0:7:0): INQUIRY. CDB: 12 0 0 0 24 0
Apr 17 17:32:18 aus-nix-backup kernel: (probe7:mpslsi0:0:7:0): CAM status: Invalid Target ID
Apr 17 17:32:18 aus-nix-backup kernel: (probe7:mpslsi0:0:7:0): Error 22, Unretryable error

I've run some initial tests to verify that communication with the tape changer is working. I'll find out tonight if backups to tape succeed and if anymore error messages are logged. The final test will be on Friday, when I again change the tapes.
 
kkaos said:
I built a custom kernel without the mps driver, and I configured /boot/loader.conf to use the alternate LSI P15 driver. After rebooting, though it appears the driver is loaded and being used with the changer, I see 100+ instances of these error messages that were logged in /var/log/messages during boot:

Code:
Apr 17 17:32:18 aus-nix-backup kernel: (probe7:mpslsi0:0:7:0): INQUIRY. CDB: 12 0 0 0 24 0
Apr 17 17:32:18 aus-nix-backup kernel: (probe7:mpslsi0:0:7:0): CAM status: Invalid Target ID
Apr 17 17:32:18 aus-nix-backup kernel: (probe7:mpslsi0:0:7:0): Error 22, Unretryable error
Since it is logging from mpslsi, you're definitely using the LSI driver and not the native FreeBSD one.

Are all of the errors logged against the 0:0:7:0 target? I would hope that the driver would take the hint after a single "Invalid Target / Unretryable error" message.

I just tried the mpslsi Phase 15 driver on an 8.4-RC1 test system here (which has a SAS2004 controller with the latest generic LSI firmware on it) and I'm not seeing those messages here. So I suspect it has something to do with either your specific SAS devices or the firmware on the adapter. Since you have a Dell tape library, I assume you're using a Dell SAS controller? Dell tends to do some funky stuff with the firmware on their versions of the LSI cards. For example, the SAS 6/iR firmware creates a "Dell BP" pseudo-device for the hot-swap backplane, which the same controller flashed with generic firmware and BIOS won't have.

You're in an odd situation, as Dell doesn't "support" the FreeBSD operating system. LSI does, but you're (probably) using a Dell version of the card, so they may not be able to help.

Hopefully the mpslsi driver fixes the problems with misc/mtx hanging and you can live with the spew of error messages at boot. If it does fix the hang, you might open a PR to notify the FreeBSD mps developer(s) that there's a newer LSI driver that fixes one problem but also has some unwanted log messages.

There's a dev.mpslsi.0.debug_level tunable which is defaulting to 4 ("Enable prints for driver faults"). You might try adding:
Code:
dev.mpslsi.0.debug_level=0
to /boot/loader.conf to try to make the driver less chatty at boot time.

Footnote: LSI released their Phase 16 driver today. However, the release notes say "version number change only", so it probably isn't worth trying that one to see if it fixes the boot messages.
 
Unfortunately the mpslsi driver did not resolve the issue. I think there may be newer firmware for my 124T so I'll try that out.
 
kkaos said:
Unfortunately the mpslsi driver did not resolve the issue. I think there may be newer firmware for my 124T so I'll try that out.
Oh well - it was worth a try.

You might try looking for a LiveCD for some other operating system which includes a pre-built version of mtx on the CD. Boot that CD and try your changer command line with it. If you get the same error, you almost* definitely have a hardware / firmware issue. If it completes without error, you may have a driver issue.

mtx builds its own private device command blocks, since the operating system doesn't have support for them built-in. If it did, you wouldn't need the sysutils/mtx port.

* I say "almost definitely" because there's some cross-pollination of drivers between the various *BSDs. If you can find a Linux LiveCD with mtx for testing, that would be ideal.
 
I booted the server using a CentOS 6 LiveCD to troubleshoot the communication with the tape changer. After reinserting the magazine into the tape changer, I ran the mtx command that always fails for this scenario on my FreeBSD system: mtx -f /dev/sg2 status, where /dev/sg2 represents the tape changer as seen by CentOS 6. Although the command does display the contents of the LTO-5 drive and changer slots, those contents are preceded by the following:

Code:
mtx: Request Sense: Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=0 (Unknown?!)
mtx: Request Sense: Sense Key=No Sense
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 00
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
mtx: Request Sense: SKSV=no

The following message (paraphrased) is printed in /var/log/messages 10 times:

Code:
kernel: mpt2sas0: log_info(0x3112043a): originator(PL), code(0x12), sub_code(0x043a)

If I run the same mtx command immediately afterward, I do not see any of the above output. Also, the command does not get stuck waiting on a response from the tape changer and SAS controller like it does when run on FreeBSD. I tried the same troubleshooting procedure on my Ubuntu 12.04 Bacula server (also connected to a PowerVault 124T LTO-5 changer via SAS) and got the same results as shown above.

Given the results of my troubleshooting, it appears the problem is with the SAS drivers provided for FreeBSD. It seems neither driver can handle the blip in communication between mtx, the SAS controller, and the tape changer, as seen in my troubleshooting.

Should I configure Bacula to use chio and chio-bacula instead of mtx? Maybe I won't see that blip in communication with chio.
 
kkaos said:
Given the results of my troubleshooting, it appears the problem is with the SAS drivers provided for FreeBSD. It seems neither driver can handle the blip in communication between mtx, the SAS controller, and the tape changer, as seen in my troubleshooting.
It seems to me that there's an underlying hardware issue which the FreeBSD and CentOS drivers handle differently, since you get error messages from both the application and the kernel in CentOS.

Should I configure Bacula to use chio and chio-bacula instead of mtx? Maybe I won't see that blip in communication with chio.
You could try that and see if it helps - it will likely depend on which commands are sent to the drive by each application.

I'd suggest filing a FreeBSD bug report to get a developer to look at this. At a minimum, they should be able to tell you what the error messages mean.
 
I may have answered my own question without reconfiguring Bacula. After the tape magazine had been removed and inserted again, I ran chio -f /dev/ch0 status -S, where /dev/ch0 represents my tape changer, and there was no blip in communication with the SAS controller and the tape changer. Then, in the Bacula console, I ran update slots successfully without the console freezing.

Apparently, I can continue having Bacula use mtx to communicate with the tape changer, but I need to use chio to initialize communication with the changer after reloading the tape magazine and before having Bacula scan the newly inserted tapes via update slots. It's a hack, but it's better than rebooting once a week.

I'll double-check the effectiveness of this procedure when I exchange the tapes on Friday, and I'll be sure to post the results.

Terry, I'll also file a bug report per your suggestion. I would certainly classify this issue as a bug, though the bug appears to be related to FreeBSD's implementation of mtx, not to its native SAS driver.
 
Yes, running chio to scan the changer's slots to initialize communication with the changer before having Bacula perform its scan using mtx worked again yesterday.

I have created a new thread to mention my troubles specifically with mtx and the Dell PowerVault 124T-LTO5 changer to see if anyone else has experienced the same troubles or if I'm doing something wrong. My decision to file a bug report on the mtx port will depend on the feedback provided there.
 
Back
Top