Attaching SCSI tape library while system is running fails with "SCSI sense: NOT READY"

MrFich · Jun 25, 2023

Hello FreeBSD community, I have been troubled by an issue that I was not able to solve myself so far. I am running a small Bareos backup server with a used MSL4048 tape library with a cartridge autochanger and two connected HP Ultrium 5-SCSI I6RW tape drives on FreeBSD 13.1-p7 (issue was also present in former FreeBSD systems). The system is connected via a Qlogic ISP 2532 FC-HBA.

In order to save power, the MSL tape library is only turned on at days where a tape backup is expected. During the power up the cartridge scanner reads in all available tapes which takes about 5 minutes. This leads to the issue that the tape drives are available but the cartridge changer fails to attach:

Code:

sa0 at isp0 bus 0 scbus0 target 0 lun 0
sa0: <HP Ultrium 5-SCSI I6RW> Removable Sequential Access SPC-4 SCSI device
sa0: Serial Number HU1204LRKW
sa0: 800.000MB/s transfers WWNN 0x50014380272b30e9 WWPN 0x50014380272b30ea PortID 0x1
sa0: Command Queueing enabled
sa1 at isp1 bus 0 scbus1 target 0 lun 0
sa1: <HP Ultrium 5-SCSI I6RW> Removable Sequential Access SPC-4 SCSI device
sa1: Serial Number HU1205LTM5
sa1: 800.000MB/s transfers WWNN 0x50014380272b30ef WWPN 0x50014380272b30f0 PortID 0x1
sa1: Command Queueing enabled
(ch0:isp0:0:0:1): MODE SENSE(6). CDB: 1a 00 1d 00 20 00
(ch0:isp0:0:0:1): SCSI sense: NOT READY asc:4,1 (Logical unit is in process of becoming ready)
(ch0:isp0:0:0:1): fatal error, failed to attach to device

When the tape library is already up and running and I (re)-start the system, everything runs fine.

Code:

Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
ses0 at ahciem0 bus 0 scbus11 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 2.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
sa0 at isp0 bus 0 scbus0 target 0 lun 0
sa1 at isp1 bus 0 scbus1 target 0 lun 0
ses0: pass3 in 'Slot 00', SATA Slot: scbus5 target 0
sa1: <HP Ultrium 5-SCSI I6RW> Removable Sequential Access SPC-4 SCSI device
sa1: Serial Number HU1205LTM5
sa1: 800.000MB/s transfers WWNN 0x50014380272b30ef WWPN 0x50014380272b30f0 PortID 0x1
sa1: Command Queueing enabled
sa0: ses0: pass4 in 'Slot 01', SATA Slot: scbus6 target 0
ses0: pass5 in 'Slot 02', SATA Slot: scbus7 target 0
<HP Ultrium 5-SCSI I6RW> Removable Sequential Access SPC-4 SCSI device
sa0: Serial Number HU1204LRKW
sa0: 800.000MB/s transfers WWNN 0x50014380272b30e9 WWPN 0x50014380272b30ea PortID 0x1
sa0: Command Queueing enabled
..
ch0: Serial Number DEC53500DN
ch0: 800.000MB/s transfers WWNN 0x50014380272b30e9 WWPN 0x50014380272b30ea PortID 0x1
ch0: Command Queueing enabled
ch0: 48 slots, 2 drives, 1 picker, 0 portals

Of course, one solution is to leave the tape library running. The tape library is, however, only used for archive purposes and only needed to archive backups to a 2nd medium every couple of days - so 100% on is in this case wasted power. So the current “fix” is to restart the backup server once the library is up… which is not the optional solution in my opinion. I already tried to rescan / reprobe / reset the SCSI bus via camcontrol(8) but did not have any positive outcome. Could you give me a hint how I could either re-attach the autochanger device while the server is running or give me another hint how I could solve the problem?

Your help is much appreciated, I absolutely love FreeBSD as a server system and have been a happy user for many years - thanks!
Matthias

ralphbsz · Jun 26, 2023

Try this: after turning the take infrastructure on, issue a "start unit" command to the drives. I think the camcontrol command for that is start, but look it up in the documentation.

MrFich · Jul 3, 2023

Hi!

Thanks for the tip and sorry for the late reply, just got around to test the start unit command. Unfortunately, I do not seem to get it to work:

The first time I run the command I get the following error:

Code:

camcontrol start 0:0:1 -v
Error received from start unit command
(pass9:isp0:0:0:1): OPEN/CLOSE IMPORT/EXPORT ELEMENT. CDB: 1b 00 00 00 01 00
(pass9:isp0:0:0:1): CAM status: SCSI Status Error
(pass9:isp0:0:0:1): SCSI status: Check Condition
(pass9:isp0:0:0:1): SCSI sense: UNIT ATTENTION asc:28,0 (Not ready to ready change, medium may have changed)

Running the command a second time (I was checking the device ID and ran it again) gives another error that stays the same in all subsequent tries.

Code:

camcontrol start 0:0:1 -v
Error received from start unit command
(pass9:isp0:0:0:1): OPEN/CLOSE IMPORT/EXPORT ELEMENT. CDB: 1b 00 00 00 01 00
(pass9:isp0:0:0:1): CAM status: SCSI Status Error
(pass9:isp0:0:0:1): SCSI status: Check Condition
(pass9:isp0:0:0:1): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
(pass9:isp0:0:0:1): Command byte 4 bit 0 is invalid

ralphbsz · Jul 4, 2023

Oops ... Jose (down below) is right.

Jose · Jul 4, 2023

The target at 0:0:1 is not a tape drive. It's the SCSI medium changer i.e., the robot that moves tapes around. See ch(4) and chio(1). If the ch driver is loaded as a module, maybe a kldunload/kldload cycle on it will do the trick.

Edit: Reading more, it seems camcontrol reprobe 0:0:1 might do the trick. I can't find out what the 1bh command will ask a robot to do. It's either reserved or OPEN/CLOSE IMPORT/EXPORT ELEMENT. The latter means open / close the "mailslot". That's a slot or set of slots in the door of a library that allow you to move tapes into or out of it without opening the door. Opening the door while the robot is moving is a good way to lose some fingers. It sounds like your library does not have a mailslot, though.

MrFich · Jul 7, 2023

Hey! Correct, the library does not have a mailslot, so my fingers are safe for now

. camcontrol reprobe 0:0:1 runs without error but does not change anything concerning the ch(4) situation. I'll play around in the next days with the ch(4) driver and see if I can get something there. If so, I'll post my findings here. If not, I guess it's back to regular reboots for archiving

. Thanks again for all the nice input!

Jose · Jul 7, 2023

You say you've tried camcontrol reset -v 0:0:1 without success. Mind pasting the results here?

MrFich · Jul 7, 2023

Sure! The reset and reprobe are actually succesfull - but the ch0 device doesn't get attached afterwards.

Code:

camcontrol reset 0:0:1 -v
Reset of 0:0:1 was successful

And in dmesg it adds the line:

Code:

isp0: Chan 0 Reset N-Port Handle 0x0000 @ Port 0x000001

Fun fact: When running camcontrol reprobe 0:0:1 -v I do not see any additional output in the commandline or via dmesg.

When running camcontrol reprobe 0:0:1 -v after running camcontrol reset 0:0:1 I still do not see verbal output in the commandline for the reprobe command. However, in dmesg I see a bunch of new entries:

Code:

(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Retrying command, 3 more tries remain
(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Retrying command, 2 more tries remain
(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Retrying command, 1 more tries remain
(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Retrying command, 0 more tries remain
(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Error 5, Retries exhausted

ralphbsz · Jul 7, 2023

Jose said:
Opening the door while the robot is moving is a good way to lose some fingers.

Or worse. The big tape libraries (the IBM 3850 MSS, and the StorageTek WolfCreek) contain a human-sized robotic mechanism, which moves remarkably fast. They have doors for field service to enter, and there are intricate safety mechanisms to disable the robots when the doors are open or field service personnel is inside. There are at least a handful cases where service people intentionally defeated those safety mechanisms, and were killed.

A more funny anecdote: The IBM MSS has two mail slots, one for adding new blank tapes, and one for tapes to be ejected. Both have a small feeder basket, where operators can put a few dozen tape cartridges (which in the case of the MSS look like fist-sized cylinders). Our MSS had a defective drive, which caused read errors. Unfortunately, someone had programmed it to automatically put "defective" cartridges into the output slot. When the machine was left unattended for a night, that caused it to eject nearly all cartridges: Every time it tried to read one cartridge, it would get ejected. Then all batch jobs that were waiting for that cartridge would fail with IO errors, causing other batch jobs (which had in some cases been queued for weeks) to start, read different cartridges, and the cycle repeats. When the operators came in the morning, there were several thousand cartridges on the floor of the computer room. The worst part is this: All those cartridges had to be put back, but because the MSS's metadata storage had also been corrupted (it "forgot" about defective cartridges), that had to be done by a human reading the bar code on each cartridge, and manually placing it into the correct spot. It took a week or two to recover from this disaster.

Jose · Jul 8, 2023

MrFich said:

When running camcontrol reprobe 0:0:1 -v after running camcontrol reset 0:0:1 I still do not see verbal output in the commandline for the reprobe command. However, in dmesg I see a bunch of new entries:

Code:

(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Retrying command, 3 more tries remain
(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Retrying command, 2 more tries remain
(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Retrying command, 1 more tries remain
(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Retrying command, 0 more tries remain
(probe0:isp0:0:0:1): FCP RESPONSE, LENGTH 8: FCP_CMND fields invalid CDB0=0x12
(probe0:isp0:0:0:1): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:isp0:0:0:1): CAM status: Unrecoverable Host Bus Adapter Error
(probe0:isp0:0:0:1): Error 5, Retries exhausted

Well that's odd. The FCP_CMND thing comes from the Fibre Channel layer. Maybe the library is doing its 5-minute, check-every-slot-and-drive thing?

ralphbsz said:
Or worse. The big tape libraries (the IBM 3850 MSS, and the StorageTek WolfCreek) contain a human-sized robotic mechanism, which moves remarkably fast. They have doors for field service to enter, and there are intricate safety mechanisms to disable the robots when the doors are open or field service personnel is inside. There are at least a handful cases where service people intentionally defeated those safety mechanisms, and were killed.

We didn't have libraries that big in our lab back when I worked on this stuff. Biggest had a couple hundred slots, and those robots were plenty scary enough. I can imagine how destructive that thing could've been.

ralphbsz said:
A more funny anecdote...When the operators came in the morning, there were several thousand cartridges on the floor of the computer room. The worst part is this: All those cartridges had to be put back, but because the MSS's metadata storage had also been corrupted (it "forgot" about defective cartridges), that had to be done by a human reading the bar code on each cartridge, and manually placing it into the correct spot. It took a week or two to recover from this disaster.

Yeah, that job taught me that robots don't get tired and they don't get bored. They'll do the same stupid thing with many repetitions and in rapid succession, even when they're destroying themselves in the process.