sysutils/nut: USB communication failure

Summary:
I recently installed Network UPS Tools (sysutils/nut) and have been experiencing issues with the USB driver losing communication to the UPS, resulting in stale data and connection failure errors. On system boot, NUT will start and operate correctly, but loses communication to the UPS after a few hours. Once communication is dead, attempts to restart the NUT service fail. The only means of recovery is to reboot or unplug and reconnect the UPS USB cable.

System/Software:
FreeBSD 13.2-p4 and NUT version 2.8.0_24 installed from the package repository.

UPS:
APC Smart-UPS 750

Setup:
FreeBSD runs in a virtual machine on VMware ESXi.
UPS connects to the ESXi host by USB cable and to the FreeBSD guest by USB pass-through.
ESXi host has NUT client software installed and communicates to the FreeBSD NUT server to shutdown guests and the host on power failure.

Details:
I had this setup (same server and UPS) running NUT on a CentOS 6 virtual machine and worked perfectly for years without UPS communication issues. When I moved the NUT server to FreeBSD I copied the working configuration (settings changed from defaults) from CentOS.

Settings applied in ups.conf:
Code:
[smt750]
    driver = usbhid-ups
    port = auto
    desc = "APC Smart-UPS SMT750I"
    offdelay = 180
    ondelay = 240

Settings applied in upsd.conf:
Code:
LISTEN 127.0.0.1 3493
LISTEN <the.vm.ip.address> 3493

Initial attempts to start the NUT service after configuration failed:
Code:
# service nut start
Network UPS Tools - UPS driver controller 2.8.0
Network UPS Tools - Generic HID driver 0.47 (2.8.0)
USB communication driver (libusb 1.0) 0.43
interrupt pipe disabled (add 'pollonly' flag to 'ups.conf' to get rid of this message)
Can't claim USB device [051d:0003]@0/0: Other error
Driver failed to start (exit status=1)
Apparently, this is caused by the kernel USB HID driver not detaching from the device and preventing NUT from binding to it. To overcome this I added the following to /boot/loader.conf:
Code:
usb_quirk_load="YES"
hw.usb.quirk.0="0x051d 0x0003 0 0xffff UQ_HID_IGNORE"
The vendor and product ID come from usbconfig. After a reboot, NUT has correct permissions on the device and upsd starts successfully.
Code:
# ll /dev/usb
crw-rw----  1 root  nut       0x84 Nov  9 17:51 0.4.0

Output of usbconfig (when everything is working):
Code:
# usbconfig
  ugen0.4: <American Power Conversion Smart-UPS 750 FW:UPS 15.0 / ID18> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (2mA)

# usbconfig -d ugen0.4 dump_device_desc
  bLength = 0x0012
  bDescriptorType = 0x0001
  bcdUSB = 0x0200
  bDeviceClass = 0x0000  <Probed by interface class>
  bDeviceSubClass = 0x0000
  bDeviceProtocol = 0x0000
  bMaxPacketSize0 = 0x0040
  idVendor = 0x051d
  idProduct = 0x0003
  bcdDevice = 0x0106
  iManufacturer = 0x0001  <American Power Conversion >
  iProduct = 0x0002  <Smart-UPS 750 FW:UPS 15.0 / ID=18>
  iSerialNumber = 0x0003  <xxxxxxxxxxxx  >
  bNumConfigurations = 0x0001

When the USB driver loses communication after a few hours, upsd marks the data as stale, and the following appears in the system log:
Code:
upsd[3132]: Data for UPS [smt750] is stale - check driver
upsmon[5991]: Poll UPS [smt750@localhost] failed - Data stale
upsmon[5991]: Communications with UPS smt750@localhost lost

On loss of communication, upsmon is configured to raise a notify event (nocomm or commbad) to the scheduler, and a 30-second timer will restart the NUT service via the upssched-cmd script. This is how it was on CeotOS. But the service restart fails here because the driver refuses the connection, as shown in the system log:
Code:
upssched-cmd[34596]: Timer event comm-bad triggered: restarting NUT service
upsd[3132]: mainloop: Interrupted system call
upsmon[5991]: Poll UPS [smt750@localhost] failed - Write error: Broken pipe
upsmon[5991]: UPS [smt750@localhost]: connect failed: Connection failure: Connection refused
usbhid-ups[2912]: libusb1: Could not open any HID devices: insufficient permissions on everything
upsmon[5991]: UPS [smt750@localhost]: connect failed: Connection failure: Connection refused
root[37497]: /usr/local/etc/rc.d/nut: WARNING: failed precmd routine for nut
upsmon[5991]: UPS [smt750@localhost]: connect failed: Connection failure: Connection refused
upsmon[5991]: UPS smt750@localhost is unavailable
This goes on and on until I unplug and plug in the UPS again.

I updated the UPS firmware, but this did not help the problem.

Searching for solutions on the Internet, I initially tried increasing the timeout threshold by adding to upsd.conf:
Code:
MAXAGE 30
and to upsmon.conf:
Code:
DEADTIME 30
This made no improvement. I then additionally extended the poll interval to 10 seconds in ups.conf:
Code:
pollinterval = 10
This seems to have stopped the failures and NUT has now run continuously for over a week without a communication error. But that's only half the problem; if communication does fail for some reason, there is no automated recovery.

During my testing I discovered it was possible to reset a device with usbconfig. So, when NUT failed I could:
Code:
# service nut stop
# usbconfig -d ugen0.4 reset
# service nut start
and this worked. So I changed part of the upssched-cmd script to:
Code:
case $1 in
...
    no-comm|comm-bad)
    if [ ! -z "`pgrep upsd`" ]; then
        $LOGGER "Timer event $1 triggered: restarting NUT upsd service"
        service nut stop
        DEVICE=`usbconfig | grep $ID | sed 's/:.*//'`
        if [ ! -z $DEVICE ]; then
            usbconfig -d $DEVICE reset
            service nut start
            printf "%s\n" "Host $HOSTNAME UPS communication failure on $NOW" \
                "The service has been restarted." | \
                $MAIL -s "UPSMON: Communication Failure" $SYSADMIN
        fi
    fi
    ;;
...
esac
It's a bit ugly, and not yet tested for real by somehow creating a failure, but it's the best workaround I have.
But I still see this as an issue with the usbhid-ups driver or libusb, or both. Perhaps the NUT maintainer can look into it.

Alternatives:
I also considered switching to a serial interface port/cable and using the apcsmart driver, but the driver list suggests this has a lower support level. Without going through the hassle of changing the setup I don't know if I would see the same or less functionality and data. If anyone has knowledge of this it would be useful. I note the USB driver does not report the load or in/out voltage, which is disappointing. I don't know if that's a limitation of the UPS or the USB driver.
 
Update:
Since extending the poll interval from 2 (default) to 10 seconds the stability has improved, but it has not stopped NUT losing communication to the UPS. NUT initially ran for 16 days before losing communication, and then 3.5 days and 2.5 days after that. I have no idea why it fails and cannot find any relation to any other system event. I will report this as a bug, as I believe there is some issue with the USB driver or libusb.

The failures did, however, provide an opportunity to test the automated recovery, which did not work. This is because upssched runs as the nut user, which lacks the necessary privilege to perform some actions. To overcome this I installed doas to allow nut to run specific commands evelated. The revised upssched-cmd script is:

Code:
NAME=`basename $0`
LOGGER="/usr/bin/logger -t $NAME"
UPS="smt750"
DESC="Smart-UPS 750"
DEVICE="ugen0.4"
UPSC="/usr/local/bin/upsc $UPS@localhost"
UPSMON="/usr/local/sbin/upsmon"
HOSTNAME=`hostname -s`
SYSADMIN="root@example.com"
MAIL="/usr/bin/mailx"
NOW=`date +"%d/%m/%Y %T"`
ONBATT="/var/db/nut/onbatt"
NOCOMM="/var/db/nut/nocomm"

case $1 in
    ...
    comm-ok)
        if [ -f $NOCOMM ]; then
            printf "%s\n" "Host $HOSTNAME UPS communication established on $NOW" "Status report for UPS [$UPS]:" "`$UPSC`" | \
                $MAIL -s "UPSMON: Communication Established" $SYSADMIN
            rm -f $NOCOMM
        fi
        ;;
    no-comm|comm-bad)
        if [ -n "`pgrep upsd`" ]; then
            touch $NOCOMM
            $LOGGER "Timer event $1 triggered: restarting NUT upsd service"
            printf "%s\n" "Host $HOSTNAME UPS communication lost on $NOW" "Check the system log." | \
                $MAIL -s "UPSMON: Communication Lost" $SYSADMIN
            /usr/local/bin/doas -n -u root -- /usr/sbin/service nut stop
            sleep 5
        fi
        if [ -f $NOCOMM ]; then
            if [ $DEVICE = "`usbconfig | grep \"${DESC}\" | sed 's/:.*//'`" ]; then
                /usr/local/bin/doas -n -u root -- /usr/sbin/usbconfig -d $DEVICE reset
                sleep 10
                /usr/local/bin/doas -n -u root -- /usr/sbin/service nut start
            fi
        fi
        ;;
    ...
esac

And in /usr/local/etc/doas.conf:

Code:
permit nopass nut as root cmd /usr/sbin/usbconfig args -d ugen0.4 reset
permit nopass nut as root cmd /usr/sbin/service args nut start
permit nopass nut as root cmd /usr/sbin/service args nut stop

This appears to work in resetting the USB interface and restarting upsd.
 
Back
Top