Intro

As per a previous post, I have a proxmox box pulling data from 3 Xiaomi Mijia Temperature and Humidity through BLE. However, after some days of succesfully working the system froze as it was not receiving any of the packets sent by the sensors.

Upon connecting to the server itself, I was expecting to find the BT interface kind of borked maybe throwing errors however, running hciconfig -a returned absolutely nothing. The adapter wasn’t “DOWN” or throwing any error, it was completely gone from the OS as if it had been physically unplugged.

The Intel Firmware Lockup

To figure out what happened, I checked the system logs with dmesg | grep -i -E "bluetooth|btusb|hci0" | tail -30. It spat out this:

Plaintext

[13208393.586823] Bluetooth: hci0: command 0xfc1e tx timeout
[13208398.336401] Bluetooth: hci0: Failed usb_autopm_get_interface: -16

My BT adapter is an Intel AX210 (8087:0032). I changed it from the default version that came in my mini pc, which was a mediatek one, without proper linux drivers. So I changed to the Intel one which was supposed to work and was very cheap (around 10€). I just though it was a random error and, not without some reservations, rebooted the server, in the end it was a good test to see if every service was well deployed. Everything went well, all services back up and, importantly, sensors back to reading.

Locking up, again

The second time this happened I had to look the issue up, as it probably wasn’t a random occurrence. Turns out, these Intel AX210 have a known issue where running continuous passive BLE scans can make the firmware crash hard. That -16 error translates to EBUSY, meaning the USB interface was completely locked up. rmmod btusb did nothing to recover it.

Since rebooting constantly wasn’t ideal (the idea is that the mini-pc is a server, a reboot every few months may work but not every few days), I found you can force a USB power cycle via software using PCI unbind/bind commands. I wrote a quick bash script to check if the adapter was down, and if so, power-cycle the USB controller. I threw it on a 5-minute cron job and hoped it would work. However, that approach did nothing to stop the crashing. However, this time dmesg showed a new pattern. The firmware was initializing from scratch exactly every 199 seconds. It wasn’t a randomly crashing, it was being put to sleep by something. Linux has a feature called USB Autosuspend to save power. Because a passive BLE scan doesn’t maintain an active connection with the sensors, the kernel assumed the Intel adapter was idle and put it to sleep. When TheengsGateway tried to read it, the firmware woke up, panicked, and borked itself.

Solving the issue

Looking around for the problem, I found this issue in Github and, in particular, this comment from user sashoalm which pointed me in the right direction (or gave me the solution really).

sashoalm found out that, once the device dissapears, to get it back you need to re-scan for it. He also notes that this may temporarily disconnect some peripherals but for my server this does not matter as there are no usbs in use. He basically ran

for dev in /sys/bus/pci/drivers/[uoex]hci_hcd/*:*; do [ -e "$dev" ] && echo -n "${dev##*/}" | sudo tee "${dev%/*}/unbind" > /dev/null && echo -n "${dev##*/}" | sudo tee "${dev%/*}/bind" > /dev/null; done

and the adatper poped back into life. After testing this approach and finding it did work on the proxmox host, I decided try and avoid the issue.

For this, I just needed to tell the kernel to stop putting the adapter to sleep.

First, I found the Vendor and Product ID using lsusb (8087:0032). Then, I made a udev rule to disable power management for that specific device:

cat > /etc/udev/rules.d/99-bluetooth-power.rules << 'EOF'
ACTION=="add", SUBSYSTEM=="usb", ATTRS{idVendor}=="8087", ATTRS{idProduct}=="0032", TEST=="power/control", ATTR{power/control}="on"
EOF

I also edited /etc/default/grub and added usbcore.autosuspend=-1 to the GRUB_CMDLINE_LINUX_DEFAULT line, followed by an update-grub, just in case. Then, a reboot and done.

As I am updating this article, 33 days have passed with no BT crashes so this seems to have done the job!