![]() ![]() ![]() ![]() |
|
Got a spare device that can capture the wifi channel over the air rather than the endpoints?
RunningMan:
Got a spare device that can capture the wifi channel over the air rather than the endpoints?
Hmm. Not on that wifi AP, since it's my tractor shed. But the other pi that has difficulties is on an AP that is closer to lots of things - including my MacBook, which I guess could work.
In the Unifi software, click on your AP and then on its tools icon at the top (crossed screwdriver and spanner). Scroll down the tools page and at the bottom there should be a Debug Terminal option. You should be able to run tcpdump from there. Make sure that if you are storing the output, it goes to a RAM partition (eg /var/log) - you will kill the flash pretty quickly if you write the output to it. It is also possible to set up Ubiquiti APs to allow direct ssh access to their command line. When I did that, my FlexHD AP said it was running a customised version of OpenWRT.
I'm able to SSH in and run tcp dump. My problem is as described above - I see a bunch of network interfaces. I can tcp dump from the bridge interface. But I don't know if that's inbound or outbound packets. There are two wifi interfaces - wifi0 and wifi1, but when I tcp dump from them I don't get any packets that look like real dat transmissions, and definitely no DHCP. One seems to have no traffic, the other looks to have management traffic. I wonder if they're some sort of virtual interface with specific management traffic on them.
EDIT: this morning I took a closer look at the other devices present. There is traffic on ath0, ath5 and ath6, so I ran tcp dump against all them as well. I also have addressing problems on the pi again today. I think I can see offer packets (presumably coming in) on ath0 and ath5. I see dhcp requests on ath6, but no offer packets (presumably going out). I interpret that as meaning that the AP is receiving but not transmitting the offer packets.
Digging around some more, I find on reddit this information: https://www.reddit.com/r/Ubiquiti/comments/mlit54/problems_with_broadcastmulticast_traffic_on_uap/, which basically points the finger at rekeying on the access point, and a bug where rekeying on one SSID also changes the keying on other SSIDs, but without the clients being informed the keying has changed. Sounds very plausible.
Next step is rebooting the AP, as that would resolve the problem if it were occurring.
EDIT2: and rebooting got me a DHCP address. Looking at the rekey intervals, I have 4 SSIDs running (plus whatever unifi does for the mesh). Two have auto rekey intervals, two have 3600 second rekey intervals. So that ties to the behaviour discussed in that thread above. The main problem is that it's claimed to be fixed in the current version of firmware, and my AP is upgraded to that current version. I'll leave it running as-is for a while to see if the behaviour comes back.
And if it is a bug in the AP, then capturing on the AP itself may give erroneous results, hence the suggestion to capture the wireless link with a separate device.
RunningMan:
And if it is a bug in the AP, then capturing on the AP itself may give erroneous results, hence the suggestion to capture the wireless link with a separate device.
Aaand - power cut. So guess everything's rebooted now. :-)
The suggestion in that thread above was that it was a key on the WPA. So a couple of things about that don't quite add up:
1. If it were a key issue, then the AP would still be broadcasting the packet, the pi wouldn't be able to decrypt it due to key issues. That's not what I'm seeing, although there are a fair few assumptions in my diagnosis so it's possible that is what I'm seeing and I'm misinterpreting
2. The issue is supposedly fixed in the most recent firmware, which I'm running
3. That particular AP is a mesh AP, so two hops within a unifi AP. There's no particular reason to think the second AP would be the one dropping the packets as opposed to the first AP, I guess it's 50/50 so maybe that's just the way it worked out today. Or maybe the mesh backhaul is a bit more "unifi" and doesn't have the keying issue.
So, back to waiting for recurrence so I can get more diagnostics, using my laptop and the inside AP that also occasionally does this. I may shorten the DHCP leases down to 10 minutes to make it more likely to occur, then see what I get. On the upside, the suggestion inside all this is that the USG isn't the problem, it's the APs. I still have a line of enquiry though which is that it didn't seem to happen when using the pihole as a DHCP server, even though that's also an dnsmasq implementation. So there's still that to look at if I don't get progress on this line of enquiry.
Was the Pi in the same physical location in the network as the current DHCP server - if it is going through a different combo of APs, that may skew your result if an AP is the issue.
RunningMan:
Was the Pi in the same physical location in the network as the current DHCP server - if it is going through a different combo of APs, that may skew your result if an AP is the issue.
It may, but a couple of different pis have both exhibited the behaviour, and they are in a different location in terms of topology - one is connected to a UAP-AC-M that is mesh connected to a UAP-AC-PRO, then to the switch and the USG (DHCP server). The other is direct connected to a different UAP-AC-PRO. Given that they have both exhibited the behaviour, I don't think that the UAP-AC-M is at the only piece of infrastructure doing it. It does happen more often on the pi connected to the AC-M, that may simply be because it's two hops, and so twice the potential for there to be a problem.
It's pissing down here. I'm not going out in the rain to sit in my tractor shed and try to diagnose an intermittent problem. I'll work with the AP that's inside.
OK, I've spent a chunk of time on this this morning. In summary, I think it's the AP, and I think changing the GTK rekeying interval to 0 either fixes it or significantly mitigates it.
I set my DHCP leases and GTK rekeying set to 600 seconds (10 minutes) on one of the SSIDs. This forces a DHCP request every 5 minutes. I ran tcpdump on the pi, on the AP (actually my UAP-AC-LR), and on my MacBook connected to the UAP-AC-LR.
I can see that there's a 10 minute period where requests from the pi receive an ACK from the DHCP server (the USG). Then followed by a 10 minute period where requests from the pi have an ACK visible on the AP, but not to the pi or the MacBook. Then a 10 minute period where it doesn't work again. The thread I linked above suggested this is a defect where if one SSID has a different rekeying interval than the other SSIDs it leads to breakage of broadcast packets on those other SSIDs, as internally the AP thinks it's rekeyed all the SSIDs, but it actually has only rekeyed one of them.
So my symptoms are broadly consistent with the description given in that thread https://www.reddit.com/r/Ubiquiti/comments/mlit54/problems_with_broadcastmulticast_traffic_on_uap/ but subtly different.
I have then set to rekeying interval to zero on the four SSIDs, and had no issues since then. I still have an outstanding concern as apparently the mesh backhaul runs on a different SSID, and that potentially also has a rekeying interval that can break things. I haven't found any information on what that rekeying interval is - as there is a suggestion that setting all the rekeying intervals the same will also mostly work (they cycle give or take a few seconds, so you'd be unlucky to lose your broadcast DHCP packets in that time).
I'm learning that my unifi gear I've had for a few years and thought was pretty good is actually growing a stack of defects. I also see that my USG doesn't have any sensible replacement, despite starting to get a bit old. I'd say over the next couple of years I'll be moving back out of the Unifi ecosystem, but I really don't want to go to Cisco, so I'll need to think about where I go instead.
MikroTik for routing, Grandstream / Aruba / ? for WiFi?
I think final update.
The backhaul rekeying is every 3600 seconds, you can find that by logging onto the AP, and looking in /etc/*.cfg - basically there's a config for each SSID there.
I tried with all the SSIDs set to rekey every 3600 seconds, but I was finding about 5 minutes every hour where DHCP was still going wrong (this with DHCP leases set to expire every 5 minutes, so that I had enough traffic to debug it relatively quickly).
Going back to no rekeying on the SSIDs, and leaving the backhaul rekeying every 3600 seconds (don't think I can change that) I was seeing no dropouts where DHCP packets weren't acknowledged. I've now moved the DHCP leases back to 7200 seconds, and will leave it that way for a week or so just to make sure there are no glitches. So far it looks clean - no lost IP addresses.
It looks like I'm now stable, at the cost of no rekeying. In the location I live, with no near neighbours (and those there are, who can still just get reception of our network, are 80 years old....unlikely to hack it I'd guess) I think I can live with the security degradation. I don't have appetite to debug/diagnose the problem, nor to work with Unifi support with tickets etc to resolve it.
|
![]() ![]() ![]() ![]() |