TL;DR: My USG believes it's still delivering DHCP offer packets to my raspberry pi(s), but the raspberry pis aren't receiving the offer packet. If I force reconnect the pi, it then does receive the offer packets. It's intermittent, it works for days then suddenly stops seeing the offer packets again. I assume some piece of networking equipment or configuration is causing these to get filtered out, but I have no idea why it'd only do it intermittently. Looking for ideas on what I missed, or ideas on how I can further diagnose it.
THE NETWORK
Upstream I have Farmside RBI, with a Huawei modem. The Huawei is providing DHCP, but only to the USG.
The USG connects to the Huawei on the WAN port, and receives an external IP address from that. The USG then provides DHCP to my network. Yes, that means double NAT. No, I haven't seen any problems from that, and Farmside/RBI is CG-NAT anyway, so not like I have any ability to port forward.
LAN side, the USG connects to a third party unmanaged switch, which connects to some ethernet cabled devices, and to 4 Unifi APs - two UAP-AC-PRO, one UAP-AC-LR, and one UAP-AC-M that is mesh connected to one of the PROs (it's in a shed external to the house, too hard to trench in an ethernet to it). All the APs are POE powered using injectors.
The APs run 3-4 different wifi networks on them. This is because I was having problems with my pis roaming between the APs, and connecting to APs that had really marginal connectivity, then basically stopping working, so I've created some SSIDs that only broadcast from specified APs to stop the fixed devices from roaming. With hindsight I think that problem was probably just the same problem I still have - they were roaming because the AP they were on wasn't giving them an IP address. Then they'd lose IP connectivity on the new AP, which I attributed to that AP having low reception and dropping packets.....but was probably just that something's funny in my DHCP.
THE DHCP CONFIGURATION
My USG is running dnsmasq. I have a pihole in the network, and dnsmasq was needed for the configuration of the pihole. Other than that pihole configuration, there's nothing non-standard that I can see. It is delivering IP addresses in the range 192.168.1.100 - 192.168.1.254, and then the fixed equipment (including the pis) have fixed IP addresses assigned by the controller (not static addresses on the pi, so they still are using DHCP to obtain those fixed IP addresses, they are configured in the Unifi controller).
THE SYMPTOMS
Every few days the pis (and some other devices) will show as having a 169.x.x.x IP address, and no longer be connectable. When I go into the unifi controller and reconnect them, they get an IP address and start connecting again.
In checking the logs I can see that the pis are asking for an IP lease renewal, but not receiving it. In the logs on the USG I can see it was offering the address renewal.
I've gone the next step and run TCP dump on one of the pis, and waited for the problem to recur, which it did overnight. In summary that tells me that everything works fine for a while, then the pi starts asking for a lease renewal and never sees a reply from the USG. Eventually it loses the IP address, and shifts to a 169.x.x.x address, but keeps asking for a DHCP address every couple of minutes. It never sees a DHCP offer response. At the USG end, it's consistently sending DHCP offer packets, but not getting a request from the pi.
THE LOGS
On the pi, I run:
tcpdump -v -n -e -i wlx001986410bb7 port 67 or port 68
Hmm. I have no good way to paste in the logs. So I'm going to link to the thread on the Unifi community, which has nicer formatting:
https://community.ui.com/questions/USG-responding-to-DHCP-but-clients-not-receiving/f56af957-5e20-4115-9536-ac268501f2ba (last post has the relevant logs).
HYPOTHESES
I have a few thoughts on what could be going on, but no easy way to validate that I've found:
- I've seen reports that the pi is fussy on checksums on packets, and discards packets it decides are malformed. But I have no idea why it would do it intermittently, but in a funny consistent intermittent way - they're all good for a while, then they're all bad for a while, until force reconnect
- The USG tends to forget things - for example it often can't tell me reliably what devices are connected to it (for ethernet connected clients), even though it's given them all IP addresses - particularly true of my virtual servers. So I could imagine that the router has "misplaced" some devices, and is somehow routing to them incorrectly. But if that were true, it'd logically be the switch that'd be discarding packets as it's the device that decides which port to route things down. If the Unifi gear is losing the device, then I could see a force reconnect would fix it - it'd refresh routing information. But I don't see how that'd impact the switch. And it wouldn't explain why the controller can still see the pi, and knows it has a 169.x.x.x address - surely that means it can see it and knows where it is
- I do have two virtual networks running (one without the pihole configured, one with). None of the devices in question are on the second network, but again the Unifi stuff can be flaky, perhaps that's upsetting it somehow?
- I have configured the pihole to act as my DHCP server before. I prefer not to, because on a power cut things don't start cleanly - the pihole is a virtual server and doesn't come up until the physical server gets an IP address.....which it won't if the pihole is the DHCP server. I can get around that by giving the server a static IP address, but I don't like that (for no good reason I guess, it's just preference, I've had it running that way and it runs, I just like my DHCP server having allocated all addresses in my network)