Intermittent Connector offline following single dns timeout < 1 sec

Several times a day, I get an email that my AWS AMI / EC2 connector has gone offline. Typically it is followed by a back online email several minutes later. Thus the connector is “online” and working for ~23.75 hours per day.

When I look at the connector logs for an alleged offline period (below), it appears a single failed DNS Lookup with a tight timeout of a fraction of a second is leading to the connector declaring itself offline. (I’ve omitted many non-relevant preceding log lines from the same second, leading to the belief the timeout happened not just in the same log second but in a small fraction of that second.)

As a workaround, I just now re-configured my DNS resolver for hopefully more consistent performance. We’ll see if that fixes the issue. (The previous DNS set up was a private resolver handling just the private subnet and forwarding other queries to the AWS-supplied resolver on my VPC. I do not have any other evidence that the DNS setup was broken in any other way.)

If that doesn’t do it, I’m wondering if I can configure the connector to use a less aggressive DNS timeout, or not take the connector offline on a single failure, or otherwise mitigate the issue?

This is not an academic problem: while I am currently not routing actual user traffic through this connector, back when I did have several dozen users going through a similar setup, it resulted in not infrequent claims of Twingate instability that persisted over a period of weeks, until I finally switched all users to a single hardcoded IP, which resolved the complaints.

Oct 10 02:09:36 ip-* twingate-connector[1238]: [DEBUG] [libsdwan] heartbeat payload: {}
Oct 10 02:09:36 ip-
twingate-connector[1238]: [DEBUG] [libsdwan] send: sending HTTP request 3296
Oct 10 02:09:36 ip-
twingate-connector[1238]: [DEBUG] [libsdwan] http::request::send_request: POST "https://.twingate.com/api/v4/connector/heartbeat" application/json
Oct 10 02:09:36 ip-
twingate-connector[1238]: [msg] Nameserver ...:53 has failed: request timed out.
Oct 10 02:09:36 ip-* twingate-connector[1238]: [msg] All nameservers have failed
Oct 10 02:09:36 ip-* twingate-connector[1238]: [WARN] [libsdwan] http::request::handle_response: POST "https://.twingate.com/api/v5/connector/refresh" failed - dns error: -4 (non-recoverable failure in name resolution), socket error: 11 (Resource temporarily unavailable), tls error: 0 ((null))
Oct 10 02:09:36 ip-
twingate-connector[1238]: [WARN] [libsdwan] operator(): failed HTTP request 1690 -1 dns error: non-recoverable failure in name resolution
Oct 10 02:09:36 ip-
twingate-connector[1238]: [WARN] [libsdwan] http::request::handle_response: POST "https://.twingate.com/api/v4/connector/heartbeat" failed - dns error: -4 (non-recoverable failure in name resolution), socket error: 11 (Resource temporarily unavailable), tls error: 0 ((null))
Oct 10 02:09:36 ip-
twingate-connector[1238]: [WARN] [libsdwan] operator(): failed HTTP request 3296 -1 dns error: non-recoverable failure in name resolution
Oct 10 02:09:36 ip-
twingate-connector[1238]: [WARN] [libsdwan] [controller] operator(): heartbeat failed “{*}”: dns error: non-recoverable failure in name resolution

It’s only been 3 days, but switching my DNS resolver out of forwarding mode and enabling prefetch, prefetch DNS Key, and serve expired options appears to have stopped the symptoms with connectors / Twingate connectivity.

While this treats the symptom, I still feel it’s a bug for the previous configuration to have led to connectivity failures. The previous configuration was in use for several years for every other need and had not caused any previous complaints as to reliability or performance.

Hey Dave,

Thanks for sharing. I do find it odd that a normally working DNS setup like that would not play nice with the connector. Just to confirm, in the log lines leading up to what you pasted and after what you pasted, say 3 minutes either direction, is that the only block of DNS failures? I believe the normal “window” of failure that will trigger the “connector down” notification is 5 minutes with no connectivity.

Is that EC2 instance doing anything else or is it just hosting the connector?

Thanks again for sharing your findings!