Several times a day, I get an email that my AWS AMI / EC2 connector has gone offline. Typically it is followed by a back online email several minutes later. Thus the connector is “online” and working for ~23.75 hours per day.
When I look at the connector logs for an alleged offline period (below), it appears a single failed DNS Lookup with a tight timeout of a fraction of a second is leading to the connector declaring itself offline. (I’ve omitted many non-relevant preceding log lines from the same second, leading to the belief the timeout happened not just in the same log second but in a small fraction of that second.)
As a workaround, I just now re-configured my DNS resolver for hopefully more consistent performance. We’ll see if that fixes the issue. (The previous DNS set up was a private resolver handling just the private subnet and forwarding other queries to the AWS-supplied resolver on my VPC. I do not have any other evidence that the DNS setup was broken in any other way.)
If that doesn’t do it, I’m wondering if I can configure the connector to use a less aggressive DNS timeout, or not take the connector offline on a single failure, or otherwise mitigate the issue?
This is not an academic problem: while I am currently not routing actual user traffic through this connector, back when I did have several dozen users going through a similar setup, it resulted in not infrequent claims of Twingate instability that persisted over a period of weeks, until I finally switched all users to a single hardcoded IP, which resolved the complaints.
Oct 10 02:09:36 ip-* twingate-connector[1238]: [DEBUG] [libsdwan] heartbeat payload: {}
Oct 10 02:09:36 ip- twingate-connector[1238]: [DEBUG] [libsdwan] send: sending HTTP request 3296
Oct 10 02:09:36 ip- twingate-connector[1238]: [DEBUG] [libsdwan] http::request::send_request: POST "https://.twingate.com/api/v4/connector/heartbeat" application/json
Oct 10 02:09:36 ip- twingate-connector[1238]: [msg] Nameserver ...:53 has failed: request timed out.
Oct 10 02:09:36 ip-* twingate-connector[1238]: [msg] All nameservers have failed
Oct 10 02:09:36 ip-* twingate-connector[1238]: [WARN] [libsdwan] http::request::handle_response: POST "https://.twingate.com/api/v5/connector/refresh" failed - dns error: -4 (non-recoverable failure in name resolution), socket error: 11 (Resource temporarily unavailable), tls error: 0 ((null))
Oct 10 02:09:36 ip- twingate-connector[1238]: [WARN] [libsdwan] operator(): failed HTTP request 1690 -1 dns error: non-recoverable failure in name resolution
Oct 10 02:09:36 ip- twingate-connector[1238]: [WARN] [libsdwan] http::request::handle_response: POST "https://.twingate.com/api/v4/connector/heartbeat" failed - dns error: -4 (non-recoverable failure in name resolution), socket error: 11 (Resource temporarily unavailable), tls error: 0 ((null))
Oct 10 02:09:36 ip- twingate-connector[1238]: [WARN] [libsdwan] operator(): failed HTTP request 3296 -1 dns error: non-recoverable failure in name resolution
Oct 10 02:09:36 ip- twingate-connector[1238]: [WARN] [libsdwan] [controller] operator(): heartbeat failed “{*}”: dns error: non-recoverable failure in name resolution