DNS problem: query timed out looking up CAA (using Netregistry)

The automatic update faile and we are seeing the following errors when we try and update the certificates manually by running sudo ./encrypt.sh

Failed authorization procedure. www.guidedogswa.com.au (tls-sni-01): urn:acme:error:connection :: The server could not connect to the client to verify the domain :: DNS problem: query timed out looking up CAA for www.guidedogswa.com.au

And same for foundation:
Failed authorization procedure. www.guidedogfoundation.com.au (tls-sni-01): urn:acme:error:connection :: The server could not connect to the client to verify the domain :: DNS problem: query timed out looking up CAA for www.guidedogfoundation.com.au

This is the command

./letsencrypt/letsencrypt-auto -q certonly --standalone --email technical@longtail.com.au --agree-tos -d www.guidedogswa.com.au
./letsencrypt/letsencrypt-auto -q certonly --standalone --email technical@longtail.com.au --agree-tos -d guidedogswa.com.au

Yeah, your DNS provider seems to have issues with CAA queries. :frowning2:

You may have to switch to a different DNS provider.

Not practical. Also there seems to be little definitive proof of what is causing this issue. Trying to find anyone that has actually resolved this issue.

Client sends a CAA query to Netregistry. Netregistry never replies.

$ digr guidedogswa.com.au @ns2.netregistry.net.

; <<>> DiG 9.10.3-P4-Ubuntu <<>> +norecurse guidedogswa.com.au @ns2.netregistry.net.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18625
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;guidedogswa.com.au.            IN      A

;; ANSWER SECTION:
guidedogswa.com.au.     3600    IN      A       52.63.85.161

;; Query time: 227 msec
;; SERVER: 203.55.143.100#53(203.55.143.100)
;; WHEN: Wed Apr 26 03:04:48 UTC 2017
;; MSG SIZE  rcvd: 52

$ digr guidedogswa.com.au caa @ns2.netregistry.net.

; <<>> DiG 9.10.3-P4-Ubuntu <<>> +norecurse guidedogswa.com.au caa @ns2.netregistry.net.
;; global options: +cmd
;; connection timed out; no servers could be reached

There’s nothing anyone without operational control over Netregistry’s DNS infrastructure can do to solve the problem.

1 Like

Having the exact same error, and the only other references to it which I've seen are Australian websites hosted on NetRegistry's DNS:

Completely stuck at the moment, we might need to take over the DNS control from the Client.

Hi, Could this be related to my problem? My main domain is with netregistry?

Well we had Netregistry refresh the domain so the SOA records show up fine, and everything else seems good but we still get the CAA record error.

Have logged a support case with them.

1 Like

This problem has affected a lot of our clients. I’m fairly sure NetRegistry is the number one domain registrar in Australia, so a huge number of people with .au domains use their DNS services as the default. But it’s not solely NetRegistry; I’ve also seen the problem on a domain using VentraIP (ventraip.net.au) which I don’t believe is a related company.

I know this isn’t the fault of Let’s Encrypt but I’m not really sure how to proceed here. Forcing our customers to change DNS providers is not really a practical option. I can complain to NetRegistry et al but in the mean time a bunch of websites will go down if we can’t renew their certs in the next few days. It’s a difficult situation.

1 Like

We’re also seeing many of our customer domains that have Netregistry (or one of their resellers) failing to DV over DNS CAA query timeouts via UDP. Yes, the same query does work and does not timeout when done via TCP.

Something appears to have recently changed at LE, since the domains in my case did initially pass DV and had their respective certs issued.

Here is what I know about my cases so far…

  • All domains have their authoritative DNS provided by Australian based Netregistry or one of their resellers.
  • All are (or were) on active LE issued certs and all passed DV in the past.
  • None changed their DNS provider since initially passing DV
  • The CAA related failures I’m observing are all for DV renewals.
  • None of the domains actually have any CAA records present.
  • The CAA queries all return successfully when done via TCP; they all fail (with timeout) when done via UDP.
  • Support calls to Netregistry reseller, TPPWholesale acknowledged that there may be an issue with the BIND version they run and CAA queries on UDP.

Here’s an example (using the domain posted by cbertozz at the start of this thread) that shows the CAA UDP vs TCP check…

This CAA check via TCP works
dig CAA guidedogswa.com.au. @ns1.netregistry.net. +tcp
; <<>> DiG 9.10.2 <<>> CAA guidedogswa.com.au. @ns1.netregistry.net. +tcp
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38281
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;guidedogswa.com.au. IN CAA

;; AUTHORITY SECTION:
guidedogswa.com.au. 3600 IN SOA ns1.netregistry.net. dmain.netregistry.net. 2017030818 86400 7200 3600000 172800

;; Query time: 320 msec
;; SERVER: 203.55.143.10#53(203.55.143.10)
;; WHEN: Wed Apr 26 13:42:10 Eastern Daylight Time 2017
;; MSG SIZE rcvd: 97

This CAA check via UDP fails with timeout
dig CAA guidedogswa.com.au. @ns1.netregistry.net. +notcp
; <<>> DiG 9.10.2 <<>> CAA guidedogswa.com.au. @ns1.netregistry.net. +notcp
;; global options: +cmd
;; connection timed out; no servers could be reached

Here are some specific tactical questions to LE regarding this issue…

  1. Why did LE’s CAA check succeed in the past?
  2. Why is LE only doing the “critical” CAA query via UDP and not via TCP too?
  3. What has changed in LE’s systems recently that may be causing new CAA failures?
  4. It should be noted that LE had significant issues with DNS CAA queries via UDP when they first went GA (Dec 2015). Is the case we’re seeing now a regression?

And the strategic ask…
I’d also like to ask LE to improve the robustness of their CAA check by including a check via TCP as well.
If LE is going to absolutely rely on CAA checks, then the onus should be on them and their systems to ensure they are absolutely correct and they have exhausted all possible ways for checking. Their present implementation for this appears to be flawed in this regard. That is, they can get the response for CAA if they tried it via TCP.

It should be noted that I’m not necessarily asking to relax the CAB’s CAA requirements; though that would also solve this.

3 Likes

Curiously I was able to create certs a few weeks ago for a com.au domain and the www subdomain but recently have been unable to create a cert for another subdomain. Have posted on Stack Exchange
I wonder if this has any relation to recent DOS attacks that netregistry DNS servers had and was resolved?

Thanks for the detailed post!

Our past investigation into reachability issues with CAA and NetRegistry have indicated that they are most likely routing-related. I.e., some hop along the path from us to them seems to drop UDP DNS packets if they are of type CAA. Querying from some parts of the Internet results in a timeout; querying from other parts succeeds. Previously we were able to get around this with a big hack, routing DNS traffic to NetRegistry through one of our datacenters that seemed to be able to reach them reliably. It's possible routing tables have changed in such a way that that hack no longer works. We'll look into it.

Our Unbound is configured with the default behavior, to attempt TCP if UDP fails. Our past investigations showed TCP failing in the same way. However, it's possible the TCP fallback is not happening fast enough for the timeouts we have configured inside Boulder. We'll dig into this too. We've also been meaning to explore a "TCP first" lookup methodology, which might mitigate the CAA timeouts if it's currently true that TCP queries reliably succeed. (see this post)

Looking at our logs, we do see an increase in CAA-related timeouts on April 13. That happens to be the same day that NetRegistry had a major outage (1 2 3). It's possible that as part of their recovery from that outage, some routing properties changed that are causing this recurrence of timeout problems.

Thanks for bringing this to our attention; we'll work on getting it fixed.

2 Likes

I'd also like to ask LE to improve the robustness of their CAA check by including a check via TCP as well.
If LE is going to absolutely rely on CAA checks, then the onus should be on them and their systems to ensure they are absolutely correct and they have exhausted all possible ways for checking. Their present implementation for this appears to be flawed in this regard. That is, they can get the response for CAA if they tried it via TCP.

quite curious with the tone here

are you benefiting from LetsEcnrypt ?

CAA records are also an optional step so why not remove these?

Why do you feel onus be on LetsEncrypt (especially if they are offering the service at no cost)

I think the onus is on you

  • You are responsible for obtaining of SSL Certificates for your organisation
  • You have chosen to use LetsEncrypt
  • You have chosen to use NetRegistry
  • You have chosen to enable CAA (or maybe not) and this is what is failing

Staff at LetsEcnrypt have always worked to help out but I feel there should be a shared responsibility to get outcomes rather than palming it off

Andrei

1 Like

@ahaw021: FWIW, I thought that @LeonA's tone was fine. He's describing a valid technical problem, and politely asking for a solution.

This issue affects even people who haven't enabled CAA (most people). Basically, Let's Encrypt has to look up CAA all the time, just to find out whether a record is present or not.

1 Like

Yes we use LE extensively. But @ahaw021 assumptions are incorrect.

  • We have thousands of customer domains that we certify and service via Akamai with LE.
  • The service is not free.
  • Some customer domains happen to be with Netregistry.
  • None use CAA.
  • In this case all had successfully DV’d in the past but are now failing DV renewal due to CAA query timeouts on UDP.
  • The stakes are high in that previously certified customer domains may loose their certificates. This is causing operational and support issues for us.

The issue is not with proving domain ownership per se. The issue here is that regardless of DV method, LE always checks CAA as well. The CAA check must succeed (i.e. NS returns a result) even when no CAA record exists. The logic on the CAA check is one of “failsafe”; that is, if the authoritative NS cannot be reached (e.g. times out) for the CAA check then the entire DV is failed. In the case we have on this thread, the failure is with the check on UDP but TCP succeeds. This behavior is new and has not been seen since LE last made a fix for it when they went GA in late 2015.

My general assertion is that if a CAA check is a must for DV and therefore cert issuance, then the implementation of the check must be robust and exhaustive. I think LE engineering recognizes this and they have taken it into consideration with their system design. However, its not working as designed now.

1 Like

@jsha

FWIW - The first recent case I have of failed CAA via UDP to Netregistry occurred in early Feb 2017. Well before the 13 Apr event.

For that domain, I ended up having to move it to a Symantec OV cert to avoid loosing TLS.

another user here with same problem - DNS problem: SERVFAIL looking up CAA for www.imolaenergy.hu

@LeonA as of last night, I was able to reproduce timeouts on this query from multiple locations. However, as of today, all locations I've tested, except for our production datacenter, succeed on this query. To gather a bit more data: Do you still see timeouts for this query?

Thanks,
Jacob

1 Like

@jsha

Wow. Something has finally changed for the better. Assuming you (LE) made no changes, it looks like Netregistry may have actually made a fix.

All of our previously failing CAA checks are now returning successfully on both UDP and TCP.

I will next re-initialze our provisioning jobs to see if the failed LE certificate enrollments and respective DVs succeed.
I’ll post my results here when I know more…

1 Like

I wouldn’t get your hopes up too much. Our internal tests so far are showing that we still get timeouts from our datacenter. My guess is that something changed routing-wise, and this went from an “everyone” problem to a “sometimes” problem. But do let us know if you see a success from your provisioning jobs.

1 Like