DNS problem: query timed out looking up CAA (using Netregistry)

Curiously I was able to create certs a few weeks ago for a com.au domain and the www subdomain but recently have been unable to create a cert for another subdomain. Have posted on Stack Exchange
I wonder if this has any relation to recent DOS attacks that netregistry DNS servers had and was resolved?

Thanks for the detailed post!

Our past investigation into reachability issues with CAA and NetRegistry have indicated that they are most likely routing-related. I.e., some hop along the path from us to them seems to drop UDP DNS packets if they are of type CAA. Querying from some parts of the Internet results in a timeout; querying from other parts succeeds. Previously we were able to get around this with a big hack, routing DNS traffic to NetRegistry through one of our datacenters that seemed to be able to reach them reliably. It's possible routing tables have changed in such a way that that hack no longer works. We'll look into it.

Our Unbound is configured with the default behavior, to attempt TCP if UDP fails. Our past investigations showed TCP failing in the same way. However, it's possible the TCP fallback is not happening fast enough for the timeouts we have configured inside Boulder. We'll dig into this too. We've also been meaning to explore a "TCP first" lookup methodology, which might mitigate the CAA timeouts if it's currently true that TCP queries reliably succeed. (see this post)

Looking at our logs, we do see an increase in CAA-related timeouts on April 13. That happens to be the same day that NetRegistry had a major outage (1 2 3). It's possible that as part of their recovery from that outage, some routing properties changed that are causing this recurrence of timeout problems.

Thanks for bringing this to our attention; we'll work on getting it fixed.

2 Likes

I'd also like to ask LE to improve the robustness of their CAA check by including a check via TCP as well.
If LE is going to absolutely rely on CAA checks, then the onus should be on them and their systems to ensure they are absolutely correct and they have exhausted all possible ways for checking. Their present implementation for this appears to be flawed in this regard. That is, they can get the response for CAA if they tried it via TCP.

quite curious with the tone here

are you benefiting from LetsEcnrypt ?

CAA records are also an optional step so why not remove these?

Why do you feel onus be on LetsEncrypt (especially if they are offering the service at no cost)

I think the onus is on you

  • You are responsible for obtaining of SSL Certificates for your organisation
  • You have chosen to use LetsEncrypt
  • You have chosen to use NetRegistry
  • You have chosen to enable CAA (or maybe not) and this is what is failing

Staff at LetsEcnrypt have always worked to help out but I feel there should be a shared responsibility to get outcomes rather than palming it off

Andrei

1 Like

@ahaw021: FWIW, I thought that @LeonA's tone was fine. He's describing a valid technical problem, and politely asking for a solution.

This issue affects even people who haven't enabled CAA (most people). Basically, Let's Encrypt has to look up CAA all the time, just to find out whether a record is present or not.

1 Like

Yes we use LE extensively. But @ahaw021 assumptions are incorrect.

  • We have thousands of customer domains that we certify and service via Akamai with LE.
  • The service is not free.
  • Some customer domains happen to be with Netregistry.
  • None use CAA.
  • In this case all had successfully DV’d in the past but are now failing DV renewal due to CAA query timeouts on UDP.
  • The stakes are high in that previously certified customer domains may loose their certificates. This is causing operational and support issues for us.

The issue is not with proving domain ownership per se. The issue here is that regardless of DV method, LE always checks CAA as well. The CAA check must succeed (i.e. NS returns a result) even when no CAA record exists. The logic on the CAA check is one of “failsafe”; that is, if the authoritative NS cannot be reached (e.g. times out) for the CAA check then the entire DV is failed. In the case we have on this thread, the failure is with the check on UDP but TCP succeeds. This behavior is new and has not been seen since LE last made a fix for it when they went GA in late 2015.

My general assertion is that if a CAA check is a must for DV and therefore cert issuance, then the implementation of the check must be robust and exhaustive. I think LE engineering recognizes this and they have taken it into consideration with their system design. However, its not working as designed now.

1 Like

@jsha

FWIW - The first recent case I have of failed CAA via UDP to Netregistry occurred in early Feb 2017. Well before the 13 Apr event.

For that domain, I ended up having to move it to a Symantec OV cert to avoid loosing TLS.

another user here with same problem - DNS problem: SERVFAIL looking up CAA for www.imolaenergy.hu

@LeonA as of last night, I was able to reproduce timeouts on this query from multiple locations. However, as of today, all locations I've tested, except for our production datacenter, succeed on this query. To gather a bit more data: Do you still see timeouts for this query?

Thanks,
Jacob

1 Like

@jsha

Wow. Something has finally changed for the better. Assuming you (LE) made no changes, it looks like Netregistry may have actually made a fix.

All of our previously failing CAA checks are now returning successfully on both UDP and TCP.

I will next re-initialze our provisioning jobs to see if the failed LE certificate enrollments and respective DVs succeed.
I’ll post my results here when I know more…

1 Like

I wouldn’t get your hopes up too much. Our internal tests so far are showing that we still get timeouts from our datacenter. My guess is that something changed routing-wise, and this went from an “everyone” problem to a “sometimes” problem. But do let us know if you see a success from your provisioning jobs.

1 Like

Good morning (well for me) @jsha & @LeonA , It was a mystery to me how I had 2 other sites with Netregistry and they had worked fine about 2 weeks ago and then this week my main domain would not! But I would say your guess “My guess is that something changed routing-wise, and this went from an “everyone” problem to a “sometimes” problem.” is spot on as my main domain vpscloud.biz and gone through fine this morning! Lets hope it will renew fine in the future!

Regards to all,

Mark.

Anyone else try it recently and had success?

1 Like

Yes I just renewed our main domain. Now to try the others. I was about to switch my domains to another domain host so it has happened just in time…

1 Like

I’ve now renewed 4. The mobile one (m.domainname.com.au) initially failed just now, but worked when I tried again. The other three worked straight up today.

Problems I was seeing seem to be resolved as of this morning. and can successfully receive UDP responses from netregistry DNS servers.

dig CAA rest-stops.computerpros.com.au. @ns1.netregistry.net. +notcp

Cert is now being successfully issues.

Seems all good here as well

I can also confirm that all of the domains that were failing DV renewal due to CAA check via UDP have successfully passed and their respective certs renewed.

1 Like

@jsha

Looks like we may not be completely out of the woods yet.
I have a new case that popped up today for domains hosted by ezyreg.com, another Netregistry reseller.
The same CAA via UDP timeout issue remains there.
Here’s the example…

This CAA check via UDP fails with timeout
dig CAA drumdigital.com.au. @ns-1.ezyreg.com. +notcp
; <<>> DiG 9.10.2 <<>> CAA drumdigital.com.au. @ns-1.ezyreg.com. +notcp
;; global options: +cmd
;; connection timed out; no servers could be reached

This CAA check via TCP works
dig CAA drumdigital.com.au. @ns-1.ezyreg.com. +tcp
; <<>> DiG 9.10.2 <<>> CAA drumdigital.com.au. @ns-1.ezyreg.com. +tcp
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33240
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;drumdigital.com.au. IN CAA

;; AUTHORITY SECTION:
drumdigital.com.au. 86400 IN SOA ns-1.ezyreg.com. cpanel.netregistry.com.au. 2017050100 86400 7200 3600000 86400

;; Query time: 315 msec
;; SERVER: 180.235.128.119#53(180.235.128.119)
;; WHEN: Wed May 03 15:12:49 Eastern Daylight Time 2017
;; MSG SIZE rcvd: 117

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Revisiting this thread, I wanted to update the above statement, which I discovered is inaccurate. Unbound never falls back to TCP due to timeouts, only when it receives a truncated ANSWER. See this reply I got on the Unbound mailing list: Trust rules and DNSSEC signatures