I've been experiencing difficulties obtaining certificates for multiple domains, approximately 50 of them, for the past two days. One of the affected domains is zajazdrowerowy.pl.
Usually, I obtain certificates this way: certbot certonly --webroot -w /var/www/html -d zajazdrowerowy.pl -d www.zajazdrowerowy.pl
but that was failing with an error:
DNS problem: query timed out looking up CAA for www.zajazdrowerowy.pl
so I did separate the subdomain from the domain and it worked: certbot certonly --webroot -w /var/www/html -d zajazdrowerowy.pl
then I tried to add the subdomain to the certificate certbot certonly --webroot -w /var/www/html --expand -d zajazdrowerowy.pl,www.zajazdrowerowy.pl
and of course, it failed with the same error:
Certbot failed to authenticate some domains (authenticator: webroot). The Certificate Authority reported these problems:
Domain: www.zajazdrowerowy.pl
Type: dns
Detail: DNS problem: query timed out looking up CAA for www.zajazdrowerowy.pl
DNS CAA records (by default added by the domain registrar) for each of the failing domains are the same, e.g.: $ dig +short -t CAA zajazdrowerowy.pl 0 issue "certum.pl" 0 issue "letsencrypt.org" 0 issuewild "certum.pl" 0 issuewild "letsencrypt.org"
and empty for www.zajazdrowerowy.pl, which seems to be fine.
The common factor of failing requests of certbot is the domain registration (nazwa.pl).
Also worth mentioning that yesterday I noticed timeouts in some cloud/hosting providers, e.g. OVH when running: dig +short -t CAA www.zajazdrowerowy.pl @213.186.33.99
but the responses were instant while requesting Google DNS 8.8.8.8.
A better DNS expert than me may have other suggestion. But, I think you need to ask your DNS provider about this.
The unboundtest.com site uses a DNS lookup similar to Let's Encrypt. It succeeds with your www subdomain for the A record but not CAA or AAAA record. So, there is some sort of problem looking up records that don't exist. It's fine to not have them but the DNS server must still say "not found" in the right way.
Also, DNSViz reports various errors and warnings for both your root domain and www subdomain.
@MikeMcQ@_az Thank you for your interest in this topic and hints!
The nameservers are out of my control, and they are DNS registrar's. I'm running only a service, and my clients are setting up their domains to point them to the IP of my server (record A on domain and www subdomain).
The interesting thing is that, at the same time, they are domains from the same DNS registrar that are working just fine e.g. jojosushi.pl - I was able to obtain an SSL certificate and I also checked https://unboundtest.com/m/CAA/www.jojosushi.pl/HN35GAJ7 and received the correct response.
Well, if you're paying for them, then you should be able to contact their support and tell them that their nameservers aren't working. Let's Encrypt isn't the only system that will have trouble using your domain name if it has malfunctioning DNS servers.
Same here, 30+ clients with domains registered at nazwa.pl. We should join forces, blog/tweet about the issue, and put pressure on nazwa.pl to admit fault and fix it.
They deliberately make it difficult. They outright won't talk to you if you aren't the client, which means we must ask 30+ clients each to contact nazwa.pl individually and present the issue as that particular client's issue. The support personnel is apparently incapable of noticing the pattern, and they keep shifting the blame on letsencrypt.org. They simply don't have procedures for reporting issues that affect multiple clients.
If you own the domain, the workaround is to add explicit CAA records and an explicit A record for the subdomain - each subdomain you want a certificate for. This scales poorly if you have dozens of clients each with a domain registered @nazwa, but we are preparing for the operation nonetheless because there seems to be no other way.
How do I prove this to a nazwa.pl support person? I will need a command that will produce different output when used with ns*.nazwa.pl vs. a well-behaved DNS server.
@_az There is a workaround however - if I add explicit A record and CAA records for a subdomain then it suddenly works, and it also works fine when generating a certificate for the root domain (one for which both A and CAA records already exist). So there seem to be two possibilities:
TCP support / lack thereof is in fact irrelevant.
Something along the lines of "letsencrypt tries UDP first, and if it receives a NOERROR response with no records, it doesn't take no for an answer and retries over TCP".
I think it's fair to describe Unbound's behavior as "complex". There are a variety of circumstances where it will do retries and query multiple nameservers to prevent response spoofing. So it's a bit hard to give a definite and deterministic answer when one of these threads crop up.
The most obvious possibility to me that points towards this being triggered by a lack of TCP responses is:
In the past, this has been the culprit with similar timeout issues.
That adding CAA records fixes the issue does make sense. If you make a query that receives an answer, you get a relatively small response (~291 bytes):
$ dig +dnssec @ns1.nazwa.pl zAjazdrowerowy.pL caa
; <<>> DiG 9.10.6 <<>> +dnssec @ns1.nazwa.pl zAjazdrowerowy.pL caa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57485
;; flags: qr aa rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1680
;; QUESTION SECTION:
;zAjazdrowerowy.pL. IN CAA
;; ANSWER SECTION:
zAjazdrowerowy.pL. 3600 IN CAA 0 issue "letsencrypt.org"
zAjazdrowerowy.pL. 3600 IN RRSIG CAA 13 2 3600 20230511000000 20230420000000 32190 zajazdrowerowy.pl. Wfm1AX80BiaYdMuMchweVvinixBHOiSs46h3Xk1qZaeDVH0yCXCZ5S4B ajZL99ZIkKnvvPHOZixB4Gf5YR7KBQ==
zAjazdrowerowy.pL. 3600 IN CAA 0 issuewild "letsencrypt.org"
zAjazdrowerowy.pL. 3600 IN CAA 0 issuewild "certum.pl"
zAjazdrowerowy.pL. 3600 IN CAA 0 issue "certum.pl"
;; Query time: 209 msec
;; SERVER: 77.55.125.10#53(77.55.125.10)
;; WHEN: Tue May 02 18:17:56 AEST 2023
;; MSG SIZE rcvd: 291
If you query for a non-existent record, it's actually much bigger due to the way that the nameserver is answering with an NSEC3 RRSet (807 bytes). This query does time out with Unbound:
$ dig +dnssec @ns1.nazwa.pl www.zAjazdrowerowy.pL caa
; <<>> DiG 9.10.6 <<>> +dnssec @ns1.nazwa.pl www.zAjazdrowerowy.pL caa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50916
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 8, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1680
;; QUESTION SECTION:
;www.zAjazdrowerowy.pL. IN CAA
;; AUTHORITY SECTION:
zAjazdrowerowy.pL. 3600 IN SOA ns1.nazwa.pL. biuro.nazwa.pL. 2008133200 28800 7200 604800 86400
zAjazdrowerowy.pL. 3600 IN RRSIG SOA 13 2 3600 20230511000000 20230420000000 32190 zajazdrowerowy.pl. 3WLOR8RYUzMlywXN4XA+rAgcPO77iBu/NCu43x6p3TmOz3qV9eBqZZ6n cTpqYQ4tyFKzs00Inus6SSIfsBXVSA==
4i6e5me652d8mokrr84va4jtka16ipj5.zAjazdrowerowy.pL. 86400 IN NSEC3 1 0 12 5BBF3690 5U50O8J5F249AKQENIRM2JHVKPPLOFPL A NS SOA MX TXT RRSIG DNSKEY NSEC3PARAM CAA
4i6e5me652d8mokrr84va4jtka16ipj5.zAjazdrowerowy.pL. 86400 IN RRSIG NSEC3 13 3 86400 20230511000000 20230420000000 32190 zajazdrowerowy.pl. VV7eewKiOcfIuRlwcXmt19iaPi5ehruM5CdqPu6ixWm6cRnIrhNwzgBe l5zgvgNjFEzFT1pWIDWypyX8cIyV8A==
5u50o8j5f249akqenirm2jhvkpplofpl.zAjazdrowerowy.pL. 86400 IN NSEC3 1 0 12 5BBF3690 T7O2TG2EO22IS5ET70SN7AL62LTSHR66 A RRSIG
5u50o8j5f249akqenirm2jhvkpplofpl.zAjazdrowerowy.pL. 86400 IN RRSIG NSEC3 13 3 86400 20230511000000 20230420000000 32190 zajazdrowerowy.pl. Hjbw2F96bjKKQNYl993d/IGHHdNZr2eIp/FIhSrHLbrta0HSyzQf0aa5 KTDVaqyJuqCohC5EJYW9B1yhGImHZg==
t7o2tg2eo22is5et70sn7al62ltshr66.zAjazdrowerowy.pL. 86400 IN NSEC3 1 0 12 5BBF3690 4I6E5ME652D8MOKRR84VA4JTKA16IPJ5 A RRSIG
t7o2tg2eo22is5et70sn7al62ltshr66.zAjazdrowerowy.pL. 86400 IN RRSIG NSEC3 13 3 86400 20230511000000 20230420000000 32190 zajazdrowerowy.pl. ZhoQZToi3OUMGfEkHuW48CW3YYjhGWuTicvZ9TR843/kLJjvwZDEwVWb V9eY9+DAMJOcGLXp6NdDrL2jQiLfKw==
;; Query time: 203 msec
;; SERVER: 77.55.125.10#53(77.55.125.10)
;; WHEN: Tue May 02 18:19:30 AEST 2023
;; MSG SIZE rcvd: 807
The larger the response is, the higher likelihood that the response is going to be truncated (according to the parameters of the client/server/query) and must fall back to TCP. I think, circumstantially, it's really likely that these DNS conversations are hitting the fragmentation threshold, which causes Unbound to retry queries with TCP, ultimately hitting a timeout.
One last potential piece of supporting evidence is that Let's Encrypt reduced their EDNS Buffer Size (the threshold at which queries will fall back to TCP) to 512 bytes, back in 2018. I'm not sure if it's the current production setting in 2023, but it provides an explanation for why that second query in my last post times if Let's Encrypt queries it, because 807 bytes > 512 bytes.
I'm in favor of doing it. I already reported the issue on April 30th in the name of my clients on their official mail support, and I got into their own internal miscommunication. So I got a response that they verifying the issue and, at the same time, from another support member that I have to go through the verification process. Anyway, that was 3 days ago and no follow-ups since then, so my faith in their goodwill to take it seriously is dropping.
As nazwa.pl is one of the main DNS providers in the country, I think they should follow RFC, workaround maybe is fine but that is not how it should be at this level. If I would have to take any per-client action, and because of my poor experience with their support, I'm willing to recommend my clients move away from them, probably not only affected ones but all of them.
I believe @jsha tries to keep the Unboundtest.com configuration as close to Boulders Unbound settings as possible and currently Unboundtest indeed has edns-buffer-size set to 512.
I think that you can get dig to be closer to Let's Encrypt's resolvers' behavior, instead of setting +tcp directly, setting +bufsize=512 to have it need to retry over TCP for responses over 512 bytes.