My domain is: _acme-challenge.abhtest.cloudengine.mercedes-benz.com
It produced this output: can't renew certificates
My hosting provider, if applicable, is: AWS ROUTE 53
Hello everyone, we recently debugged an issue with one of our domain name not being able to renew a certificate. Let's encrypt is not able to fetch the TXT records for DNS-01 challenge.
We then went to check unboundtest and found a difference between 1.16 and 1.18.
This particular certificate is made of 21 domain name, we have no issue renewing other certificates except this one. My assumption is that it has something to do with UDP size?
Looking at the difference between the working logs and failing logs we can the TC flag in the response unbound 1.18:
Agreed, the TXT record should be removed after issuance. Having that many TXT records isn't helping things, though I don't think it's your only problem.
_acme-challenge.abhtest.cloudengine.mercedes-benz.com/TXT: A query for _acme-challenge.abhtest.cloudengine.mercedes-benz.com results in a NOERROR response, while a query for its ancestor, abhtest.cloudengine.mercedes-benz.com, returns a name error (NXDOMAIN), which indicates that subdomains of abhtest.cloudengine.mercedes-benz.com, including _acme-challenge.abhtest.cloudengine.mercedes-benz.com, don't exist.
I'm kind of surprised if this is stock Route53 behavior, and if so it may be tough for you to fix. But NXDOMAIN means There Really Is Nothing Underneath, so any DNS system querying for abhtest.cloudengine.mercedes-benz.com and thus getting the NXDOMAIN would be justified in not separately querying _acme-challenge.abhtest.cloudengine.mercedes-benz.com. I don't know as Unbound specifically does it that way, so it may not be related to this particular problem, but I think it's something you should look at fixing.
mercedes-benz.com zone: The server(s) were not responsive to queries over UDP. (2a03:9e42:e201:1001::53)
This server is ns1.corpinter.net. (serving cloudengine.mercedes-benz.com), and seems to just not be working.
_acme-challenge.abhtest.cloudengine.mercedes-benz.com/TXT: No response was received until the UDP payload size was decreased, indicating that the server might be attempting to send a payload that exceeds the path maximum transmission unit (PMTU) size. (2600:9000:5300:3f00::1, UDP_-_EDNS0_4096_D_KN)
And that indicates that there's something wrong with UDP connectivity to ns-63.awsdns-07.com.. You might want to contact AWS about that.
Thank you for your prompt reply, I'll try and clarify part of our infrastructure.
The domain _acme-challenge.abhtest.cloudengine.mercedes-benz.com is a just a standalone domain created to replicate and keep the TXT records of the domain failing. The production one is _acme-challenge.abh.cloudengine.mercedes-benz.com (hence the NXDOMAIN error).
The alternative domains all have CNAME pointing to _acme-challenge.abh.cloudengine.mercedes-benz.com:
example:
$ dig _acme-challenge.preprod.toolbox.onl
; <<>> DiG 9.10.6 <<>> _acme-challenge.preprod.toolbox.onl
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52441
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;_acme-challenge.preprod.toolbox.onl. IN A
;; ANSWER SECTION:
_acme-challenge.preprod.toolbox.onl. 4502 IN CNAME _acme-challenge.abh.cloudengine.mercedes-benz.com.
_acme-challenge.abh.cloudengine.mercedes-benz.com. 77 IN A 52.57.19.202
_acme-challenge.abh.cloudengine.mercedes-benz.com. 77 IN A 35.156.216.252
_acme-challenge.abh.cloudengine.mercedes-benz.com. 77 IN A 18.194.7.84
;; Query time: 344 msec
;; SERVER: fe80::8820:dff:fe4b:a264%22#53(fe80::8820:dff:fe4b:a264%22)
;; WHEN: Tue Dec 26 22:20:53 IST 2023
;; MSG SIZE rcvd: 175
Again, there was no issue created this cert before with 24 SANs. The only difference I could find is the new version of unbound and how 1.18+ doesn't retrieve the TXT records.
I'll poke around regarding ns1.corpinter.net and ns-63.awsdns-07.com.
To recap, we have no issues issuing certificates using the same hosted zone on route53, only this certificate which is the biggest in term of SANs is failing. When we try to create a certificate with 21 SANs it works.
Hmm. That's a bit unusual, as you're expecting a 24-TXT-record result, and I don't think that's a scenario that many people have tried or tested. But it certainly should work.
I'm guessing that you're running into sending a response that's so big that something doesn't like it. (Whether that's something on your DNS server end not working with the right size, or Unbound's resolver, or whatnot I'm not sure.)
But as a workaround, might you be able to have each domain name aliased to a different record, rather than them all going to the same record? That is, _acme-challenge.domain1.abhtest.cloudengine.mercedes-benz.com through _acme-challenge.domain24.abhtest.cloudengine.mercedes-benz.com or something along those lines. You'd need to change all your CNAMEs (which I'm sure is a pain), and then change your acme.sh command to specify a different alias for each domain.
You shouldn't need to go that route, having them all in one record should work. I'm just proposing a workaround that I think might get you going again. Or maybe someone else has a better idea of what's going on with the large DNS responses.
If you have support from AWS, you might want to bring them in if you haven't already. You can probably come up with a test case to show them that isn't related to certificates at all, just having a couple dozen TXT records on one name and showing how resolving it isn't working in current unbound.
OR
Since [most] validations are cached for up to 30 days... [per account].
You could simply run the same request a few times in a row.
Each time anything is authenticated, it will be removed from the next request.
Eventually all will have been authenticated and the cert will be issued.
That said, there are rate limits that must be adhered to.
So, just play within the set limits and you can game a win.
It's not clear to me that any of the authorizations are succeeding. It sounds like the TXT record that they're all being CNAME'd to is so large (or something along those lines) that Unbound isn't able to resolve the name at all. But I guess we haven't seen the actual Boulder error messages in this thread, just the results of their own testing with Unbound.
So...
The "normal network configuration" does not allow large[r than "normal"] UDP packets?
Sounds like they may need to read a bit on EDNS0 and DNS over TCP.
Route 53 claims EDNS0 support with UDP packets up to 4096. And Let's Encrypt's configuration sets edns-buffer-size to 512 anyway, so it should be switching to TCP for these responses anyway. It's certainly possible that Route 53 is doing something weird or offbeat here (like I said, this probably isn't a situation that gets a lot of test coverage), but in general AWS seems to know what they're doing and it may be that Unbound is the one doing something weird (or needs some different configuration for this use case or the like).
Thank you, that's what we ended up doing manually:
request a certificate with 15 domains:
$ ./acme.sh --issue -d *.abh.cloudengine.mercedes-benz.com -d domain1.com ..... -d domain15.com --challenge-alias abh.cloudengine.mercedes-benz.com --dns dns_aws --server letsencrypt
...
...
[Wed Dec 27 10:49:12 UTC 2023] Verifying: *.abh.cloudengine.mercedes-benz.com
[Wed Dec 27 10:49:13 UTC 2023] Pending, The CA is processing your order, please just wait. (1/30)
[Wed Dec 27 10:49:17 UTC 2023] Success
[Wed Dec 27 10:49:17 UTC 2023] Verifying: domain1.com
[Wed Dec 27 10:49:18 UTC 2023] Pending, The CA is processing your order, please just wait. (1/30)
[Wed Dec 27 10:49:23 UTC 2023] Success
...
...
[Wed Dec 27 10:53:03 UTC 2023] Cert success.
request a certificate with same wildcard + 10 domains:
$ ./acme.sh --issue -d *.abh.cloudengine.mercedes-benz.com -d domain16.com ..... -d domain25.com --challenge-alias abh.cloudengine.mercedes-benz.com --dns dns_aws --server letsencrypt
.....
.....
[Wed Dec 27 10:57:13 UTC 2023] *.abh.cloudengine.mercedes-benz.com is already verified, skip dns-01.
[Wed Dec 27 10:57:13 UTC 2023] Verifying: domain16.com
[Wed Dec 27 10:57:14 UTC 2023] Pending, The CA is processing your order, please just wait. (1/30)
[Wed Dec 27 10:57:19 UTC 2023] Success
[Wed Dec 27 10:57:19 UTC 2023] Verifying: domain17.com
[Wed Dec 27 10:57:20 UTC 2023] Pending, The CA is processing your order, please just wait. (1/30)
[Wed Dec 27 10:57:24 UTC 2023] Pending, The CA is processing your order, please just wait. (2/30)
[Wed Dec 27 10:57:29 UTC 2023] Success
....
[Wed Dec 27 10:59:56 UTC 2023] Cert success.
request the full certificate with 25 domains that have already been verified above:
$ ./acme.sh --issue -d *.abh.cloudengine.mercedes-benz.com -d domain2.com ..... -d domain25.com --challenge-alias abh.cloudengine.mercedes-benz.com --dns dns_aws --server letsencrypt
...
...
[Wed Dec 27 11:57:21 UTC 2023] *.abh.cloudengine.mercedes-benz.com is already verified, skip dns-01.
[Wed Dec 27 11:57:21 UTC 2023] domain2.com is already verified, skip dns-01.
...
...
[Wed Dec 27 11:57:22 UTC 2023] domain25.com is already verified, skip dns-01.
...
...
[Wed Dec 27 11:57:25 UTC 2023] Cert success.
All good, the certificate has been created with 25 domains.
As a workaround this is fine and unblock us for now. We still need to figure out why it fails when directly requesting 25 domains. I will create a separate post below for that.
Same for us, it looks like that the TXT record is so large that Unbound isn't able to resolve and fails at the first verification when requesting a cert with 25 domains:
[adding txt record to route53 one by one]
...
...
...
β [Wed Dec 27 12:24:34 UTC 2023] All success, let's return
β [Wed Dec 27 12:24:34 UTC 2023] Verifying:
β *.abh.cloudengine.mercedes-benz.com
β [Wed Dec 27 12:24:34 UTC 2023] Pending, The CA is processing your order,
β please just wait. (1/30)
β [Wed Dec 27 12:24:37 UTC 2023] Invalid status,
β *.abh.cloudengine.mercedes-benz.com:Verify error detail:No TXT record found
β at _acme-challenge.abh.cloudengine.mercedes-benz.com
β [Wed Dec 27 12:24:37 UTC 2023] Removing DNS records.
β [Wed Dec 27 12:24:37 UTC 2023] Removing txt:
β lhnUPlA9u22VhTR2t22WL1B6XxVJM3O745H-mhUKboE for domain:
β _acme-challenge.abh.cloudengine.mercedes-benz.com
...
...
...
failure
Which indicates that unbound is not able to get the TXT record. But it did in the past since we successfully created this certificate in one command, with 25 domains, 3 months ago.
Now, to troubleshoot this issue we created a large fake TXT record: _acme-challenge.abhtest.cloudengine.mercedes-benz.com
Unbound 1.16 successfully return the TXT records
Unbound 1.18 and 1.19 doesn't.
We then ran unbound 1.19 locally via docker using unboundtest Dockerfile and the same unbound.conf:
Unbound 1.19 running locally also fails to return the TXT record. We can see the TXT record in the debug log, but not in the answer.
Query results for TXT _acme-challenge.abhtest.cloudengine.mercedes-benz.com
Response:
;; opcode: QUERY, status: NOERROR, id: 46640
;; flags: qr tc rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version 0; flags: do; udp: 512
;; QUESTION SECTION:
;_acme-challenge.abhtest.cloudengine.mercedes-benz.com. IN TXT
----- Unbound logs -----
...
...
Dec 27 13:39:13 unbound[13:0] debug: process_response: new external response event
Dec 27 13:39:13 unbound[13:0] info: scrub for cloudengine.mercedes-benz.com. NS IN
Dec 27 13:39:13 unbound[13:0] info: response for _acme-challenge.abhtest.cloudengine.mercedes-benz.com. TXT IN
Dec 27 13:39:13 unbound[13:0] info: reply from <cloudengine.mercedes-benz.com.> 205.251.196.169#53
Dec 27 13:39:13 unbound[13:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr aa ; QUERY: 1, ANSWER: 21, AUTHORITY: 4, ADDITIONAL: 0
;; QUESTION SECTION:
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. IN TXT
;; ANSWER SECTION:
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "EHQYOqAeQgjOEpYljeyOvHTTKFc2XvLRxz3L3t0GpJg"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "Mo4Jj06GlYbnTpSoGrW9hfUxQmwaACajaAjVQNoDBq0"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "NxTMWNG8vB3_sqBFl-hYLAqNfgYF-CG3EsOXYtNYiTY"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "PbJXwoJr8Xmneky7VBgOf-WfVaQFuu50AaJB-R-u4PY"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "PeTIYmnDU2cyq-l_VljNIYO7tRjdWI5yzpexoDP3Z1U"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "Q1QrrSEEInPrag2g7y4_EH"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "Q1QrrSEEInPrag2g7y4_EH-GTcUmL8XlcWv6SqDdCsE"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "QSadJSxPUioThP2XHNH1aXvJKEjyPbkttdINObZZGfA"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "YTJxX-cdB5bXJQ2oR03rhN1Au1BZFZS955DrnhDbOBI"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "ex76rM--NrtcwTlx1rqpxtsk_0fv4oSEVcfxjiqm8VQ"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "jakxcJHi_sAFnE64fjyVh1fhPk3SLOLfIrNssr5YX6Q"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "qAacYHuljj2mkA82MkYEUbACRVcWkLYNkUU8lwrLHAY"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "rbV3iZeujOvzfsl7Vpj9vM0L0CMoPaPLzXHb0yM3DB4"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "yfu8J7zX1TqTtBau9Mdm7aBui3Sba8BlG5XYCjMOWkw"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "3O6-HP0qVo03wno7w7dLPuSCDfZKBXJM1nNbQYY-Y1Q"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "6dWY3tFbyPWIgkL46ok1TE63UFqGnfQzaRdd7a1YPUI"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "A4EGy0uFH_79sdkVkMJhT9_U4Ltkf0-6Uoqup-SLI70"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "AYuYR7_0pTjkDaHa5-vjhsMDWDvGp7ZgJP2HqzWzXSA"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "C6CtkcfTd21q5m2FB-QH0vbArg_g0QNQaeIUHIc91vI"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "CbwkUnUZz6-plPNxhpkCSXbCZDt87gMf-JED4pIUv0E"
_acme-challenge.abhtest.cloudengine.mercedes-benz.com. 0 IN TXT "DeJjJ9cImQwPPFUMsI39UBFVtj2Fsn6L9uGx4m8qV5A"
;; AUTHORITY SECTION:
cloudengine.mercedes-benz.com. 0 IN NS ns-1814.awsdns-34.co.uk.
cloudengine.mercedes-benz.com. 0 IN NS ns-603.awsdns-11.net.
cloudengine.mercedes-benz.com. 0 IN NS ns-63.awsdns-07.com.
cloudengine.mercedes-benz.com. 0 IN NS ns-1193.awsdns-21.org.
;; ADDITIONAL SECTION:
;; MSG SIZE rcvd: 1362
Dec 27 13:39:13 unbound[13:0] debug: iter_handle processing q with state QUERY RESPONSE STATE
Dec 27 13:39:13 unbound[13:0] info: query response was ANSWER
Dec 27 13:39:13 unbound[13:0] debug: TTL 0: dropped msg from cache
Dec 27 13:39:13 unbound[13:0] debug: iter_handle processing q with state FINISHED RESPONSE STATE
Dec 27 13:39:13 unbound[13:0] info: finishing processing for _acme-challenge.abhtest.cloudengine.mercedes-benz.com. TXT IN
Dec 27 13:39:13 unbound[13:0] debug: mesh_run: iterator module exit state is module_finished
Dec 27 13:39:13 unbound[13:0] debug: validator[module 0] operate: extstate:module_wait_module event:module_event_moddone
Dec 27 13:39:13 unbound[13:0] info: validator operate: query _acme-challenge.abhtest.cloudengine.mercedes-benz.com. TXT IN
Dec 27 13:39:13 unbound[13:0] debug: validator: nextmodule returned
Dec 27 13:39:13 unbound[13:0] debug: val handle processing q with state VAL_INIT_STATE
Dec 27 13:39:13 unbound[13:0] debug: validator classification positive
Dec 27 13:39:13 unbound[13:0] info: no signer, using _acme-challenge.abhtest.cloudengine.mercedes-benz.com. TYPE0 CLASS0
Dec 27 13:39:13 unbound[13:0] debug: val handle processing q with state VAL_FINISHED_STATE
Dec 27 13:39:13 unbound[13:0] debug: TTL 0: dropped msg from cache
...
...
We then modified the unboundtest configuration unbound.conf to add the following line
max-udp-size:4096
Unbound 1.19 running locally successfully returned the TXT record.
Indeed, Unbound 1.16 default max-udp-size was 4096 and it was changed in this commit to 1232 which is used by 1.18 and 1.19
19 January 2023: Wouter
- Set max-udp-size default to 1232. This is the same default value as
the default value for edns-buffer-size. It restricts client edns
buffer size choices, and makes unbound behave similar to other DNS
resolvers. The new choice, down from 4096 means it is harder to get
large responses from Unbound. Thanks to Xiang Li, from NISL Lab,
Tsinghua University.
Is it possible that Let's Encrypt unbound configuration is misbehaving when:
max-udp-size: 1232
edns-buffer-size: 512
It is working with 1.16:
max-udp-size: 4096
edns-buffer-size: 512
Now that max udp-size is 1232 does it mean the response is discarded and the TCP fallback is not being used?
Locally I have no issue using dig to query the TXT records without edns:
I have been thinking about this for the last 4 days and my sanity is slowly going away. At this stage I am not sure if there is an issue between Let's Encrypt unbound and route53, or is there an issue with some ns1-4.corpinter.net NS.
With more and more domain adding constant TXT records to verify their ownership of a domain name, it is not uncommon to have bigger and bigger TXT records. Just look at google.com or mercedes-benz.com TXT records to see all the something-verification=123456789.
So we got to the bottom of it. This is a musl bug... that's fixed in newer alpine images. When a truncated DNS reply is sent by unbound, it does not retry with TCP like it ought to.
Uh, maybe. Are you suggesting that your internal Unbound testing might have been affected by it, or do you think that Let's Encrypt's Validation Unbound servers might be hitting this too? I'd be surprised if so since I'm not aware of them using Alpine, though I suppose I wouldn't know what they are using so I probably just shouldn't say anything.
It's certainly possible something weird is happening, but I don't think that particular one would be the problem: They have the edns buffer size set to 512, so anything over that size would be retried over TCP, and I think there are enough normal responses that would be over that size (especially with DNSSEC) that if that were the problem then more people would be complaining (and Let's Encrypt's monitoring would be alerting them).
But maybe there's some much-bigger size, where even TCP isn't working right for that size of response?
@lestaff Hoping that it's okay to start bugging you and you're back from the holidays.
Between this thread, and another similar recent report, it looks like there was some sort of regression in Unbound (either the updated version or in configuration) for the use case of multiple (20+) domains using DNS-01 where the challenge record for all of those domains is CNAME'd to one single record which is populated with all the TXT entries for all of them.
Not a particularly common configuration, no, but it is described as a standard way for acme.sh's alias mode for a multiple-SAN certificate, so it might be something that others are trying too. (And I think it should be working.)
I think this is the reason, why unbound 1.18 is showing this behavior:
"The new default for the maximum UDP response size is 1232, with max-udp-size: 1232. This is similar to other resolvers. The new default is smaller and that makes it harder to get large responses. Thanks to Xiang Li, from NISL Lab, Tsinghua University."
Is there any chance that Let's Encrypt will change this behavior?