DNS problem: SERVFAIL looking up CAA for solarexchange.agl.com.au - the domain's nameservers may be malfunctioning

We're trying to renew the following certificate: https://crt.sh/?q=agldevops.digital.agl.com.au and are currently facing the issue that the validation from LE fails with the following error (but for different domains in every run):

"error": {
    "type": "urn:ietf:params:acme:error:caa",
    "detail": "Error finalizing order :: While processing CAA for test.solarexchange.agl.com.au: DNS problem: SERVFAIL looking up CAA for solarexchange.agl.com.au - the domain's nameservers may be malfunctioning",
    "status": 403

and another example from a different try:

  "error": {
    "type": "urn:ietf:params:acme:error:caa",
    "detail": "Error finalizing order :: While processing CAA for testapi.platform.agl.com.au: DNS problem: SERVFAIL looking up CAA for testapi.platform.agl.com.au - the domain's nameservers may be malfunctioning",
    "status": 403

we first checked the failing domains with let's debug and found no issues (e.g. https://letsdebug.net/test.solarexchange.agl.com.au/366910 ). We proceeded by manually checking the CAA records using dig and also didn't find any issue:
$dig +dnssec powerdirectsapphiretest.digital.agl.com.au CAA

; <<>> DiG 9.10.6 <<>> +dnssec powerdirectsapphiretest.digital.agl.com.au CAA
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13157
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 1, ADDITIONAL: 2

; EDNS: version: 0, flags: do; udp: 1220
;powerdirectsapphiretest.digital.agl.com.au. IN CAA

powerdirectsapphiretest.digital.agl.com.au. 10800 IN CNAME test.agl.edgekey.net.
test.agl.edgekey.net.	300	IN	CNAME	e24020.x.akamaiedge.net.

akamaiedge.net.		31	IN	SOA	internal.akamaiedge.net. hostmaster.akamai.com. 1559044082 90000 90000 90000 180

powerdirectsapphiretest.digital.agl.com.au. 1 IN TXT "ETPA"

;; Query time: 440 msec
;; WHEN: Tue Nov 24 10:55:46 CET 2020
;; MSG SIZE  rcvd: 264

(we also tested digital.agl.com.au, agl.com.au similar results can be seen here: https://dnsviz.net/d/test.solarexchange.agl.com.au/dnssec/ )

For the last run where test.solarexchange.agl.com.au failed we also checked the DNS Server logs and found the expected nxdomain response (and no errors).

So far we didn't see any issues during our tests and the domains were revalidated with the exact same setup several times. Can you please help us understand the exact reason why the validation fails from Let's Encrypt?

Thank you in advance,

PS: We're using Akamais Certificate Provisioning System to setup the challenge responses and to submit the order, but that shouldn't really matter in this case as the issue is with the DNS Setup according to the error-message

1 Like

Is it legal to respond with both these RRs? :thinking:. Being in the additional section, it probably doesn't matter. But weird nonetheless.

Is the CAA error you're getting totally reliable? i.e. if you repeat the order a few times, does it eventually succeed?

1 Like

Totally reliable in the sense of that it fails for one of the >90 hostnames in the certificate, during the last 4 tries - as previously mentioned mentioned the error message always a different hostname. Which is really odd because the issue seem to happen after all hostnames have been validated and we try to finalize the order.

1 Like

I retried a couple of times and the cert got issued now. I still don't have an explanation why this failed previously...

1 Like

The issue happened again today @2020-12-01 20:36 GMT - "Let’s Encrypt: Error finalizing order :: While processing CAA ... DNS problem: SERVFAIL looking up CAA for powerdirectdeadpool.digital.agl.com.au , the domain nameservers may be malfunctioning". We have verified that the Authority nameservers are setup correctly and return NOERROR for a CAA record. The enrollment was then cancelled and resubmitted and the second attempt went through smoothly. Perhaps there needs to be multiple redundancies in network location(s) (and retries) for the LE client testing for CAA record to prevent transient errors in communicating to Authority NS from stalling an enrollment - which looks like it isn't the case today?

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.