SERVFAIL looking up CAA for some host names but not others in the same zone

My domain is: stibo.dk

We're using cert-manager v1.11.1 to issue certificates via DNS-01 for a number of hosts in the stibo.dk domain, reissuing existing certificates and issuing new certificates works perfectly, but for a select few issuance fails with:

E0525 11:42:54.803799 1 sync.go:379] cert-manager/challenges/acceptChallenge "msg"="error waiting for authorization" "error"="acme: authorization error for asciinema.stibo.dk: 400 urn:ietf:params:acme:error:dns: DNS problem: SERVFAIL looking up CAA for asciinema.stibo.dk - the domain's nameservers may be malfunctioning" "dnsName"="asciinema.stibo.dk" "resource_kind"="Challenge" "resource_name"="asciinema.stibo.dk-tmv5j-3528835467-1941497609" "resource_namespace"="asciinema" "resource_version"="v1" "type"="DNS-01"

The above error doesn't show up for other similar (and working) certificates of the same zone.

The CAA record is fine :

❯ dig @8.8.8.8 -tCAA stibo.dk
stibo.dk.               3600    IN      CAA     0 issue "letsencrypt.org"

The certificate was first created on may 19 2022 and was issued and renewed several times until renewal stopped working about a month ago.

I can't figure out any difference between the certificates that work and those that don't.

I don't think anything has changed at our side.

Can anyone spare a clue?

You're looking at the CAA record for stibo.dk, but the error is for the full name (which it has to check first): SERVFAIL looking up CAA for asciinema.stibo.dk

And that request doesn't work:

$ dig -tCAA asciinema.stibo.dk

; <<>> DiG 9.16.38-RH <<>> -tCAA asciinema.stibo.dk
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 13970
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

https://dnsviz.net/d/asciinema.stibo.dk/dnssec/?rr=257&a=all&ds=all&ta=.&tk=

https://unboundtest.com/m/CAA/asciinema.stibo.dk/CLISN7WE

You don't need a CAA record for the full name, but if you don't have one the DNS server needs to correctly respond NOERROR (that there are no records) instead of giving an error.

For what it's worth, I see a SERVFAIL trying to request an A or AAAA record for the asciinema.stibo.dk name as well.

4 Likes

Thank you, it seems I had a bad NS record in the zone, which caused recursors to fail.

I think I've fixed the problem by nuking the extra NS record, but I'm waiting for TTLs to expire.

2 Likes

That may not be necessary; As LE will only use the authoritative DNS servers.
That said, I would prefer that you "test" this out using the staging system [first].

3 Likes

As it happened it wasn't DNS caches that needed timing out, but simply the back-off of cert-manager itself that needed to time out and renew all the problematic certificates.

All's well again, thank you for reading the log file for me:)

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.