False CAA failure when issuing certs

Hi all, we’re seeing many instances recently of certs failing to be issued due to SERVFAIL, however when we attempt to verify manually, we cannot reproduce the SERVFAIL - using either dig or unboundtest.

For example, we just attempted to issue a cert with the domain “www.ndcpartnership.org” on it, and received the SERVFAIL on CAA check error, however when checking unbound, it appears fine:

https://unboundtest.com/m/CAA/www.ndcpartnership.org/2V365RUD

Another example, very recently (last 24 hours): www.rule.com

https://unboundtest.com/m/CAA/www.rule.com/W6LMXXEQ

1 Like

To clarify, the errors are intermittent?

I’m not sure whether Let’s Encrypt records unbound’s logs but you may be able to ask jsha or cpu to check the logs for when the CAA lookups did fail.

Well, unfortunately we have a system that stops trying to process the certificate when it encounters the CAA error message, and automatically reaches out to the domain name owner so they can fix their CAA records or DNS server. So we don’t know if it’s intermittent or not since we don’t immediately retry.

hopefully @cpu or @jsha can help :slight_smile:

When I look up those domains in our logs, I see “query timed out” rather than SERVFAIL. It’s a subtle distinction since they can often both result from the same cause, but figured it’s worth mentioning.

I don’t see any reason why things would be failing right now. The number of “query timed out” over the past 30 days across all users is pretty steady.

Can you quantify how recently you noticed the problem, and how many domains have the problem (and how many total domains you renew per day)?

2 Likes

BTW, while I’ve got your attention: It looks like you’re sending the User-Agent “Go-http-client/1.1”. Most like’s that’s Go’s ACME module, which used to send that by default. Could you upgrade to the latest version of the module, which will send a more meaningful User-Agent?

One more thing: dnsviz shows a couple of errors: www.rule.com | DNSViz

pantheon.io to edge.pantheon.io: No SOA RR was returned with the NODATA response. (205.251.192.148, 205.251.195.156, 205.251.196.72, 205.251.199.65, 2600:9000:5300:9400::1, 2600:9000:5303:9c00::1, 2600:9000:5304:4800::1, 2600:9000:5307:4100::1, UDP_0_EDNS0_32768_4096)
pantheon.io to edge.pantheon.io: The Authoritative Answer (AA) flag was not set in the response. (205.251.192.148, 205.251.195.156, 205.251.196.72, 205.251.199.65, 2600:9000:5300:9400::1, 2600:9000:5303:9c00::1, 2600:9000:5304:4800::1, 2600:9000:5307:4100::1, UDP_0_EDNS0_32768_4096)

I am afraid I'm not well-versed enough in DNS to tell whether those could result in the query timeouts we are seeing, but that might be worth looking into. What software do you run for your authoritative nameserver?

1 Like

Those two issues should be harmless.

Pantheon's using Amazon Route 53. DNSViz does the equivalent of "dig @ns-924.awsdns-51.net +norecurse edge.pantheon.io ds". (Where ns-924.awsdns-51.net is one of pantheon.io's nameservers and edge.pantheon.io is a zone cut.)

Route 53 design predates the DNSSEC RFCs, and it returns a referral to the edge.pantheon.io nameservers instead of a NODATA NOERROR like a modern nameserver. But since Route 53 doesn't support DNSSEC, the zones aren't signed in the first place, and recursive nameservers won't send queries like that. Besides, they had to be designed to interoperate with older authoritative nameservers anyway.

2 Likes

@jsha hey there, I’ll make a note to update our ACME client.

Recall that we are parsing the error messages that we get back from Let’s Encrypt to see what kind of failure it was. If we see the words “query timed out” in the error, then we do retry, but it looks like in this case, the SERVFAIL failure message is actually being delivered instead of “query timed out”. I see some issues / comments in Github that indicate that CAA checking has been changed recently.

Is it possible that Boulder is falling back to the CAA SERVFAIL error instead of delivering the “timed out” error message?

It looks like I was looking too fast and misinterpreted. You're right that there was a SERVFAIL (and there was also a timeout for a related domain):

403 :: caa :: Error creating new cert :: Rechecking CAA: While processing CAA for www.rule.com: DNS problem: SERVFAIL looking up CAA for www.rule.com, While processing CAA for www2.rule.com: DNS problem: query timed out looking up CAA for www2.rule.com

However, www.ndcpartnership.org has just the timeout message:

403 :: caa :: Error creating new cert :: Rechecking CAA: While processing CAA for www.ndcpartnership.org: DNS problem: query timed out looking up CAA for www.ndcpartnership.org

I do notice that you're requesting large multi-SAN certificates. Sometimes checking rate limits for such large certificate can be a bit slow. Perhaps that's taking away time from the overall deadline allowable for looking up CAA, resulting in a timeout for an otherwise performant DNS server.

Which DNS server software are you using? Do you have stats on response times? Can you set it to log queries and double check the performance of the next CAA query that you see timing out?

Maybe there’s just some moderately bad issue and it’s hitting a timeout due to the complexity of the situation?

pantheonsite.io, pantheon.io and edge.pantheon.io are 3 different zones on 3 different sets of nameservers. Plus there’s rule.com and ndcpartnership.org and their nameservers.

The unboundtest.com links show how much indirection there is and how many queries it takes a resolver to figure all that out.

A routing issue causing, say, 1/4 of Route 53 nameservers to get routed to the opposite side of the world, or be entirely inaccessible, could slow down resolution pretty badly, and perhaps cause Boulder to give up.

1 Like

We don’t control DNS at all - it’s in the hands of our customers, and they may be using any nameservers, anywhere in the world. Thus, I can’t provide any logs.

The main issue is that there is a SERVFAIL being reported by Boulder, but this seems false or cannot be reproduced. We can handle timeouts by retrying.

Regard www.ndcpartnership.org, apologies that’s my mistake, I copy-pasted the wrong example. for rule.com, we get:

403 urn:acme:error:caa: Error creating new cert :: Rechecking CAA: While processing CAA for www.rule.com: DNS problem: SERVFAIL looking up CAA for www.rule.com, While processing CAA for www2.rule.com: DNS problem: query timed out looking up CAA for www2.rule.com

It’s a timeout for www2.rule.com, and a SERVFAIL for www.rule.com, so we report the CAA failure to the user, however we cannot manually reproduce (https://unboundtest.com/m/CAA/www.rule.com/W6LMXXEQ)

Hm, are you not pantheon.io? I just noticed both of those domains CNAME to pantheon.io, which is why I assumed you had a role in operating the relevant DNS.

Yeah we are, so in those cases we have customers CNAME to our platform, but in other cases our customers use an A record directly to our IP addresses.

Have you checked if there’s a correlation between the customers using CNAME and the ones getting this error?

Will take a look :slight_smile:

I’ve seen that lack of DNS response be related to IPv6 MTU size limits being set too low or overly strict ICMP block rules. And would contribute to the overall delays in response (or fail to respond altogether).
EDNS can allow UDP packet size up to 4096 bytes.
But sadly it is not properly implemented across the Internet and it has been out since 1999: https://www.ietf.org/rfc/rfc2671.txt

from: https://tools.ietf.org/html/rfc6891
EDNS provides a mechanism to improve the scalability of DNS as its
uses get more diverse on the Internet. It does this by enabling the
use of UDP transport for DNS messages with sizes beyond the limits
specified in RFC 1035 …

https://tools.ietf.org/id/draft-andrews-dnsext-udp-fragmentation-01.html

1 Like

@jsha here’s one that just occurred, which is not CNAME to us, they are using their own DNS:

www.altamidtown.com
https://unboundtest.com/m/CAA/www.altamidtown.com/KEQ5TQL6
dig CAA www.altamidtown.com = NOERROR
dig +short www.altamidtown.com = 205.178.189.131

but Boulder says: “SERVFAIL checking CAA on www.altamidtown.com for www.altamidtown.com

Both unboundtest and local dig to Google and their resolvers are NOERROR

So far all these domains are delegated via worldnic.com nameservers.

Have you seen the errors for domains delegated anywhere else?

I wonder if Network Solutions are doing something like rate-limiting Let’s Encrypt’s resolvers.

Which ones? www.rule.com is CNAME to Pantheon, and the Pantheon domains are AWS Route53