SERVFAIL causing issuance failures, unable to reproduce in Unbound or locally

Hi team, we’re seeing issues and we’re not sure if it’s related to the incident or not. The status page says “monitoring” and that the root cause has been fixed.

LE is giving us:

    403 urn:acme:error:caa: Error creating new cert :: Rechecking CAA: While processing CAA for brightgen.com: 
    DNS problem: SERVFAIL looking up CAA for brightgen.com

We see NOERROR locally:

$ dig CAA brightgen.com @8.8.8.8

; <<>> DiG 9.12.1-P2 <<>> CAA brightgen.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7437

And we get NOERROR from unboundtest: https://unboundtest.com/m/CAA/brightgen.com/MZZR5RHO

This is not an isolated case, other domains with the same or similar problem (sometimes it fails to look up A records), are:

DNS problem: SERVFAIL looking up CAA for brightgen.com
SERVFAIL looking up CAA for brightgen.co.uk
DNS problem: query timed out looking up A for www.faerykisses.co.uk
SERVFAIL looking up A for www.aclu-nca.org

We’re seeing a very high error rate trying to issue certs and wonder if there is still perhaps some fallout from the DNS issue? Could unbound require a restart or something like that?

This is an ongoing issue, here is another domain that has no apparent problem yet is being rejected: expresstrailer.net
www.expresstrailer.net

DNS problem: query timed out looking up CAA for expresstrailer.net

Also having the problem:

www.thechartstore.biz
www.thechartstore.com

@jsha hope you don’t mind the mention on this one, we’re running into it a lot and it doesn’t appear that the status page has any info about this specific issue.

Jacob is on vacation. I’m not sure the Let’s Encrypt server logs are very helpful with these kinds of issues, so I would suggest that you let our resident DNS guru @mnordhoff have a look first. He has a knack for figuring out wacky DNS issues like these.

2 Likes

I’m stumped. :confounded:

There have been an oddly large number of reports of DNS issues today.

One of them was due to a misconfiguration with the domain.

With another, 1 of 2 nameservers used by the domain was partly broken due to a misconfiguration, but the domain ought to have worked anyway thanks to the other nameserver.

Most of the domains seem to have nothing obviously wrong.

I’m wondering if Let’s Encrypt really is having an issue. Maybe a routing issue affecting a small percentage of traffic.

The letsencrypt.org issue on the status page was with the authoritative DNS. It ought to be more or less impossible for it to have had any impact on the resolvers, but strange things can happen when you have a severe outage.

2 Likes

Something that I have noticed is that each domain you reported has errors according to dnsviz.

:open_mouth: Still, Let’s Encrypt won’t make SOA queries, and shouldn’t be using TCP often. If those are the only issues with those domains, they should be harmless. (Edit: Harmless to Let’s Encrypt’s resolver. They’re still bad, and need to be fixed, in general.)

The www.aclu-nca.org issues are mostly the TLD and ordinary Amazon Route 53 stuff.

Adding another to the list of domains we’re seeing LE have difficulty with: staffs-wildlife.org.uk

1 Like

This could potentially be an IPv4 vs. IPv6 problem; the www subdomain has an IPv6 entry but the base domain doesn't, and the IPv6 www subdomain and the IPv4 base domain return different content in HTTP.

Production Let's Encrypt legitimately can't resolve it: https://acme-v02.api.letsencrypt.org/acme/authz/twN2s7iCs5AUlnEsr7mEb__1bBGfvVLhzkYvzhJfX3c

Neither can staging.

The last couple of days seem to have introduced some invisible hoop that nameservers need to jump through but nobody can identify :confused:

That said, OP did post about this a few weeks ago, so it could just be their specific nameservers still suffering from the same problems they did in the past.

1 Like

Thanks for the reply @schoen. I can’t reproduce different content being served based upon IPv4 vs IPv6. Test I used:

~ curl staffs-wildlife.org.uk/.well-known/acme-challenge/randomtext && echo
randomtext.vKGSnNTMm-njyWJQYjhmPuIovGcwxiduMtzbURl4_Yc
~ curl -4 www.staffs-wildlife.org.uk/.well-known/acme-challenge/randomtext && echo
randomtext.vKGSnNTMm-njyWJQYjhmPuIovGcwxiduMtzbURl4_Yc
~ curl -6 www.staffs-wildlife.org.uk/.well-known/acme-challenge/randomtext && echo
randomtext.vKGSnNTMm-njyWJQYjhmPuIovGcwxiduMtzbURl4_Yc

This matches our experience as well.

Another one:

"detail": "DNS problem: SERVFAIL looking up A for www.northarrowpartners.com",

2 posts were split to a new topic: SERVFAIL from authoritative DNS server (0x20 case randomization issue)

It's unlikely because we don't control our customer's nameservers or DNS settings, and there are a wide range of domains failing on completely different servers.

Another:

SERVFAIL looking up CAA for jrudman.com

Another

403 urn:acme:error:caa: Error creating new cert :: Rechecking CAA: While processing CAA for awhitepondparadise.com: DNS problem: query timed out looking up CAA for awhitepondparadise.com, While processing CAA for www.awhitepondparadise.com: DNS problem: query timed out looking up CAA for

This time we let our service continue attempting to issue a cert for it.

After about 90 retries, LE was finally able to resolve the domain correctly.

@marktheunissen @kf6nux I went through all of the domain names shared in this thread. The majority of them all displayed problems with 0x20 case randomization. Further, there is overlap with the authoritative nameservers in use by the problematic domains. The only one I’m currently stumped by is www.aclu-nca.org.

All of the following use meganameservers.eu for their authoritative DNS and fail to handle 0x20 randomization properly:

The following use att-websites.com for their authoritative DNS and fail to handle 0x20 randomization properly:

The following use aplus.net for their authoritative DNS and fail to handle 0x20 randomization

The following use earthlink.net for their authoritative DNS and fail to handle 0x20 randomization

Did you check the authoritative DNS servers? To me it looks like there is non-trivial overlap between failing domains and their DNS providers...