Intermittent SERVFAIL for DNS Validation in staging and always fails in production

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is: letsencrypttest.agd.gov.au

I ran this command:

acme.sh -d letsencrypttest.agd.gov.au --server letsencrypt --dns --yes-I-know-dns-manual-mode-enough-go-ahead-please --issue

Manually created the TXT records through a support ticket (purely for debugging purposes because we have a delegated sub domain to Azure DNS for the automated version)

acme.sh -d letsencrypttest.agd.gov.au --server letsencrypt --dns --yes-I-know-dns-manual-mode-enough-go-ahead-please --renew

It produced this output:

[Mon Oct 23 14:15:35 AEDT 2023] The domain 'letsencrypttest.agd.gov.au' seems to have a ECC cert already, lets use ecc cert.
[Mon Oct 23 14:15:35 AEDT 2023] Renew: 'letsencrypttest.agd.gov.au'
[Mon Oct 23 14:15:35 AEDT 2023] Renew to Le_API=https://acme-v02.api.letsencrypt.org/directory
[Mon Oct 23 14:15:36 AEDT 2023] Using CA: https://acme-v02.api.letsencrypt.org/directory
[Mon Oct 23 14:15:36 AEDT 2023] Single domain='letsencrypttest.agd.gov.au'
[Mon Oct 23 14:15:36 AEDT 2023] Getting domain auth token for each domain
[Mon Oct 23 14:15:36 AEDT 2023] Verifying: letsencrypttest.agd.gov.au
[Mon Oct 23 14:15:38 AEDT 2023] Pending, The CA is processing your order, please just wait. (1/30)
[Mon Oct 23 14:15:43 AEDT 2023] Invalid status, letsencrypttest.agd.gov.au:Verify error detail:DNS problem: SERVFAIL looking up CAA for agd.gov.au - the domain's nameservers may be malfunctioning
[Mon Oct 23 14:15:43 AEDT 2023] Please add '--debug' or '--log' to check more details.
[Mon Oct 23 14:15:43 AEDT 2023] See: https://github.com/acmesh-official/acme.sh/wiki/How-to-debug-acme.sh
[Mon Oct 23 14:15:45 AEDT 2023] The dns manual mode can not renew automatically, you must issue it again manually. You'd better use the other modes instead.

My web server is (include version): N/A (I'm using DNS verification)

The operating system my web server runs on is (include version): Ubuntu 20.04

My hosting provider, if applicable, is: DNS is hosted by Telstra

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):

# acme.sh version
https://github.com/acmesh-official/acme.sh
v3.0.7

Something weird is going on between Let's Encrypt Servers and Telstra's DNS servers, when using the Let's Encrypt staging servers sometimes it works and sometimes it fails. However with the Let's Encrypt production servers it fails every time. This used to be working, but we noticed our scheduled renewals started failing from the 23rd of September, and it's been broken since then.

The SERVFAIL sometimes happens on the TXT record or sometimes on the CAA record. We're not using DNSSEC so it's unrelated to that.

While debugging I decided to try Google Trust Services and it worked fine, so it seems to be a transient issue only affecting the network path between Let's Encrypt Production and Telstra. I raised the support request with Telstra and they looked through their DNS logs and didn't find any errors during the window. They saw the requests from let's encrypt but no errors were logged.

I tried using letsdebug.net and it shows the failure.

Then running it again, it succeeded.

Is it possible to get more detail on why Let's Encrypt DNS resolvers think it is a SERVFAIL? Is it connectivity problems? Why does staging sometimes work and production never? Why does it work via Google Trust Services in place of Let's Encrypt but using the same DNS provider?

I've been trying to diagnose this for a few weeks now and it seems a really curly problem. At the moment it seems like it's a case of "The Internet says No".

Unboundtest might help:

https://unboundtest.com/m/CAA/agd.gov.au/34SUKG5O

1 Like

Oh thanks, I had looked at unboundtest beforehand and I noticed there were a lot of referrals, but this one succeeded:

https://unboundtest.com/m/CAA/agd.gov.au/IK65JM63

So I guess sometimes that referral pathway is too long?

Well at least there is something I can point Telstra at, it looks like their DNS setup is a bit wonky.

when it errors out it was

exceeded the maximum nameserver nxdomains

3 Likes

Ok thanks.

So it worked with Google, because Google is more tolerant of these nxdomains? But Let's Encrypt DNS resolvers don't allow as many non-existent name servers?

1 Like

Looking through the unboundtest results.

What is it that is actual problem? Ie which nameserver didn't respond, the last line mentions too many nxdomains.

Are the problems the "nodata" responses?

Oct 23 06:05:35 unbound[246337:0] info: response for adns01.bigpond.com. AAAA IN
Oct 23 06:05:35 unbound[246337:0] info: reply from <bigpond.com.> 2600:1408:1c::41#53
Oct 23 06:05:35 unbound[246337:0] info: query response was nodata ANSWER
Oct 23 06:05:36 unbound[246337:0] info: response for adns03.bigpond.com. AAAA IN
Oct 23 06:05:36 unbound[246337:0] info: reply from <bigpond.com.> 2600:1480:d800::42#53
Oct 23 06:05:36 unbound[246337:0] info: query response was nodata ANSWER

Or is unboundtest not showing us which nameserver was problematic?

Im not really dns expert but it looks like referral was too deep and unbound gave up?

3 Likes

letsencrypttest.agd.gov.au | DNSViz
I would start here:
image

Then, (especially) since it is a gov.au domain:
image
[have your IT folks turn on DNSSEC]

4 Likes

To detail the above:

Following the DNS tree path, you see only two authoritative DNS servers:

agd.gov.au      nameserver = ns13.msng.telstra.com.au
agd.gov.au      nameserver = ns23.msng.telstra.com.au

But when they are asked, they return four authoritative DNS servers:

agd.gov.au      nameserver = ns1.msng.telstra.com.au
agd.gov.au      nameserver = ns2.msng.telstra.com.au
agd.gov.au      nameserver = ns13.msng.telstra.com.au
agd.gov.au      nameserver = ns23.msng.telstra.com.au

Since all four seem to be in sync and authoritative, the logical fix is to include them in at the domain registrar.

5 Likes

They have removed ns1 and ns2, as that is the simplest at present (I'm not sure the process to modify the .gov.au registrar). But that hasn't made a difference.

I have tried a dnsviz on both nameservers

There is an error retrieving a DNS key:

  • telstra.com.au/DNSKEY: No response was received from the server over UDP (tried 4 times). (203.50.248.6, UDP_-_EDNS0_512_D_KN)

Could that impact the Unbound Let's Encrypt DNS validations?

2 Likes

Where?
Did you confirm the existence of such a record?
I get no TXT record from either:
nslookup -q=txt letsencrypttest.agd.gov.au. ns13.msng.telstra.com.au
nslookup -q=txt letsencrypttest.agd.gov.au. ns23.msng.telstra.com.au

[edit: I see it now - just forgot to include _acme-challenge. at the start]

3 Likes

Yes, the TXT record was created on _acme-challenge.letsencrypttest.agd.gov.au.

# dig _acme-challenge.letsencrypttest.agd.gov.au TXT @168.63.129.16

; <<>> DiG 9.16.1-Ubuntu <<>> _acme-challenge.letsencrypttest.agd.gov.au TXT @168.63.129.16
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59425
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1224
;; QUESTION SECTION:
;_acme-challenge.letsencrypttest.agd.gov.au. IN TXT

;; ANSWER SECTION:
_acme-challenge.letsencrypttest.agd.gov.au. 1800 IN TXT "zFYVV-8JUYqdSb4mHTKu_SSIG_0ofsFl4NocR8huaIE"

;; Query time: 339 msec
;; SERVER: 168.63.129.16#53(168.63.129.16)
;; WHEN: Tue Oct 24 11:59:41 AEDT 2023
;; MSG SIZE  rcvd: 127
2 Likes

We need to find out who is doing this validation? And from where?
[which DNS servers are being used to validate it?]

Using these may give us more insight:

2 Likes

unboundtest readily reproduces this. My first two worked but I got a SERVFAIL on my 3rd try (and osiris has one earlier)
https://unboundtest.com/m/TXT/_acme-challenge.letsencrypttest.agd.gov.au/6J7EZF3I

3 Likes

OK
These entries are troubling:

Oct 24 02:29:56 unbound[253473:0] info: reply from <au.> 2a01:8840:c0::1#53
Oct 24 02:29:56 unbound[253473:0] info: query response was REFERRAL
Oct 24 02:29:56 unbound[253473:0] info: resolving adns01.bigpond.com. AAAA IN
Oct 24 02:29:56 unbound[253473:0] info: resolving adns04.bigpond.com. A IN
Oct 24 02:29:56 unbound[253473:0] info: resolving adns04.bigpond.com. AAAA IN
Oct 24 02:29:56 unbound[253473:0] info: resolving adns01.bigpond.com. A IN

Oct 24 02:29:56 unbound[253473:0] info: response for ns13.msng.telstra.com.au. AAAA IN
Oct 24 02:29:56 unbound[253473:0] info: reply from <au.> 65.22.196.1#53
Oct 24 02:29:56 unbound[253473:0] info: query response was REFERRAL
Oct 24 02:29:56 unbound[253473:0] info: resolving a11-66.akam.net. AAAA IN
Oct 24 02:29:56 unbound[253473:0] info: resolving adns02.bigpond.com. A IN
Oct 24 02:29:56 unbound[253473:0] info: resolving adns02.bigpond.com. AAAA IN
Oct 24 02:29:56 unbound[253473:0] info: resolving a11-66.akam.net. A IN

Oct 24 02:29:56 unbound[253473:0] info: response for ns23.msng.telstra.com.au. AAAA IN
Oct 24 02:29:56 unbound[253473:0] info: reply from <au.> 65.22.196.1#53
Oct 24 02:29:56 unbound[253473:0] info: query response was REFERRAL
Oct 24 02:29:56 unbound[253473:0] info: resolving a1-185.akam.net. AAAA IN
Oct 24 02:29:56 unbound[253473:0] info: resolving a1-185.akam.net. A IN

Why/how is BigPond and AKAM being involved?
Seems like it gets derailed early on.

2 Likes

Yeah, oddly my fail example and osiris had much different flows. That's the edge of my skills. I'm sure you can make better sense. You should be able to recreate it too.

3 Likes

I had also wondered why some of it is Akamai and some of it is Bigpond, I would have thought they would just pick one or the other. For reference Telstra owns Bigpond, it's basically a brand of theirs, maybe Telstra is migrating their DNS to Akamai or something but they haven't finished the migration or something?

What do you find troubling about those records?

I did try that, there is no extra detail it just shows the API call made to the let's encrypt servers and the JSON response. There is no further detail than in the error message letsencrypttest.agd.gov.au:Verify error detail:DNS problem: SERVFAIL looking up CAA for agd.gov.au - the domain's nameservers may be malfunctioning

I thought that maybe Let's Encrypt must be running their own unbound resolvers given the comment at https://unboundtest.com/ says:

The Unbound instance is configured very similarly to Let's Encrypt's production servers

Or do you mean which of the Telstra/Bigpond/Akamai DNS servers are responding with NXDOMAIN?

I have been considering running my own test unbound server running in verbose logging as looking at the unbound code there are two reasons (see also line 2440) for getting "exceeded the maximum nameserver nxdomains" error.

1 Like

They seem to lead down a dead-end path.

They do.
I was just not sure if the ACME client in use [didn't see which one is being used] is also doing some sort of pre-validation [via some third-party/public DNS].
If so, that might be able to be turned off.
[seems like this is not the case]

That is "the problem".
It is a cumulative response - a few may be tolerable, but with too many, it chokes.
My questions are:

  • Why are there any?
  • What is their cause?
  • How can we reduce/illiminate them?
4 Likes

Additional Diagnostics

I figured I should post some updates, so this topic doesn't close itself. I've been working with Telstra on a resolution.

I installed my own unbound DNS resolver so that I could ramp up the log level and figure out where the NXDOMAIN's are coming from. I have it running in Windows (which is perhaps not ideal) because I was struggling to get internet routable IPv6 working inside WSL 2.

I saw that the messages we needed were logged with the VERB_ALGO verbosity level which is "4".

It's really noisy at level 4, and it produced 5 MB of logs for just the CAA query.

27/10/2023 12:06:13 AM unbound[29856:0] info: iterator operate: query ns13.msng.telstra.com.au. A IN
27/10/2023 12:06:13 AM unbound[29856:0] debug: iter_handle processing q with state QUERY TARGETS STATE
27/10/2023 12:06:13 AM unbound[29856:0] info: processQueryTargets: ns13.msng.telstra.com.au. A IN
27/10/2023 12:06:13 AM unbound[29856:0] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 0
27/10/2023 12:06:13 AM unbound[29856:0] debug: request has exceeded the maximum number of nxdomain nameserver lookups (5) with 6
27/10/2023 12:06:13 AM unbound[29856:0] debug: parent-side information is already present for the delegation point, no fallback possible
27/10/2023 12:06:13 AM unbound[29856:0] debug: return error response SERVFAIL
27/10/2023 12:06:13 AM unbound[29856:0] debug: mesh_run: iterator module exit state is module_finished

So it looks like we are hitting line 2397, which is the error: request has exceeded the maximum number of nxdomain nameserver lookups (5) with 6.

I also noticed 24 errors like this in my logs, which I wondered was some artifact of running inside windows, and whether the UDP ports it was picking where clashing with something on my system. I couldn't figure out if these permission denied errors were counting as nxdomain errors. However, 24 doesn't seem to exactly match the number of nxdomain errors, so I'm not sure.

27/10/2023 12:06:09 AM unbound[29856:0] error: can't bind socket: Permission denied. for 0.0.0.0 port 65434 (len 16)
27/10/2023 12:06:10 AM unbound[29856:0] error: can't bind socket: Permission denied. for :: port 49615 (len 28)

In total there were 6 of the request has exceeded the maximum number of nxdomain nameserver errors.

61951 27/10/2023 12:06:13 AM unbound[29856:0] info: processQueryTargets: ns13.msng.telstra.com.au. A IN
61954: 27/10/2023 12:06:13 AM unbound[29856:0] debug: request has exceeded the maximum number of nxdomain nameserver lookups (5) with 6
61965 27/10/2023 12:06:13 AM unbound[29856:0] info: processQueryTargets: ns13.msng.telstra.com.au. AAAA IN
61968: 27/10/2023 12:06:13 AM unbound[29856:0] debug: request has exceeded the maximum number of nxdomain nameserver lookups (5) with 6
62200 27/10/2023 12:06:13 AM unbound[29856:0] info: processQueryTargets: a18-64.akam.net. AAAA IN
62203: 27/10/2023 12:06:13 AM unbound[29856:0] debug: request has exceeded the maximum number of nxdomain nameserver lookups (5) with 7
62412 27/10/2023 12:06:13 AM unbound[29856:0] info: processQueryTargets: a18-64.akam.net. A IN
62415: 27/10/2023 12:06:13 AM unbound[29856:0] debug: request has exceeded the maximum number of nxdomain nameserver lookups (5) with 7
62426 27/10/2023 12:06:13 AM unbound[29856:0] info: processQueryTargets: ns23.msng.telstra.com.au. AAAA IN
62429: 27/10/2023 12:06:13 AM unbound[29856:0] debug: request has exceeded the maximum number of nxdomain nameserver lookups (5) with 8
62440 27/10/2023 12:06:13 AM unbound[29856:0] info: processQueryTargets: _acme-challenge.letsencrypttest.agd.gov.au. CAA IN
62443: 27/10/2023 12:06:13 AM unbound[29856:0] debug: request has exceeded the maximum number of nxdomain nameserver lookups (5) with 9

What I don't understand is what counts as a NXDOMAIN? From the logs I don't believe any of the DNS servers actually responded with NXDOMAIN, because at verbosity level 4 the actual DNS responses are logged in full. It looks like there were 35 responses from various DNS servers in total.

One thing that is interesting, is that there were 10 errors of: request has exceeded the maximum number of sends with 33. It logs this error very close to the NXDOMAIN code, so I'm wondering if there are just too many nameservers involved? Or it gets stuck in some sort of loop or something?

I found that default number of sends in unbound.conf is set by the max-sent-count flag which defaults to 32, but that the count resets for each CNAME and referral.

I also noticed 16 Capsforid related errors, that were combinations of these errors

27/10/2023 12:06:08 AM unbound[29856:0] info: Capsforid: timeouts, starting fallback
27/10/2023 12:06:08 AM unbound[29856:0] info: Capsforid: reply is equal. go to next fallback

But I'm really not sure how much was just my computer and how much is the actual cause.

Pending Changes with Telstra

Telstra noticed that there are large numbers of NS records with BigPond for msng.telstra.com.au, with the addition of the Akamai nameservers in the ADDITIONAL section for bigpond.com, may be causing too large of a response size.

They mentioned that

dig @adns01.bigpond.com msng.telstra.com.au NS +noedns

vs

dig @adns01.bigpond.com msng.telstra.com.au NS

The response is actually being truncated to 509 bytes when using the ‘noedns’ flag, instead of fragmenting the UDP packet.

I did some digging into what is means to disable EDNS I noticed that the “+noedns” version cut out AAAA records from the additional section part of the response. I discovered that without EDNS that the maximum packet size is for DNS is 512 bytes, and there is a TC flag which means the UDP response has been truncated meaning you can do a TCP query and get the full response. It doesn’t sound like we should expect DNS UDP packets to get fragmented, I do note that the bigpond server didn’t set the truncated flag either, perhaps because the ADDITIONAL SECTION is a convenience thing, maybe you’d only see the TC flag if the ANSWER SECTION was too long.

Telstra also pointed out to me that one of the A records was missing in the ADDITIONAL SECTION too. 13 records instead of 14.

However, surely unbound the DNS server wouldn't disable EDNS?

Although maybe reducing the number of nameservers will make a difference for other reasons, Telstra have said this:

We’re currently in the process of working with BigPond to trim the number of nameservers for msng.telstra.com.au from 14 down to 4.

Fingers crossed it resolves the issue. Otherwise maybe I'll need to post on the unbound DNS server mailing list so I can understand what is causing unbound to report NXDOMAIN/SERVFAIL.

3 Likes

Telstra pushed the change through and I managed to issue a certificate via Let's Encrypt Production, hooray!

I guess 14 nameservers is just too many and unbound doesn't like it.

1 Like