Was there some dns lookup failure in recent days?

We have a CI pipeline that runs once a day to check if it needs to renew any of our certificates. In the last couple of days, it did want to renew at least one, a cert with several Subject Alternative Names which were wildcards. However the challenges kept failing. We tried to diagnose if there was something was wrong on our end, but didn't really find anything. For example I set up the same TXT record by hand, and didn't see any issues in looking it up from various public DNS servers.

The job ran again this morning, and now it went fine; nothing changed in our setup since the last failure.

So I was wondering if perhaps there was some known failure in looking up DNS records from the Let's Encrypt side, that affected us in the last few days.

My domain is: [*.cloud.syseleven.net, *.infra.sys11cloud.net, *.infrabk.sys11cloud.net, *.infrabl.sys11cloud.net, *.infrafe.sys11cloud.net, cloud.syseleven.de]

I ran this command:
["lego", "--accept-tos", "--dns", "designate", "--path", "/tmp/lego", "--dns.resolvers", "8.8.8.8", "--dns.resolvers", "1.1.1.1", "--server=https://acme-v02.api.letsencrypt.org/directory", "--email", "noreply@syseleven.de", "--key-type", "rsa4096", "-d", "*.cloud.syseleven.net", "-d", "*.infra.sys11cloud.net", "-d", "*.infrabk.sys11cloud.net", "-d", "*.infrabl.sys11cloud.net", "-d", "*.infrafe.sys11cloud.net", "-d", "cloud.syseleven.de", "renew", "--preferred-chain", "ISRG Root X1"]

It produced this output:

(output from ansible; line breaks added for better readability)

fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["lego", "--accept-tos", "--dns", "designate", "--path", "/tmp/lego", "--dns.resolvers", "8.8.8.8", "--dns.resolvers", "1.1.1.1", "--server=https://acme-v02.api.letsencrypt.org/directory", "--email", "noreply@syseleven.de", "--key-type", "rsa4096", "-d", "*.cloud.syseleven.net", "-d", "*.infra.sys11cloud.net", "-d", "*.infrabk.sys11cloud.net", "-d", "*.infrabl.sys11cloud.net", "-d", "*.infrafe.sys11cloud.net", "-d", "cloud.syseleven.de", "renew", "--preferred-chain", "ISRG Root X1"], "delta": "0:08:24.702166", "end": "2022-11-30 16:16:11.198601", "msg": "non-zero return code", "rc": 1, "start": "2022-11-30 16:07:46.496435", "stderr": "2022/11/30 16:07:47 [INFO] [*.cloud.syseleven.net] acme: Trying renewal with 731 hours remaining
2022/11/30 16:07:47 [INFO] renewal: random delay of 7m36.394249225s
2022/11/30 16:15:24 [INFO] [*.cloud.syseleven.net, *.infra.sys11cloud.net, *.infrabk.sys11cloud.net, *.infrabl.sys11cloud.net, *.infrafe.sys11cloud.net, cloud.syseleven.de] acme: Obtaining bundled SAN certificate
2022/11/30 16:15:25 [INFO] [*.cloud.syseleven.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182124365647
2022/11/30 16:15:25 [INFO] [*.infra.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182218481187
2022/11/30 16:15:25 [INFO] [*.infrabk.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182182570677
2022/11/30 16:15:25 [INFO] [*.infrabl.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182124365677
2022/11/30 16:15:25 [INFO] [*.infrafe.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182182570687
2022/11/30 16:15:25 [INFO] [cloud.syseleven.de] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182316532597
2022/11/30 16:15:25 [INFO] [*.cloud.syseleven.net] acme: authorization already valid; skipping challenge
2022/11/30 16:15:25 [INFO] [*.infrabl.sys11cloud.net] acme: authorization already valid; skipping challenge
2022/11/30 16:15:25 [INFO] [*.infra.sys11cloud.net] acme: authorization already valid; skipping challenge
2022/11/30 16:15:25 [INFO] [*.infrabk.sys11cloud.net] acme: authorization already valid; skipping challenge
2022/11/30 16:15:25 [INFO] [*.infrafe.sys11cloud.net] acme: authorization already valid; skipping challenge
2022/11/30 16:15:25 [INFO] [cloud.syseleven.de] acme: Could not find solver for: tls-alpn-01
2022/11/30 16:15:25 [INFO] [cloud.syseleven.de] acme: Could not find solver for: http-01
2022/11/30 16:15:25 [INFO] [cloud.syseleven.de] acme: use dns-01 solver
2022/11/30 16:15:25 [INFO] [cloud.syseleven.de] acme: Preparing to solve DNS-01
2022/11/30 16:15:25 [INFO] [cloud.syseleven.de] acme: Trying to solve DNS-01
2022/11/30 16:15:26 [INFO] [cloud.syseleven.de] acme: Checking DNS record propagation using [8.8.8.8:53 1.1.1.1:53]
2022/11/30 16:15:36 [INFO] Wait for propagation [timeout: 10m0s, interval: 10s]
2022/11/30 16:15:36 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 16:15:46 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 16:15:56 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 16:16:09 [INFO] [cloud.syseleven.de] acme: Cleaning DNS-01 challenge
2022/11/30 16:16:10 [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182124365647
2022/11/30 16:16:10 [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182218481187
2022/11/30 16:16:10 [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182182570677
2022/11/30 16:16:10 [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182124365677
2022/11/30 16:16:10 [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182182570687
2022/11/30 16:16:11 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182316532597
2022/11/30 16:16:11 error: one or more domains had a problem:
[cloud.syseleven.de] acme: error: 400 :: urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up TXT for _acme-challenge.cloud.syseleven.de - the domain's nameservers may be malfunctioning", "stderr_lines":
    ["2022/11/30 16:07:47 

(and the same output basically again)

My web server is (include version): N/A, since we do DNS challenges

The operating system my web server runs on is (include version):

My hosting provider, if applicable, is: we are the hosting provider

I can login to a root shell on my machine (yes or no, or I don't know): yes, in theory we can ssh into the CI job but it's a pain

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): We don't use certbot but LEGO and we used v4.9.1.

Here is output from today, which worked:

TASK [lego : Renew certificate for global-wildcard] ****************************
task path: /builds/openstack/cert-automation/ansible/roles/lego/tasks/renew-certificates.yml:76
changed: [localhost] => {"changed": true, "cmd": ["lego", "--accept-tos", "--dns", "designate", "--path", "/tmp/lego", "--dns.resolvers", "8.8.8.8", "--dns.resolvers", "1.1.1.1", "--server=https://acme-v02.api.letsencrypt.org/directory", "--email", "noreply@syseleven.de", "--key-type", "rsa4096", "-d", "*.cloud.syseleven.net", "-d", "*.infra.sys11cloud.net", "-d", "*.infrabk.sys11cloud.net", "-d", "*.infrabl.sys11cloud.net", "-d", "*.infrafe.sys11cloud.net", "-d", "cloud.syseleven.de", "renew", "--preferred-chain", "ISRG Root X1"], "delta": "0:02:37.323749", "end": "2022-12-01 04:09:11.576960", "rc": 0, "start": "2022-12-01 04:06:34.253211", "stderr": "2022/12/01 04:06:35 [INFO] [*.cloud.syseleven.net] acme: Trying renewal with 719 hours remaining
2022/12/01 04:06:35 [INFO] renewal: random delay of 1m45.643985623s
2022/12/01 04:08:21 [INFO] [*.cloud.syseleven.net, *.infra.sys11cloud.net, *.infrabk.sys11cloud.net, *.infrabl.sys11cloud.net, *.infrafe.sys11cloud.net, cloud.syseleven.de] acme: Obtaining bundled SAN certificate
2022/12/01 04:08:22 [INFO] [*.cloud.syseleven.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182124365647
2022/12/01 04:08:22 [INFO] [*.infra.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182218481187
2022/12/01 04:08:22 [INFO] [*.infrabk.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182182570677
2022/12/01 04:08:22 [INFO] [*.infrabl.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182124365677
2022/12/01 04:08:22 [INFO] [*.infrafe.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182182570687
2022/12/01 04:08:22 [INFO] [cloud.syseleven.de] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/182511576557
2022/12/01 04:08:22 [INFO] [*.cloud.syseleven.net] acme: authorization already valid; skipping challenge
2022/12/01 04:08:22 [INFO] [*.infrabl.sys11cloud.net] acme: authorization already valid; skipping challenge
2022/12/01 04:08:22 [INFO] [*.infra.sys11cloud.net] acme: authorization already valid; skipping challenge
2022/12/01 04:08:22 [INFO] [*.infrabk.sys11cloud.net] acme: authorization already valid; skipping challenge
2022/12/01 04:08:22 [INFO] [*.infrafe.sys11cloud.net] acme: authorization already valid; skipping challenge
2022/12/01 04:08:22 [INFO] [cloud.syseleven.de] acme: Could not find solver for: tls-alpn-01
2022/12/01 04:08:22 [INFO] [cloud.syseleven.de] acme: Could not find solver for: http-01
2022/12/01 04:08:22 [INFO] [cloud.syseleven.de] acme: use dns-01 solver
2022/12/01 04:08:22 [INFO] [cloud.syseleven.de] acme: Preparing to solve DNS-01
2022/12/01 04:08:23 [INFO] [cloud.syseleven.de] acme: Trying to solve DNS-01
2022/12/01 04:08:23 [INFO] [cloud.syseleven.de] acme: Checking DNS record propagation using [8.8.8.8:53 1.1.1.1:53]
2022/12/01 04:08:33 [INFO] Wait for propagation [timeout: 10m0s, interval: 10s]
2022/12/01 04:08:33 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/12/01 04:08:43 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/12/01 04:08:53 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/12/01 04:09:09 [INFO] [cloud.syseleven.de] The server validated our request
2022/12/01 04:09:09 [INFO] [cloud.syseleven.de] acme: Cleaning DNS-01 challenge
2022/12/01 04:09:09 [INFO] [*.cloud.syseleven.net, *.infra.sys11cloud.net, *.infrabk.sys11cloud.net, *.infrabl.sys11cloud.net, *.infrafe.sys11cloud.net, cloud.syseleven.de] acme: Validations succeeded; requesting certificates
2022/12/01 04:09:11 
[INFO] [*.cloud.syseleven.net] Server responded with a certificate for the preferred certificate chains \"ISRG Root X1\".", "stderr_lines": ["2022/12/01 04:06:35

There are a couple recent issues where the Let's Encrypt API might have given you errors, but it's highly unlikely that would manifest itself as the API telling you your DNS was malfunctioning. And it looks like your failure run was after the most recent incident was resolved.

That means what it says, DNS resolution of that record returned a SERVFAIL at that time, meaning that your DNS was responding incorrectly to something. If it's working now, though, it's going to be hard to diagnose much further than that.

8 Likes

Yeah, I was kind of expecting there wouldn't be much to find out about it. I guess we'll just have to observe what happens next time.

1 Like

A potential pitfall here is that is isn't necessarily a servfail from the upstream DNS servers. Let's Encrypts local resolving infrastructure can also generate this error message in certain cases, including but not limited to

  • Certain DNS lookup timeouts/lost packets/high latency
  • I also recall an issue where the global processing timeout could result in this specific error message, if the DNS lookup was still pending at the time of the timeout.
  • Other invalid replies (failing DNSSEC, violation of standards) can also result in a response getting "converted" into a SERVFAIL.

So it could have been related to load issues at Let's Encrypt. If response times were already high, an increase due to load might have brought it over the limit. It is however perfectly possible that is was an issue at the servers named in the error message - we don't know.

9 Likes

I've reviewed our metrics at the 16:15 timestamps you gave and there was a bit of an unusual increase in DNS latency at the 99th percentile for a few minutes around then: That can be indicative of some sort of network disruption. 95th percentile was unchanged so it wasn't widespread.

It's possible that some queries timed out, which I think can sometimes lead to a SERVFAIL response -- we hope to get better DNS diagnostics in error messages in the future to distinguish these cases, with better support for rfc8914 extended DNS errors.

10 Likes

Since I'm trying to approach this matter from various angles, I'm also looking at the "lego" client we are using. It seems that, given the way it pre-checks if the challenge's TXT record is already available, it may cause negative cache poisoning. (It is asking a recursive name server for it, and if this happens "too early", the negative response may be cached).

However, this would not explain SERVFAIL errors.

We have another certificate to renew ( *.service.overlay.sys11cloud.net) and so far, it failed several times again with a SERVFAIL.

All in all, we haven't found the root cause of our issue yet.

2 Likes

I see from the log that the DNS precheck is done via Google's (8.8.8.8) and Cloudflare's (1.1.1.1) DNS caching recursive servers. That is not ideal. It is possible that the result is positive, but the caching server accessed only the authoritative server where the record is already up to date. Letsencrypt's DNS resolver may not necessarily accessing that authoritative server. The appropriate way of the precheck is to verify each authoritative server directly for the existence of the challenge DNS record.

5 Likes

Yes I also thought first that that is how lego checks, but it is more extensive.

It first checks at the given name server for the TXT record. If this results in a CNAME, it uses the target name instead for the rest.

Then it finds the authoritative name servers, and queries all of them. At least that's what the comments say...

In the mean time I've done a few more attempts. One with a much longer minimal time to allow for DNS propagation: more than 5 minutes. That was probably too long. It resulted in

"2022/12/07 13:51:32 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/chall-v3/185020353697/IPdtBQ :: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: \"5CA2kavYhd1YWvGGL2eAfq3rpqD4t-UjpyxauWnUX6OERG4\""

The next attempt with halved waiting time got me back to

"2022/12/07 14:18:08 [INFO] [*.service.overlay.sys11cloud.net] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/185028823067"
"2022/12/07 14:20:46 error: one or more domains had a problem:", "[*.service.overlay.sys11cloud.net] acme: error: 400 :: urn:ietf:params:acme:error:dns :: During secondary validation: DNS problem: SERVFAIL looking up TXT for _acme-challenge.service.overlay.sys11cloud.net - the domain's nameservers may be malfunctioning"

A difference is "During secondary validation". I don't know if this represents progress.
If only the message told me which server was failing, then we could at least see if it's ours or not...

Yes, that means the primary validation passed.

Only authoritative DNS servers are checked.

6 Likes

There are two possibilities for that.

  1. The ACME client is using both IPv4 and IPv6. The two protocols may end up in two different Letsencrypt's data centers. The nonces are not shared among datacenters, so you may encounter that issue.
  2. Otherwise it is an ACME client issue. (In my ACME client after calling external procedures I unconditionally fetch new nonce. It is not predictable how long the execution of the external procedure will take, the current nonce may just expire.)

But all the above may not be important, because there is a retry:

4 Likes

This was true in the past but isn’t any longer. Nonces are shared cross-dc.
However they aren’t persistent: if a nonce server restarts then outstanding nonces are no longer valid. So clients should handle nonce failures by retrying with a new nonce.

8 Likes

So we had another failing daily attempt, so I set up some extra logging and some packet tracing for another attempt, which also failed:

2022/12/08 09:29:37 error: one or more domains had a problem: [*.service.overlay.sys11cloud.net] acme: error: 400 :: urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up TXT for _acme-challenge.service.overlay.sys11cloud.net - the domain's nameservers may be malfunctioning",

I found that all 4 authoritative name servers got requests for the TXT record, and they responded.
2 out of 4 received a query for an CAA record, and they responded.

I don't see any sign of server failure here. All 4 servers responded to the TXT query. If all 4 servers should have received a query for the CAA record, well, UDP is a lossy protocol, and there should be retries. The more data I collect, the less I think that the problem is on our side.

They are numbered 0, 1, 2 and 4; 3 is being phased out.

Dec 08 09:29:31 designate-publicdns-0 pdns_server[10548]: Remote 195.192.159.88 wants '_acme-challenge.service.overlay.sys11cloud.net|TXT', do = 0, bufsize = 1680: packetcache MISS
Dec 08 09:29:31 designate-publicdns-1 pdns_server[7285]: Remote 195.192.159.88 wants '_acme-challenge.service.overlay.sys11cloud.net|TXT', do = 0, bufsize = 1680: packetcache MISS
Dec 08 09:29:31 designate-publicdns-2 pdns_server[7095]: Remote 195.192.159.88 wants '_acme-challenge.service.overlay.sys11cloud.net|TXT', do = 0, bufsize = 1680: packetcache MISS
Dec 08 09:29:31 designate-publicdns-4 pdns_server[7137]: Remote 195.192.159.88 wants '_acme-challenge.service.overlay.sys11cloud.net|TXT', do = 0, bufsize = 1680: packetcache MISS
Dec 08 09:29:31 designate-publicdns-4 pdns_server[7137]: Remote 3.135.62.88 wants '_aCME-ChALLENgE.SerVicE.oVeRLaY.SyS11cLOuD.nEt|TXT', do = 1, bufsize = 512: packetcache MISS


Dec 08 09:29:31 designate-publicdns-0 pdns_server[10548]: Remote 18.222.40.126 wants 'OvERLAy.SYS11ClouD.NET|CAA', do = 1, bufsize = 512: packetcache MISS
Dec 08 09:29:31 designate-publicdns-0 pdns_server[10548]: Remote 18.118.215.15 wants 'SErvIce.overLAY.SyS11CLOUD.net|CAA', do = 1, bufsize = 512: packetcache MISS
Dec 08 09:29:31 designate-publicdns-2 pdns_server[7095]: Remote 23.178.112.109 wants 'serVIce.OVerLAY.sYs11clOud.nEt|CAA', do = 1, bufsize = 512: packetcache MISS

You should be seeing requests from at least 3 different IPs, and the requests should all be in random case. If they're not in random case, then Let's Encrypt's Unbound server has already tried a few times and not received a response and is trying to fall back to not using case randomization.

Your problem reminds me of this thread from a year ago, where there was some sort of packet loss between Let's Encrypt's servers and the authoritative DNS servers, leading to unbound retrying but enough packets were lost to sometimes cause a SERVFAIL.

6 Likes

Thanks, the thing about the random case is interesting. I had noticed that that happens a lot. And indeed, if there should be multiple source addresses, then it seems indeed that some queries don't arrive.

I will look into the UDP rate limiter that's in front of the cloud, but given the number of packets that reach the name servers, I don't think it would kick in at the current traffic levels.

The link to unboundtest https://unboundtest.com/m/TXT/_acme-challenge.service.overlay.sys11cloud.net/SYJX24S2 is interesting (and it shows no problems; I tried several times). Likewise, Let's Debug Let's Debug shows no issues. (Of course it currently gets an incorrect value for the TXT record, but I don't see a way to indicate the string to expect there) Yet, I reran the renewal after Let's Debug showed it's ok, and it failed. The above link is from just after that, and it is successful. This is very confusing.

1 Like
Walking root to FQDN: [FAILS]
[abbreviated]
nslookup -q=ns net. 8.8.8.8
net     nameserver = a.gtld-servers.net

nslookup -q=ns sys11cloud.net a.gtld-servers.net
(root)  nameserver = a.root-servers.net

nslookup -q=ns sys11cloud.net a.root-servers.net
in-addr.arpa    nameserver = a.in-addr-servers.arpa

nslookup -q=ns sys11cloud.net a.in-addr-servers.arpa
[Address:  199.180.182.53]
*** UnKnown can't find sys11cloud.net: Query refused

There exist root servers that don't know where sys11cloud.net can be resolved.

2 Likes

Are you sure you're not getting confused by nslookup trying to reverse-DNS the nameserver IP using that nameserver? I wouldn't expect anything to use in-addr-servers.arpa for a normal forward-DNS request.

4 Likes

Here is the full response:

nslookup -q=ns sys11cloud.net a.root-servers.net
in-addr.arpa    nameserver = e.in-addr-servers.arpa
in-addr.arpa    nameserver = f.in-addr-servers.arpa
in-addr.arpa    nameserver = d.in-addr-servers.arpa
in-addr.arpa    nameserver = c.in-addr-servers.arpa
in-addr.arpa    nameserver = b.in-addr-servers.arpa
in-addr.arpa    nameserver = a.in-addr-servers.arpa
e.in-addr-servers.arpa  internet address = 203.119.86.101
e.in-addr-servers.arpa  AAAA IPv6 address = 2001:dd8:6::101
f.in-addr-servers.arpa  internet address = 193.0.9.1
f.in-addr-servers.arpa  AAAA IPv6 address = 2001:67c:e0::1
d.in-addr-servers.arpa  internet address = 200.10.60.53
d.in-addr-servers.arpa  AAAA IPv6 address = 2001:13c7:7010::53
c.in-addr-servers.arpa  internet address = 196.216.169.10
c.in-addr-servers.arpa  AAAA IPv6 address = 2001:43f8:110::10
b.in-addr-servers.arpa  internet address = 199.253.183.183
b.in-addr-servers.arpa  AAAA IPv6 address = 2001:500:87::87
a.in-addr-servers.arpa  internet address = 199.180.182.53
a.in-addr-servers.arpa  AAAA IPv6 address = 2620:37:e000::53

Where:

Name:    a.root-servers.net
Address: 198.41.0.4
3 Likes

Just for informational purposes from here .ARPA Domain

in-addr-servers.arpa For hosting authoritative name servers for the in-addr.arpa domain
RFC 5855

1 Like

The domain syseleven.net is part of the domain resolution chain of sys11cloud.net. Way to many name servers are involved in the name resolution, and that increases the chance of failure.

$ dig +short NS syseleven.net. @dns5.syseleven.de.
dns5.syseleven.net.
dns1.syseleven.net.
dns2.syseleven.de.
dns4.syseleven.de.
dns4.syseleven.net.
dns3.syseleven.net.
dns3.syseleven.de.
dns1.syseleven.de.
dns2.syseleven.net.
dns5.syseleven.de.

Going further, it is non-sense to define multiple name server names that resolve to the same IP addresses:

$ for d in $(dig +short NS syseleven.net. @dns5.syseleven.de.); do host $d;done | sort -k 4n
dns1.syseleven.de has address 37.49.156.94
dns1.syseleven.net has address 37.49.156.94
dns2.syseleven.de has address 151.252.44.17
dns2.syseleven.net has address 151.252.44.17
dns3.syseleven.de has address 195.192.142.74
dns3.syseleven.net has address 195.192.142.74
dns4.syseleven.de has address 195.192.142.233
dns4.syseleven.net has address 195.192.142.233
dns5.syseleven.de has address 195.192.146.36
dns5.syseleven.net has address 195.192.146.36

Please drastically simplify your DNS configuration.

4 Likes

I think somebody is going to look at the dns configuration. Note that the data at all those servers is static, and only on the final authoritative servers there is changing data.

Also I asked our NOC and they very much doubt that any of our DDoS rate limiting would kick in here.

In any case, I tried some of the same, and some different, certificates with the LE staging server. The first attempt failed, but the ones after that were successful. Including one that involved several DNS-01 challenges.

bash-5.0# lego --accept-tos --dns designate --path /tmp/lego --dns.resolvers 8.8.8.8 --dns.resolvers 1.1.1.1 --server=https://acme-staging-v02.api.letsencrypt.org/directory --email noreply@syseleven.de --key-type rsa4096 -d "*.cloud.syseleven.net" renew --preferred-chain "(STAGING) Pretend Pear X1" --days 90
2022/12/09 15:23:43 [INFO] [*.cloud.syseleven.net] acme: Trying renewal with 884 hours remaining
2022/12/09 15:23:43 [INFO] [*.cloud.syseleven.net, *.infra.sys11cloud.net, *.infrabk.sys11cloud.net, *.infrabl.sys11cloud.net, *.infrafe.sys11cloud.net, cloud.syseleven.de] acme: Obtaining bundled SAN certificate
2022/12/09 15:23:45 [INFO] [*.cloud.syseleven.net] AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/4580041434
2022/12/09 15:23:45 [INFO] [*.infra.sys11cloud.net] AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/4580041444
2022/12/09 15:23:45 [INFO] [*.infrabk.sys11cloud.net] AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/4580041454
2022/12/09 15:23:45 [INFO] [*.infrabl.sys11cloud.net] AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/4580041464
2022/12/09 15:23:45 [INFO] [*.infrafe.sys11cloud.net] AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/4580041474
2022/12/09 15:23:45 [INFO] [cloud.syseleven.de] AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/4580041484
2022/12/09 15:23:45 [INFO] [*.cloud.syseleven.net] acme: use dns-01 solver
2022/12/09 15:23:45 [INFO] [*.infrabl.sys11cloud.net] acme: use dns-01 solver
2022/12/09 15:23:45 [INFO] [*.infra.sys11cloud.net] acme: use dns-01 solver
2022/12/09 15:23:45 [INFO] [*.infrabk.sys11cloud.net] acme: use dns-01 solver
2022/12/09 15:23:45 [INFO] [*.infrafe.sys11cloud.net] acme: use dns-01 solver
2022/12/09 15:23:45 [INFO] [cloud.syseleven.de] acme: Could not find solver for: tls-alpn-01
2022/12/09 15:23:45 [INFO] [cloud.syseleven.de] acme: Could not find solver for: http-01
2022/12/09 15:23:45 [INFO] [cloud.syseleven.de] acme: use dns-01 solver
2022/12/09 15:23:45 [INFO] [*.cloud.syseleven.net] acme: Preparing to solve DNS-01
2022/12/09 15:23:46 [INFO] [*.infrabl.sys11cloud.net] acme: Preparing to solve DNS-01
2022/12/09 15:23:46 [INFO] [*.infra.sys11cloud.net] acme: Preparing to solve DNS-01
2022/12/09 15:23:47 [INFO] [*.infrabk.sys11cloud.net] acme: Preparing to solve DNS-01
2022/12/09 15:23:48 [INFO] [*.infrafe.sys11cloud.net] acme: Preparing to solve DNS-01
2022/12/09 15:23:49 [INFO] [cloud.syseleven.de] acme: Preparing to solve DNS-01
2022/12/09 15:23:49 [INFO] [*.cloud.syseleven.net] acme: Trying to solve DNS-01
2022/12/09 15:23:50 [INFO] [*.cloud.syseleven.net] acme: Checking DNS record propagation using [8.8.8.8:53 1.1.1.1:53]
2022/12/09 15:24:00 [INFO] Wait for propagation [timeout: 15m0s, interval: 10s]
2022/12/09 15:24:06 [INFO] [*.cloud.syseleven.net] The server validated our request
2022/12/09 15:24:06 [INFO] [*.infrabl.sys11cloud.net] acme: Trying to solve DNS-01
2022/12/09 15:24:06 [INFO] [*.infrabl.sys11cloud.net] acme: Checking DNS record propagation using [8.8.8.8:53 1.1.1.1:53]
2022/12/09 15:24:16 [INFO] Wait for propagation [timeout: 15m0s, interval: 10s]
2022/12/09 15:24:22 [INFO] [*.infrabl.sys11cloud.net] The server validated our request
2022/12/09 15:24:22 [INFO] [*.infra.sys11cloud.net] acme: Trying to solve DNS-01
2022/12/09 15:24:22 [INFO] [*.infra.sys11cloud.net] acme: Checking DNS record propagation using [8.8.8.8:53 1.1.1.1:53]
2022/12/09 15:24:32 [INFO] Wait for propagation [timeout: 15m0s, interval: 10s]
2022/12/09 15:24:39 [INFO] [*.infra.sys11cloud.net] The server validated our request
2022/12/09 15:24:39 [INFO] [*.infrabk.sys11cloud.net] acme: Trying to solve DNS-01
2022/12/09 15:24:40 [INFO] [*.infrabk.sys11cloud.net] acme: Checking DNS record propagation using [8.8.8.8:53 1.1.1.1:53]
2022/12/09 15:24:50 [INFO] Wait for propagation [timeout: 15m0s, interval: 10s]
2022/12/09 15:24:57 [INFO] [*.infrabk.sys11cloud.net] The server validated our request
2022/12/09 15:24:57 [INFO] [*.infrafe.sys11cloud.net] acme: Trying to solve DNS-01
2022/12/09 15:24:57 [INFO] [*.infrafe.sys11cloud.net] acme: Checking DNS record propagation using [8.8.8.8:53 1.1.1.1:53]
2022/12/09 15:25:07 [INFO] Wait for propagation [timeout: 15m0s, interval: 10s]
2022/12/09 15:25:13 [INFO] [*.infrafe.sys11cloud.net] The server validated our request
2022/12/09 15:25:13 [INFO] [cloud.syseleven.de] acme: Trying to solve DNS-01
2022/12/09 15:25:13 [INFO] [cloud.syseleven.de] acme: Checking DNS record propagation using [8.8.8.8:53 1.1.1.1:53]
2022/12/09 15:25:23 [INFO] Wait for propagation [timeout: 15m0s, interval: 10s]
2022/12/09 15:25:30 [INFO] [cloud.syseleven.de] The server validated our request
2022/12/09 15:25:30 [INFO] [*.cloud.syseleven.net] acme: Cleaning DNS-01 challenge
2022/12/09 15:25:30 [INFO] [*.infrabl.sys11cloud.net] acme: Cleaning DNS-01 challenge
2022/12/09 15:25:30 [INFO] [*.infra.sys11cloud.net] acme: Cleaning DNS-01 challenge
2022/12/09 15:25:30 [INFO] [*.infrabk.sys11cloud.net] acme: Cleaning DNS-01 challenge
2022/12/09 15:25:30 [INFO] [*.infrafe.sys11cloud.net] acme: Cleaning DNS-01 challenge
2022/12/09 15:25:30 [INFO] [cloud.syseleven.de] acme: Cleaning DNS-01 challenge
2022/12/09 15:25:30 [INFO] [*.cloud.syseleven.net, *.infra.sys11cloud.net, *.infrabk.sys11cloud.net, *.infrabl.sys11cloud.net, *.infrafe.sys11cloud.net, cloud.syseleven.de] acme: Validations succeeded; requesting certificates
2022/12/09 15:25:33 [INFO] [*.cloud.syseleven.net] Server responded with a certificate for the preferred certificate chains "(STAGING) Pretend Pear X1".

I noticed that the links to the AuthURLs show a weird error (urn:ietf:params:acme:error:malformed Method not allowed), which is likely incorrect given that the process worked.

This test run uses exactly the same name servers as the real run, so if this can pass, so can the real certificate. So I tried the real one again, twice, and both failed. But I think this hints again that our side works...

Addendum: another try, and it mysteriously worked. We have more to renew in 8 days, so I will certainly keep an eye on it...

1 Like