according to the debug it is unable to resolve for an A record. However I have checked our DNS server configuration and each of our dns servers appear to be returning the correct IP address
In fact in the Debug output it lists the DNS records for the domain in the block above where it shows the error. Suggesting it must have done the lookup. I thought maybe it was a DNSSEC issue but according to dnsviz the record should validate.
Letsencrypt uses the Amazon Cloud (AWS) for secondary validation. If you mass block AWS IPs in your firewall for some reason (there’s a long list of good reasons) then unblock all AWS IPs to test if this is your problem.
We had tried previously without the IPv6 records as I was trying to eliminate possibles and that didn’t seem to make a difference. So I added the IPv6 records back in. I’ve also spun up a test system which only has an IPv6 record and I get a DNS error there as well.
No we don’t block AWS. In fact I have two of our DNS servers are hosted in AWS ec2 instances so that we have DNS off site in the event our network provider goes down.
In this case, primary validation – which is not hosted in AWS – is what’s failing. Secondary validation might be failing too; there’s no way to know from the error message.
I agree that it doesn’t look like anything is wrong, but your DNS setup is a lot. 10 DNSSEC keys, almost all of them active KSKs or ZSKs, non-minimal responses, and 1 minute TTLs… That’s a lot of data. Mostly over TCP.
I don’t suppose Let’s Encrypt’s IPs could have been blocked?
I don’t think it would be blocked. It was my understanding that Let’s Encrypt’s infrastructure was hosted in multiple locations and that they don’t publish a list of IPs so that they can remain flexible. I seem to remember looking for a list of IP’s a while back so I could setup firewall rules to white list them but instead I setup a my system to limit access to the acme directory in the web server software. If you know where there is an IP list I can check the firewall logs and configuration for them to be sure.
The weakest link in the HTTP authentication process is knowing the specific source IPs used to validate the authentication requests.
They can’t publish that list.
They should NEVER publish such a list.
Furthermore, they should validate from multiple points and change their IPs frequently.
True - and that's the reason they have no business validating from any bad neighbourhoods such as AWS. They should be validating from 1000% clean and obscure server locations, not prone to potential IP range blacklisting in user firewalls.
I understand you logic.
But promoting blocking “bad” IPs from HTTP access is a false sense of security.
By that I mean, all HTTP traffic should be sent to a quarantined area (not on any sensitive production servers).
There you can do very basic things.
Like…
[no actual programming language used below - just an English text example]
when servername = {your external IP}
return 200 "you have no business here - shoo fly don't bother me!"
when servername = * #everything else
##deal with any cert auth requests here - chose a path that makes sense to you##
#1 redirect them to another HTTP(S) site
#2 proxy them inward to a common "share" (accessible to all servers that need certs)
#3 proxy them inward to the actual server that made the request
#4 handle it locally (on this system)
#then redirect ALL other requests to HTTPS
redirect / permanent https://{servername} + {URL path}
Otherwise, you are allowing some HTTP connections from IPs you presume are good (enough), simply because they are not (yet) included in some bad list.
Trust no IP (until you don’t have to).
Well I guess I can’t be certain we don’t block any of the hundreds of thousands of IPs but I’m pretty sure we didn’t add 1250 ranges to our firewall configuration. I’ve also been thinking that 4 of the 6 DNS servers actually reside outside of our firewall two hosted in AWS and two hosted by our network provider. Would a failure at any one of the DNS servers be enough to prevent resolution?
Maybe. The DNS resolver is normally resilient. Let's Encrypt turns on its random capitalization feature. That's irrelevant, but if it thinks there's a problem, it has a fallback mode that can fail if some of the authoritative nameservers have subtle differences in behavior or are down.
Looks like you were on the right track. I set the DNS servers to do minimal responses and now it looks like we may be back in business. So my guess is the size of the DNS responses was causing a problem. If a response is larger than the UDP limit shouldn’t it use EDNS or fall back to TCP?