DNS Manual Failure Going on 4.75 hours so far

Very interesting and well written article. Thanks for posting.

I do use geo-blocking for SH, ZA, CN, IR, MY, KP, SA, SY, YE, & RU. But I don't think LE has servers there.

Additionally, I subscribe to a number of IP blacklist feeds, and have 15,000 or so IPs that my servers have logged suspect traffic from.

I am aware that LE does not publish a list of IPs/FQDNs that I can whitelist.

1 Like

That's five thousand people in the middle of the Atlantic. (and they don't have the best connectivity, first submarine cable got there in 2023)

They have access to the Internet?

Blocking IR, SA, ZA sounds unusual. You see threats from there?

The number of people theoretically represented by a country code, nor the relative recentness (is that a word?) of a physical connection does not stop that country code from being used. It was / is very advantageous for a country to license their country code. It's a practice that has been widespread since the late 90s, especially before ICANN expanded the number of TLDs available.

Yes. Based on spam results and attack attempts (not blocked by subscription blocklists), it has been very advantageous to block all of those country codes (via geo-blocking lists) that I listed.

I'm thinking of adding BR, IN, and VM. Lots of junk coming to my servers from there recently.

Interesting, you block TLDs, not geoIP. But once you block CN, IN, BR you might want to switch to an allowlist model -- that's a majority of the world population.

Also: VPSs in EU & US are cheap for attackers too.

I do not block TLDs. I use a geo-blocklist. I select the countries to be used in the geo-blocklist by their country code.

In this instance, I'm not interested in the population of a country. I am interested in blocking spam and attacks that have FQDNs which end in a particular country code that can be geo-blocked.

I am unfortunately very aware of US/EU located VPS. That's where much of the 15,000 or so IPs on my list come from.

In case you haven't seen it yet, below is an excellent article about LE's multi-perspective validation.

It sounds like your problem is related more to your rate limit but IP based firewalls on DNS could be contributors.

As of now, and as the above article explains, LE checks from 5 locations of which 4 must succeed. The number and quorum can change at any time so should not be designed around.

But, if one of those 5 locations is blocked by, say, an IP firewall that leaves no room for error from the other 4.

The primary center must succeed first. Then the secondaries are dispatched at the same time. This is when you'll see the larger burst of queries to your system.

3 Likes

And you get attackers with IP addresses from North Korea or Saint Helena? That sounds like an issue with the IP geolocation.

1 Like

Ok, so the standard operating procedure should not just include temporarily dropping all dns connection limits, but also temporarily removing any IP restrictions on traffic for port 53 via the firewall.

I do implement DNSSEC and CAA records, as the linked article suggests.

You should be able to see traffic from Let's Encrypt in your DNS logs and plan accordingly. Errors related to validation from secondary centers will say so in their error message. If you post the exact error message reported by the LE server we can help interpret.

If the rate limit adjustment doesn't resolve your issue look at your IP based firewall. Perhaps it is catching out one of the secondary centers. A fully successful challenge should see 5 requests to the _acme-challenge endpoint. As noted, a challenge will succeed today with 4 but if not seeing 5 means a secondary is getting blocked.

If you can't open your DNS Servers sufficiently you could delegate the _acme-challenge to a different DNS system. Your ACME Client needs to support placing the TXT records at the delegated location. Not all do.

5 Likes

From: https://postmaster.google.com/managedomains?pli=1

Verify your ownership of example.com

  1. Add the TXT to the DNS configuration for example.com

  2. Click Verify

When Google finds the DNS record that you have added, we will verify your ownership of the domain. To stay verified, do not remove the DNS record, even after the verification succeeds. (DNS changes may take some time and if we do not find the record immediately, we will check for it periodically.)

(DNS changes may take some time and if we do not find the record immediately, we will check for it periodically.)

The google verification was pretty quick, and I didn't have to modify my dns server config (rate limiting), nor my firewall rules.

I was surprised it verified so quickly, none of my third party, external secondary DNS zones have updated yet.

I guess my setup isn't as screwed up as some of you have been telling me after all. I guess that maybe after managing DNS servers for 30+ years, I may accidentally do something right from time to time. Nah, that can't possibly be true. :slight_smile:

In case anyone is confused, I am posting this example because it is the request / suggestion that I made earlier (automatic retries) and because the google DNS record verification here is very similar to the LE DNS record verification.

It worked without me having to temporarily modify my DNS server rate limiting and / or firewall IP restrictions.

I get it, team LE isn't going to implement something like this because their existing system works near perfectly (for everyone but me).

But it's kind of nice when something that should work, works like it should, and even if it doesn't work immediately, there's a way of automatically checking back later without any additional effort.

This didn't take 5 hours of active effort to accomplish. It took 5 minutes. That's what I expected from the LE verification process. That is why I was so frustrated in my initial post.

The validation done by postmaster.google.com is not quite the same as the validation done by Let's Encrypt, nor the validation done by Google's own certificate authority, GTS. The postmaster system is not subject to the same set of requirements that the CAs are: running its own recursive resolver, respecting CAA records, validating from 5 perspectives around the world, etc. The lack of multi-perspective validation could definitely make a difference with regards to rate limits.

Also, based on what you just posted, it seems like you just went through the postmaster flow for one domain, not fifty like you were attempting to validate with Let's Encrypt. I may be wrong, but that may also make a difference when it comes to rate limits.

At the end of the day, the ACME protocol is designed from the ground up to be automated. While I'm sorry that you're having difficulty getting your domain validation to work by hand, it's going to be hard to get any advice other than "automate it". If you can explain a little more about why you can't automate it, some of us here may be able to help devise technical solutions that achieve both automation and your security/privacy requirements. Many very rigorous, locked-down, security-conscious organizations have automated their certificate renewals and deployments already.

5 Likes

Thank you for your reply and your great work. I really appreciate it.

The issue unfortunately isn't with automation, but failed validation due to connectivity.

As I mentioned several times above (you're forgiven for not reading the whole long thread), after I received every new token, I updated the zone file and re-loded the server. Then I verified via a third-party, remote server that the record was accessible, and the token was correct. Only then did I renew.

:white_check_mark: Token received
:white_check_mark: Zone record created
:white_check_mark: Server reloaded
:white_check_mark: Token viewed from remote, third party server: connectivity and accuracy confirmed
:white_check_mark: Certificate renewal requested
:cross_mark: Token failed to be verified by LE due to connection issues

Automation does not solve the underlying problem.

Like I said earlier, I will temporarily remove all connection limitations for both source IPs and connection frequency, next time I renew. I have to believe that will solve the issue.

I'm just sorry that there is not currently an easier solution, like:

(DNS changes may take some time and if we do not find the record immediately, we will check for it periodically.)

Which would solve my issue because after-all, I was able to get every token validated eventually. So it wasn't a permanent issue. It just stinks that I had to repeat the process every time a token was not validated, when in a perfect world…

(…if we do not find the record immediately, we will check for it periodically.)

Sorry, I think my broken record is beating a dead horse.

Thank you again for your consideration and your great work. I am a LE fan.

Edit: I setup nsupdate on my dns server and changed the authentication method for the acme client to DNS_NSUPDATE.

Now it's your turn to add periodic checks if you don't find the record immediately. :slight_smile:

For what it's worth, I have read the whole thread, and it's actually this assertion that I disagree with. The issue, as you've described it, is one of reliability: LE can find and verify your DNS TXT records 95% of the time, but sometimes fails for reasons that are somewhat unclear but may be related to rate limits imposed by your authoritative nameservers.

Automation is the solution to reliability problems. It allows the failed validations to be immediately retried, completely transparently to you. It means that, even if it does take five tries -- heck, even if it takes five hours like it did for you -- your time, effort, and happiness are unaffected.

While I appreciate the effort you're going to to confirm that the record is in place, this method can't guarantee that Let's Encrypt will see the record. The validation recursive resolvers pick one authoritative nameserver at random, and check to see if the record is served there. If your record has not propagated to every ns, then validation may still fail.

And of course, this isn't taking into account the rate limits. Let's Encrypt will fire off 5 nearly-simultaneous request for the TXT record, followed immediately by 5 CAA requests per subdomain. If you're requesting a cert for beta.blog.example.com, that's 15 requests to example.com's nameservers.

Automation can also help solve this, by automating the rate limits changes you've discussed making part of your SOP, and then automating putting the rate limits back.

Or automation can help by spreading your renewals out across the month -- if your automation is only renewing one certificate each day, rather than fifty at a time, rate limits changes may not be necessary at all.

So again, while I understand your frustration at something that feels outside your control, I truly believe that automation can and will make this problem entirely vanish.

8 Likes

I don't mean this in a mean-spirited way, but I hope you can see that it is a little ironic for you to tell me that automation is the solution.

Automation can absolutely help solve this if your servers automatically, periodically, rechecked for the record with the token - the token which you've already issued, the token which is already published, waiting to be verified. Automating the repeating of the last step in the process is a great idea.

Instead, the token is instantly canceled, and the process starts again. My Acme client has to ask for new tokens again, your servers have to issue new tokens again, my servers need to publish the new tokens again, your servers need to attempt to verify the tokens again. It doesn't matter if I automate that on my end; it is still more steps and more computational work for both parties.

If you drive to the grocery store, pull groceries from the shelf, and go to the checkout, and someone is already at the register: please don't tell me that it makes more sense to put all of your groceries back on the shelves, walk out the door of the market, and then turn around and start from the beginning - as opposed to waiting for the person in front to finish and then checking yourself out.

I understand if you don't have the money for it, or if you've got more pressing bug fixes and upgrades to implement. That makes sense. Just don't tell me that starting the process from the beginning and requiring extra engineering and resources on both of our ends is the easier and better way.

I'm just trying to make a suggestion which I think would be beneficial for all parties, because I care. You don't have to do what I suggest, I'm just a nobody with a humble suggestion.

Either way, I'll do what I have to do to make things work on my end, because it is what it is.

Again, thank you truly for your time and all of your great work. I'm very sorry to have taken up so much of your time.

Those 5 nearly-simultaneous requests come from different IP addresses, right? If so, per-IP rate limiting should only count it as 1 for each IP address the query came from.

If your global rate limit is 5 requests per second, IMHO that is a bad configuration, because it makes it so it takes a very low amount of bandwidth (~5 Kbit/s) to DoS your DNS server.

// Options

match-clients { any; };
recursion no;
allow-recursion { none; };
notify yes;
    rate-limit { 
	slip 2;				// Every other response truncated
	window 15;			// Seconds to bucket
            responses-per-second 5;		// # of good responses per prefix-length/sec
	referrals-per-second 5;		// referral responses
	nodata-per-second 5;		// nodata responses
	nxdomains-per-second 5;		// nxdomain responses
	errors-per-second 5;		// error responses
	all-per-second 20;		// When we drop all
	log-only no;			// Debugging mode
	qps-scale 250;			// x / query rate * per-second
					// = new drop limit
	exempt-clients {127.0.0.1;};
	ipv4-prefix-length 24;		// Define the IPv4 block size
	ipv6-prefix-length 56;		// Define the IPv6 block size
	max-table-size 20000;		// 40 bytes * this number = max memory
	min-table-size 500;		// pre-allocate to speed startup
};

@doom 666, Here's my rate limiting config.

The comment I made was directed at @aarongable, not you (@dLubxiWGGVB8vXyLzUM3). It was about that in terms of (per IP/subnet) rate limiting that the 5 requests are effectively one, if they are coming from completely different IP addresses.

I think that part of the problem with implementing retry logic on LE's end is zombie systems. Sure, there are systems that demonstrate some inconsistencies and hiccups (like yours), but there have also historically been a large number of systems that are systemically broken and simply fail forever. The more that LE attempts to provide leniency for failure in the process, the more burden is placed upon LE to accommodate those broken systems. One might argue that lack-of (or seemingly insufficient) automatic retries initiated by LE will just result in (potentially more expensive) additional attempts initiated by clients. To this I would counter-argue that externally-initiated attempts can be monitored and limited more efficiently than retries, which create extra burden of state tracking and monitoring internally. Making a system more tolerant to failure also makes it more tolerant to abuse.

7 Likes

Griffin is spot on -- we've run the numbers in the past, and the vast, vast majority of validation failures are not transient. Adding two retries of each failed validation on our end would more than double our total validation bandwidth usage and cost. Adding two retries on your end has a negligible impact on your server's bandwidth (and has nearly the same impact as if we implemented the retries on our end), and no impact on the wider internet or our ability to offer our services for free.

I truly appreciate the thought and effort that you've put into this thread, and I am taking your suggestion of adding server-side retries in good faith. Unfortunately, our data shows that adding such retries would cause more harm than good. I'm sorry that you're caught in the middle here, but adding automation and retries on your end is the lowest-cost, most widely-supported path forward here.

9 Likes

My friend, you are a persistent and imaginative debater. Have you trained to be a lawyer?

If you have internal data that shows a significantly larger number of combined abusive or functionally broken clients vs the number of clients that are are temporally unreachable but eventually successful after some number of rounds of server-client tag, then the data would speak for itself.

I can see that @aarongable has just confirmed your assertion.

Well, I'm in no position to judge. I have only the data from my situation.

Again, thank you for your time and efforts. I appreciate it.

1 Like