Flooding of .well-known/acme-challenge GET requests

I can login to a root shell on my machine (yes or no, or I don't know):
Yes

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
certbot 2.5.0

A server went unresponsive. controlling the nginx logs, it became apparent it was because it was being flooded with .well-known/acme-challenge requests.
The intial responses were 301, then 404 ... and finally became client abandoned (error 499)

"GET /.well-known/acme-challenge/ukzVX1ep-U02oFuMvbMukR0SWswl9OJd-vtiduzI7AI HTTP/1.1" 499 0 "http://saltalafila.online/.well-known/acme-challenge/ukzVX1ep-U02oFuMvbMukR0SWswl9OJd-vtiduzI7AI" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)"

Passenger sits between nginx and the application/s and is limited to 100 processi.

These were coming from at least 4 different ip addresses.

Now, admittedly, there may have been certificates issued to 3rd level domains that were moved to other servers, thus creating this overflow of requests; cannot state for sure as a lot of re-organisation was conducted at some point. Still, having the server hit 100 passenger processes and then freeze is not expected and not very desired.

a) I surmise the best course of action upon a re-organisation is a clearing and re-issuance of all certificates. How should this be conducted?
b) or is there a way to removed individual 3rd level domain certificates that may be combined with others in the same cert?
c) are there ways to mitigate this high intensity querying?

Maybe try make the request cheap, don't route /.well-known/acme-challenge to Passenger.

Something like this will be cheap and nginx shound be able to handle many thousands of requests per second in its sleep:

location /.well-known/acme-challenge/ {
  return 404 "";
}

I don't think that trying to prevent ACME orders against your server is going to be fruitful as a mitigation strategy.

4 Likes

Then I may have some misunderstanding on some logic.

I thought the above nginx setup would give the letsencrypt querying server the idea that the domain was not hosted on the site... or does the 404 error signify that yes, the 3rd level domain is hosted on that ip address but does not respond with content?

That's correct. A location statement like that will cause any ACME HTTP Challenge to fail with a 404. It will be very fast as it avoids Passenger but also blocks any valid HTTP Challenges you might want to process in that server.

So, you need to be careful in which server blocks you place that location statement.

Also, I'm not clear why requests for certs on one server would get directed to this one. The Let's Encrypt Server uses the public DNS to find the IP to use for the HTTP Challenge. Wouldn't your third-level domain's have a different IP?

4 Likes

Some common scenarios that cause this are:

  • Virtual Servers that were not shut down, and mistakingly left on - with their former domains pointed elsewhere. This is very common in buggy auto-scaling setups.
  • Buggy ACME clients
  • Anti-Patterns used to deploy ACME clients in a clustered or scalable environment (nodes and/or domains)

I suggest auditing your network to see if these requests are originating from AcmeOrders on your systems and fixing the problems if they are. This is in your best interests, as you will have to invest a lot of effort to deploy a solution that blocks unwanted ACME traffic and allows wanted ACME traffic.

If you have some sort of whitelabel service that others point domains to, these requests could be coming from present/former business clients that are messing up their integrations.

5 Likes

Simply, 3rd level domains were shifted among existing VPSes.

you need to be careful in which server blocks you place that location statement.
Yes I fear this.

maybe then, an alternative would be to simply point the call to a static object?

Can you give specific examples? It is not possible to give specific advice with such general problem description

For example, let's start with:

Give domain names for 2nd and 3rd level domains that appear in the nginx log.

Describe which domain name you don't think should be in the nginx log.

Show the nginx server block that is listening on port 80 and is receiving these ACME Challenge requests. You say you first see a 301 and then 404 and then flood passenger. This can be avoided.

3 Likes

Apologies if there may have been some affirmations/questions that were misleading.

The problem is spawning of processi in Passenger, where it freezes at 100 contemporaneous processes. Diggin' through the log was not undertaken as to reach that figure, one has to track back quite a bit.

Although no proof is available, it seems some nginx configuration file might have been altered, but the certs NOT renewed for the entire new set of 3rd level domains.
Thus the challenge still got issued; the cert servicing the specific conf file was limping with partial failures... and generating new challenges. All being routed to Passenger, which had no destination to route to, thus leaving a hanging process. +1 etc.

So the main lesson is: 'you change your 3rd level domains, (delete existing cert?), regenerate cert for desired domains.'

Still dubious lesson: would it not make sense to nonetheless direct anything towards /.well-known/* to have a static acme_challenge.html file for safety purposes?

I usually do something like:

location /.well-known/acme-challenge/ {
    if (-f /etc/nginx/flags/cerbot_running) {
        # proxypass to a higher port that certbot listens to
    }
    return 403;  
}

nginx basically uses the operating systems file cache to check the existence of the semaphore file , so it's usually all done in memory. you can just touch /path/to/flag to enable it, or delete to disable.

you should find all of the processes running this though. you're just shifting your problem and bad architecture onto LetsEncrypt, instead of stopping these requests from being made

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.