Certbot renew fails

I got a failure with certbot renew --dry-run for a subdomain: auth.notes.byped.fr.
It fails with:

Domain: auth.notes.byped.fr
  Type:   connection
  Detail: 79.90.190.23: Fetching http://auth.notes.byped.fr/.well-known/acme-challenge/-some-hash: Timeout during connect (likely firewall problem)

I'm running nginx and the server is listening on both port 80 and 443. Both port are accessible from the internet (on another host:

$ ip a
[...]
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
[...]
    altname enp3s0
    inet 188.X.Y.Z/24 metric 100 [...]

$ telnet 79.90.190.23 80
Trying 79.90.190.23...
Connected to 79.90.190.23.
Escape character is '^]'.
GET / HTTP/1.1
Host: auth.notes.byped.fr

HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Wed, 30 Apr 2025 10:20:09 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Location: https://auth.notes.byped.fr/

<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>

I can access both URL (with http and https) from a browser, the former redirecting to the later. I've tried with special and no special nginx configuration. The special configuration contains:

server {
 location ^~ /.well-known/acme-challenge/ {

    default_type "text/plain";

    root         /var/www/letsencrypt;
}

location = /.well-known/acme-challenge/ {
    return 404;
}

location / {
        return 301 https://$host$request_uri;
}
    listen 80;
    server_name auth.notes.byped.fr;

}

As I understand it, certbot override this configuration anyway so I don't see how it could interfer. Other domains on the same machine renew correctly, only this one which is a reverse proxy fails.

Notice that the proxied server doesn't answer to any standard HTTP GET method other than a 404, but the reported error by certbot is Timeout and I don't understand how that could happen. The error is systematic on this SNI host, all other hosts served by this machine (more than 10) don't fail.

Looking at https://check-host.net/check-http?host=auth.notes.byped.fr it looks like connections are only accepted from within Europe (and Tokyo?) although also some endpoints from within Europe failed.

Looks like some geoblocking going on.

1 Like

Why does it work for other host on the same machine then?

I don't know :man_shrugging:

1 Like

I've rerun the check multiple times, each time getting a different set of failed servers (like previous that worked failed but those who failed now work). Maybe it's a transient error.

Try some of the other hosts from various places around the world. Maybe whoever's blocking started recently, and the other hosts had gotten their certificates already? Or maybe the system has more than one IP (more common now as many systems have both IPv4 and IPv6 addresses, or at least should) and different hosts are using different ones?

For more information on how geoblocking prevents Let's Encrypt from validating control over your domain, including some links to some test sites, you may want to read through this:

But nobody here is likely to know why your system's connectivity isn't working as well as you think it should.

4 Likes

I've no idea what my ISP is doing. I don't have any geo specific rules on the firewall. I was able to generate the certificate this morning, maybe by pure luck. I can ping the host from anywhere in the world with no issue, it's only TCP & UDP connections that seems to be slowed. I do have multiple IP address (IPv4 + IPv6) but the error happens only on IPv4 and all hosts on the same machine only reply on IPv4 (I haven't set up the IPv6 DNS records anyway). Why some domain works and some not, I don't know, there all resolve to the same IP address via CNAME or directly via A. I'll wait for a day or two and check again.

Certbot only overrides your nginx config when using the --nginx option. If that is what you use you don't need the location blocks for /.well-known in your server block. The --nginx option will add the needed statements to reply to the Let's Encrypt server.

None of that affects timeouts. Just noting for reference.

The connection problem is interesting. The https://letsdebug.net test site never gets a timeout. It tries from its own server (I forget where it's hosted but somewhere in/near EU). It also tests using Let's Encrypt Staging system which always gets the expected "404" not found for the token. This LE staging test uses a server in the USA but the check-host test site had all its USA tests failing.

I also connect just fine from an AWS-based server in the USA.

Could there be some server or connectivity problem in your system? You mention proxying and multiple servers so could whatever you use for that be having an intermittent problem? Is this a residential system? Maybe just rebooting your router, switches and such would help. I know it's an old joke ('switch if off and back on') but sometimes helps.

What kind of rules do you have? By origin IP?

4 Likes

Yes it's a residential system. The network architecture is:
ISP > ISP's router at home with DMZ to > My router / firewall > HTTPS reverse proxy machine (nginx) > Docker containers

I've rebooted what I could but to no available. I've no control on ISP side, nor ISP router firmware. I have a public IPv4 and I've set the domain DNS to point to it. There's no geo rule anywhere in my control chain. Notice that it used to work perfectly for many years before, just this morning when I added a new subdomain, I had a timeout from Certbot, but it finally worked issuing the certificate and now I get a timeout each time I'm trying to renew it.

The only rules I have are port based, nothing selects on IP address, I don't have any fail2ban or whatever over rule. I have a tarpit rule for SSH(port 22) but it doesn't apply for HTTP(S) ports.

Since ICMP is working fine, I guess it's an ISP filtering going on on IP. I just don't know how to check this.

You'd have to ask your ISP

Are you sure your nginx->containers are routing/working reliably?

The failures from various locations is peculiar. Not quite what we normally see with routine "geo" blocking. It looks more like firewall based on IP or similar.

Or, inconsistent behavior in your own servers / network.

It could also just be some routine comms issue related to your location. Sometimes the "backbone" network has problems that can look like this. If so these usually resolve on their own in a day or two as the network provider's own diagnostics identify and resolve it.

In any case, not really a Let's Encrypt problem. If you want to avoid needing to resolve HTTP requests for a cert you could use a DNS Challenge. That wouldn't help anyone else having connectivity problems but at least the cert requests would work reliably. To automate that needs a DNS provider with an API which Certbot also supports. If yours doesn't / isn't then switching DNS providers (such as to Cloudflare) makes it more tedious.

See: Challenge Types - Let's Encrypt

3 Likes

This doesn't look like geo-blocking, but rather like a very aggressive Anti-DDoS firewall blocking and/or throttling the connection.

I can easily reproduce this by just connecting frequently to your host, like this:

for i in $(seq 1 30); do time $(curl --silent auth.notes.byped.fr > /dev/null); done

The first 5 connections are always very fast - less than 100 milliseconds each - but on the sixth connection there's throttling going on that limits me to 1 request/second. After about 10 connections the throttling increases to 0.3 req/s, then lifts and repeats.

And this is only from one single IP address. I imagine that connecting from multiple IPs at once results in more severe penalities (haven't verified this tough), which is causing the timeout issues for Let's Encrypt and the check-host.net website (which by the way tends to show flip-flopping servers that are blocked, likely because the requests race and the DDoS blocking tends to let X servers through before blocking. The servers with the best timing/lowest ping usually win)

for i in $(seq 1 30); do time $(curl --silent auth.notes.byped.fr > /dev/null); done

real    0m0.082s
user    0m0.004s
sys     0m0.000s

real    0m0.081s
user    0m0.003s
sys     0m0.000s

real    0m0.081s
user    0m0.001s
sys     0m0.003s

real    0m0.081s
user    0m0.003s
sys     0m0.000s

real    0m0.080s
user    0m0.003s
sys     0m0.000s

real    0m1.107s
user    0m0.003s
sys     0m0.000s

real    0m3.199s
user    0m0.005s
sys     0m0.000s
[...]
5 Likes

Good catch. I tried your script from my AWS East Coast though and don't see as reliable as throttling as you describe. To me it looks like a bogged down system rather than a well-programmed DDoS protection.

My series of "real" times look like this:

real    0m0.508s
real    0m0.189s
real    0m0.200s
real    0m0.200s
real    0m0.188s
real    0m1.257s
real    0m0.199s
real    0m1.209s
real    0m1.216s
real    0m1.204s
real    0m1.223s
real    0m0.198s
real    0m1.215s
real    0m1.216s
real    0m1.204s
real    0m1.216s
real    0m0.189s
real    0m1.229s
real    0m2.230s
real    0m7.367s
real    0m0.200s
real    0m0.200s
real    0m1.199s
real    0m1.220s
real    0m1.217s
real    0m0.188s
real    0m1.221s
real    0m1.214s
real    0m2.229s
real    0m0.200s
4 Likes

Very interesting hypothesis. So I've done those tests to assert it:

  1. Checked if hosts with nginx's proxying are at fault, there's throttle for all of them
  2. Checked if hosts without nginx proxying are at fault, (plain http file serving, no PHP, nothing), there's throttle for all of them
  3. Disabled (temporally) the firewall: there's still throttle
  4. Removed my router/firewall machine and set the HTTP reverse proxy as the DMZ. This solved the throttling issue, and now all check on the given website are working.

This means it isn't an ISP issue, but on my side, the router is throttling the connection going in.

So it's actually a very good news, as I can now try to sort out the issue with the router I'm using. Thanks!

4 Likes

Indeed, the router I'm using (Asus RT-ACXXU) has a DoS protection in its firewall package that's still active when the firewall is disabled. It must be manually disabled to work. Their default setting (not configurable) limits to 1 TCP/SYN per second, which clearly breaks with the new "Multi Perspective Validation" feature of Let's Encrypt.

I'd like to thank both @petercooperjr for explaining the feature and @Nummer378 for thinking about the throttling issue.

So now, I'm a bit disappointed here:

  1. Either I disable the DoS protection of the firewall, so Certbot can renew my certificate (and since I don't actually know when this happens, I must keep it disabled). I think it's a bad move. I think many people use Asus router at home so are likely to see the issue happening (and since there's no more email sending on failure, there's gonna be ranting in the future).
  2. Either I change the challenge mode to DNS (but since my domain provider doesn't have an API that's available in Certbot, it's a PITA). There's a old ACME tool for this, but I haven't tried it yet.
  3. Or I hope that Certbot will renew when there's no activity on my website so it doesn't trigger the DoS protection. However, if it fails, I won't know since there's no email sent anymore.

For Let's Encrypt developers, is it possible to rate limit the multi perspective validation ?
I don't see why it would need to be instantaneous.

2 Likes

I'm not sure with systemd timers, but I'm pretty sure cron sends emails with the output of Certbot when it fails.

1 Like

I believe the timeout for HTTP Challenge is 10s. I could easily be wrong as I am not the best at reading "go" (see: GitHub - letsencrypt/boulder: An ACME-based certificate authority, written in Go.). Of course, even if correct it could change.

The current auth sequence has one request going out from the primary Let's Encrypt server. If that succeeds secondary centers are engaged (currently 4).

Usually if a secondary auth server fails it says "Secondary" in the error message. Yours doesn't so probably comes from the primary. There isn't much LE could do to slow that request and 10s is plenty long to wait. And, yes there are some resources tied up in the LE server while it waits.

The 4 secondary requests may arrive at your server about the same time but asking a server to reply to just 4 requests is not a large burden. The better question is why your system is taking so long to reply. I don't think that is fully explained yet.

It may be less time-consuming for you to switch your DNS provider than trying to sort that out. You shouldn't have to move your registrar. Just update the DNS Servers at your registrar to be something else. Cloudflare is popular, free for many, and well-supported by their docs and community and also various ACME Clients including Certbot.

You are not obligated to use the pre-installed systemd timer or cronjob for Certbot renew. You could replace them with your own or adjust the pre-installed ones to occur in a window of your choosing. That said, this may be a game of "whack a mole" as you can't control when bots crawl your domains (or how many). Something was causing long delays ... I saw a 7s delay right after a 2s delay in my series. Nummer saw a different pattern but also fairly long response times. Even now when (I assume) that firewall is not active about 7% of requests from my server to yours takes over 1s.

If you modify your renewal timing please keep this FAQ about that in mind: FAQ - Let's Encrypt

4 Likes