1 of 15 domains failed certbot renew of cert? (no issues since 2019 before)

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. https://crt.sh/?q=example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is: 3111skyline.com

I ran this command: certbot renew

It produced this output:

[23:13 valkyrie:/srv/http/tmp] # certbot renew
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/3111skyline.com.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
/usr/lib/python3.12/site-packages/certbot/ocsp.py:238: CryptographyDeprecationWarning: Properties that return a naïve datetime object have been deprecated. Please switch to this_update_utc.
  if not response_ocsp.this_update:
/usr/lib/python3.12/site-packages/certbot/ocsp.py:240: CryptographyDeprecationWarning: Properties that return a naïve datetime object have been deprecated. Please switch to this_update_utc.
  if response_ocsp.this_update > now + timedelta(minutes=5):
/usr/lib/python3.12/site-packages/certbot/ocsp.py:242: CryptographyDeprecationWarning: Properties that return a naïve datetime object have been deprecated. Please switch to next_update_utc.
  if response_ocsp.next_update and response_ocsp.next_update < now - timedelta(minutes=5):
Renewing an existing certificate for 3111skyline.com and 15 more domains

Certbot failed to authenticate some domains (authenticator: webroot). The Certificate Authority reported these problems:
  Domain: drrankin.com
  Type:   connection
  Detail: During secondary validation: 66.76.46.195: Fetching http://drrankin.com/.well-known/acme-challenge/CYwoluh_N-pwMztXv9Wsqp4qTIca8lVFDajberP7V2w: Timeout during connect (likely firewall problem)

Hint: The Certificate Authority failed to download the temporary challenge files created by Certbot. Ensure that the listed domains serve their content from the provided --webroot-path/-w and that files created there can be downloaded from the internet.

Failed to renew certificate 3111skyline.com with error: Some challenges have failed.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
All renewals failed. The following certificates could not be renewed:
  /etc/letsencrypt/live/3111skyline.com/fullchain.pem (failure)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 renew failure(s), 0 parse failure(s)
Ask for help or search for solutions at https://community.letsencrypt.org. See the logfile /var/log/letsencrypt/let

My web server is (include version): apache 2.4.62-1

The operating system my web server runs on is (include version): Archlinux

My hosting provider, if applicable, is: N/A

I can login to a root shell on my machine (yes or no, or I don't know): Yep

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): none command-line only

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): certbot 3.0.0

What is odd is 3111skyline.com is the primary, but only 1 of 15 domains in the certificate. I have renewed these domains some 20 times since 2019 and never experienced any issue. The error said "likely firewall", so I disabled iptables and fail2ban and re-ran the certbot renew and it completed fine.

My firewall has not changed (other than perhaps different IP in the various blocklists) since the last time certificates were updated. I have never had this happen before and I do not understand what part of my "firewall" made the renewal fail. I have always renewed without disabling the firewall in the past. Why would it fail this time and then succeed immediately after taking the firewall down (something I am remiss to do in this polluted internet environment we have allowed to be created)

I've reviewed the letsencrypt.log.1 and can find where:

"During secondary validation: 66.76.46.195: Fetching http://drrankin.com/.well-known/acme-challen
ge/CYwoluh_N-pwMztXv9Wsqp4qTIca8lVFDajberP7V2w: Timeout during connect (likely firewall problem)"

But I have no clue how that failure is now somehow related to my firewall. However the empirical evidence from having taken down the firewall and then having certbot succeed suggests it is. I need to understand why so I can fix the firewall, if needed, so this doesn't repeat on the next update.

What to check?

1 Like

Do you have firewall logs that might describe why the Let's Encrypt authorization server could not connect?

Here is some background. LE first does the HTTP Challenge from its Primary center. If that succeeds it tries from its Secondary centers. Currently there are 4 dispersed around the globe. When "secondary validation" appears in the error message it means one of these 4 Secondary failed. It actually means more than 1 failed as 1 failure is currently tolerated.

A successful challenge should have 5 successful access log records in your Apache log (1 primary, 4 secondary). It might have 4 (b/c one secondary is allowed to fail) but usually will see 5.

I am not a fail2ban expert so not sure how "tight" you might have settings. Perhaps the timing from the LE centers is slightly different to now get blocked by your firewall.

Did this "timeout" failure occur repeatedly before you disabled the firewall? I wonder if there was just a stray, temporary, ISP related comms problem that just happened to resolve before you re-tried with firewall disabled.

One last note ... do any firewall testing using sudo certbot renew --dry-run

The dry-run will not affect your production certs. The --dry-run will ensure that each test has LE sending fresh HTTP Challenges. Without that LE uses its cache of successes and won't need to re-send the Challenge. That can look like something worked when it was really just the cache "working" :slight_smile: (this particular cache is 30 days).

2 Likes

Unfortunately, there is no firewall log. The packets are just silently dropped. What is odd is that there has been no change since the last renew, or the renewal before that, etc.. The failure came a quite a surprise.

Here is some background. LE first does the HTTP Challenge from its Primary center. If that succeeds it tries from its Secondary centers. Currently there are 4 dispersed around the globe.

Bingo! - now that is the likely cause and the change - but it should have failed for all domains and not just drrankin.com. What has changed (and the only thing that does) is the list of IPs in the various blocklists. How can I get the IP for each of the 4 centers dispersed around the globe?

Given the dramatic increase in distributed brute-force attempts to compromise mail and web hosts, I have had to be much more aggressive blocking IP from netblocks that are repeat offenders. If the Let's Encrypt centers fall into one of those netblocks, I need the IP so I can whitelist them so there is no issue with the firewall.

I can simply check the center IPs against the current setup (iptables, ipset, fail2ban) to see if any IP falls within a blocked range in the firewall. Is there a way I can get a list of the IPs? I checked the IP I got for letsencrypt.org and it was fine - but I have no way of knowing about secondary or other centers.

Would the --dry-run be helpful in this case, or would the check of the center IPs be best?

You cannot. LE does not publish such a list. Each center has a pool of IP and even those pools change regularly. See this FAQ answer: FAQ - Let's Encrypt

Maybe, or at least more than one. As I just noted the source IP change.

Can you check your Apache access logs and see the successful requests for each domain? Perhaps a pattern will emerge.

That is not an IP for any of the authorization centers. (you would connect to that outbound, the auth centers are inbound to you)

The advice from Let's Encrypt is that if you cannot keep HTTP open to the entire internet you should use the DNS Challenge.

This is a terrific article on the overall strategy. Perhaps start with this section: Multi-Perspective Validation & Geoblocking FAQ

2 Likes

Alright, I'm not sure the "we hide the IPs from you" is that helpful, but given all the crap that I have to try and keep out of my servers, not having a published "hit list" is understandable.

I can do the log thing, but I wonder would publishing the pools be a bad thing? I can whitelist by CIDR (as can most firewalls) and if the pools are known, that provides a solution. If the pools are small 8-IP blocks, then it wouldn't do you much good to publish them, but if they are full /24 pools or larger, I'm not sure I see the harm.

At least I understand how this is related to the firewall now. Not being able to know what IP the challenge comes from is a bit at odds with being able to affirmatively manage the firewall to the best extent possible. I'll look at the access logs, but with one renewal every 90 days, I won't hold my breath waiting on a pattern to emerge. I suspect the logs from the renewal 3 months ago have long gone the way of the Dodo...

Thanks for the information though. It did answer the question and clear up the mystery :)

One idea for you that isn't mentioned in the Multi-Perspective article:

Your cert has roughly 15 domains and each will get 5 HTTP challenge requests. Some of these source IP will repeat. Is getting this many requests in a short time enough to trigger a block by your firewall? If so, consider breaking up your cert into smaller groups or "loosen up" your firewall settings.

1 Like

Oh yes, 3-strikes and you are out with fail2ban if an authentication attempt is made. Without authentication, (which I suspect is how the challenges go), then the fail2ban block isn't implicated at all.

What I suspect has happened, is that one of the challenge servers is in an IP block from Latin America or APNIC, and I have large blocks from those area completely banned. I don't do business there and I don't need the 1000 intrusion attempts per-day I see from there. So likely one of the challenge server pool IPs is within an ip-block I have added to the firewall due to the repeat intrusion attempts from that same ip-block.

This will likely become a larger issue going forward as the various providers from those locations start selling virtual servers. The problem being is that they do no due-diligence to prevent their services from being used by miscreants which leads to that ip-block being banned.

At least I have a good idea now what the landscape looks like and I can check and whitelist the IP I can identify in the logs. Sooner or later I can amass enough to leave the firewall up. Until then I'll just take it down to renew and then bring it back up (quickly).

It's largely academic given LE has never done it and has not indicated they will.

Are you able to exempt any HTTP request with URI starting with /.well-known/acme-challenge ? That was suggested in that Multi-Perspective article but just repeating for others who may just skim this thread :slight_smile:

No, none are currently. FWIW the secondary LE auth centers are all AWS (today).

That's likely a perpetual whack-a-mole.

2 Likes

Yes, I do some of that for nextcloud and egroupware, but I'll have to check on how to move that server wide. I'll also look at whether the DNS challenge provides a better solution.

That's likely a perpetual whack-a-mole.

Hey, I have public-facing internet hardware, "whack-a-mole" is my specialty :slight_smile:

Thank you for all your help. The veil of confusion has been lifted. Just what the permanent solution will be is still a bit hazy, but I can at least see the target I am shooting at.

AWS, unfortunately, is one of the worst offenders in the "do no due-diligence" and "doesn't respond to abuse reports" categories. I started collecting failed auth attempts several months ago to do identity and frequency analysis on. AWS is a repeat offender.

While I generally do not block US or CA situated IPs, I do block individual repeat offender IPs. If that AWS IP cycles into the LE pool and ends up behind a challenge request, then I can see what happened. (one reason behind the this will only become a larger problem comment). I liked the internet better when it was ftp:// only...

2 Likes

Found It!

The offending IP originates from:

inetnum:        94.156.64.0 - 94.156.71.254
netname:        Fuse_Hostingweb-NET
descr:          Fuse Hosting Web
org:            ORG-FA1221-RIPE
country:        US
admin-c:        NA6844-RIPE
tech-c:         NA6844-RIPE
mnt-domains:    fusehosting-mnt
mnt-routes:     fusehosting-mnt
status:         ASSIGNED PA
mnt-by:         MNT-NETERRA
created:        2024-10-31T11:31:22Z
last-modified:  2024-10-31T11:31:22Z
source:         RIPE

And as surmised, it was part of a net-block now-excluded by the firewall since the last renewal due to repeated distributed attacks originating within that net-block over the past 30 days. I'll whitelist the individual address used by the challenge and suspect all will go well on the next update (unless, of course, another offending net-block between now and then also contains another of your challenge IPs :)

Thank you again for your help!

(now as to why a US listed net-block is being returned through a RIPE query... because inquiring minds just gotta know...)

  1. Leave port 80 unrestricted
  2. Exempt /.well-known/acme-challenge from redirects
  3. Redirect everything else to port 443
  4. Keep the desired geo-blocking on port 443
  5. Profit!
2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.