One of your helpful tech persons (@rg350) suggested I post a summary of my help request (Certificate renewals fail on all mail and web servers) here as it raises an issue that needs to be addressed by Let's Encrypt ("LE") urgently.
Background
I run a small server farm (primarily email, web sites and social media hubs) housed in a major French rack host data centre and have used LE to obtain HTTPS certificates almost since LE's inception.
The entire farm sits behind an "enterprise grade" firewall of my choice and under my control - the significance of which will become clear.
For years all servers and services reliant on LE have hummed along just fine - certificate renewals "just happened" automatically (cron jobs) and new web sites or services obtained their certificates with a simple, concise command line entry.
Problem ...
Just over a month ago as the then current certificates began to enter their expiry / renewal phase (ie; 60 days into the 90 day currency period) certificate renewal would no longer work. Every attempt (in short, running certbot renew
) produced nothing but the same meaningless errors - stating that (mostly) the http-01 challenge had failed - followed by a claim that my firewall was at fault and to add to the 'fun' an occasional statement that a dns-01 challenge had failed accompanied by the helpful suggestion that I should place the correct DNS A records in the correct place.
Extracts from relevant logs can be found in the original help request.
Understand that
- All of the web sites and services were fully, correctly published in DNS and could be resolved completely with any tool against any public (or internally private) DNS service right up to the TLDs
- There had been NO changes to the configurations of any web or other service since the previous renewal cycle - for some web sites no change in a decade
- There had been no relevant change in firewall configuration since installation in 2014 and none at all since the past three certificate renewal cycles (more than 6 months)
... and yet ...
certbot continually failed to renew any certificates - just continued to throw out the same meaningless and obviously false errors.
What started as a small camp fire grew into a forest fire as the month progressed and finally an inferno as the certificate used by our email server actually expired - causing the loss of all access to email by our users across the world.
When a certificate enters its expiry window, cron jobs here attempt to renew it up to four times per day. Normally, the renewal is issued on the first attempt and we bother your servers no more for another 60 days. This month, the email server certificate renewal had been attempted 30 x 4 = 120 times automatically plus as many times as we could within the use limits imposed by LE. Then it expired.
Exactly the same was happening with all other certificates - running in differing operating environments,, different operating systems and different use cases. All were failing to renew four times each and every day and issuing essentially identical false error messages each time. Attempts to renew certificates manually produced identical results.
Trying to debug and correct the problem
Very little is written or available on the way that certbot operates its challenges. Reverse engineering the code got me only so far - and not far enough.
The combination of web server logfiles (which seemed to show a confusing pattern of successful and failed attempts to respond to the key challenges issued by the http-01 process aligning with what I can only describe as complete gobbledy-gook (I am very technically proficient so use the term correctly) in the LE logs just caused greater confusion.
Hundreds of hours expended just proving what had been the number one suspicion from the beginning - the cause of the problem lay outside our systems and completely outside our control.
Hence - the call on your help team.
Solution?
Who quickly came back with a reply ... that the problem was probably "Geofencing" (geographically blocking) servers that LE was using "spread around the globe" that issued multiple http-01 challenges before a certificate was issued or renewed.
Now recall that our firewall config has been essentially static for six years during which time it has passed every LE certificate issue and renewal without problem.
Then understand that we employ no "geofencing" (eg; we do not block the whole of North America or Asia from accessing our systems - nor any other geographic region).
BUT ... the firewall does (as is very common) use DNS blacklists to block traffic identified by reputable DNS blacklist services from accessing our systems.
Disabling the DNS blacklist functions momentarily was all it took for all certificate renewals to proceed successfully to completion in a few seconds - as would normally be expected.
During the ~1/2 hour our firewall was open we saw several thousand new fail2ban hits - we usually see about a dozen per day. THAT is the effect of running an open firewall
So, the problem is caused by ...
Examination of HTTP and LE logs following the renewals revealed that:
- LE has only two servers of its own that issue http-01 challenges (outbound1. and outbound2.letsencrypt.org)
- All other challenges were issued by AWS cloud instances using widely varying IP addresses
The great majority of these AWS issued IP addresses appear in DNS blacklists - presumably because their previous users used them to spray out the very sort of traffic nobody wants passing a firewall protecting Internet facing systems.
Your help team confirmed that it is LE's practice now to spread the load on its servers by spinning up AWS server instances on whatever IP is issued by AWS as needed.
Impact of this practice
The stated mission objective of the LE project is to encourage the spread of HTTPS use by simplifying the technicalities required to "HTTPSize" a web server and reduce the cost of obtaining and maintaining the necessary certificates. In this it has been highly successful - with 225 million certificates issued according to the LE web site.
HOWEVER that progress is about to come to a grinding halt. I strongly doubt that I am alone in running my Internet facing services behind a strong firewall - one that sensibly uses the best "intelligence" it can obtain on which IP addresses it should allow no traffic from as those IP addresses have shown themselves to be major spam / malware / phishing / DOS trouble-makers.
Cloud providers like AWS and ISPs commonly choose to recycle "damaged" IP addresses (ie; whose previous use has caused them to be placed on DNS blacklists) to new users - in the hope that the new user will "clean" the IP by conducting no illicit activity for a sufficient period for the blacklist entry to expire (usually a period between 1~3 months - but may be as long as one year).
LE have blindly accepted and put into live production use "dirty" IP addresses in a critical infrastructure role.
WORSE LE has implemented a MAJORITY of its http-01 challenge servers in this way - meaning that any LE user server sitting behind a half-decent firewall will inevitably fail to receive sufficient http-01 challenges to ever pass LE's http-01 test.
Am I alone in being affected by this?
I strongly doubt it.
The only significant difference between my environment and a typical small or medium size business (I assume the prime target for LE) is that I have control of my firewall - so can turn off the (entirely proper) blocking mechanisms long enough to allow a certificate issue or renewal to succeed before "slamming the door shut" again. A typical business is likely to just sign up for a virtual or rack server or cluster (as scale demands) on-line and when asked "Do you want a firewall with that?" just tick the box ... thus becoming "protected" by a corporate, data centre managed firewall appliance over which they have no control.
What choice would they have faced with the problem I have faced for the entirety of this past month?
Even given the advice from your helpline colleagues to, in essence, "open your firewall to the four winds" they have no way of doing so - and no data centre is going to open the flood gates of a firewall protecting who-knows-how-many client servers.
They have a choice of
- giving up HTTPS (contrary to LE's mission statement) and all the SEO and e-commerce that goes with it
- finding expensive professional help (contrary to LE's mission statement) to implement a different ACME client that doesn't rely on correctly blocked servers to authenticate a certificate requesting server.
LE must realise the harm it is doing.
The cost of time alone spent here trying to resolve this problem - entirely of LE's making - would have paid for a substantial donation to the EFF. That's just one client. Assume a ridiculously low number of 1% of LE certificate users having servers behind a half-decent firewall and you have just cost the world 2.25 million times the cost I have just incurred.
That is 2.25 million extremely irate LE users who I wouldn't blame for a second if they sought to recover their costs and losses from the EFF
Bear in mind this is a problem likely to impact far more than 1% of your users.
The solution is staggeringly simple and quick to implement
The KEY thing is to ensure that only servers with squeaky clean IP addresses are put into production as acme-challenge servers.
The options for achieving this are several:
- Insist when opening a new AWS (or whomever) cloud instance that the IP address issued is "clean" and appears on no known DNS blacklists - it is so trivially easy to check any IP address that I can't bring myself to give you a few web URLs that do the job for you ... OR,
- open a tiny cloud instance, obtaining a (probably dirty) IP address as you do so - now either wait long enough for the IP to expire from (at least) all the major DNS blacklists OR accelerate that process by using the IP to host some of your web traffic to "enhance" its reputation OR write to all the DNS BL operators saying "We're the EFF - please remove IP xxx.xxx.xxx.xxx from your list - Thanks!" As soon as the IP address is clean, enlarge the instance to whatever scale is appropriate and put it into production use.
It bears saying once again ...
NO SERVER SHOULD BE RELEASED FOR LIVE PRODUCTION USE UNLESS ITS IP ADDRESS IS SQUEAKY CLEAN
or put another way ...
NEVER DEPLOY A SERVER TO LIVE PRODUCTION USE (IN A CRITICAL INFRASTRUCTURE ROLE) UNLESS ITS IP ADDRESS IS SQUEAKY CLEAN
What do you need to do now?
- Conduct an audit of all AWS (and any other cloud or externally hosted) acme-challenge (http-01) servers using the aforementioned websites to verify that they appear on NO DNS BLs
- IMMEDIATELY REMOVE from production use ANY server that appears in any DNS BL. **Do not put it back into production before ensuring that its IP is squeaky clean"
- To the extent that you need additional servers to carry the load of 225 million / 60 daily renewals contract only for servers with squeaky clean IP addresses.
My hope
- That this has been helpful
- That you will put the recommended actions into effect within the next 60 days (ie; before my certificates are due for renewal again)
- The personal cost of analysing, debugging, researching - and contributing this information has been substantial - please do not squander it.
--
George Perfect FBCS, FIoD