Renewal or new issue behind a firewall fails due to LE operation

georgep · October 8, 2020, 4:38pm

One of your helpful tech persons (@rg350) suggested I post a summary of my help request (Certificate renewals fail on all mail and web servers) here as it raises an issue that needs to be addressed by Let's Encrypt ("LE") urgently.

Background

I run a small server farm (primarily email, web sites and social media hubs) housed in a major French rack host data centre and have used LE to obtain HTTPS certificates almost since LE's inception.

The entire farm sits behind an "enterprise grade" firewall of my choice and under my control - the significance of which will become clear.

For years all servers and services reliant on LE have hummed along just fine - certificate renewals "just happened" automatically (cron jobs) and new web sites or services obtained their certificates with a simple, concise command line entry.

Problem ...

Just over a month ago as the then current certificates began to enter their expiry / renewal phase (ie; 60 days into the 90 day currency period) certificate renewal would no longer work. Every attempt (in short, running certbot renew) produced nothing but the same meaningless errors - stating that (mostly) the http-01 challenge had failed - followed by a claim that my firewall was at fault and to add to the 'fun' an occasional statement that a dns-01 challenge had failed accompanied by the helpful suggestion that I should place the correct DNS A records in the correct place.

Extracts from relevant logs can be found in the original help request.

Understand that

All of the web sites and services were fully, correctly published in DNS and could be resolved completely with any tool against any public (or internally private) DNS service right up to the TLDs
There had been NO changes to the configurations of any web or other service since the previous renewal cycle - for some web sites no change in a decade
There had been no relevant change in firewall configuration since installation in 2014 and none at all since the past three certificate renewal cycles (more than 6 months)

... and yet ...

certbot continually failed to renew any certificates - just continued to throw out the same meaningless and obviously false errors.

What started as a small camp fire grew into a forest fire as the month progressed and finally an inferno as the certificate used by our email server actually expired - causing the loss of all access to email by our users across the world.

When a certificate enters its expiry window, cron jobs here attempt to renew it up to four times per day. Normally, the renewal is issued on the first attempt and we bother your servers no more for another 60 days. This month, the email server certificate renewal had been attempted 30 x 4 = 120 times automatically plus as many times as we could within the use limits imposed by LE. Then it expired.

Exactly the same was happening with all other certificates - running in differing operating environments,, different operating systems and different use cases. All were failing to renew four times each and every day and issuing essentially identical false error messages each time. Attempts to renew certificates manually produced identical results.

Trying to debug and correct the problem

Very little is written or available on the way that certbot operates its challenges. Reverse engineering the code got me only so far - and not far enough.

The combination of web server logfiles (which seemed to show a confusing pattern of successful and failed attempts to respond to the key challenges issued by the http-01 process aligning with what I can only describe as complete gobbledy-gook (I am very technically proficient so use the term correctly) in the LE logs just caused greater confusion.

Hundreds of hours expended just proving what had been the number one suspicion from the beginning - the cause of the problem lay outside our systems and completely outside our control.

Hence - the call on your help team.

Solution?

Who quickly came back with a reply ... that the problem was probably "Geofencing" (geographically blocking) servers that LE was using "spread around the globe" that issued multiple http-01 challenges before a certificate was issued or renewed.

Now recall that our firewall config has been essentially static for six years during which time it has passed every LE certificate issue and renewal without problem.

Then understand that we employ no "geofencing" (eg; we do not block the whole of North America or Asia from accessing our systems - nor any other geographic region).

BUT ... the firewall does (as is very common) use DNS blacklists to block traffic identified by reputable DNS blacklist services from accessing our systems.

Disabling the DNS blacklist functions momentarily was all it took for all certificate renewals to proceed successfully to completion in a few seconds - as would normally be expected.

During the ~1/2 hour our firewall was open we saw several thousand new fail2ban hits - we usually see about a dozen per day. THAT is the effect of running an open firewall

So, the problem is caused by ...

Examination of HTTP and LE logs following the renewals revealed that:

LE has only two servers of its own that issue http-01 challenges (outbound1. and outbound2.letsencrypt.org)
All other challenges were issued by AWS cloud instances using widely varying IP addresses

The great majority of these AWS issued IP addresses appear in DNS blacklists - presumably because their previous users used them to spray out the very sort of traffic nobody wants passing a firewall protecting Internet facing systems.

Your help team confirmed that it is LE's practice now to spread the load on its servers by spinning up AWS server instances on whatever IP is issued by AWS as needed.

Impact of this practice

The stated mission objective of the LE project is to encourage the spread of HTTPS use by simplifying the technicalities required to "HTTPSize" a web server and reduce the cost of obtaining and maintaining the necessary certificates. In this it has been highly successful - with 225 million certificates issued according to the LE web site.

HOWEVER that progress is about to come to a grinding halt. I strongly doubt that I am alone in running my Internet facing services behind a strong firewall - one that sensibly uses the best "intelligence" it can obtain on which IP addresses it should allow no traffic from as those IP addresses have shown themselves to be major spam / malware / phishing / DOS trouble-makers.

Cloud providers like AWS and ISPs commonly choose to recycle "damaged" IP addresses (ie; whose previous use has caused them to be placed on DNS blacklists) to new users - in the hope that the new user will "clean" the IP by conducting no illicit activity for a sufficient period for the blacklist entry to expire (usually a period between 1~3 months - but may be as long as one year).

LE have blindly accepted and put into live production use "dirty" IP addresses in a critical infrastructure role.

WORSE LE has implemented a MAJORITY of its http-01 challenge servers in this way - meaning that any LE user server sitting behind a half-decent firewall will inevitably fail to receive sufficient http-01 challenges to ever pass LE's http-01 test.

Am I alone in being affected by this?

I strongly doubt it.

The only significant difference between my environment and a typical small or medium size business (I assume the prime target for LE) is that I have control of my firewall - so can turn off the (entirely proper) blocking mechanisms long enough to allow a certificate issue or renewal to succeed before "slamming the door shut" again. A typical business is likely to just sign up for a virtual or rack server or cluster (as scale demands) on-line and when asked "Do you want a firewall with that?" just tick the box ... thus becoming "protected" by a corporate, data centre managed firewall appliance over which they have no control.

What choice would they have faced with the problem I have faced for the entirety of this past month?

Even given the advice from your helpline colleagues to, in essence, "open your firewall to the four winds" they have no way of doing so - and no data centre is going to open the flood gates of a firewall protecting who-knows-how-many client servers.

They have a choice of

giving up HTTPS (contrary to LE's mission statement) and all the SEO and e-commerce that goes with it
finding expensive professional help (contrary to LE's mission statement) to implement a different ACME client that doesn't rely on correctly blocked servers to authenticate a certificate requesting server.

LE must realise the harm it is doing.

The cost of time alone spent here trying to resolve this problem - entirely of LE's making - would have paid for a substantial donation to the EFF. That's just one client. Assume a ridiculously low number of 1% of LE certificate users having servers behind a half-decent firewall and you have just cost the world 2.25 million times the cost I have just incurred.

That is 2.25 million extremely irate LE users who I wouldn't blame for a second if they sought to recover their costs and losses from the EFF

Bear in mind this is a problem likely to impact far more than 1% of your users.

The solution is staggeringly simple and quick to implement

The KEY thing is to ensure that only servers with squeaky clean IP addresses are put into production as acme-challenge servers.

The options for achieving this are several:

Insist when opening a new AWS (or whomever) cloud instance that the IP address issued is "clean" and appears on no known DNS blacklists - it is so trivially easy to check any IP address that I can't bring myself to give you a few web URLs that do the job for you ... OR,
open a tiny cloud instance, obtaining a (probably dirty) IP address as you do so - now either wait long enough for the IP to expire from (at least) all the major DNS blacklists OR accelerate that process by using the IP to host some of your web traffic to "enhance" its reputation OR write to all the DNS BL operators saying "We're the EFF - please remove IP xxx.xxx.xxx.xxx from your list - Thanks!" As soon as the IP address is clean, enlarge the instance to whatever scale is appropriate and put it into production use.

It bears saying once again ...
NO SERVER SHOULD BE RELEASED FOR LIVE PRODUCTION USE UNLESS ITS IP ADDRESS IS SQUEAKY CLEAN

or put another way ...
NEVER DEPLOY A SERVER TO LIVE PRODUCTION USE (IN A CRITICAL INFRASTRUCTURE ROLE) UNLESS ITS IP ADDRESS IS SQUEAKY CLEAN

What do you need to do now?

Conduct an audit of all AWS (and any other cloud or externally hosted) acme-challenge (http-01) servers using the aforementioned websites to verify that they appear on NO DNS BLs
IMMEDIATELY REMOVE from production use ANY server that appears in any DNS BL. **Do not put it back into production before ensuring that its IP is squeaky clean"
To the extent that you need additional servers to carry the load of 225 million / 60 daily renewals contract only for servers with squeaky clean IP addresses.

My hope

That this has been helpful
That you will put the recommended actions into effect within the next 60 days (ie; before my certificates are due for renewal again)
The personal cost of analysing, debugging, researching - and contributing this information has been substantial - please do not squander it.

--
George Perfect FBCS, FIoD

griffin · October 8, 2020, 5:04pm

@jsha

I feel like you are the right one to address this.

georgep · October 8, 2020, 5:07pm

Thanks - hope you haven't been bitten by it.

George

griffin · October 8, 2020, 5:10pm

You're very welcome and not all all. We're always glad to receive feedback and realize that there can be very real frustrations that certainly should have a viable outlet. In taking even an initial glance at your topic (and having followed the original closely), I wanted to make certain that it was addressed with the most informed response possible. You took the time to come to us and lay it out. We should certainly take the time to consider it carefully. I, for one, want to assure you that you are not alone in the "global firewall" frustration arena. I've been involved with aiding many cases where that has been an issue.

Osiris · October 8, 2020, 5:21pm

Hi George,

Do know that I'm not affiliated with Let's Encrypt in any way, except from being a regular on this Community for quite a few years now.

I sense a lot of hostility from your post. Could be me, but thats how I feel about it. To me, it comes across as quite obtrusive, if that's the correct English word for it (I'm not a native speaker). In Dutch, we have a saying: "One catches more flies with honey than with vinegar." I.e., a hostile tone will only put people on the defensive, where you probably want them in a helpful, understanding stance.
Part of this feeling I'm getting is due to the large amount of bold in combination with capitalization, strong wording. I'm also sensing some kind of feeling of superiority. In all caps.. And bold..

Just saying you migh want to rephrase some bits, remove some bold and all-caps pieces.

Also:

You'd think we'd see a lot more help requests if that was the case on this Community? I cannot deny it might be possible you're indeed not the only one affected, but one might starting to think...

ski192man · October 8, 2020, 5:22pm

I personally use the DNS-01 challenge for nearly every certificate I get. Depending on your DNS provider that might be a viable alternative and would eliminate this issue.

The post did feel rather abrasive / demanding that Let's Encrypt (a free, nonprofit service) MUST take your action to accommodate your block list. Keep in mind a huge number of people use the HTTP challenge without issue

jvanasco · October 8, 2020, 5:25pm

I'm sorry for this experience, and I don't want to derail your post but I have to ask - would you mind sharing any details about your DNS Blacklist integration? I have some servers that often get overloaded by tens of thousands (and likely a few orders of magnitude more) fail2ban rules, and your current setup sounds like what I've been meaning to invest in!

jvanasco · October 8, 2020, 5:30pm

Very few commercial DNS providers allow for granular account/api permissions that can be securely/safely implemented in many business settings. To get around that, you need to self-host DNS or use a delegate the DNS-01 challenges to a DNS response system you control (like acme-dns) – both of those approaches requires opening up the firewall for the renewal period.

I am not affected by this, because my firewall is not as strict AND we have an implementation detail that protects us from this issue – we use acme-dns and have pre/post hooks that remove/reapply firewall rules. If we didn't manage the firewall rules like that with foresight, we likely would have been screaming about this years ago.

I am reading this as @georgep's approach to LAN security has been more precautionary and less reactionary, while many people pursue a more reactionary response and must constantly improve IP Filtering rules. In the past few years, I have slowly been migrating to more precautionary measures -- and most people I know have been doing the same as well.

So another way of looking at this,@Osiris, is that few people have experienced this particular issue so far – but industry trends suggest many people will be experiencing this in the future.

georgep · October 8, 2020, 6:16pm

If you read my linked help request post you will see details of my DNS service - and that I have fully tested and rejected for very good reasons the DNS-01 approach. In any case, the type of test used is irrelevant if LE continue to use challenge servers that will not pass common firewall tests.

Do feel free to write here again if you have installed a half-decent firewall and your DNS-01 challenges not only stop working but you simultaneously discover the security holes that should have stopped you in your tracks before even starting down that road.

Sorry, full stop.

@Osiris and @ski192man - sorry if you sense abrasiveness. Hostility? None is there. What you are sensing is frustration and urgency. [Sigh] Again, read my original post - I fully recognise that the EFF and LE is a non-profit relying on donations and offering a free service.

Now ask yourselves how many people have STOPPED (CAPS intended) using LE / certbot because of problems encountered with its use. 225 million certificates sounds a lot but as those certificates have an effective lifespan of only 60 days you can divide that number by 6 to get an idea of the number of sites using the service if all the certificates were issued in one year. As LE has been running somewhat longer than one year the number of client servers supported is some very long way short of 35 million - stab a guess 6 million sites? 3 million? Now bear in mind that most servers host multiple virtual web sites and divide that number down again.

Hmm - getting less impressive all the time. Now ... if LE made it EASIER for commonly installed web servers to use their service they may gain more clients. Just a thought.

I leave it to you to look up how many web sites exist on the 'net and calculate LE's "market share" - or how successfully they have hit their mission statement. Perhaps somebody from LE will chip in the number of certificates issued and then never again renewed?

You will have to point to where I write that LE "MUST" (your CAPS) implement "take your action to accommodate your blocklist" [my emphasis].

If LE "MUST" do anything it is to act sensibly by making very simple changes - that cost the non-profit, donation funded organisation precisely NOTHING - to avoid problems that currently prevent a (yup, I know - look me up) very substantial number of users from making use of a service that offers an otherwise pretty compelling USP..

To be clear ([sigh] again, read my linked original post) I am not demanding that LE does anything. I am documenting a problem that has fatal consequences for a substantial group of users and offering a trivially simple, no cost fix that solves the problem that I have described at length.

I really do suggest you take time to read the original linked thread - if only to see the technical bowl of spaghetti proposed to get round a trivially simply fixable problem. The choice is clear - LE implement a simple, zero-cost administrative fix (no code, no technology) or every affected user gets to spend $€£'000's implementing additional servers and adding entirely unnecessary holes in firewalls plus all the consequential added maintenance time and cost.

Seems a pretty simple proposition to me. It can be taken or cast aside - I'm not charging for my thoughts.

if things remain as they are I will simply move to a different ACME client (as you will also find written if you care to look). Oddly the two I have already tried work out of the box first time (ending by issuing LE certificates, no less) and without troubling "my firewall" or "my blocklists". Strange, huh?.

In crude, capitalist, economic terms it is cheaper to buy a certificate at great cost from Verisign than implement any of the "technical" workarounds proposed by LE to overcome a problem they (I assume entirely unintentionally) caused.

I do hope that puts both your concerns aside.

I have no interest in engaging in a ffame war. I am trying to help here - if you don't like what I write or the way I write it and it doesn't affect you - don't read it.

George

thisisbroken · October 8, 2020, 6:32pm

Funny people talking about abrasive nature of OPs post. He is obviously frustrated by all of this and I am too with letsencrypy, letsdebug and unboundtest simply not working with very little explanation as to why outside of vanilla failed responses.

The problem I have is similar to OPs except we pay a 3rd party to do this for us and they use letsencrypt and offer 0 support when it doesn't work (which is now). So regardless of whether or not this is a paid service, when it breaks you want to be able to talk to someone who can support it and provide insight into what its doing. More importantly for guys like me a source or destination IP goes a long way and spinning up AWS instances on-demand pretty much negates that unless letsencrypt can provide a realtime IP of where the request or challenge is being generated from.

georgep · October 8, 2020, 6:48pm

Hi - at both the data centre and here at home (I am retired but surrounded by more computers than I ever bothered with while working) I use the "pfsense" router which has recently been purchased by Netgate (https://www.netgate.com/solutions/pfsense/). It is extremely secure, extremely flexible and can be obtained either as an open source "free" download to run on your own multi-ethernet PC (the OS version comes complete with its own customised version of FreeBSD ) or as traditional rack appliances of various sizes.

Whichever route chosen, it is possible to add plug-ins to accomplish many tasks that are not already in the core. The one I use for DNS blocking is called "pfBlockerNG" which can be configured to geofence an entire continent if that is your wish or, as you canchoose which filters to apply to it, can be used to access pretty much whichever DNS BLs you care to choose. Even in the free version you have options to access the MaxMind BL (registration only is necessary) and the plug-in itself comes with a range of filters such as "Top_v4", "TopSpammers" etc. that allow it to be tailored to whatever environment. IPsec, L2TP and OpenVPN VPN choices are all available and multiply configurable either as point to point or client-server links. Lots more - read the website and find more help and info than you'll ever need on multiple websites.

At home I have multiple Internet connections (who'd live in rural France?) and can direct traffic by source, destination, port or just about any combination you can think of - at least it has never defeated me. I also have a dedicated VPN to the datacentre and multiple commercial VPN connections ending at various spots. It is possible to set faialover rules arbitrarily between connections - one might be "if the satellite fails, direct all traffic to 4G" - another might be "if VPN_1 colllapses divert immediately to VPN_2"and so on.

As it's free to download and try, if you have an old PC (CPU and memory demands are tiny) that has a couple of Ethernet ports give it a try. If you use it commercially by one of the appliances or purchase a licence and support contract. If you use it privately, donate.

Compared to the "traditional" products from the big network vendors I'd choose it every time.

jsha · October 8, 2020, 7:08pm

Hi @georgep! Thanks so much for the feedback. We always want the experience of using Let's Encrypt to be transparent, friendly, and give accurate error messages. Sorry you had such a bad experience.

It sounds like the problem you're experiencing is related to the "Multi Perspective Validation" we deployed in February 2020 to improve security of our validation process: https://letsencrypt.org/2020/02/19/multi-perspective-validation.html. We've definitely gotten some other reports that some people block all EC2 IPs, and that breaks validation for them. For now we're planning to continue using EC2, but of course feedback from subscribers like yourself helps inform those decisions and things could always change in the future.

This is the first report we've gotten that one of our validation servers is on a specific block list. Would you share the name of that list?

Do you have any thoughts about how the error message could have pointed you more directly towards the fix?

Thanks,
Jacob

jvanasco · October 8, 2020, 7:18pm

Once upon a time, I had a few Netgate, Soekris and PcEngines/Alix routers.
I am very familiar with them and pfsense I also had a multi-wan failover system running between some of those, but I can't remember the name of that technology; it was 20 years ago!

"pfBlockerNG + MaxMind BL" looks like what I should be using. I had no idea MaxMind had a dns blacklist; we use them for geoip. We do some custom blacklist blocking by checking ARIN and other systems to determine what ip range allocation a problematic IP falls in, then block that range.

Whatever sub-lists are getting stuck, you should file a ticket against GitHub - letsencrypt/boulder: An ACME-based certificate authority, written in Go. and mention them; LetsEncrypt should definitely ensure those IPs are not used.

In the meantime, you can try using the pre-post hooks to change firewall rules during validation. We use certbot + acme-dns for DNS based validation on our core domains; the pre and post hooks are used to completely change iptables routing, so acme-dns will respond. At all other times, port 53 is closed to everyone; then we open it to everyone for these challenges. Because of our routing rules, it was easier to open it up to everyone and bypass the fail2ban rules – otherwise I would have possibly experienced your issues. Our whitelabel domains are handed via http-01 validation using a custom client, and I have not encountered your error yet.

JamesLE · October 8, 2020, 7:21pm

We got one, quite a while ago: Let's Encrypt Fails due to ModSec rule In that case, our own on-premise validation servers were listed, so this kind of problem isn't limited to just AWS.

The cause was sinkholing: our validation server connected to a formerly malicious site that had since been taken over by a blocklist operator. Their systems assume that if an IP address is connecting to a sinkhole, it's infected by malware (or is itself doing malicious scanning). That assumption has a pretty low rate of false positives, but it's bad for us, because we validate a lot of sites - and anyone can ask us to validate any site.

The upshot is that any validation server we spin up, previously clean IP or not, has a good chance of mistakenly being added to blocklists. I've tried to follow up with list operators to get exemptions, and have had no success.

Because of this, if you're blocking traffic based on feeds of suspicious IP addresses, I recommend making an exemption for traffic that's destined for your DNS servers. That should be a pretty simple exemption, exposing the narrowest possible attack surface, and would allow you to use dns-01 challenges successfully.

One followup question, @georgep - am I reading your post correctly that you had this problem when requesting certificates with Certbot, but not with other clients? If that's correct, then something else is going on! The validation from our side happens the same way regardless of which ACME client is used.

georgep · October 8, 2020, 7:28pm

Hi again - you identified several of the reasons why DNS-01 challenges are JPW ("Just Plain Wrong") - there are more. I run our own (twinned for backup and maintenance) DNS master servers inside the firewall and have secure updates to external commercial "slaves" that act to the world as authoratitive for any of our domains. So, updating a "root" DNS server with a dns-01 challenge key is pretty trivial and secure for us (it can be done with the usual DNS update key mechanisms) as the sensitive update can be accomplished entirely on the private LAN and then securely propagated to the external servers - about which we can care little as even if attacked we need to do no more than revert to secondaries .. or tertiaries .. or ...

But look between the cracks and it's not too hard to see that even in an environment such as this DNS-01 is far from a good choice.

TLNS-01 is JPD ("Just Plain Dead") as a protocol and IMHO should not be used in any public environment. Which leaves HTTP based challenges as a proven, effective route to server identity..

Honestly, after too many decades at the sharp end of the IT industry I'm all for a quiet life. So all the systems in use are configured for lowest maintenance, best security and to "just work". I honestly haven't touched the firewall since installing it in 2014 (replacing its predecessor that I can't honestly remember when was first installed) - apart from occasionally applying updates (FreeBSD is a pretty secure base and pfSense is rock solid so updates usually add new protocols or extend the feature set - all without disturbing the sitting configuration. Implementing / changing the various VPN and other secure connections when moving from UK to France didn't even require a reboot - just copy the config and reload.

So, in my experience, it's about as maintenance free as a router/firewall/VPN gets - it just sits there and works. I never have to play around with the filtering rules ... unless somebody like the good folks running LE decide to spin up a bunch of VMs in the cloud on IP addresses that are rightly rejejected by the firewall. Then it's a click to disable a rull and a second to re-enable it. Easier than opening and closing a metaphorical door.

As to the system architecture - I look at many and ask myself if people still understand the KISS principle. It's as if there's a choice between a simple, easy, low-cost way of doing something and complex, costly, high maintenance method that offers no performance or other gain - but the choice has to be the complex and

I've seen proposals here to install a separate server with what amounts to a DMZ (remember them?) rule to just hang out in the Internet breeze just to deal wit LE's HTTP only challenges (or DNS or TLNS ,,,) - acting like some sacrificial lamb with no idea whether the AWS server that is challenging is controlled by LE or by some underground hacker group. People tend to forget that trust relationships have to work in both directions. Simple question folks in this age of smartphones and "give your identity to ??? for free" - ask who the heck wants your server identity before handing over the keys - even if they are offering a free SSL certificate as a carrot. Because if you get the wrong certificate then someone else basically owns your machine and if that machine is connected to your network everything else within reach.

That's one of the reasons why I do not like the fact that LE do not publish a list of servers - or even a trusted source of "moving targets" if they really must work that way. Even in the broken world of DNS companies like Cloudflare and even Google openly publish server addresses - and if you chose to use DoH will securely swap identities and key pairs to establish that essential two-way trust relationship.

Well - w-a-y off topic. If you want to discuss firewalls and system architecture DM or email me and we can either start another thread her, if appropriate, or elsewhere.

Best
George

jsha · October 8, 2020, 7:40pm

Thanks, much appreciated! We've had a pretty good conversation here- I'm going to close the topic for now, and anyone should feel free to start a new one with any spinoff thoughts.

Topic		Replies	Views
Certificate renewals fail on all mail and web servers Help	19	1791	November 7, 2020
Suggested resolution to Firewall problems Feature Requests	42	5452	January 21, 2021
Renewal suddenly stopped two days ago Help	26	963	May 19, 2024
Have would-be hackers blocked LetsEncrypt? Help	34	843	May 10, 2024
Renew certificate failed due to secondary validation Help	32	2349	July 2, 2022