The Curious Incident of cPanel services CA change

(This post at the base is ranting about cPanel (and was posted at their community forum), but it is posted here for a reason - Let's Encrypt also plays a prominent role in it and it can help)

8l66y2

TL;DR:

For cPanel – You do not change the CA that the checkallsslcerts script uses to produce new cert for the cPanel services SSL certificate, without proactively letting customers know about it, preferred in advance. It is changing the CA of the cPanel Services SSL certificates! This change has consequences…

For Let's Encryp – I don't know what your reason is for constantly and frequently changing the IPs that are resolved for r3.o.lencr.org, but if you can – please avoid it or at least alter the IP changes using longer time intervals, to avoid making this FQDN an lightspeed moving target – it fails stuff (as you can read below).

The long version, my journey:

I have a very simple installation – cPanel on one Ubuntu backend server, behind a pfSense firewall who handles the connection to the Internet.

Recently I begun receiving emails with a subject line like:
"
[backend-server-fqdn] The SSL (Secure Sockets Layer) certificate for “cpanel” on “backend-server-fqdn” will expire in less than 30 days.
"

Multiple emails like this hinted me that the renewal of the cpanel services SSL certificate is not successful.

These services handle the web server for the cPanel admin web sites, and the email server that handle sending and receiving emails.

So, I SSHed to the server and run the checkallsslcerts script, to learn what is wrong with the services cert renewal process (the script is explained here - The checkallsslcerts Script | cPanel & WHM Documentation).

The first meaningful error was:
"
warn [checkallsslcerts] Failed to fetch CA bundle information from certificate’s “authorityInfoAccess” extension:cpanel::Exception::HTTP::Network/(XID xxxxx) The system failed to send an HTTP (Hypertext Transfer Protocol) “GET” request to http://r3.i.lencr.org/ because of an error: Could not connect to 'r3.i.lencr.org:80': Connection timed out
"

It was a bit strange to me – because I do use "Let's Encrypt" as the certificate authority (CA) for the web sites hosted by my cPanel installation, but specifically for this case, of cPanel's own services cert renewal - so far the CA by/of cPanel itself. But I am OK with Let's Encrypt, so I flowed with this change for now, noting my self to look into this change later, once I solve this issue.

Because in my FW I also try to secure outgoing sessions, I have a specific rule to allow outgoing sessions towards port 80, HTTP, on the Internet, so I added to its group object (Alias, the term used for it at pfSense) of the destinations for this rule – the FQDN of r3.i.lencr.org , which is used to get the CA cert of Let's Encrypt, the organization that manages the CA provider, the one that the checkallsslcerts uses.

(pfSense allows to use also non-numeric-IP as FW objects, hence FQDN values, and it is resolving them every FW admin set period of time, and save the numeric IPs of them in "Table" objects (as a kind of "Cache"), which are the source for the alias objects used in the FW rules. So, when a new connection arrives the FW – the FW can do a numeric IP match also for FQDN based objects)

This changed made the above error message be gone, but now a new error message arrived:
"
warn [checkallsslcerts] Retrying after network failure: (XID xxxxxx) The system failed to send an HTTP (Hypertext Transfer Protocol) “POST” request to http://r3.o.lencr.org because of an error: Could not connect to 'r3.o.lencr.org:80': Network is unreachable
"

Smarter now, I run to add this new value, of r3.o.lencr.org (the OCSP server address of Let's Encrypt, and querying OCSP is done towards port 80 TCP, hence HTTP), to the same group of objects in the FW, which is allowed to access the Internet at destination port of 80 TCP.

This solved it, the new certificate for the cPanel Services SSL certificate was installed successfully, and I was happy!

But something else came now to hit me…

I have a monitoring system, that directly probes my backend server's httpS protocol, using its core/raw name, the one used by the above services cert, using a simple httpS://fqdn request (the text of "fqdn" is of course replaced in the request with the actual server name value).

The health check is performed every one minute, its timeout for the probed target server side to respond is five seconds, and if two consecutive checks fail (hence no server reply within maximum of 10 seconds from each probe start) - it sends me an email that the probe has failed, and that the system may be down.

I began to steadily get such emails, about one each hour, but not in a very tidy timely pattern. So, most of the checks passed and all was OK, but from time to time – the check failed.

I SSHed to the server again.

This time I looked in the web server's error log, at /var/log/apache2/error_log, and found the following suspicious error message:

"
[Sun Mar 31 20:28:07.069878 2024] [ssl:error] [pid 13026] (101)Network is unreachable: [client Numeric-IP-Address:Source-Port—of-the-monitoring-system] AH01974: could not connect to OCSP responder 'r3.o.lencr.org'

[Sun Mar 31 20:28:07.069916 2024] [ssl:error] [pid 13026] AH01941: stapling_renew_response: responder error
"

But, hey, wait a minute, I just allowed it, r3.o.lencr.org, in the FW, so why is it blocked again??

So, I went back to the FW and set the logging level to a higher value, to show me both allowed and blocked requests from the backend server to the any IP on the Internet, towards port 80, and zoomed into the log output to look for events that happened just around the times of the monitoring alerts.

Strangely enough – some requests were allowed, exactly by the FW rule that allows this access, and some were blocked by a rule I made to block outgoing port 80 access attempts, made to targets on the Internet that I did not approve in the "Allow" rule. Ha??

OK, I got it. There is possibly a mismatch between what the backend server "knows" to be the IP addresses of r3.o.lencr.org are, and what the FW "knows" they are.

And this is although both systems use the same DNS servers as resolvers.

First, I wanted to see how much diversity of IPs this DNS translation gives, around the Internet, so I accessed this nice website that perform a DNS query across many DNS servers around the world.

It turned out that indeed, Let's Encrypt really put a lot of effort to distribute the IPs for this server across many unique IPs around the world.

Next I run the following commands, in several recurring loops, at the terminal of the backend server, to see how fixed or varied are the IPs I get as replies for the DNS query for this FQDN:

  1. To run the DNS lookup

nslookup r3.o.lencr.org

  1. To clear the DNS cache

systemd-resolve --flush-caches

And it turned out, as suspected, and as demonstrated in the above DNS checks website - very diverse.

Hence, when the cPanel services web server at the backend server is wishing to serve/reply to a client httpS access to its service web server name, it needs to do an OCSP stapling (see here how OCSP works to understand the communication flow - OCSP stapling - Wikipedia), hence it needs to query r3.o.lencr.org using http:// (port 80 TCP) access, but first it needs to learn which IPs are serving this FQDN, but each time (post local DNS cache expiration) – it probably gets totally new set of IPs as a reply, hence - for the FW to match them – it is a moving target!!! Both systems contantly have possibly different-from-each-other IPs for this FQDN, and only if there is a match between them – the access to the OCSP server will be allowed.

This is probably the cause for the blocks the FW does to the requests that are originated from the monitoring system, there is a mismatch between what the backend server knows as the IPs for r3.o.lencr.org and what the FW knows as the IPs for this FQDN; each of the systems is doing DNS querying and caching at its own intervals, and possibly gets different results in the replies, although they use the same DNS server resolver.

And this mismatch causes the FW to block the request of the OCSP stapling, so the web server never replies to the monitoring system, because the webserver doesn't get an OCSP reply, hence reaching the request timeout for the monitoring request, which fails the monitoring probe.

I need to get both systems, the backends server and the FW, as much as possible, to be on the same page, so both will have the same list of IPs for the FQDN of r3.o.lencr.org.

The first, dumb, ancient security related instinct - was to use fixed IP objects, which led me to find all the numeric IP values that are linked to this FQDN and add them to the relevant object at the allow rule on the FW.

Quite quickly I learned it is not efficient, there are too many of these IPs, and the concept is not future proof, as these IPs can, and probably will, be changed/removed/added in the future.

So, I moved towards looking into the FW's DNS lookup interval, the one that constantly loops DNS queries for FQDN alias objects.

I found that all FQDN based Alias objects of pfSense are resolved by default every 300 seconds, hence every 5 minutes, at the admin web GUI path of system > Advanced, Firewall & NAT, at the field of "Aliases Hostnames Resolve Interval", which is empty by default (which is actually translated to the default of 300 seconds)

https://docs.netgate.com/pfsense/en/latest/firewall/aliases.html#using-hostnames-in-aliases

HAAA! Eureka! I shouted… and set it, in my rage, to a value of 1 second, of course, and applied the change.

Yes, that was it. It solved the problem. Not entirely, I still do get here and there those monitoring failure email, but at the big picture level – it is solved.

Yes, I pay for it with a few more CPU percentage utilization at the FW, and possibly some few more consumed memory megabytes, but hey – I am now up to date with the Internet DNS accuracy, up to date to the second!!

I guess pfSense caches these frequent DNS lookup results to its Alias objects tables, which are cached for longer periods than these very frequent DNS lookups, hence creating a larger list of possible IPs that represent the reference FQDN, hence enlarging the chance that the IP that the server will ask for when accessing r3.o.lencr.org - will also be in the list of IP for this FQDN at the FW, so the traffic will be allowed and the monitoring check will be completed successfully.

Now, to cPanel, the firm…

To verify, I went to https://crt.sh/, a web site which is like a search engine for historical issuance of public certificates, based on "Certificate Transparency (CT)" (Certificate Transparency - Wikipedia).
I searched in it for my cPanel server's FQDN, and indeed, the most recent cert was the first to use Let's Encrypt as CA. Most of the former ones were issued by the CA of cPanel.

So, a change was made here.

I believe I am quite on top of being informed of meaningful changes at cPanel, I do get their emails about prominent changes.

So, I assumed I missed this change, so I went on to search for any announcement by cPanel about it.

I tried the following relevant cPanel documentation articles.

The very relevant following article does not even mention "Let's Encrypt"

The checkallsslcerts Script - The checkallsslcerts Script | cPanel & WHM Documentation

The following cPanel support article gave me a generic direction towards solution, but it is not mentioning "Let's Encrypt" at all.

OCSP responder errors - https://support.cpanel.net/hc/en-us/articles/360036533894-OCSP-responder-errors

The following article does mention "Let's Encrypt" as the source of the services cert but does not mention since when and for which cPanel version(s).

Manage Service SSL Certificates - Manage Service SSL Certificates | cPanel & WHM Documentation

The most explicit mentions around this change were found in the "Change Log" page for my cPanel version branch, of 118,

The changes are for version 117.9999.78, which is I guess a pre-118 version, dated 2024-01-18:

"
Fixed case EK-24: Convert checkallsslcerts to use Let's Encrypt for hostname certificates.

Fixed case EK-45: Set the AutoSSL provider to Let's Encrypt on updates to 118.

Fixed case EK-46: Add a deprecation warning to the AutoSSL UI for the Sectigo provider.

Fixed case EK-47: Add a feature showcase for the Let's Encrypt changes.

Fixed case EK-58: Update the current provider headings on the AutoSSL UI.

Fixed case EK-70: Install the Let's Encrypt plugin before running checkallsslcerts during initial setup.
"

But I don't follow change logs closely, as they mostly contain fixes, not new features, or changes in behavior.

In the main, more friendly, page of "Release Notes", for the version branch of 118 – "Let's Encrypt" is mentioned, but not in the reference of cPanel's' own application services CA provider

All in all, from where I stand, cPanel failed here, and failed me as a customer.

You do not do such a prominent system change (even if it is at the lower levels of the system, without a GUI noticeable change) before alerting your customers first, to let them know about the change in advance, so they can prepare their environment for that change, and hopefully avoid issues.

Thank you.

The OCSP and AIA URLs are hosted by the Akamai CDN, which runs many servers to distribute the (very high) amount of read traffic.

We don't have much control over what IPs Akamai returns, but as you've discovered, they do have a lot of them and they can change (eg, if your traffic is redirected to another location due to Akamai maintenance, or network routing changes)

8 Likes

@eitanc this seems more like an issue appropriate for the cPanel forum; as they are higher up in the stack and their choices in managing usage of Let's Encrypt. I don't believe that it is Let's Encrypt's responsibility to cater to cPanel's choices.

4 Likes

Thank you @mcpherrinm.

Thank you for the info.

This too-frequent DNS reply values cause an issue for me as the end user, and if Let's Encrypt, as Let's Encrypt, doesn't have a need for these constant and frequent IP values change - please ask Akamai, as you are their customer - to slow down the IP change rate, so all will be happy.
Thank you.

1 Like

This seems like the real issue, for the domain name lencr.org and subdomains turn off caching.

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.