Hi all! My company has a platform where customers can purchase custom domains for their shop. We utilize a "Cert Service" with the AcmePHP client to request certificates. This process is triggered at the time of domain purchase or from a cronjob to update expiring certs.
About a week and a half ago, we noticed an uptick in failed authorizations (see example output below). Our service is not in active development and has been working properly since the Acme V1 deprecation. I'm curious if anyone else has experienced this type of issue before.
My domain is: woollastudio.com but we have many others that are also getting an invalid authorization.
I ran this command: Automated task that utilizes AcmePHP to order, authorize, and issue certs.
It produced this output: Invalid response from https://woollastudio.com:443/.well-known/acme-challenge/dgSHTK5GET4_5S3VY9k_bBbPGs3mBbh4HNycLH_dxTs: 400
Note: our cronjob is still active and may result in this link becoming invalid. I can always provide an updated acme-challenge URL as needed.
My web server is (include version): Apache 2.4.6
The operating system my web server runs on is (include version): CentOS 7
I can login to a root shell on my machine (yes or no, or I don't know): yes
I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no
Yes, but, they do acquire a Let's Encrypt cert under that domain name and one is even valid for another 10 days. It is odd they don't send it out. I should have been more explanatory with my last sentence so I am glad you brought this up.
You are probably right in thinking it is more likely a firewall problem. But, the collection of odd things I saw pointed at their "custom domain platform" as possible culprit.
Q1. Has the redirection changed recently?
Q2. Do you really need to redirect to HTTPS (and add :443 to the URL)?
Q3. Since it is Apache, have you verified there is no name:port overlap?
[apachectl -t -D DUMP_VHOSTS]
We've been digging into the security rules and haven't noticed anything that correlates with the amount of errors we've been encountering but are still digging deeper.
Here is an example of one of the logs we're seeing:
If the cert issuance fails, the default wildcard cert is apparently used. I can only assume the original devs thought the default cert was better than no cert at all.
Thanks for the logs! These look like ACME client logs. I'm also particularly interested in logs (both error and access) from your Apache instance.
Being slightly pedantic: If it were a traditional firewall, we'd probably see a "Timeout during connect" problem. I suspect this is at the web server level (Apache); though products that operate on the web server level are often called "Web Application Firewalls (WAF)".
A1. Redirection has not changed recently.
A2. It is our security policy to always redirect to HTTPS.
A3. There doesn't appear to be any overlap. This particular server is only used for this Cert Service to make the calls to Lets Encrypt.
Have you been able to inspect your apache access and error logs? It is important clue if we see the acme challenge requests in them. For example, if they don't appear there then something in front of that server is blocking the request.
Also, a Let's Debug test just now revealed a wrongly formatted CAA record. Is that new? The wrong format will prevent LE from issuing a cert to that domain (I think so anyway). Do you have this CAA record on the other domains?
Then there really seems like even less reason to redirect the HTTP challenge requests to HTTPS.
It only delays what could have been dealt with then.
Further, you overlooked the "and add :443 to the URL". https://any.site/
is technically equal to: https://any.site:443/
but they may not be handled exactly equally (especially by systems outside your control).
I checked with curl -v and the client's headers don't mention the port number on which the connection took place (and this information also isn't expressed inside of the TLS protocol). I think it would be challenging to construct a situation in which https://example.com/ and https://example.com:443/ URLs produce detectably different behavior on the wire. (Maybe if a client is using an application-layer proxy and the user-agent passes the complete URL to that proxy. A reverse proxy or WAF on the server side will apparently not be able to detect this distinction, though.)
If I were to make a program (like a proxy or browser) where HTTPS defaults to use some other port...
Then any use of HTTPS forced to port 443 would fail.
Unless :443 was explicitly stripped off or ignored.
Which begs the question: Why have it there at all?
Thank you all for the message so far. We are working to get additional logging in the critical path to see what else we can glean from the failing process.
Don't forget your faulty CAA record. Let's Debug reports it as invalid which would block cert issuance. You have extra \" characters surrounding the value.
Does anyone know for sure those extra chars would block issuance? I am not confident enough in my Boulder code-reading to say for sure. Could that just be a bug in Let's Debug? tag @Osiris
You're correct. The :443 was causing an issue on our end. I'm curious where the port is being added in. Our codebase doesn't appear to be adding it anywhere. Could something have changed in the requests LetsEncrypt is making for the acme-challenges?
Ahh yes. That was a mistake on my end as I was attempting different solutions in my initial investigation. I fixed that in the DNS record and the cert was issued properly.