Timeout when sumitting CSR although crt.sh says a cert is generated

My domain is: www.

We are using Acme4j.
We had a "Service busy" on August 11th, 00:07 for the domain, and since then we a timeout error when Acme4j tries to send the CSR :

org.shredzone.acme4j.exception.AcmeNetworkException: Network error
	at org.shredzone.acme4j.connector.DefaultConnection.performRequest(DefaultConnection.java:450)
	at org.shredzone.acme4j.connector.DefaultConnection.sendSignedRequest(DefaultConnection.java:383)
	at org.shredzone.acme4j.connector.DefaultConnection.sendSignedRequest(DefaultConnection.java:200)
	at org.shredzone.acme4j.Order.execute(Order.java:276)
Caused by: java.net.http.HttpTimeoutException: request timed out
	at java.net.http/jdk.internal.net.http.HttpClientImpl.send(Unknown Source)
	at java.net.http/jdk.internal.net.http.HttpClientFacade.send(Unknown Source)
	at org.shredzone.acme4j.connector.DefaultConnection.sendRequest(DefaultConnection.java:347)
	at org.shredzone.acme4j.connector.DefaultConnection.performRequest(DefaultConnection.java:434)

Other domains have no problem.
And the strange thing is that crt.sh tells that a certificate is generated each time, although we get a timeout error.

Does anybody have any idea what's going on ?

Forgot to paste the domain : www.nouvelle-aquitaine.fr

We've seen this sort of thing a small handful of times around here, where the "finalize" call doesn't get a response even though the other ACME API calls work and in some of them, Let's Encrypt seems to actually be issuing the certificate fine.

There hasn't been a single easy-to-find explanation across all of them. Generally, because the finalize call includes the CSR, it has larger-size packets than the other requests which can help explain why a network misconfiguration might affect that request but not others.

Some things to try include

  1. Double-checking MTU/MSS settings, and that ICMP messages for path MTU discovery aren't being dropped. A first step might be trying Cloudflare's ICMP IPv4 Blackhole Check and corresponding IPv6 check from your server, but this can get pretty esoteric pretty quick and getting beyond my expertise, and it depends a lot on how your network is set up.
  2. Getting an actual packet dump (from Wireshark and/or similar tools) to see if that sheds any light.
  3. Using an ECDSA key for the certificate CSR instead of an RSA one, just because it might be smaller enough to work even though there's something affecting larger requests. (That's, of course, assuming that your server and expected clients support ECDSA, which is usually the case but not always.)
  4. Trying a different CA, at least to see if you get different behavior or more helpful packet captures. (And depending on your requirements, it might be a good enough workaround if another CA happens to work.)

Some past threads with possibly-similar issues, just to reference:

4 Likes

And, are you sure it is the /finalize URL that fails?

Because right after that you should poll the order until status is valid. These are fairly small request and response packets similar to earlier API requests.

But then you retrieve the certificate which will be a large response compared to other API calls.

I agree with Peter we have seen network errors with /finalize before and he offers good debug suggestions. These can be puzzling because the /finalize must succeed before the cert is issued. So, the request must reach LE in good order for it to issue the cert. Although perhaps LE gets "stuck" formulating the response or something else happens in the response (or with API requests right after).

3 Likes

Adding to the comments above:

1- As a potential hotfix: if you can extract the private key from the Acme4j client logs or datastore, you can download the certificate from crt.sh to use with that private key.

2- In an attempt to clarify what @MikeMcQ is stating above in case you are not familiar with the ACME spec in detail (using some psuedocode here):

  • When a client POSTs to Directory.newOrder: a new ACME Order is created, and is assigned a unique acme_order__url that is identified in the response' Location header.
  • Retrieving the acme_order__url returns an AcmeOrderObject, which contains the Authorization urls, the finalize url, the certificate url (when ready), and status.
  • A Client POSTs the CSR to AcmeOrderObject.finalize_url, then
  • The Client polls the acme_order__url to detect a status change
  • When the status becomes ready, the Client retrieves the Certificate from the certificate url.

Based on what you shared (logs and behavior), your issue could have happened on any of the last 3 steps: finalize, poll, download.

3- Absent the ability to determine exactly where the failure occurred in your logs, I would try running an identical request with Certbot - same account key, same domain(s), same profile, same everything. That should essentially recreate the same data flow between your machine and LetsEncrypt – and also leverage the validated authorizations, so this should immediately go to finalize.

Why?

  • If this passes, it would suggest the issue may be in the underlying Java networking library (the errors do not suggest Acme4j is the issue).
  • If this fails, Certbot has great logging. You'll be able to see exactly where this failed. That can help us pinpoint the issue better.

I am inclined to agree the issue is probably with posting the CSR to /finalize, but the logs you shared are just for the generic POST code in your client (see acme4j/acme4j-client/src/main/java/org/shredzone/acme4j/connector/DefaultConnection.java at master · shred/acme4j · GitHub ), and this forum has a history of people strongly believing an issue is on one endpoint, but a deeper dive into the logs indicates the issue was on another endpoint.

5 Likes

I've enable debug logs in Acme4j and can confirm it is indeed /finalize, for instance : POST https://acme-v02.api.letsencrypt.org/acme/finalize/378217970/417742745706.
The timeout occurs 10s after that. It's Acme4j's default timeout value.

My guess would be a delay performing some sort of pre-issuance check like CAA. What happens when you increase the acme4j timeout?

I've just tried that, but we now unfortunately hit a rate limit. Will have to wait until tomorrow 17:11 UTC to try again.

I agree the finalize step should probably complete sooner than it is doing for you, but as issuance is happening it just seems to be that the client has to wait just a but longer. I assume there is no security product in the way (on your machine/network) that could be trying to inspect traffic and slowing the request down or proxying it.