We are frequently facing this issue for almost two weeks(or more), which happens roughly 1/5 times.
Has something changed?
{
"type":"urn:ietf:params:acme:error:serverInternal",
"detail":"Error creating new order",
"status":500
}
Client: acme4j
sample domain request failed: 98e4b25b2f3ba887.dim-s9m3.svbr-nqvp.int.cldr.work
Any suggestions?
Essentially the POST request for the create order is failing with 500 response , below is the trace from Acme4j
Exception from the ACME server while executing the order. Problem : Error creating new order Exception: {} org.shredzone.acme4j.exception.AcmeServerException: Error creating new order
at org.shredzone.acme4j.connector.DefaultConnection.throwAcmeException(DefaultConnection.java:548)
at org.shredzone.acme4j.connector.DefaultConnection.performRequest(DefaultConnection.java:479)
at org.shredzone.acme4j.connector.DefaultConnection.sendSignedRequest(DefaultConnection.java:407)
at org.shredzone.acme4j.connector.DefaultConnection.sendSignedRequest(DefaultConnection.java:161)
at org.shredzone.acme4j.OrderBuilder.create(OrderBuilder.java:314)
Since no one else has posted...
Let's try solving this generically.
Presuming the problem started recently and you haven't made any change to warrant this error...
Internal server errors are not something the user can fix nor cause as far as I know. Maybe there's something going on with the servers, although currently I don't see an active incident.
The only thing the spec says for "serverInternal" is that it means "The server experienced an internal error". Generally retrying should work. Are these "complicated" certificates in any way, like having lots of domain names on them that would need validation? When you say it fails roughly 1/5 times, is that with the same certificate or domain list? How big of a sample size of failures are we talking about? Does retrying the same order usually work?
There isn't anything special with the certificate/domains, I say this because some of them have passed on retries. There are at most 2 domains in the request.
It fails for different certificates and domain lists, so this is not something specific to domain names I think.
There were around 30 such failures yesterday.
There is a sample domain I have mentioned in the description for which the issue happened, I can add more of those if that helps.
So, you had roughly 30 failures and (extrapolating from you saying 1/5 of your requests fail) roughly 120 successful requests yesterday, all to the staging environment, all for certificates with just 1 or 2 domains? That does sound like something odd going on. While I hate to suggest any testing in production, do you make a similar level of requests to the production environment? If so, what portion of requests to production work? And you've been having roughly this level of requests per day for weeks, and notice something change a couple weeks ago? Can you narrow down more specifically when it started?
Yes, we are making changes to our staging environment that we hope will bring better quality of service and stability. However, the current change needs some fine tuning and is causing a little more impact on the new-order endpoint for some use cases. In general, we've noticed the endpoint has a better success rate but it's still not where we want it to be.
On production, this is significantly lesser requests and thankfully have not noticed this issue there. Unfortunately, I don't have older logs to pin down from when exactly started seeing this.