Acme: error 500 serverInternal :: Error creating new order

fstelzer · November 26, 2019, 9:25am

My domain is: SAN Batch: aische-pervers.tv, api.vxpages.com, blondy93.com, candyxs.com, chantalsweet.com, fitness-maus.net, herrin-jessy.com, inkedvanessa.net, julia-jones.com, julia-jones.net, kathirocks.com, lillygirly.com, marywet.com, mimisweet.net, sexy-auswanderer.com, sexy-auswanderer.de, sexy-auswanderer.net, staging.vxinsta.net, and more (up to 100, but i’m only allowed to put 20 into a post)

I ran this command: We use the golang lego API to issue certs. Usually without any problems.

It produced this output: error=“acme: error: 500 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:serverInternal :: Error creating new order”

My web server is (include version): NA

The operating system my web server runs on is (include version): CentOS 7

My hosting provider, if applicable, is: NA

I can login to a root shell on my machine (yes or no, or I don’t know): yes

I’m using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you’re using Certbot): latest lego lib

We issue certificates for a large number of domains. A lot of these are batched into single certs using up to 100 SAN’s. The whole process is fully automated and used to run hourly (only issuing new certs, or changed names). After migrating to the acmev2 api recently we are seeing 500 Errors from the https://acme-v02.api.letsencrypt.org/acme/new-order endpoint. As we retry hourly this would not be a big issue. However the order seems to have been kind of created, as we run into “too many pending authorizations” rate limit after we got a few 500 errors. As we only get the 500 and not the actual authorization urls or any other response we have no way of actually clearing these
For now i’ve switched to issue certs only twice a day so we can react and manually check when a 500 apperas. But afaik the only way now to get rid of these authorizations is to wait a week or rotate the account key.

mnordhoff · November 26, 2019, 3:35pm

Are you sure that the request contains no more than 100 SANs? Never 101 or a thousand or something?

Due to a bug that should be fixed ~~this~~ next week, Boulder currently swallows the ‘too many SANs’ error message and just returns the generic error you reported.

But that’s only a guess. It is a generic error that can have other causes.

Do you know how you get into a situation with too many pending authorizations? If you just retry the same order, that shouldn’t happen: Even if it goes wrong, if you repeatedly (try to) create orders for the same <= 100 names, you should never wind up with more than <= 100 pending authorizations.

If a new order fails, does the client give up and move on and create orders for different names? That may have some advantages, but can result in too many pending authzs. If so, can you (temporarily) change it to stubbornly retry the same names, or at least keep fewer than 300 names up in the air at once?

cpu · November 26, 2019, 4:00pm

Hi @fstelzer,

I took a quick look at the logs and only saw 2 newOrder 500s in the past 7d for what I think is your ACME client's egress IP. They were both caused by RPC timeouts on our side during periods of higher than average load. Since your requests are bundling 100 names in one order the processing takes longer and so these timeouts will happen to you more frequently than for other users.

Can you share timestamped request logs from your client? Knowing your ACME account ID would also be helpful.

Hmm. I would suspect you have another bug that is leaking pending authorizations somehow. We reuse pending orders and pending authorizations. Even if you made the same newOrder request multiple times what I would expect would happen is:

newOrder for names X, Y, Z received. We'd create pending authzs for X,Y,Z and oops, a 500 occurs creating the order. We return the 500.
newOrder for names X, Y, Z received as a retry. We'd find the existing pending authzs for X, Y, Z for your account and use them with a new order that is returned.
newOrder for names X, Y, z received as a retry. We'd find the existing pending order and its existing pending authorizations and return it.

You wouldn't see 3x pending authorization quota consumption even if you weren't able to get the first order.

Since you're using Lego as a library can you try adding explicit logging of every received authorization ID and a log of every authorization challenge that is POSTed and the response code? If you're leaking pending authz's I'd expect to see IDs that aren't ever POSTed to initiate a challenge, or challenge IDs that are POSTed but receive a failure response code.

I would also recommend you break your domain names into smaller certificates. Using 100 names in one certificate will exacerbate timeout issues (though we do our best to optimize our service when corner cases arise). Managing 100 domains per cert also decreases your agility w.r.t revocation as well as customer off-boarding.

cpu · November 26, 2019, 4:03pm

That's a good point. This would cause 429's from pending authorizations over time.

fstelzer · November 26, 2019, 4:12pm

We actually already log all the received authorizations and used a script to check them all (last 10d) against the API to see if any of them were pending and none of them were. So we never received those. Thats why its probably connected to the 500 errors.

If a new order fails, does the client give up and move on and create orders for different names? That may have some advantages, but can result in too many pending authzs

That’s a good point. This would cause 429’s from pending authorizations over time.

I think this is what might happen. Since we manage ~7500 domain in ~900 batches (we only batch domains per client as per letsencrypt policy) we move on to other batches if one fails (which always happens from time to time when customers change dns after requesting a cert).
Decreasing the batch size is possible but more certs results on higher load / reload times for reloading our proxies where all of those are installed. What amount of names would you suggest is ok?

We used to run our client once every hour. It would always get the TODO batches and try to issue them one by one, moving on if one fails. We now switched it to twice per day to catch errors before running into troubles but had "too many pending auths" this morning again (and customers have to wait longer for their certs).

With the v1 api we used the pre authorization before actually issuing the certs. (and only put them into batches if they successfully authorized). But i think this is no longer possible with the v2 api.

cpu · November 26, 2019, 4:23pm

Hey again @fstelzer, thanks for the added information.

When you're working with this volume of domain names I definitely recommend that you check whether domain names are still associated with your service in DNS or otherwise before you make an attempt to validate the name with Let's Encrypt.

I think taking @mnordhoff's advice and being more conservative about when you give up on an order for a set of names would help your situation.

Since you're a larger integrator we would also be happy to process a request via our rate limiting adjustment request form to give you an adjusted pending authorizations rate limit. Please try to answer the questions with as much information as you can, we can't always find the time to follow-up on incomplete requests.

Can you share more information about the proxy software you use? Maybe someone will have advice that could help with this side effect. I can understand why you'd want to avoid smaller certs if the software involved isn't able to handle a lot of certificates with acceptable performance.

Cutting the number of names in half seems like a good place to start.

That's correct. RFC 8555 / ACME maintained the notion of pre-authorization but we have not implemented it and don't plan to at this time.

Hope that information helps!

fstelzer · November 26, 2019, 4:32pm

Thanks for your help.
I’ll decrease the batch size to 50. Not that many customers actually have that much certs so the overall number shouldn’t grow too much.
We use https://hitch-tls.org to terminate tls. They actually claim to perform with 500.000 certs but we never could verify this number in our tests. It sill performed better then every other software. The reloads with new certs are relatively graceful but still take a while and we always see small hiccups with client connections during it and quite a bit of memory usage.

We validate all domains hourly (until successfully validated, then daily) and batch them only on success. Maybe i can add a forced extra check just before cert issuing.

If the smaller batch size doesen’t help i will try the ratelimit adjustment request.

system · December 26, 2019, 4:32pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Certificate error with large SAN cert Server	19	5653	December 10, 2017
Staging API responds with 500 internal server error Help	4	1008	August 8, 2019
Is the ACME v2 staging server working? Help	8	2324	March 12, 2018
Error creating new order on Acme Staging Help	13	993	November 10, 2021
Internal Server Error (500) when creating certificates Help	4	2249	April 6, 2017

Acme: error 500 serverInternal :: Error creating new order

Related topics