500s from staging server, now acme:error:rateLimited

jillvogel · February 23, 2017, 2:46am

Hello!

We use the letsencrypt staging server for our server deployment integration testing, which can result in 50+ new certificates requested per day.

A couple of days ago, we got a number of 500 errors for what looked like valid requests:

Feb 21 18:55:42 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    a1102cd9.integration.plebia.net#012    preview-a1102cd9.integration.plebia.net#012    studio-a1102cd9.integration.plebia.net#012b'An unexpected error occurred:\nThe server experienced an internal error :: Error creating new authz\nPlease see the logfiles in /var/log/letsencrypt for more details.\n'
Feb 21 18:55:49 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    8936c961.integration.plebia.net#012    preview-8936c961.integration.plebia.net#012    studio-8936c961.integration.plebia.net#012b'An unexpected error occurred:\nThe server experienced an internal error :: Error creating new authz\nPlease see the logfiles in /var/log/letsencrypt for more details.\n'

The full server response didn’t provide much more information:

2017-02-21 18:56:09,616:DEBUG:acme.client:Received response <Response [500]> (headers: {'Content-Length': '102', 'Boulder-Request-Id': 'iVYoa7hsZYp1aF475dIWUqxOxGHGExcD83GQAJc6DDM', 'Expires': 'Tue, 21 Feb 2017 18:56:09 GMT', 'Server': 'nginx', 'Connection': 'close', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Boulder-Requester': '455874', 'Date': 'Tue, 21 Feb 2017 18:56:09 GMT', 'Content-Type': 'application/problem+json', 'Replay-Nonce': 'XCyd_1t7p4tNzPF7PvaLDY7uPAm7aM_4QsjBZkvhqXs'}): '{\n  "type": "urn:acme:error:serverInternal",\n  "detail": "Error creating new authz",\n  "status": 500\n}'

And now, we’re over the pending requests quota:

Feb 22 10:30:39 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    8936c961.integration.plebia.net#012    preview-8936c961.integration.plebia.net#012    studio-8936c961.integration.plebia.net#012b'An unexpected error occurred:\nThere were too many requests of a given type :: Error creating new authz :: Too many currently pending authorizations.\nPlease see the logfiles in /var/log/letsencrypt for more details.\n'
Feb 23 02:05:18 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    8936c961.integration.plebia.net#012    preview-8936c961.integration.plebia.net#012    studio-8936c961.integration.plebia.net#012b'An unexpected error occurred:\nThere were too many requests of a given type :: Error creating new authz :: Too many currently pending authorizations.\nPlease see the logfiles in /var/log/letsencrypt for more details.\n'

We’ve tried waiting for the rate limit to expire, but that’s not working. Existing certificates are renewing fine.

Using letsencrypt 0.4.1, with https://acme-staging.api.letsencrypt.org/acme/reg/455874

Can anyone help?

Thanks!
Jill

jillvogel · February 23, 2017, 4:32am

Have also seen some errors like this:

Feb 20 07:15:29 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    f748845a.integration.plebia.net#012    preview-f748845a.integration.plebia.net#012    studio-f748845a.integration.plebia.net#012b'At least one of the (possibly) required ports is already taken.\n'

And found this post which suggested:

This was a problem with my script not stopping nginx for some reason (service nginx stop). When I stopped it manually and ran the renew command, it renewed ok.

But we're not running nginx or apache; just serving letsencrypt (and proxying our load balanced services) directly from haproxy.

Any ideas how/if this is related to the errors shown above?

serverco · February 23, 2017, 7:52am

Hi Jill,

It could also be related to recent changes with Boulder on the staging server - Changing our RPC system in staging

Was there any additional info in /var/log/letsencrypt ?

jillvogel · February 23, 2017, 11:37am

Unfortunately, we’ve lost the last few days’ letsencrypt logs in the rotation. Now they’re just rate limit errors, not very helpful, I’m afraid.

The response given above was captured prior to the log rotation.

serverco · February 23, 2017, 1:01pm

@cpu or @roland will hopefully be around shortly and may be able to help more than myself.

jillvogel · February 23, 2017, 1:03pm

Thanks @serverco!

But I think we’ve found the culprit… there was a hung letsencrypt process occupying port 8080. So this post was indeed relevant: we needed to kill that process and to let our cron’d certificate creation process restart it. No clues as to why that process hung though. It had been running since Feb 20.

And since we were hitting the rate limit after so many retries, and because it’s just our integration test server, we ended up just clearing out the old letsencrypt account and keys, and letting it create a new account to request new certificates.

Whew!

serverco · February 23, 2017, 1:04pm

Glad you found the problem, and thanks for letting us know in case others have a similar issue

jillvogel · February 23, 2017, 1:06pm

No worries… thank you for supporting such a fantastic service!

cpu · February 23, 2017, 2:20pm

I can confirm that on Tuesday we had a failed staging update that for a short time produced 500 errors. That's likely the cause for this portion of the report.

In this case I'm fairly confident this is unrelated to the gRPC changes and was from the brief window we were generating 500's in staging from a failed deploy.

Great! Glad to hear you sorted that out.

From the rate limit errors you were seeing about pending authorizations it definitely sounded like there was something leaking incomplete authentications. This can definitely bite you in production as well - glad you caught it before then

Take care,

system · March 25, 2017, 2:20pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rate limit on staging server?	2	2450	February 21, 2016
Too many currently pending authorizations: Server	2	2081	February 15, 2018
The letsencrypt staging server is down? Server	3	1702	August 14, 2016
Rate-limiting due to authz errors? Help	14	31016	July 1, 2017
Staging environment - getting rate limit service busy retry later Help	13	206	December 5, 2024

500s from staging server, now acme:error:rateLimited

Related topics