500s from staging server, now acme:error:rateLimited


#1

Hello!

We use the letsencrypt staging server for our server deployment integration testing, which can result in 50+ new certificates requested per day.

A couple of days ago, we got a number of 500 errors for what looked like valid requests:

Feb 21 18:55:42 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    a1102cd9.integration.plebia.net#012    preview-a1102cd9.integration.plebia.net#012    studio-a1102cd9.integration.plebia.net#012b'An unexpected error occurred:\nThe server experienced an internal error :: Error creating new authz\nPlease see the logfiles in /var/log/letsencrypt for more details.\n'
Feb 21 18:55:49 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    8936c961.integration.plebia.net#012    preview-8936c961.integration.plebia.net#012    studio-8936c961.integration.plebia.net#012b'An unexpected error occurred:\nThe server experienced an internal error :: Error creating new authz\nPlease see the logfiles in /var/log/letsencrypt for more details.\n'

The full server response didn’t provide much more information:

2017-02-21 18:56:09,616:DEBUG:acme.client:Received response <Response [500]> (headers: {'Content-Length': '102', 'Boulder-Request-Id': 'iVYoa7hsZYp1aF475dIWUqxOxGHGExcD83GQAJc6DDM', 'Expires': 'Tue, 21 Feb 2017 18:56:09 GMT', 'Server': 'nginx', 'Connection': 'close', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Boulder-Requester': '455874', 'Date': 'Tue, 21 Feb 2017 18:56:09 GMT', 'Content-Type': 'application/problem+json', 'Replay-Nonce': 'XCyd_1t7p4tNzPF7PvaLDY7uPAm7aM_4QsjBZkvhqXs'}): '{\n  "type": "urn:acme:error:serverInternal",\n  "detail": "Error creating new authz",\n  "status": 500\n}'

And now, we’re over the pending requests quota:

Feb 22 10:30:39 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    8936c961.integration.plebia.net#012    preview-8936c961.integration.plebia.net#012    studio-8936c961.integration.plebia.net#012b'An unexpected error occurred:\nThere were too many requests of a given type :: Error creating new authz :: Too many currently pending authorizations.\nPlease see the logfiles in /var/log/letsencrypt for more details.\n'
Feb 23 02:05:18 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    8936c961.integration.plebia.net#012    preview-8936c961.integration.plebia.net#012    studio-8936c961.integration.plebia.net#012b'An unexpected error occurred:\nThere were too many requests of a given type :: Error creating new authz :: Too many currently pending authorizations.\nPlease see the logfiles in /var/log/letsencrypt for more details.\n'

We’ve tried waiting for the rate limit to expire, but that’s not working. Existing certificates are renewing fine.

Using letsencrypt 0.4.1, with https://acme-staging.api.letsencrypt.org/acme/reg/455874

Can anyone help?

Thanks!
Jill


#2

Have also seen some errors like this:

Feb 20 07:15:29 haproxy-stage manage_certs.py: Failed to obtain a new certificate for these domains:#012    f748845a.integration.plebia.net#012    preview-f748845a.integration.plebia.net#012    studio-f748845a.integration.plebia.net#012b'At least one of the (possibly) required ports is already taken.\n'

And found this post which suggested:

This was a problem with my script not stopping nginx for some reason (service nginx stop). When I stopped it manually and ran the renew command, it renewed ok.

But we’re not running nginx or apache; just serving letsencrypt (and proxying our load balanced services) directly from haproxy.

Any ideas how/if this is related to the errors shown above?


#3

Hi Jill,

It could also be related to recent changes with Boulder on the staging server - Changing our RPC system in staging

Was there any additional info in /var/log/letsencrypt ?


#4

Unfortunately, we’ve lost the last few days’ letsencrypt logs in the rotation. Now they’re just rate limit errors, not very helpful, I’m afraid.

The response given above was captured prior to the log rotation.


#5

@cpu or @roland will hopefully be around shortly and may be able to help more than myself.


#6

Thanks @serverco!

But I think we’ve found the culprit… there was a hung letsencrypt process occupying port 8080. So this post was indeed relevant: we needed to kill that process and to let our cron’d certificate creation process restart it. No clues as to why that process hung though. It had been running since Feb 20.

And since we were hitting the rate limit after so many retries, and because it’s just our integration test server, we ended up just clearing out the old letsencrypt account and keys, and letting it create a new account to request new certificates.

Whew!


#7

Glad you found the problem, and thanks for letting us know in case others have a similar issue :slight_smile:


#8

No worries… thank you for supporting such a fantastic service!


#9

I can confirm that on Tuesday we had a failed staging update that for a short time produced 500 errors. That’s likely the cause for this portion of the report.

In this case I’m fairly confident this is unrelated to the gRPC changes and was from the brief window we were generating 500’s in staging from a failed deploy.

Great! Glad to hear you sorted that out.

From the rate limit errors you were seeing about pending authorizations it definitely sounded like there was something leaking incomplete authentications. This can definitely bite you in production as well - glad you caught it before then :slight_smile:

Take care,


#10

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.