Intermittent SSLError

Our Client
We issue hundreds of challenges per day on hundreds of domains, using the ancient certbot client known as letsencrypt, ACME V1. We typically put 70 to 100 domains in a single SAN.

We are beginning our process of upgrading to ACME V2 (latest certbot) right now.

Quick Context
Given our extreme number of domains, networking or DNS issues are inevitable. Our system watches for error messages from letsencrypt like the following:

urn:acme:error:dns :: DNS problem: NXDOMAIN looking up A for my.domain.com - check that a DNS record exists for this domain

This format of colon-separated messages allows our system to reliably parse the cause of failure and take highly specialized error handling action based on various failures, such as removing a problematic domain from the attempt.

The Problem

Within the last couple months, we began to intermittently get this error daily, on any given cert attempt:

SSLError: EOF occurred in violation of protocol (_ssl.c:590)

Simply retrying resolved. Running the exact same command with the same domains will succeed. It would seem this problem is not within our client.

It doesn’t give information we can write useful error handling for (such as removing the problematic domain from the domain list). This makes me wonder if the issue is a deeper, uncaught/unhandled error in your system.

The Questions

  • What does this error mean? What might it indicate?
  • Should we expect to continue getting this after upgrading to ACME v2 certbot? We won’t be using DNS challenges.

Special note: We have the exact same problem with another intemittent error: The server experienced an internal error :: Failed to get registration by key. Again, retrying solves.

To me, it doesn't look like an error generated by certbot itself. Could you provide more context around the error? Preferably the whole log.

This is a server side error at Let's Encrypt. It looks like the account key used by the client wasn't recognised by Boulder (the Let's Encrypt ACME server). I assume every certificate uses just one account? Or are multiple accounts randomly chosen? Which seems weird, because you're using the official certbot.. It could be a server side error due to a bug, but I'm not sure.

That’s what I assumed as well, that the message we normally read and parse is being generated by the LE system and simply passed along by certbot. That’s why I think it might be an issue in your system, if only in error handling logic and a more helpful error message could be given.

I’ll try to find something relevant in the underlying letsencrypt client logs for you.

2020-01-27 16:00:11,398:DEBUG:root:Sending POST request to https://acme-v01.api.letsencrypt.org/acme/chall-v3/2415508072/_JSNmw. args: (), kwargs: {'data': '{"header": {"alg": "RS256", "jwk": {"e": "AQAB", "kty": "RSA", "n": "<redacted>"}}, "protected": "<redacted>", "payload": "<redacted>", "signature": "<redacted>"}'}
2020-01-27 16:00:11,398:INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): acme-v01.api.letsencrypt.org
2020-01-27 16:02:14,635:DEBUG:letsencrypt.cli:Exiting abnormally:
Traceback (most recent call last):
  File "/usr/bin/letsencrypt", line 9, in <module>
    load_entry_point('letsencrypt==0.4.1', 'console_scripts', 'letsencrypt')()
  File "/usr/lib/python2.7/dist-packages/letsencrypt/cli.py", line 1986, in main
    return config.func(config, plugins)
  File "/usr/lib/python2.7/dist-packages/letsencrypt/cli.py", line 696, in obtain_cert
    certr, chain = le_client.obtain_certificate_from_csr(config.domains, csr, typ)
  File "/usr/lib/python2.7/dist-packages/letsencrypt/client.py", line 225, in obtain_certificate_from_csr
    authzr = self.auth_handler.get_authorizations(domains)
  File "/usr/lib/python2.7/dist-packages/letsencrypt/auth_handler.py", line 84, in get_authorizations
    self._respond(cont_resp, dv_resp, best_effort)
  File "/usr/lib/python2.7/dist-packages/letsencrypt/auth_handler.py", line 136, in _respond
    self._send_responses(self.dv_c, dv_resp, chall_update))
  File "/usr/lib/python2.7/dist-packages/letsencrypt/auth_handler.py", line 162, in _send_responses
    self.acme.answer_challenge(achall.challb, resp)
  File "/usr/lib/python2.7/dist-packages/acme/client.py", line 234, in answer_challenge
    response = self.net.post(challb.uri, response)
  File "/usr/lib/python2.7/dist-packages/acme/client.py", line 650, in post
    response = self._send_request('POST', url, data=data, **kwargs)
  File "/usr/lib/python2.7/dist-packages/acme/client.py", line 609, in _send_request
    response = requests.request(method, url, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 480, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 588, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 447, in send
    raise SSLError(e, request=request)
SSLError: EOF occurred in violation of protocol (_ssl.c:590)

I’m not familiar with the underlying HTTP API being used and what all data is being logged so I <redacted> anything appearing cryptographic.

Beats me what the cause of the error is, but I think there’s a good chance it’s fixed when certbot is upgraded.

How did you even manage to keep using such an ancient version of the client? :astonished: I’d think it would have failed way back already.

Also, I think Python 2.7 is being deprecated by the certbot team? Not entirely sure, but I thought that is the case. It could be I was confused with the deprecation of Python 2.7 itself.

I don't think you can do anything about either of these.

Most likely a database timeout on the ACME server. Can't really see another way for it to occur.

The error message isn't too revealing - it's basically an opaque networking error. Since we have the timestamp of the request (what timezone is this machine configured for?), maybe the ops people can look it up and see if they can track it down on their CDN or their web servers.

If it happens semi-reliably ... more timestamps could help.

Edit: I agree with @Osiris that upgrading Certbot may fix the latter error. At least, trying to debug it from an ancient version of Certbot might not be the most productive thing to do right now.

2 Likes

This one is a server-side error related to database load, as folks on this thread have said.

This is interesting. I'm not sure why Python would treat an EOF as a "violation of protocol," since it seems like it could result from a regular connection reset at the TCP layer, which can be caused by connection timeouts, network issues, exceeding the max requests per connection configured at the remote HTTPS server, and so on.

I think the best thing here is to upgrade to a newer Python and a newer Certbot and see if the issue reproduces, and if it does, whether it has a more useful error message.

1 Like

September, perhaps? The API changed CDNs:

The old and new CDNs were both supposed to work, of course, but they're bound to have subtle behavior differences.

It's less surprising for random issues to happen at times of peak load like that, FWIW.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.