Tls: internal error when provisioning certificate

ukutaht · October 9, 2020, 9:11am

Hello, thanks for this awesome project.

I'm the founder of a startup where we offer a custom domain feature. The customer creates a CNAME form their subdomain to custom.plausible.io.. On custom.plausible.io. server we manage the ssl cert with certbot and proxy_pass the traffic to our backend servers with nginx.

We have over 800 certificates already issued, but this time I got a weird error I haven't seen before.

My domain is:

stats.elixir-lang.org CNAME -> custom.plausible.io.

I ran this command:

sudo certbot certonly --nginx -n -d stats.elixir-lang.org

It produced this output:

2020-10-09 09:01:14,061:DEBUG:urllib3.connectionpool:https://acme-v02.api.letsencrypt.org:443 "POST /acme/authz-v3/7770321146 HTTP/1.1" 200 1032
2020-10-09 09:01:14,063:DEBUG:acme.client:Received response:
HTTP 200
Server: nginx
Date: Fri, 09 Oct 2020 09:01:13 GMT
Content-Type: application/json
Content-Length: 1032
Connection: keep-alive
Boulder-Requester: 78463456
Cache-Control: public, max-age=0, no-cache
Link: <https://acme-v02.api.letsencrypt.org/directory>;rel="index"
Replay-Nonce: 0103_DqV7eizlb1HYeTIrgyPibwdjg9IuYfFGqepqoEa5vs
X-Frame-Options: DENY
Strict-Transport-Security: max-age=604800

{
  "identifier": {
    "type": "dns",
    "value": "stats.elixir-lang.org"
  },
  "status": "invalid",
  "expires": "2020-10-16T09:01:07Z",
  "challenges": [
    {
      "type": "http-01",
      "status": "invalid",
      "error": {
        "type": "urn:ietf:params:acme:error:tls",
        "detail": "During secondary validation: Fetching https://stats.elixir-lang.org/.well-known/acme-challenge/Hj-nIHKHENtSC3nyVrqpKSa4ceE_5d81etZ1TxcpV1I: remote error: tls: internal error",
        "status": 400
      },
      "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/7770321146/s3Dh5A",
      "token": "Hj-nIHKHENtSC3nyVrqpKSa4ceE_5d81etZ1TxcpV1I",
      "validationRecord": [
        {
          "url": "http://stats.elixir-lang.org/.well-known/acme-challenge/Hj-nIHKHENtSC3nyVrqpKSa4ceE_5d81etZ1TxcpV1I",
          "hostname": "stats.elixir-lang.org",
          "port": "80",
          "addressesResolved": [
            "46.101.161.209"
          ],
          "addressUsed": "46.101.161.209"
        }
      ]
    }
  ]
}
2020-10-09 09:01:14,064:DEBUG:acme.client:Storing nonce: 0103_DqV7eizlb1HYeTIrgyPibwdjg9IuYfFGqepqoEa5vs
2020-10-09 09:01:14,067:DEBUG:certbot.reporter:Reporting to user: The following errors were reported by the server:

Domain: stats.elixir-lang.org
Type:   tls
Detail: During secondary validation: Fetching https://stats.elixir-lang.org/.well-known/acme-challenge/Hj-nIHKHENtSC3nyVrqpKSa4ceE_5d81etZ1TxcpV1I: remote error: tls: internal error

My web server is (include version):

nginx version: nginx/1.19.0

The operating system my web server runs on is (include version):

Ubuntu 18.04.3 (LTS) x64

My hosting provider, if applicable, is:

Digital Ocean

I can login to a root shell on my machine (yes or no, or I don't know):

yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel):

no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):

certbot 0.31.0

Osiris · October 9, 2020, 9:14am

You've got a redirect from HTTP to HTTPS in place, which by itself is fine, but your HTTPS site is broken:

osiris@desktop ~ $ curl -LIv https://stats.elixir-lang.org/.well-known/acme-challenge/Hj-nIHKHENtSC3nyVrqpKSa4ceE_5d81etZ1TxcpV1I
*   Trying 46.101.161.209:443...
* Connected to stats.elixir-lang.org (46.101.161.209) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS alert, internal error (592):
* error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error
* Closing connection 0
curl: (35) error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error
osiris@desktop ~ $

Also, my Google Chrome won't connect to it, nor will SSLLabs.

No idea why though. It isn't the common "Speaking HTTP on port 443" error, but something else.

Please check your webserver logs, your general server logs or perhaps even dmesg.

_az · October 9, 2020, 9:23am

If we take a look at the actual validation failure:

During secondary validation: Fetching https://stats.elixir-lang.org/.well-known/acme-challenge/Hj-nIHKHENtSC3nyVrqpKSa4ceE_5d81etZ1TxcpV1I: remote error: tls: internal error

That's a weird one. Typically, when using Certbot's nginx plugin, it should serve the challenge response over port 80.

In this case, it's followed a redirect to port 443. That seems like pretty damning evidence that the nginx plugin is not configuring your webserver properly.

The number of virtual hosts involved makes me suspect that perhaps some of your nginx workers were not fully reloaded, which is why we only see a secondary validation fail, rather than the primary validation.

If you are on Certbot 1.70 or newer, you could try stick --nginx-sleep-seconds 30 or something to try eliminate this as a cause.

If you can't get a version of Certbot that recent, you may need to give using --nginx for something like --webroot (and a matching nginx location which avoids any HTTPS redirects for that URL), which will avoid modifying your server configuration. I think that is a wise thing to do anyway, if you have a large nginx configuration. Save a significant number of CPU cycles.

Assuming, that is, that this diagnosis is correct, which is not a sure thing at all.

Osiris · October 9, 2020, 9:29am

Not sure if this is a nginx plugin issue: the whole site isn't accessible

Also: a redirect from HTTP to HTTPS by certbot would only be there if there was actually a certificate issued. However, there is none: https://crt.sh/?q=stats.elixir-lang.org&deduplicate=Y

So I'm guessing HTTPS was set up manually somehow, not by the nginx plugin of certbot.

Another reason to believe it's a nginx misconfiguration: there's actually something replying speaking TLS. The "internal error" isn't from curl or boulder, it's literally in the TLS reply from the server (according to the netcap in Wireshark).

_az · October 9, 2020, 9:41am

Yes indeed, OP's command was certonly --nginx - only the authenticator.

I had given them the benefit of the doubt that they were doing something like:

ssl_certificate /etc/letsencrypt/live/$ssl_server_name/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/$ssl_server_name/privkey.pem;

which produces an SSL alert if the certificate doesn't exist, but works fine if it does. It's a trick to force a valid configuration, even if the certificate doesn't exist yet.

Osiris · October 9, 2020, 9:42am

Would that trick result in a working nginx plugin authentication with a redirect in place @_az? Let's Encrypt doesn't actually verify the certificate, but a working TLS connection is of course required.. I'm doubting nginx can do that without a certificate.

_az · October 9, 2020, 9:46am

The validation method is HTTP.

So --nginx temporarily inserts a rule into the port 80 nginx virtualhost to respond to /.well-known/acme-challenge/xyz without redirecting to HTTPS.

A TLS connection shouldn't be happening at all. In my first reply, I suspected that the reason it is, is due to a delay in reloading the above rule into every nginx worker.

I think we can assume that there's nothing wrong with the nginx TLS stack besides a missing certificate for that particular domain - https://custom.plausible.io works just fine.

Osiris · October 9, 2020, 9:50am

https://stats.elixir-lang.org is also working now, but it didn't work before. I'm assuming something has changed on the server. It also has a valid LE certificate now

That might be the case if you put it like that, makes sense. The only thing I saw was the https:// protocol in the failed authz reply from Boulder.

ukutaht · October 9, 2020, 10:49am

I went for lunch and during that time our background worker finally succeeded after dozens of failures. I changed nothing in the configuration. To me this seems to suggest a potential race condition in reloading nginx virtual host configurations as proposed by @_az . Does nginx reload configs for all virtual hosts during each http challenge?

I will paste my nginx config so you can maybe spot issues here:

server {
    server_name _;

    # ... proxy_pass configuration

    listen 443 ssl;
    ssl_certificate /etc/letsencrypt/live/$ssl_server_name/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/$ssl_server_name/privkey.pem;
    include /etc/letsencrypt/options-ssl-nginx.conf;
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;
}

server {
    server_name  _;
    listen       80;
    return 301 https://$host$request_uri;
}

Osiris · October 9, 2020, 10:52am

As far as I know, there's only one reload command for nginx which reloads the entire configuration. That is: signaling worker threads to gracefully stop and start new (a) worker thread(s).

_az · October 9, 2020, 10:55am

I absolutely think you should just add this to your port 80 server:

location /.well-known/acme-challenge {
    root /some/directory/somewhere;
}

and change Certbot to use --webroot -w /some/directory/somewhere rather than --nginx.

For a high virtual host count, it's just too wasteful to keep reloading nginx when you can avoid it.

ukutaht · October 9, 2020, 10:58am

I will give it a go.

Thanks @_az @Osiris you've been incredibly helpful

system · November 8, 2020, 10:58am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.