Increasing tls-sni-01 server timeout

Hi -

I have implemented an ACME plugin in libwebsockets, which works well with both OpenSSL and mbedTLS backends, using tls-sni-01. And Let’s Encrypt is very nice once it’s working.

libwebsockets is very lightweight and works on many platforms, including ESP32. On normal platforms we use 4096-bit RSA keys, but on ESP32 we can only use 2048-bit.

On ESP32 platform, I met a problem using the staging server… ESP32 (which uses mbedTLS backend) is too weak to complete the TLS action in the ~1.6s that the staging server (connecting to us as a client) seems to allow before timing it out and making a new connection.

For comparison of the timing, here is a local browser making an http/2 connection to the ESP32…

[2017/11/28 11:34:34:2111] NOTICE: _realloc: size 544: new server wsi (free heap 172708)
ssl_pm_handshake
ssl ret 0 state 1
ssl ret 0 state 2
ssl ret 0 state 3
ssl ret 0 state 4
ssl ret 0 state 5
ssl ret 0 state 6
ssl ret 0 state 7
ssl ret 0 state 8
ssl ret -26880 state 8   (this is WANT_READ in MBEDTLS_SSL_CLIENT_KEY_EXCHANGE state)
[2017/11/28 11:34:35:2663] NOTICE: lws_tls_server_accept: accept SSL_get_error 5 errno 11
[2017/11/28 11:34:35:2697] NOTICE: _realloc: size 996: ah struct (free heap 145280)
[2017/11/28 11:34:35:2778] NOTICE: _realloc: size 1024: ah data (free heap 144252)
[2017/11/28 11:34:35:2866] NOTICE:  heap :144248 (-2684)
ssl_pm_handshake
ssl ret 0 state 9
ssl ret 0 state 10
ssl ret 0 state 11
ssl ret 0 state 12
ssl ret 0 state 13
ssl ret 0 state 14
ssl ret 0 state 15
ssl ret 0 state 16
[2017/11/28 11:34:36:0596] NOTICE: _realloc: size 208: h2n (free heap 146856)

You can see the connection is noticed at 11:34:34:2111 but it does not complete SNI and the TLS handshake until 11:34:36:0596, which is like 1.7s later. What we see when we are getting the SNI challenge from your server is it gives up and connects three more times at ~1.6s intervals; that is too fast for us to complete the handshake.

The timeout situation may also be exacerbated by our being in Taiwan, so there is a significant RTT on top.

Can I gently request that the challenge timeout be extended to 3s or so, making Let’s Encrypt compatible with a new class of weak devices? Thanks for your consideration.

1 Like

Hi @warmcat, thanks for posting.

It sounds like you have a unique ACME plugin & use-case. Cool :slight_smile:

Hmm. Can you share some way for me to identify these validation requests from the staging validation authority logs? Source IP, ACME account ID, or the domain(s) in question would be helpful.

I'm a little bit surprised because my understanding was that the timeout was a static 5s for HTTP-01 and TLS-SNI-01 for both staging and prod. A response in ~1.7s should have worked unless I'm overlooking/misunderstanding something.

Hmm. Can you share some way for me to identify these validation requests from the staging validation authority logs? Source IP, ACME account ID, or the domain(s) in question would be helpful.

Sure, the domain I'm testing with is home.warmcat.com. These logs should be in UTC and from the last test I ran today (well, yesterday for me now).

I’m a little bit surprised because my understanding was that the timeout was a static 5s for HTTP-01 and TLS-SNI-01 for both staging and prod. A response in ~1.7s should have worked unless I’m overlooking/misunderstanding something.

Thanks for the info... the logs I have look like this:

...
{
  "type": "tls-sni-01",
  "status": "pending",
  "uri": "https://acme-staging.api.letsencrypt.org/acme/challenge/usixEGWuCtSLs34WDmdga_FDV2j6HvBa9eJ-NMs5RgY/79732693",
  "token": "ACsPE9iM5POafXzM61qpVL9nFixiUNy3u0HzdHTmMYk",
  "keyAuthorization": "ACsPE9iM5POafXzM61qpVL9nFixiUNy3u0HzdHTmMYk.SvP97mm_rw60iWUtRVr1L-374tebODwrrJoOu1kb5XQ"
}
...
    
[2017/11/28 11:26:05:1521] NOTICE: _realloc: size 544: new server wsi (free heap 93200)  (first connect)
ssl_pm_handshake
ssl ret 0 state 1
[2017/11/28 11:26:05:1626] NOTICE: lws_mbedtls_sni_cb: e2e2cf3a09271ce6833ba0b63a7c7dac.31548fd9254ca30f93e7f98913542965.acme.invalid
[2017/11/28 11:26:05:1730] NOTICE: SNI: Found: e2e2cf3a09271ce6833ba0b63a7c7dac.31548fd9254ca30f93e7f98913542965.acme.invalid:443 at vhost 'e2e2cf3a09271ce6833ba0b63a7c7dac.31548fd9254ca30f93e7f98913542965.acme.invalid'
ssl ret 0 state 2
ssl ret 0 state 3
ssl ret 0 state 4
ssl ret 0 state 5
ssl ret 0 state 6
ssl ret 0 state 7
ssl ret 0 state 8
ssl ret -26880 state 8
[2017/11/28 11:26:06:6950] NOTICE: lws_tls_server_accept: accept SSL_get_error 5 errno 11  (this is SSL_WANT_READ)
[2017/11/28 11:26:06:6975] NOTICE: _realloc: size 996: ah struct (free heap 65180)
[2017/11/28 11:26:06:7055] NOTICE: _realloc: size 1024: ah data (free heap 64152)
[2017/11/28 11:26:06:7154] NOTICE: _realloc: size 544: new server wsi (free heap 63520)  (second connect)
ssl_pm_handshake
... (after 2 more retries)
[2017/11/28 11:26:10:9197] NOTICE: callback_acme_client: LWS_CALLBACK_RECEIVE_CLIENT_HTTP
{
  "type": "tls-sni-01",
  "status": "invalid",
  "error": {
    "type": "urn:acme:error:connection",
    "detail": "Timeout",
    "status": 400
  },
  ...

So the first and second connect come at 11:26:05:1521 vs 11:26:06:7154, around 1.6s, and it asserts "timeout". However it's true the last connection attempt causes us to run out of heap (there is only 350KB of RAM on the whole chip :slight_smile: ) making that fail on our side; the first three don't though. So there may have been some other handshake error on the first three tries we can't see, because it reports what happened on the last one?

Thanks for any advice.

1 Like

I was able to find these requests and piece together a timeline from what the validation authority on our side logged. As you noticed the staging env performs domain validation from multiple network perspectives. There are 3 "remote" VAs and one primary VA. For the sake of making this timeline easier to understand lets call them RVA_1, RVA_2, RVA_3 and PVA_1.

  • 26/11/17 11:26:03.000 R_VA1 and R_VA2, start a TLS-SNI-01 validation attempt to 1.163.189.31:443 with SNI value e2e2cf3a09<snipped>.acme.invalid
  • 26/11/17 11:26:03.899674 P_VA1 starts a validation attempt to the same address/SNI value.
  • 26/11/17 11:26:05.000 R_VA3 starts a validation attempt to the same address/SNI value.
  • 26/11/17 11:26:08.000 R_VA1 reports a timeout. 11:26:08 minus 11:26:03 gives us the expected 5s timeout.
  • 26/11/17 11:26:08.000 R_VA2 reports a EOF error (Maybe the heap was blown on your end at this point?)
  • 26/11/17 11:26:08.899836 P_VA1 reports a timeout
  • 26/11/17 11:26:09.000 R_VA3 reports an EOF

At face value this seems to match up to my expectation of how the timeouts are enforced at 5s.

Does the above timeline help at all with explaining the failure conditions on your end? It's possible the EOF being read on the alternating connection failures is a consequence of the outstanding validation requests being cancelled on our end after the first timeout but I'd have to take a closer look at the code to know for sure.

I can do a little bit more digging tomorrow if there's still a mystery at hand.

(I'm impressed that you hack around on chips with so little heap space! Sounds like an interesting challenge. I'll keep you in my thoughts the next time some of my sloppier code mallocs a larger than needed buffer without a care in the world :laughing: )

I was able to find these requests and piece together a timeline from what the validation authority on our side logged.

Thanks a lot for the effort.

As you noticed the staging env performs domain validation from multiple network perspectives.

Ah... it rings a bell now you say it, but looking at my logs alone I thought it was one thing retrying.

That makes a lot of sense why my heap is going down then, the TLS connections are expensive at mbedTLS level and the attempts are coming in parallel; the first ones really are still ongoing when the later ones come. I was wondering why the heap was not recovering sequentially.

That makes it sound like something simply broke my end and we are timing out each parallel attempt in turn (except the last one, which doesn't really get very far due to OOM). I confirmed this works on a PC using mbedtls, but that was a few days ago and several changes were needed to adapt it for ESP32. I will check we're still working on a PC and study the ESP32 behaviour closer today.

Lws already has a concept of restricting the amount of simultaneous TLS by deferring additional connection accept until we are below the TLS limit again, but I didn't think to enable it on the temporary sni vhost.

The 1.6s delay in the requests seen my side is reflecting the singlethreaded event loop in lws in blocked for about that time by the mbedTLS actions. So as soon as we finished the work in mbedTLS for the first connection, we return to the event loop and saw the next connection, which had been waiting. Even if we have multiple TLS in play, this serialization effect will still be there on our side since we can only do the mbedTLS work one at a time. Your logs show two of the four starting together and the remaining two at +300ms and +2s... if the attempts can be cleared in 1.7s and we restrict it to one at a time on our side, ignoring RTTs it should be like

source  start        connect  done  (delay)
1       0            0        1.7    1.7
2       0            1.7      3.4    3.4
3       0.3          3.4      5.1    4.8
4       2.0          5.1      6.8    4.8

Considering roundtrip times on top, and there's no guarantee about the test starts being staggered, it's maybe too close for comfort to 5s. Anyway before worrying about that I will try to find out why we are timing out on the first two at least, which is going to be something my side, and reply again later. Thanks again for the insight.

so little heap space

We're inheriting the advantages from a lot of other projects underneath that were also modest with space, like lwip and mbedtls; ESP32 has half the wireless-related pieces and libc in ROM as well. But even so it's pretty wild the entire image I am working on is 920KB in flash including the OS, wlan stack, all pictures, JS, http/2 + websocket server etc, but once it gets to the browser, the TLS setup delay was the only sign you're not talking to a normal server...

After:

  • restricting it to 2 tls connections at a time (one ws one is open back to the browser, so effectively one at a time for the challenge),

  • deferring the client tls connection back to the ACME server to monitor the status (this also eats time)

  • disabling background WLAN scanning

I think it is able to complete the first challenge OK

[2017/11/29 06:42:46:2796] NOTICE: _realloc: size 544: new server wsi (free heap 133612)
[2017/11/29 06:42:46:2839] NOTICE: lws_gate_accepts: on = 0
[2017/11/29 06:42:46:2845] NOTICE: lws_server_socket_service_ssl: ssl restr 2, simul 2
ssl_pm_handshake
ssl ret 0 state 1
[2017/11/29 06:42:46:2982] NOTICE: lws_mbedtls_sni_cb: 1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid
[2017/11/29 06:42:46:3085] NOTICE: SNI: Found: 1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid:443 at vhost '1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid'
ssl ret 0 state 2
ssl ret 0 state 3
ssl ret 0 state 4
ssl ret 0 state 5
ssl ret 0 state 6
ssl ret 0 state 7
ssl ret 0 state 8
ssl ret -26880 state 8
[2017/11/29 06:42:47:8029] NOTICE: lws_tls_server_accept: accept SSL_get_error 5 errno 11
[2017/11/29 06:42:47:8064] NOTICE: _realloc: size 996: ah struct (free heap 99588)
[2017/11/29 06:42:47:8148] NOTICE: _realloc: size 1024: ah data (free heap 99148)
[2017/11/29 06:42:47:8233] NOTICE:  heap :99340 (-3896)
[2017/11/29 06:42:48:0058] NOTICE:  heap :102460 (+3120)
ssl_pm_handshake
ssl ret 0 state 9
ssl ret 0 state 10
ssl ret 0 state 11
ssl ret 0 state 12
ssl ret 0 state 13
ssl ret 0 state 14
ssl ret 0 state 15
ssl ret 0 state 16
[2017/11/29 06:42:49:0059] NOTICE:  heap :105920 (+3460)
[2017/11/29 06:42:49:1072] NOTICE: lws_gate_accepts: on = 1
[2017/11/29 06:42:49:1080] NOTICE: lws_ssl_close: ssl restr 2, simul 1

Basically it takes it from 06:42:46:2796 - 06:42:49:1080 to do the first one, so a bit less than 3s. The second one also looks like it completes

[2017/11/29 06:42:49:1126] NOTICE: _realloc: size 544: new server wsi (free heap 129496)
[2017/11/29 06:42:49:1217] NOTICE: lws_gate_accepts: on = 0
[2017/11/29 06:42:49:1255] NOTICE: lws_server_socket_service_ssl: ssl restr 2, simul 2
ssl_pm_handshake
ssl ret 0 state 1
[2017/11/29 06:42:49:1383] NOTICE: lws_mbedtls_sni_cb: 1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid
[2017/11/29 06:42:49:1497] NOTICE: SNI: Found: 1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid:443 at vhost '1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid'
ssl ret 0 state 2
ssl ret 0 state 3
ssl ret 0 state 4
ssl ret 0 state 5
ssl ret 0 state 6
ssl ret 0 state 7
ssl ret 0 state 8
ssl ret -26880 state 8
[2017/11/29 06:42:50:2904] NOTICE: lws_tls_server_accept: accept SSL_get_error 5 errno 11
[2017/11/29 06:42:50:2939] NOTICE: _realloc: size 996: ah struct (free heap 100352)
[2017/11/29 06:42:50:3019] NOTICE: _realloc: size 1024: ah data (free heap 99512)
[2017/11/29 06:42:50:3106] NOTICE:  heap :99508 (-6412)
ssl_pm_handshake
ssl ret 0 state 9
ssl ret 0 state 10
ssl ret 0 state 11
ssl ret 0 state 12
ssl ret 0 state 13
ssl ret 0 state 14
ssl ret 0 state 15
ssl ret 0 state 16
[2017/11/29 06:42:51:4067] NOTICE:  heap :105660 (+6152)
[2017/11/29 06:42:51:4130] NOTICE: lws_gate_accepts: on = 1
[2017/11/29 06:42:51:4136] NOTICE: lws_ssl_close: ssl restr 2, simul 1

Actually we don’t know when this one originated, we only know we could start to handle it at 06:42:49:1126. Assuming it started at the same time as the first one, actually it has to wait from 06:42:46:2796 - 06:42:51:4136, it’s slightly over 5s.

The third one (in the other test, +300ms) fails out with the far side having hung up on them.

[2017/11/29 06:42:52:7505] NOTICE: _realloc: size 544: new server wsi (free heap 128976)
[2017/11/29 06:42:52:7634] NOTICE: lws_gate_accepts: on = 0
[2017/11/29 06:42:52:7643] NOTICE: lws_server_socket_service_ssl: ssl restr 2, simul 2
ssl_pm_handshake
ssl ret 0 state 1
[2017/11/29 06:42:52:7772] NOTICE: lws_mbedtls_sni_cb: 1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid
[2017/11/29 06:42:52:7887] NOTICE: SNI: Found: 1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid:443 at vhost '1b8216e54128c30bb8da736836cd8546.9a000f1ecace5a701521975f6c5758d6.acme.invalid'
ssl ret 0 state 2
ssl ret 0 state 3
ssl ret 0 state 4
ssl ret -80 state 5 (MBEDTLS_ERR_NET_CONN_RESET)
ssl_pm_handshake: mbedtls_ssl_handshake() returned -0x50
[2017/11/29 06:42:53:9253] NOTICE: lws_tls_server_accept: accept SSL_get_error 5 errno 104

If he was +300ms again, it is 06:42:46:5796 - 06:42:53:9253, which is way over 5s.

The fourth one (+2s on the first test) fails the same way.

So unless I continue to misunderstand it, it seems to still boil down to needing a timeout extension.

Would it be practical to use HTTP-01 validation? That way you’d only have to do the crypto stuff once, and the validation requests would only be an HTTP connection downloading a little bit of text.

Would it be practical to use HTTP-01 validation?

Well... I can't say "no"... it is practical for some values of "practical'. But IIUI that needs a second port-forward at :80. Using tls-sni-01, the user just needs the portforward he needed anyway, which doesn't have to externally be :443 (although I did not test that yet). SNI takes care of managing multiple vhosts on the same port, which we anyway support along with dynamic vhosts creation and destruction. The story for setup is very simple, set it up how you wanted to access it anyway with regards to external access, and then click the button.

We generate our own selfsigned cert locally on the device by default, so it's intended to be always providing some kind of tls. So port 80 at all with no tls, and the portforward on external port 80 only, would be an ACME-only thing.

In the case there are several devices on site that want to use this, with HTTP-01 you would have to fiddle around pointing the magic external port 80 at the router at each in turn. That also means they cannot self-manage their cert lifetime, you have to run around manually for each device updating them and again pointing external port 80 to each in turn. By contrast, with tls-sni-01, they guys that have let'sencrypt certs with external access on some port can just notice themselves they are running out of time with their cert and auto-renew only using their existing external portforward (again, AIUI / didn't test that yet). So it's a significant deficit AIUI.

Another possibility that would solve it is explicitly serialize the probes to be 5s apart, or be able to request that. So letsencrypt keeps the 5s timeout but can be requested to do the four probes at 5s intervals.

It does. The tls-sni-01 challenge can only be performed over port 443.

1 Like

It does. The tls-sni-01 challenge can only be performed over port 443.

Where is that specified to be so?

Section 7.3 merely says

type (required, string): The string "tls-sni-01"

...

Open a TLS connection to the domain name being validated on the
requested port, presenting the value
"<Zi[0:32]>.<Zi[32:64]>.acme.invalid" in the SNI field (where the
comparison is case-insensitive).

I don't know if it's in the spec, but Let's Encrypt definitely only uses port 443 for tls-sni-01.

EDIT: ah, there it is: not in the ACME spec but in LE's CPS.

TLS Using a Random Number: Confirming the Applicant’s control over the requested FQDN by confirming the presence of a random value (with at least 128 bits entropy) within a Certificate on the requested FQDN which is accessible to the CA via TLS over port 443. (BR Section 3.2.2.4.10)

1 Like

You can have it listen on any port, but Let’s Encrypt will only send it over 443. So, if you have a port forwarding scenario set up, your router can listen on public port 443, and forward to internal port, say, 4000. You’d set Certbot/whatever ACME client to listen on 4000.

You can have it listen on any port, but Let’s Encrypt will only send it over 443

The scenario I am talking about has n individual devices wanting to be externally accessible under their own DNS.

So for example, a device behind NAT, "group1.home.warmcat.com", might have his own :443 port-forwarded to be externally accessible at :1443. He wants to acquire and manage his own cert for "group1.home.warmcat.com" all from external port :1443, without involving whatever might get the external portforward for :443.

draft-ietf-acme-acme-01 doesn't seem to have a problem with that.

The reason for this is to provide an alternative to "all your IoT devices send all your data to someone else's server so you can access it with a valid tls cert".

Certificate on the requested FQDN which is accessible to the CA via TLS over port 443.

Thanks... right now with ACME and LE and boulder implementation variations it is difficult to know what applies...

If the IETF versions will allow other ports, hopefully LE will also allow it. I can get where they're coming from that :443 is the only guy that is authoritative to bless the request for a cert that will be used on :443. But there is nothing about certificates that lock them to any specific port.

Let’s Encrypt also has to comply with the CA/Browser Forum’s Baseline Requirements (note the BR reference in the CPS). That specifies that it must use an “Authorized Port” which is defined elsewhere in the document to mean 80, 443, 115, 25 or 22. So the options available are quite limited anyway, even if LE were to allow all of them, and that still wouldn’t really help for your use case.

it must use an “Authorized Port” which is defined elsewhere in the document to mean 80, 443, 115, 25 or 22.

Thanks.

I can think of some ideas to workaround that but nothing really useful atm.

Another possible (?) solution to this problem might be to provide a parent domain on the PSL and allow the devices to register subdomains and use the DNS-01 challenge to obtain their certificates. That's likely to be less resource-hungry too, I'd imagine.

1 Like

Could you describe how these devices will be used in practice? Here’s a great blog post on how Plex deployed HTTPS for their IoT-like servers, there’s a good chance their approach could work for your use-case: https://blog.filippo.io/how-plex-is-doing-https-for-all-its-users/

Something like this could be built on top of Let’s Encrypt. I’m not aware of any existing projects that solve this exact problem, but acme-dns might be a good starting point.

(Side note: Later ACME drafts (such as acme-08) clarified that the ports used for the HTTP and TLS-SNI challenges are 80 and 443, to avoid this type of confusion.)

1 Like

provide a parent domain on the PSL

Thanks, I will take a look.

Could you describe how these devices will be used in practice?

These ESP32 modules are very cheap (US$3.50 ESP-WROOM-32, ESP32 WIFI+BT+BLE Module [D/U] – ElectroDragon ) and very capable.

I am providing generic LGPL2 + SLE support for them in libwebsockets, I have two use-cases myself, one design is an isolated UART <-> WIFI supporting 1.8V and 3.3V, and the other is an isolated switch with voltage and current monitoring, again over WIFI.

The UART design is accessible over ssh, telnet and browser; for ssh you set it up with the browser and paste your public ssh key into a form.

The modules bind together by using a "group" name with mdns locally, so if you access one in a browser you get a list of devices in the same group with their UI in individual iframes.

The devices produce a selfsigned cert automatically when first booted. So if you have several devices in your group, each browser that wants to access them faces a bunch of selfsigned cert browser UI noise... browsers like Firefox for Android just ban selfsigned as far as I can tell with no accept UI presented. So this is very messy.

If they can each acquire and maintain a valid cert replacing the initial selfsigned one, and behind the NAT the cert DNS name resolves to the correct local IP for the device, that will remove all the problems. Implementing that the way I was following until now also requires individual portforwards to the devices, even if that is only used to reply to the tls-sni-01 challenges.

Better ideas are welcome :slight_smile: I think it's a general problem that before Let's Encrypt had no way for the certificate infrastructure to solve neatly.

Hi again @warmcat,

Apologies on dropping off this thread. It's been a busy week :slight_smile:

So unless I continue to misunderstand it, it seems to still boil down to needing a timeout extension.

I'll discuss with the other two Boulder engineers whether an adjustment to this timeout is reasonable. It might require tweaking a few other timeouts down the callpath to ensure that the RPCs involved in performing validation don't time out before the validation request.

Do you think that a 10s timeout would give your devices enough headroom to complete the required handshakes?

I don't think its likely we would relax this restriction on our end. In practice there's all manner of strange configurations in the wild and historically its hard to make assumptions about which ports are safe to assume can authoritatively speak for a domain in a way that won't result in any misissuance. Between HTTP-01, TLS-SNI-01 and DNS-01 there's a pretty good selection of options available that should work for most cases without needing additional challenge ports.