Multi-Perspective Validation, simultaneous requests, and Apache caching

if LetsEncrypt eventually ends up with 20 validation locations, is it really going to utilize all 20 on every HTTP-01 request, or is it going to be a randomized subset using a more reasonable value, like 5?

Currently, all 5 validation connections hit my server almost simultaneously. Was any consideration given to spacing them out maybe 1 second apart, or even half a second, for the sake of the servers?

After the number of validation connections was increased to 3 then 4 then 5, I encountered an apparently undiscovered bug with Apache's mod_cache_socache module (RAM caching of HTTP responses). It seems that when such caching is enabled, and Apache is slammed with many almost-simultaneous requests for the same uncached file, it will respond to at least one of them (usually the 3rd) with a zero-byte response body. Seems like a race condition due to identical requests continuing to come in while it's still attempting to place the first response into cache. And thus the whole HTTP-01 challenge is failed because of one bad response.

I've already found a decent fix/workaround, placing CacheDisable /.well-known/acme-challenge/ in my configuration, and I'm okay with leaving this in place permanently, but I'm concerned this might trip up others so I'm posting about it here in case anyone else encounters this issue.

I know this is an Apache problem and I'll attempt to raise a bug with them. But staggering the validation connections even by half a second would probably avoid this & similar issues that could be lurking out there. Also of concern are low-end servers that might not be able to handle many simultaneous connections in general, especially if the current "5" continues growing.

8 Likes

Sounds like even one single retry/confirmation on failure may be in order [if a delay can't be coded] to overcome such a bug.

Do you pass?

  • yes: Great!
  • no: How about now?
    --yes: Great!
    --no: So, you really do fail :frowning:
5 Likes

It's too early to tell, but that's the approach we're thinking about. If the draft ballot passes as is, then the subset would likely be either 6 (1 primary + 5 remote) or 7 (1 primary + 6 remote).

7 Likes

I appreciate the information. Has there ever been discussion regarding spacing out the validation connections a bit so they don't all hit the server almost-simultaneously?

2 Likes

It's just a text/plain file, I'm sure the smallest micro-crontroller/e.g. ESP32 could handle 5 simultaneous requests for the token without any issue.

4 Likes

That's my curiosity, too: is it a good thing to slow down issuance like that? The validation won't proceed until the client indicates it is ready, so it seems like it should be able to service that tiny amount of traffic. Artificial delays seem like they should be unnecessary.

7 Likes

What about a single retry on an initial failure?
[erroring on the side of caution - maybe there is such a bug]

4 Likes

By contrast, the associated DNS queries (e.g. for CAA) have apparently triggered some anti-DDoS code, if I'm remembering those threads correctly! (Although I don't think that the servers literally had any trouble with capacity to answer the requests.)

8 Likes

This sounds like a decent idea.

That is commonly known as a dogpile or cache stampede effect. A popular workaround is for the first response to generate a cache lock, and have the subsequent requests either serve the old value or block & poll until the new value populates.

While I do think ISRG/LetsEncrypt should consider spacing requests to help deal with misconfigured servers like this, I think there are two paths you should take:

1- File a bug report against that Apache module, and cite this as a use-case and reproduction method. The read-through cache layer should not be acting like this on a static file.

2- What ACME client are you using? If Certbot, which methods/plugins are you using? Some clients/plugins/methods integrate a self-check before trigging a validation. IIRC, Certbot had that in the webroot code. A self-check should prime the cache before LetsEncrypt is triggered.

5 Likes

Thanks for your input.

I'm currently working on filing an Apache bug, but I know they're going to want logs from the latest Apache version so it's going to take a bit longer to set up an additional test environment.

here's what I was using for testing this yesterday (using a temporary duckdns subdomain which I've since deleted)

certbot certonly --staging --dry-run --cert-name test --apache -d certificate-test.duckdns.org

Apache logs -- note, the final entry in the line, i.e. 87, is response body size in bytes, and the next-to-final entry, i.e. 5656, is response processing time in microseconds

With caching enabled (note no response size on 3rd and 4th hit):

[2024-05-12/16:20:29] certificate-test.duckdns.org 3.137.149.208 "GET /.well-known/acme-challenge/PyNXZDa2-gMGP6yFx8PWvgo8dbTkrtDxIgwd8ZqJduw HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 5656 87
[2024-05-12/16:20:29] certificate-test.duckdns.org 13.50.109.217 "GET /.well-known/acme-challenge/PyNXZDa2-gMGP6yFx8PWvgo8dbTkrtDxIgwd8ZqJduw HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1752 87
[2024-05-12/16:20:30] certificate-test.duckdns.org 54.255.229.102 "GET /.well-known/acme-challenge/PyNXZDa2-gMGP6yFx8PWvgo8dbTkrtDxIgwd8ZqJduw HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 2291 -
[2024-05-12/16:20:30] certificate-test.duckdns.org 66.133.109.36 "GET /.well-known/acme-challenge/PyNXZDa2-gMGP6yFx8PWvgo8dbTkrtDxIgwd8ZqJduw HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1775 -

With caching of /.well-known/acme-challenge/ disabled (all good):

[2024-05-12/16:16:01] certificate-test.duckdns.org 13.50.109.217 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 6226 87
[2024-05-12/16:16:01] certificate-test.duckdns.org 3.137.149.208 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1531 87
[2024-05-12/16:16:02] certificate-test.duckdns.org 35.89.137.209 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1876 87
[2024-05-12/16:16:02] certificate-test.duckdns.org 66.133.109.36 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 2121 87
[2024-05-12/16:16:02] certificate-test.duckdns.org 54.255.229.102 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1677 87

My cache configuration (CacheDisable is toggled on or off as needed for testing):

CacheSocache "shmcb:mycache(4294967294)"
CacheEnable socache /
#CacheDisable /.well-known/acme-challenge/
CacheHeader on
CacheDetailHeader on
CacheQuickHandler off
CacheSocacheMaxSize 1000000
1 Like

Have you tried using a smaller number?

3 Likes

This is annoying....

I was going to suggest that you debug the simple_verify command, which should run:

But then I found this ticket - acme library HTTP01Response.simple_verify should not verify host certificate · Issue #6614 · certbot/certbot · GitHub - which suggests simple_verify is no longer used by Certbot and just remains as a legacy option for other clients that use the codebase.

That would explain how you're not dealing with a primed cache. The only hook that could be used is --manual-auth-hook, but that is not really compatible with this setup.

Maybe someone more familiar with Certbot can chime in here. The former pre-flight validation check would have primed your cache and avoided the stampede/dogpile effect in your caching layer.

4 Likes

I'd like to suggest to move these quite specific implementation issues/bugs into its own separate thread instead of "polluting" this Wiki.

5 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

We have just made a change that will do primary validation first, and only do secondary validation once it (and CAA checking) has succeeded. While we did this for other reasons, it might help in this case too. (I'll post an API announcement shortly, once proofread by a coworker)

7 Likes

This topic was automatically closed after 17 days. New replies are no longer allowed.