Multi-Perspective Validation, simultaneous requests, and Apache caching

catharsis · May 11, 2024, 9:27pm

if LetsEncrypt eventually ends up with 20 validation locations, is it really going to utilize all 20 on every HTTP-01 request, or is it going to be a randomized subset using a more reasonable value, like 5?

Currently, all 5 validation connections hit my server almost simultaneously. Was any consideration given to spacing them out maybe 1 second apart, or even half a second, for the sake of the servers?

After the number of validation connections was increased to 3 then 4 then 5, I encountered an apparently undiscovered bug with Apache's mod_cache_socache module (RAM caching of HTTP responses). It seems that when such caching is enabled, and Apache is slammed with many almost-simultaneous requests for the same uncached file, it will respond to at least one of them (usually the 3rd) with a zero-byte response body. Seems like a race condition due to identical requests continuing to come in while it's still attempting to place the first response into cache. And thus the whole HTTP-01 challenge is failed because of one bad response.

I've already found a decent fix/workaround, placing CacheDisable /.well-known/acme-challenge/ in my configuration, and I'm okay with leaving this in place permanently, but I'm concerned this might trip up others so I'm posting about it here in case anyone else encounters this issue.

I know this is an Apache problem and I'll attempt to raise a bug with them. But staggering the validation connections even by half a second would probably avoid this & similar issues that could be lurking out there. Also of concern are low-end servers that might not be able to handle many simultaneous connections in general, especially if the current "5" continues growing.

rg305 · May 11, 2024, 10:43pm

Sounds like even one single retry/confirmation on failure may be in order [if a delay can't be coded] to overcome such a bug.

Do you pass?

yes: Great!
no: How about now?
--yes: Great!
--no: So, you really do fail

JamesLE · May 12, 2024, 4:36am

It's too early to tell, but that's the approach we're thinking about. If the draft ballot passes as is, then the subset would likely be either 6 (1 primary + 5 remote) or 7 (1 primary + 6 remote).

catharsis · May 12, 2024, 5:59pm

I appreciate the information. Has there ever been discussion regarding spacing out the validation connections a bit so they don't all hit the server almost-simultaneously?

Osiris · May 12, 2024, 6:17pm

It's just a text/plain file, I'm sure the smallest micro-crontroller/e.g. ESP32 could handle 5 simultaneous requests for the token without any issue.

jcjones · May 12, 2024, 6:32pm

That's my curiosity, too: is it a good thing to slow down issuance like that? The validation won't proceed until the client indicates it is ready, so it seems like it should be able to service that tiny amount of traffic. Artificial delays seem like they should be unnecessary.

rg305 · May 12, 2024, 6:51pm

What about a single retry on an initial failure?
[erroring on the side of caution - maybe there is such a bug]

schoen · May 12, 2024, 7:38pm

By contrast, the associated DNS queries (e.g. for CAA) have apparently triggered some anti-DDoS code, if I'm remembering those threads correctly! (Although I don't think that the servers literally had any trouble with capacity to answer the requests.)

jvanasco · May 13, 2024, 3:12pm

This sounds like a decent idea.

That is commonly known as a dogpile or cache stampede effect. A popular workaround is for the first response to generate a cache lock, and have the subsequent requests either serve the old value or block & poll until the new value populates.

While I do think ISRG/LetsEncrypt should consider spacing requests to help deal with misconfigured servers like this, I think there are two paths you should take:

1- File a bug report against that Apache module, and cite this as a use-case and reproduction method. The read-through cache layer should not be acting like this on a static file.

2- What ACME client are you using? If Certbot, which methods/plugins are you using? Some clients/plugins/methods integrate a self-check before trigging a validation. IIRC, Certbot had that in the webroot code. A self-check should prime the cache before LetsEncrypt is triggered.

catharsis · May 13, 2024, 3:57pm

Thanks for your input.

I'm currently working on filing an Apache bug, but I know they're going to want logs from the latest Apache version so it's going to take a bit longer to set up an additional test environment.

here's what I was using for testing this yesterday (using a temporary duckdns subdomain which I've since deleted)

certbot certonly --staging --dry-run --cert-name test --apache -d certificate-test.duckdns.org

Apache logs -- note, the final entry in the line, i.e. 87, is response body size in bytes, and the next-to-final entry, i.e. 5656, is response processing time in microseconds

With caching enabled (note no response size on 3rd and 4th hit):

[2024-05-12/16:20:29] certificate-test.duckdns.org 3.137.149.208 "GET /.well-known/acme-challenge/PyNXZDa2-gMGP6yFx8PWvgo8dbTkrtDxIgwd8ZqJduw HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 5656 87
[2024-05-12/16:20:29] certificate-test.duckdns.org 13.50.109.217 "GET /.well-known/acme-challenge/PyNXZDa2-gMGP6yFx8PWvgo8dbTkrtDxIgwd8ZqJduw HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1752 87
[2024-05-12/16:20:30] certificate-test.duckdns.org 54.255.229.102 "GET /.well-known/acme-challenge/PyNXZDa2-gMGP6yFx8PWvgo8dbTkrtDxIgwd8ZqJduw HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 2291 -
[2024-05-12/16:20:30] certificate-test.duckdns.org 66.133.109.36 "GET /.well-known/acme-challenge/PyNXZDa2-gMGP6yFx8PWvgo8dbTkrtDxIgwd8ZqJduw HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1775 -

With caching of /.well-known/acme-challenge/ disabled (all good):

[2024-05-12/16:16:01] certificate-test.duckdns.org 13.50.109.217 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 6226 87
[2024-05-12/16:16:01] certificate-test.duckdns.org 3.137.149.208 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1531 87
[2024-05-12/16:16:02] certificate-test.duckdns.org 35.89.137.209 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1876 87
[2024-05-12/16:16:02] certificate-test.duckdns.org 66.133.109.36 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 2121 87
[2024-05-12/16:16:02] certificate-test.duckdns.org 54.255.229.102 "GET /.well-known/acme-challenge/BtV0g096O8yER6Cc9XINYvu7xNSFDEyniujVRVcq4TI HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 1677 87

My cache configuration (CacheDisable is toggled on or off as needed for testing):

CacheSocache "shmcb:mycache(4294967294)"
CacheEnable socache /
#CacheDisable /.well-known/acme-challenge/
CacheHeader on
CacheDetailHeader on
CacheQuickHandler off
CacheSocacheMaxSize 1000000

rg305 · May 13, 2024, 4:12pm

Have you tried using a smaller number?

jvanasco · May 13, 2024, 4:31pm

This is annoying....

I was going to suggest that you debug the simple_verify command, which should run:

github.com

certbot/certbot/blob/873f979a25cc29bba93627eb08697baaa894d640/acme/acme/challenges.py#L303-L357


      
          def simple_verify(self, chall: 'HTTP01', domain: str, account_public_key: jose.JWK,
                            port: Optional[int] = None, timeout: int = 30) -> bool:
              """Simple verify.
          
              :param challenges.SimpleHTTP chall: Corresponding challenge.
              :param str domain: Domain name being verified.
              :param JWK account_public_key: Public key for the key pair
                  being authorized.
              :param int port: Port used in the validation.
              :param int timeout: Timeout in seconds.
          
              :returns: ``True`` iff validation with the files currently served by the
                  HTTP server is successful.
              :rtype: bool
          
              """
              if not self.verify(chall, account_public_key):
                  logger.debug("Verification of key authorization in response failed")
                  return False

This file has been truncated. show original

But then I found this ticket - acme library HTTP01Response.simple_verify should not verify host certificate · Issue #6614 · certbot/certbot · GitHub - which suggests simple_verify is no longer used by Certbot and just remains as a legacy option for other clients that use the codebase.

That would explain how you're not dealing with a primed cache. The only hook that could be used is --manual-auth-hook, but that is not really compatible with this setup.

Maybe someone more familiar with Certbot can chime in here. The former pre-flight validation check would have primed your cache and avoided the stampede/dogpile effect in your caching layer.

Osiris · May 13, 2024, 4:38pm

I'd like to suggest to move these quite specific implementation issues/bugs into its own separate thread instead of "polluting" this Wiki.

system · June 12, 2024, 9:23pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

mcpherrinm · June 13, 2024, 7:36pm

We have just made a change that will do primary validation first, and only do secondary validation once it (and CAA checking) has succeeded. While we did this for other reasons, it might help in this case too. (I'll post an API announcement shortly, once proofread by a coworker)

mcpherrinm · July 1, 2024, 12:00pm

This topic was automatically closed after 17 days. New replies are no longer allowed.

Topic		Replies	Views
LE validation server sending multiple verification requests Help	4	499	July 7, 2019
Http-01 timeout issues Client dev	37	1416	October 1, 2023
Problem with LetEncrypt validation after blocking all server IPs Help	14	295	October 1, 2024
Too many requests (it seems), but never recieved a certificate Help	2	1194	November 16, 2016
Apache multidomain webroot Server	5	26756	June 2, 2018

Multi-Perspective Validation, simultaneous requests, and Apache caching

Related topics