if LetsEncrypt eventually ends up with 20 validation locations, is it really going to utilize all 20 on every HTTP-01 request, or is it going to be a randomized subset using a more reasonable value, like 5?
Currently, all 5 validation connections hit my server almost simultaneously. Was any consideration given to spacing them out maybe 1 second apart, or even half a second, for the sake of the servers?
After the number of validation connections was increased to 3 then 4 then 5, I encountered an apparently undiscovered bug with Apache's mod_cache_socache module (RAM caching of HTTP responses). It seems that when such caching is enabled, and Apache is slammed with many almost-simultaneous requests for the same uncached file, it will respond to at least one of them (usually the 3rd) with a zero-byte response body. Seems like a race condition due to identical requests continuing to come in while it's still attempting to place the first response into cache. And thus the whole HTTP-01 challenge is failed because of one bad response.
I've already found a decent fix/workaround, placing CacheDisable /.well-known/acme-challenge/ in my configuration, and I'm okay with leaving this in place permanently, but I'm concerned this might trip up others so I'm posting about it here in case anyone else encounters this issue.
I know this is an Apache problem and I'll attempt to raise a bug with them. But staggering the validation connections even by half a second would probably avoid this & similar issues that could be lurking out there. Also of concern are low-end servers that might not be able to handle many simultaneous connections in general, especially if the current "5" continues growing.
It's too early to tell, but that's the approach we're thinking about. If the draft ballot passes as is, then the subset would likely be either 6 (1 primary + 5 remote) or 7 (1 primary + 6 remote).
I appreciate the information. Has there ever been discussion regarding spacing out the validation connections a bit so they don't all hit the server almost-simultaneously?
That's my curiosity, too: is it a good thing to slow down issuance like that? The validation won't proceed until the client indicates it is ready, so it seems like it should be able to service that tiny amount of traffic. Artificial delays seem like they should be unnecessary.
By contrast, the associated DNS queries (e.g. for CAA) have apparently triggered some anti-DDoS code, if I'm remembering those threads correctly! (Although I don't think that the servers literally had any trouble with capacity to answer the requests.)
That is commonly known as a dogpile or cache stampede effect. A popular workaround is for the first response to generate a cache lock, and have the subsequent requests either serve the old value or block & poll until the new value populates.
While I do think ISRG/LetsEncrypt should consider spacing requests to help deal with misconfigured servers like this, I think there are two paths you should take:
1- File a bug report against that Apache module, and cite this as a use-case and reproduction method. The read-through cache layer should not be acting like this on a static file.
2- What ACME client are you using? If Certbot, which methods/plugins are you using? Some clients/plugins/methods integrate a self-check before trigging a validation. IIRC, Certbot had that in the webroot code. A self-check should prime the cache before LetsEncrypt is triggered.
I'm currently working on filing an Apache bug, but I know they're going to want logs from the latest Apache version so it's going to take a bit longer to set up an additional test environment.
here's what I was using for testing this yesterday (using a temporary duckdns subdomain which I've since deleted)
certbot certonly --staging --dry-run --cert-name test --apache -d certificate-test.duckdns.org
Apache logs -- note, the final entry in the line, i.e. 87, is response body size in bytes, and the next-to-final entry, i.e. 5656, is response processing time in microseconds
With caching enabled (note no response size on 3rd and 4th hit):
My cache configuration (CacheDisable is toggled on or off as needed for testing):
CacheSocache "shmcb:mycache(4294967294)"
CacheEnable socache /
#CacheDisable /.well-known/acme-challenge/
CacheHeader on
CacheDetailHeader on
CacheQuickHandler off
CacheSocacheMaxSize 1000000
That would explain how you're not dealing with a primed cache. The only hook that could be used is --manual-auth-hook, but that is not really compatible with this setup.
Maybe someone more familiar with Certbot can chime in here. The former pre-flight validation check would have primed your cache and avoided the stampede/dogpile effect in your caching layer.
We have just made a change that will do primary validation first, and only do secondary validation once it (and CAA checking) has succeeded. While we did this for other reasons, it might help in this case too. (I'll post an API announcement shortly, once proofread by a coworker)