here’s a curiosity I stumbled upon today when doing the following:
request a challenge
trigger DNS-01 challenge
provide DNS record with 10s TTL
request certificate
deactivate authorization
before TTL expires, request a new challenge and trigger it
server will complain that the DNS data doesn’t match, because it has the old record cached
Is this intentional? Shouldn’t the cache be cleared when deactivating an authorization? Should DNS-01 requests be cached at all? What is the purpose of caching a one-time challenge-response?
I have lowered my TTL to 1s now (I’m hesitant to use 0s).
Yes, staging server and same domain name ( actually 2, mytestdomain and www.mytestdomain ) I have managed to get a couple of fails - most work fine though
Can you reproduce it if you stay under 10s reliably? Maybe the server uses a max TTL. I never had any success with a 10s TTL and 2 successive triggers in a row. It reliably fails with
"error" => {
"detail" => "Correct value not found for DNS challenge",
"status" => 403,
"type" => "urn:acme:error:unauthorized"
},
I can get the error if I go for very short timescales. it doesn’t seem to be the TTL though, I can get the same error message if I just re-request in a very short period ( less than 10 seconds )
If I’m slower ( such as below) I’m always successful
2016-11-15 18:16:29 Registering account
2016-11-15 18:16:31 Verify each domain
2016-11-15 18:16:31 Verifying mytestdomain.com
2016-11-15 18:16:34 Verifying www.mytestdomain.com
2016-11-15 18:16:37 checking DNS at mimi.ns.cloudflare.com for www.mytestdomain.com. Attempt 1/100 gave wrong result, waiting 5 secs before checking again
2016-11-15 18:16:44 Verified mytestdomain.com
2016-11-15 18:16:48 Verified www.mytestdomain.com
2016-11-15 18:16:49 Verification completed, obtaining certificate.
2016-11-15 18:16:51 Certificate saved in /home/andy/.getssl/mytestdomain.com/mytestdomain.com.crt
2016-11-15 18:16:52 The intermediate CA cert is in /home/andy/.getssl/mytestdomain.com/chain.crt
2016-11-15 18:16:53 deactivating domain mytestdomain.com
2016-11-15 18:16:55 deactivating domain www.mytestdomain.com
getssl: mytestdomain.com - certificate obtained but certificate on server is different from the new certificate
$ getssl mytestdomain.com -f
2016-11-15 18:17:42 Registering account
2016-11-15 18:17:44 Verify each domain
2016-11-15 18:17:44 Verifying mytestdomain.com
2016-11-15 18:17:46 Verifying www.mytestdomain.com
2016-11-15 18:17:49 checking DNS at mimi.ns.cloudflare.com for mytestdomain.com. Attempt 1/100 gave wrong result, waiting 5 secs before checking again
2016-11-15 18:17:54 checking DNS at mimi.ns.cloudflare.com for mytestdomain.com. Attempt 2/100 gave wrong result, waiting 5 secs before checking again
2016-11-15 18:18:02 Verified mytestdomain.com
2016-11-15 18:18:06 Verified www.mytestdomain.com
2016-11-15 18:18:08 Verification completed, obtaining certificate.
2016-11-15 18:18:10 Certificate saved in /home/andy/.getssl/mytestdomain.com/mytestdomain.com.crt
2016-11-15 18:18:10 The intermediate CA cert is in /home/andy/.getssl/mytestdomain.com/chain.crt
2016-11-15 18:18:11 deactivating domain mytestdomain.com
2016-11-15 18:18:14 deactivating domain www.mytestdomain.com
getssl: mytestdomain.com - certificate obtained but certificate on server is different from the new certificate
Mine is slightly slower anyway - since I’m having to wait for cloudflare DNS servers to update (hence the 5 second pauses ). I’m using a TTL of 300 seconds though - hence why I don’t think it’s TTL related.
If I try and complete the second request within 10 seconds ( and cloudflare servers have responded quickly and providing the correct result), then I do get the same error as you.
I was under the impression that Let’s Encrypt`s unbound instance doesn’t do any caching, but it’s possible there’s a short minimum TTL (maybe 60s?), which is often used as a defense-in-depth measure against rebinding attacks (boulder generally pins IPs once they’re resolved, but there’s always the chance you forget that somewhere).
Due to #2326 it’s hard to say whether the issue here is boulder actually getting the wrong TXT record or no record at all, but the fact that tcpdump doesn’t show any requests suggests it’s the former (due to caching).
That’s interesting. The only explanation I can think of (if we assume these observations are correct) would be a maximum TTL of something like 10 seconds. Not sure why that value would be used though.
Found this comment suggesting the max TTL is 5 minutes:
Slightly confused about why things worked with a 300s TTL in that case. Maybe there are multiple resolvers that don’t share their cache and you got lucky, or maybe the value has been lowered since.
I was puzzled why it works with a much higher TTL, but I suspect this is because of different resolvers? Anyway, it doesn’t really change anything, since I can reliably reproduce the “issue”.
My main point is, if a new challenge comes into existence for a name that previously had a challenge, it should flush all caches for that name. This would solve any race conditions regardless of the TTL used.
It isn’t really an issue anyway because I can just use a very low TTL which LE seems to obey. On that note, is it safe to use a TTL of 0? The standards say 0 should mean “never cache”, but is this something that’s on your radar and is it safe to rely on? Or should I use 1s?
Edit: While I don’t expect to handle multiple challenges for the same name in a short time when switching to production, I think it all should work correctly in any case.