Temporary CAA failures in top level domain DNS

I have recently seen a authorization being rejected by boulder with an error while checking CAA for “me” (i.e. the toplevel domain for Montenegro).
I think it would be great if Let’s Encrypt would retries such things, that are firmly outside of the users control, a bit more before giving up. As far as i understand best practice is not to automatically retry authorizations on challenge failure, especially now that ACMEv2 makes it impossible to just retry an authorization and needs a whole new certificate order.

Is this something that can be improved in boulder, or are clients supposed to be able to detect specific errors and automatically retry? Having that logic in clients seems more likely to go wrong in ways that create unneeded load on the API.

For reference the exact error message was:

{
 "identifier": {
   "type": "dns",
   "value": "www.XXXX.me"
 },
 "status": "invalid",
 "expires": "2020-08-20T05:04:28Z",
 "challenges": [
   {
     "type": "dns-01",
     "status": "invalid",
     "error": {
       "type": "urn:ietf:params:acme:error:dns",
       "detail": "During secondary validation: DNS problem: query timed out looking up CAA for me",
       "status": 400
     },
     "url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/6502528531/pjFM2Q",
     "token": "XXXXXX",
     "validationRecord": [
       {
         "hostname": "www.XXXX.me"
       }
     ]
   }
 ]
}

In one real respect, it does get retried. Not inside Boulder, but inside the DNS recursor itself. It is resilient to some level of “internet weather”. (Though Boulder does actually retry some DNS lookups if it’s detected as a “temporary” network error).

I guess the question there is - within the 10 seconds available to make the lookup, how likely is it that giving another 10 seconds is going to make any difference? I don’t think there’s a great way to differentiate issues that are a fluke against those that are not.

I believe that Sectigo’s ACME server actually does retry DNS challenges for some duration of time; you can observe a very protracted processing state. So it is doable at some scale. Would be good to hear what Let’s Encrypt’s thoughts on that are.

From what I’ve seen, TLD issues like this have been pretty rare. When you consider also that renewal attempts usually begin quite far away from the expiry date, the danger overall seems low. Has this been causing problems for you?

Hi @Martin2

looks like a temporary problem. “check your website” has a lot of checked me - domains, without problems checking CAA + .me (last sample - https://check-your-website.server-daten.de/?q=ironman.merali.me#caa ).

Temporary problem -> your client tries it again some hours later, job done.

You can create a CAA with your domain name, that blocks checking CAA + .me.

PS: Conclusion: Not really a problem.

You know, this looks like a good reason why the common advice on renewal cron and timers is to run them twice a day and not once a week :wink:

2 Likes

Yep, with that such a temporary thing isn’t a problem. May happen sometimes.

I’m not convinced that this is not a problem. That means even if the client is careful with calling the api only when the expectation is that everything is properly setup, it needs to implement automatic retries with complicated rules.

Currently my expectation was that retries for internal errors (urn:ietf:params:acme:error:serverInternal) are really all that should be needed when things are properly setup. Leaving other situations for human intervention.
But, if this is considered not to be a problem, the client needs either to parse the free text parts of the error message or implement retrying blindly.
For renewals trying a few hours later is ok, but for new websites from our customers that’s not going to be the time frame we have. So we need to retry more rapidly. From how ACMEv2 is implemented in Let’s Encrypt without the possibility to do easy retries that seems like a unneeded strain on community resources and on our quotas.
Of course if Let’s Encrypt prefers automatic retries (likely also in cases where there is an actual reason why the first try didn’t work) over spending developer time(always too little of that, i understand) on things like this, that is a possible position as well. In that case the client implementation would need to prefer a few(?) retries over escalating the problem for human intervention. Of course if something is actually not working on some scale that could lead to increased load on the ACME API if every automated client would work like that.

1 Like

If this is a complication, your setup is wrong.

See Certbot - run it two times / day. If the first try fails because of such a random error, the next try will work (ok, 99,9 %). Or run it 4 - 6 times per day, that’s not a problem.

You must always check if the certificate is renewed. But that’s possible checking the running webserver, it’s not required to check / parse the output of your client.

That’s

always wrong. A new customer can wait some hours. If not, he should come three days earlier.

In this situation, the problem is that the upstream DNS provider has a terminal failure and is either offline or broken. This type of problem should not be automatically retried, and is very rare.

Consider the implications of automatic retries for this problem with a Certbot installation which is scheduled to renew multiple certificates on a given day. A single 10 second retry could cause massive bottlenecks if the affected TLD were in multiple certificates. If a certificate had multiple domains within this TLD, that would cause further delays. If there were multiple retries per outage, the bottlenecks would explode.

There is no guarantee or belief an error like this could be resolved within seconds. While it would be more convenient for a small minority of use-cases to change this behavior, that change has the capacity to create massive inconveniences to all users.

The Certbot client could definitely use better explanations of the various errors that it presents to users - but this situation appears to be handled correctly by Certbot and Boulder. When there is an outage like this:

  1. the order should fail;
  2. a human or program should note the failure and address it;
  3. when the outage has resolved, a new order should be created and validated.

At first. I also did not like the ACME v2 choice of making all errors fatal and have those failures “bubble up” from Challenges to Authorizations to Orders. The more I used the ACME API, the more I appreciated these design choices.

The one concern I have here though, mostly for @_az - shouldn’t there be a test-case for this in boulder, just to ensure the right message is being generated for the right scenario? This scenario and message isn’t tested yet, and it seems worth a quick mocked test-case.

To be honest, I think that interpreting error types, while appealing on the surface, is a bit of a misstep. My own experience with it has been that it inevitably leads to overfit of clients to idiosyncrasies of Boulder. That is a very fragile practice, things change all the time. There are also one or two error classifications in Boulder today which (imo) do not make sense and are only that way because of reasons relating to how things were coded, basically.

There are some words about it under https://letsencrypt.org/docs/integration-guide/#retrying-failures, but more generally, I think a reliable pattern is to implement back-off based on the certificate FQDN set, and always leave interpretation up for humans.

What do you mean? For an opaque DNS error, or for the public suffix failing a CAA lookup?

We use a custom client and have during the ACMEv2 transition focused on making sure everything is properly setup and working before letting boulder start running it’s checks in the hope to avoid trouble and unneeded resource usage. Thus we expected to get away with little to no outright restarting whole orders.

And having smarter retries that can automatically guess if the failure reason is something that is likely fixing itself or something we need to investigate based on detailed rules has to be more complicated than just retrying on “error:serverInternal”.

Sure, checking if the client successfully got a certificate needs only parsing of well specified parts of ACME. Interpreting the error messages from boulder is what i come to conclude what needs heuristic text matching. Which i rather not do, but we do have to resort to that with lots of other suppliers too.

Yes, that’s why i prefer if suppliers are resilient against transient hickups. boulder has mostly been a quite sensible software to interface. But whenever standards are involved it seems error handling is too diverse to be able to be reasonably captured in standards (though epp is much much worse than acme)

That obviously works way better if you are the only CA that offers free certificate than it does for us. Also generally offering the best service possible is something we deeply care about. The whole point of this discussion is how to do that while being respectful about the resources of Let’s Encrypt we are using to accomplish this. (while our volume is not huge it’s already hundreds of certificates a day and we expect to grow over time as everybody expects)

We’re currently leaving all retry decisions to humans after our internal prevalidation. But that does not work well with hundreds of orders a day, so we are working on a sensible retry strategy. I’m coming to the conclusion that the general sentiments seems to be that sensible means blindly retrying a few times and only then escalating to human intervention, because the error messages are only intended to be consumable by humans.

Yes, trying to work around something like this in certbot would be the wrong place, because certbot is too synchronous. boulder itself could retry more efficiently though and fortunately our custom client can retry without having to slow down other certificates too much as well.

From my understanding of Boulder, the latter. The CAA lookup appears to just be a generic opaque DNS error from the 3rd-party library. Shouldn’t there be a test case to ensure we continue to generate this same error during these outages? The concern is a version bump of the 3rd party library could handle this error differently, and the test suite wouldn’t pick that up.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.