IETF ACME: DNS challenge fallback

Hello,

I’m working on an application that needs to try HTTP-based authz first; then, if that fails, it’ll try DNS-based authz. (This would be after having locally confirmed that HTTP-based authz should work; the problem is that there are cases where local HTTP authz passes but a remote HTTP authz check still fails.)

What I’m seeing (against the v2 API), though, is that once the HTTP authz fails, the request to accept the DNS challenge fails because “Unable to update challenge :: authorization must be pending”.

I don’t see anything in RFC 8555 concerning switching challenge methods after one fails; there’s a section on retrying a challenge, but it doesn’t seem to envision (or to proscribe) trying a different challenge method after one has already failed.

One way around this would be to create a new certificate order just to get a new authz. But that seems pretty inefficient; I’m hoping there’s a way to reuse the existing authz (and order).

Thank you!

I think that pre-authorization (https://tools.ietf.org/html/rfc8555#section-7.4.1) can enable this kind of thing, but it’s not implemented by Let’s Encrypt.

In the real world, I have seen some applications (like CertifyTheWeb IIRC? maybe it was another client) host an external service to check the accessibility of the challenge. But there’s privacy/data sharing and operational consequences to doing that.

Pre-authz would be fantastic, but it shouldn’t be necessary for this, I wouldn’t think; I just need to use a different challenge from one that’s already failed.

Given that LE does allow reuse of an already-attempted challenge, I don’t see why it would be problematic to accept a DNS challenge after an HTTP challenge for the same authz has failed. (Or vice-versa, for that matter.)

Authz re-use is still subject to “State Transitions for Authorization Objects”.

You can re-use a valid authz because it does not require a state transition (it’s in a useful final state).

You can re-use a pending authz because it is not in a final state.

You can’t re-use an invalid authz because it is in an unusable final state.

You can’t re-use a failed authz because it is in an unusable final state.

Reuse of a failed challenge constitutes reuse of a failed authz, though, and that’s allowed.

I’m not sure I understand.

From what I can tell, invalid is a final state for every object type (order, authz, challenge).

From the RFC (8.2, last paragraph):

Clients can explicitly request a retry by re-sending their response
to a challenge in a new POST request (with a new nonce, etc.). This
allows clients to request a retry when the state has changed (e.g.,
after firewall rules have been updated). Servers SHOULD retry a
request immediately on receiving such a POST request. In order to
avoid denial-of-service attacks via client-initiated retries, servers
SHOULD rate-limit such requests.

Ah, true (though also not implemented by Let’s Encrypt).

I guess you are right - if retries were implemented, the authz would need to “stay open”, which would enable switching between challenges.

Would be interesting to hear what the Boulder developers think.

(though also not implemented by Let’s Encrypt)

Doh! Well, at least that’s consistent.

(For anyone watching, citation: https://github.com/letsencrypt/boulder/blob/master/docs/acme-divergences.md)

We have no plans to implement retrying failed authorizations.

Can you expand on this requirement? If DNS-01 is available and you can't verify HTTP-01 will succeed reliably before POSTing the challenge, why not just use DNS-01 by default?

You could also try a proper dry-run with the staging environment. If HTTP-01 fails in staging your code can use the DNS-01 challenge immediately when it requests a prod certificate.

If DNS-01 is available and you can’t verify HTTP-01 will succeed reliably before POSTing the challenge, why not just use DNS-01 by default?

At least two reasons:

  • DNS changes are much slower in our environment than HTTP changes.
  • Not all of our users host their DNS such that the ACME-client server can make the necessary changes to DNS.

You could also try a proper dry-run with the staging environment. If HTTP-01 fails in staging your code can use the DNS-01 challenge immediately when it requests a prod certificate.

That’s what we do, but there are still cases where HTTP works in the staging phase but breaks when talking to LE. (mod_rewrite is a particular nuisance in this regard.) Rather than just failing those requests, it’s more helpful to the user to retry the authz using DNS.

Have you considered having customers delegate the DNS-01 challenges to a DNS zone (perhaps run by your employer) that supports the necessary changes and provisioning the TXT records for the DNS-01 challenges there? All that requires on the user side is a (one-time) CNAME (I believe adding CNAMEs is widely available in DNS hosts) and the changes could happen as fast as you make them.

I'm not sure I understand. You're saying you create an order with https://acme-staging-v02.api.letsencrypt.org/directory, POST the HTTP-01 challenge, it succeeds, but changing to the production endpoint doesn't succeed and the root cause is something related to mod_rewrite on the user side?

All that requires on the user side is a (one-time) CNAME (I believe adding CNAMEs is widely available in DNS hosts) and the changes could happen as fast as you make them.

This sounds like every single LE authz would require a “phone-home” RPC call (possibly batched)? That would be a significant infrastructural undertaking.

POST the HTTP-01 challenge, it succeeds, but changing to the production endpoint doesn’t succeed

I misunderstood your meaning; by “staging” I thought you meant a generic local pre-authz check. We don’t reach out to LE’s staging server in production, no. Is LE’s staging environment intended for production use in this way?

I obviously don't know what your customer base looks like but I don't think it's a particularly sophisticated engineering problem, just one that requires an investment of time/energy. Processing a single "add TXT" request per-identifier per-customer every 30-60 days and managing a DNS zone doesn't seem insurmountable. It would also fully resolve problems related to HTTP-01 in distinct customer environments without imposing on their existing DNS hosts. (It would also be a significant infrastructural undertaking to implement retries for already failed authorizations in Boulder based on its current architecture.)

That's up to you. We don't publish an SLA for either of our environments. We do experiment in staging and it does occasionally result in bugs that are not present in production. It's also where we roll out new features like multiple perspective validation before moving them to prod. My recommended approach (if
you're unwilling/unable to use DNS-01 with some sort of delegation) would be to use an HTTP-01 validation in staging as a heuristic only. If it fails because the environment is unresponsive or returns an error outside of typical validation failures you could ignore the result and choose to either do an HTTP-01 in prod anyway or use DNS-01.

Alternatively you could invest in running your own Boulder stack for pre-flight validation checks in place of our staging environment and own the availability of that environment. It's an open source codebase and other large integrators have taken this route in the past. Of course that's also an infrastructural undertaking.

1 Like

There are internal organizational hurdles that a “phone-home” workflow would entail. I suspect that’d be our biggest impediment there. But I’ll mention the idea and see what happens.

I definitely appreciate that making Boulder implement retries would be a significant undertaking on LE’s end, and I hope I didn’t come across as suggesting otherwise. Until last night I just didn’t know about this behavior.

I’ll see what others on my team think about using the staging environment as a heuristic.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.