How can we fix a bug in the ACME protocol?

mholt · June 14, 2023, 9:27pm

I wish that were true:

Refactor challenge retry logic to create new orders (sigh)

committed 10:02PM - 29 Jul 20 UTC

The ACME spec (RFC 8555) is somewhat ambiguous/conflicting about finalizing auth…orizations. In §7.1.4 it says: client should attempt to fulfill one of these challenges, and a server should consider any one of the challenges sufficient to make the authorization valid. This makes it sound like solving any one of the possible challenges for an authz is sufficient to make an authz "valid". And here it says if any one of the challenges fails, the entire authz is considered "invalid" (§7.5.1): The server is said to "finalize" the authorization when it has completed one of the validations. This is done by assigning the authorization a status of "valid" or "invalid", corresponding to whether it considers the account authorized for the identifier. To my dismay, it appears that if any one of the challenges listed for an authz are marked "invalid", indeed the entire order fails. This means that a server may offer http-01, tls-alpn-01, and dns-01 challenges for an authz, and if a client tries tls-alpn-01 and fails, it cannot simply try http-01. This is very unfortunate, as this is a very common use case, especially in deployments where site owners don't control their customers' domain names. We see a lot of cases where port 443 has TLS termination before the ACME client (breaking the tls-alpn-01 challenge), but where port 80 is open and the http-01 challenge would succeed. We also see the reverse, where port 80 is blocked but port 443 is open. There is often no way for the client to know this ahead of time because it does not have an outside perspective. Because a single failed challenge invalidates *the entire authz* even though other challenges *offered by the server as acceptable options* are still perfectly capable of succeeding, we need to cancel the order (which involves deactivating the remaining authorizations one-by-one) and make a new one. SUPER unfortunately, newOrder calls are rate-limited by Let's Encrypt, effectively halving even a correctly-implemented, robust, and well- behaved ACME client's management capacity. Orders are also associated with a lot of state, and as such, are expensive database transactions on the server-side. Further, client-side logic is forced to be much more complex in order to correctly take advantage of all offered challenge types. Clients that don't do this effectively ignore all but the first, making it pointless to offer more than one challenge type in the first place! The previous logic was much cleaner and more elegant: an order was created, its authorizations were iterated, and each authorization's challenges were iterated until one succeeded. If any authorization failed (i.e. all challenges failed), it simply returned that error and the order was cancelled (other authorizations were deactivated). This kept all error-handling and retry state local to the respective loops: authzs -> challenges That was the previous logic. Now, we have a third loop: order retries -> authzs -> challenges We need to bubble retry state up to the top-most "order" loop, which gets manipulated in the inner-most "challenge" loop. We have to carry failure state around through the whole retry process, mapping identifiers to challenge types in order to remember which challenges failed for which identifiers so we don't try them again on the next order. Additionally, our challenge selection is necessarily made more complex. Before, we could just randomize the order of the challenges (as a good practice, to avoid accidental dependence on just one challenge type). Now, because retries are expensive and complex, we absolutely need to avoid them as much as possible. So instead of a random order, we keep a history of challenge success rates and choose the most successful challenge type first, every time. If it fails, we try the next-most- successful, and so on, but each retry is part of a new order and that's expensive. The ACME spec forces leaky, complex abstractions, and makes writing correct clients more difficult and error-prone than is necessary. (Just look at this commit!) I am not aware of any good reason that the spec is the way that it is on this point. One possibility I've heard is "it's simpler for servers that way" and "free CAs want to keep their costs down" but it's NOT simpler this way (again, look at the code), and order transactions are *expensive* -- CAs don't want frequent polling on order status because there is so much state attached to an order! -- but the way the spec is written requires significantly more CPU and network cycles than are necessary. Because it only takes one successful challenge to mark the authz as "valid", and because order transactions are expensive for the server, and because the client-side logic is immeasurably more complex and convoluted and tricky to get right this way, the current ACME spec is nonsensical on this point. Maybe it intended to optimize for server implementations (which it didn't do successfully, as explained), but forgot that ACME *clients* would fill the world, not servers; and now we have something that is unintentionally hostile toward clean, correct, efficient, and low-cost implementations. In summary: The ACME protocol should changed so that an authz is not marked as "invalid" until ALL offered challenges fail, rather than just one.

The vast majority of those 500 LoC are to account for the fact that we need to start over after 1 of N challenges fail. The reality is that you CAN'T drop all state, otherwise you'd end up retrying the same challenge that already failed. If port 80 is blocked, for example, you have to remember that you can't try the http challenge again.

It's a LOT of state management. I am not sure why you think you can just "drop all state" when actually you have to remember everything from the previous order so that the next time you don't encounter the same problem.

mholt · June 14, 2023, 9:29pm

We're not asking to retry failed challenges though. We're asking to try the OTHER challenges that are offered, before the order is closed.

As an ACME client developer who is trying to make the client ecosystem more robust across the board, I would definitely like to have this discussion.

Why open 3 orders when you can open just 1?

aarongable · June 14, 2023, 9:48pm

Because you don't. The vast majority of clients get by without retaining information from a previous run. In fact, many don't even specifically retry failed orders: they just log, exit 1, and wait for their next cron trigger to try again. It's very cool that Caddy goes to these lengths, but it is not necessary. As I said, most failures are either fundamental (falling back to other methods won't help) or transient (retrying the same method will help). The case you're solving for is a slim minority as far as I'm aware.

Yes, which is why I addressed that in the second half of my message. The spec does not allow for it, the vast majority of clients would break if we tried, and I don't believe this use-case is prevalent enough to be worth updating the spec to account for it. In fact, many clients can't fall back anyway! Many can only do HTTP-01: they aren't configured with credentials to update DNS, and they're not integrated into the server so they can't respond to a TLS-ALPN-01 request.

mholt · June 14, 2023, 9:59pm

Thanks for your reply.

Ok, and maybe this emphasizes what the actual goal is here: Can we elevate our vision a little bit? Raise the bar a little higher? The goal for all of us is a robust and resilient ecosystem, not one where we rely on brittle and naive cron jobs for ACME clients that work only in 1/3 use cases. We should not be optimizing for that scenario. That is not the Internet we want to live in.

There's a need here for clients and servers to work together, to cooperate on improving the status-quo, so I am hoping there is a way forwards instead of a response that tells everyone, essentially, "What we have is good enough," and, "It's too hard," and "It's not worth it."

Maybe the reason more clients don't already do this is because of this: it's too complicated currently. I'm suggesting we lower that barrier. It will promote more integrated, fully-native ACME clients.

Anyway, this fix should be trivial compared to starting the world's first automated and non-profit CA from scratch.

It's OK if it's not an errata. But we should find a way to make it happen nonetheless.

jvanasco · June 14, 2023, 11:48pm

I disagree with this reading, and believe this is where Errata is needed. The Errata would likely be striking this section, as making this section plausible would require significant changes that necessitate a new RFC.

Why? If a failed challenge were left in the "processing" state, under the current framework there would be no mechanism to notify the Client of the CA's validation attempt and subsequent failure. This notification is only made through polling for status. In certain situations a user may be able to detect the specific network traffic, but this is an onerous task.

The only ways I can imagine this section working are either undoable:

The status would need to change to something else, but that would break most clients.
A challenge failure would not trigger an Authorization Failure, which then triggers an Order Failure. This would likely break existing usage patterns of Clients and CAs.

Or would require more work on the spec:

A new field is added to the payload, which states there was a failure, and the transition to "invalid" is given a window that would allow for an immediate or intended retry.
The CA offered mechanisms to "revive" the "Challenge", Authorization and Order.

Our internal client does this, and it had been massively helpful when troubleshooting issues.

An issue I've had is ISRG's decision to re-use certain objects under certain circumstances. This makes it slightly more difficult to persist information and analyze things.

Edit: Sorry if I'm complaining a lot on this. I know the odds of changing are small, but I do believe (i) the spec is incredibly shortsighted here; (ii) some of ISRG's decisions make it a bit harder to do widespread metrics and analysis;l and (iii) complaining is part of my healing process in accepting this
won't change even though it should.

schoen · June 15, 2023, 4:10am

@mholt, can you describe a scenario where you think a cleverer client will be able to benefit from falling back to a different challenge type?

My intuitions may have been narrowed by years of trying to help people use Certbot, which would generally not be able to fall back this way (basically no Certbot authenticators can actually natively successfully complete more than one challenge type, while --preferred-challenges is basically only used with -a manual and does not actually imply that automation would be able to solve multiple different challenges¹). So I think my intuitions developed this way match up well with @aarongable's intuitions.

But I did just read a thread (and a blog post by you) where you and @bmw were talking about how, indeed, non-tightly-server-integrated clients like Certbot are not that great at automation and reliability compared to integrated ones like Caddy. Which I think most of us have agreed on for years.

Still, I don't immediately how a more sophisticated or merely more integrated ACME client will commonly be able to benefit from falling back to a different challenge type, even given that it can potentially solve more than one (at least the ALPN challenge in addition to the HTTP challenge). Is there a likely case for that? A server firewall that blocks port 80 and not port 443, or blocks port 443 and not port 80?

¹ Edit: well, there is an experimental Certbot plugin written by @_az that does a super-magic thing with Linux networking to intercept challenge requests before they even reach the web server, and I suggested that as an April Fool's joke this could be proposed to implement TLS-ALPN-01 too, which @Osiris seemed to think would be genuinely worthwhile and doable. If certbot-standalone-nfq did implement both HTTP-01 and TLS-ALPN-01 then I guess it would be a novel example of a Certbot authenticator that could natively and automatically solve both of these challenges without additional user scripting.

bruncsak · June 15, 2023, 6:35am

Is it possible to modify the ACME server logic to fulfill the ACME client's requirement described in this topic without breaking any existing stateless ACME client?

I imagine doing the following way.

The ACME client fulfills the condition of all the ACME challenge types it is capable to use for a given identifier. It makes sens if there is at least two, or at most three with the current specs. Then, the ACME client triggers the challenge verification for those challenges simultaneously minimizing the delay among them.
Here comes the slightly modified logic of the ACME server. A trigger time attribute is needed if there is no such attribute yet for the challenge object. The special value 0 signifies no trigger received for the challenge object. There is a new attribute internal state as well.
When a trigger arrives for a given challenge, the ACME server updates both the internal and public state as "processing", and notes the trigger time. Then, it process the challenge verification.
When a challenge verification terminates, it immediately updates only the internal object state according to the verification result. It leaves the public state in "processing".
Then, it waits up to one second from the trigger time. Please note that there may be no wait is needed, since the verification itself took longer than a second.
After that one second grace period, it checks is there any challenge with internal state "processing" for the same identifier? If yes, it finish.
If there is no challenge with internal state "processing" found, then it copies the internal state to the public state for each challenge type. Then it processes the state transition of the authorization object based on the status of all challenges.

Tradeoff relative to the current situation:
The stateless ACME clients will encounter at least one second delay from triggering the challenge up to getting know the final state of the challenge and authorization.
The multi-challenge ACME client will get the opportunity of one second window to trigger multiple challenges for the same identifier. If there is any delay due to network or ACME server load condition, the client must be ready to fall back as it was earlier.

mholt · June 15, 2023, 2:29pm

Sure, I thought it would be obvious, but:

Port 80 is blocked, port 443 is not. Thus retrying http-01 will not work, so must use tls-alpn-01. We see this frequently. That's why these two different challenges exist.

ACMEz will quickly "learn" that port 80 is blocked in this scenario and use tls-alpn first in the future. I don't expect clients to even be that clever, but there is state whether we like to admit it or not.

Another very common scenario: CDN fronting. If you have Cloudflare in front of your site, HTTP is forwarded but TLS is terminated.

If we want HTTPS to "just work" robustly and reliably (i.e. 0-config), we need to accept that retrying without state will not work.

schoen · June 15, 2023, 4:50pm

Thanks!

I think I slightly disagree as a historical matter that the existence of the ALPN challenge derives from port blocking. I think it derives from some hosting providers or CDN providers telling the Mozilla people that it would actually be easier for them to do bulk issuance via a port 443 challenge involving proxy or load balancer configuration (not that port 80 was blocked for them!). That then led to the TLS-SNI challenge method and then the ALPN method to replace it.

Your point is indeed a ways outside of my intuition because I'm used to telling people here on the forum over and over again that having port 80 blocked is a misconfiguration that they must fix. But

I guess the "stop blocking port 80" is really outsourcing the "state" to the human user!!

And so on reflection, I think you're right to say that we would get better, easier, and more universal HTTPS deployment if we didn't do so.

(Also, analogously, with your CDN example I'm used to telling people "turn off your CDN for your certificate renewal" or "change your certificate deployment model" or something. Which is also outsourcing state to the human user.)

mholt · June 15, 2023, 6:35pm

I appreciate your insights, @schoen. I didn't know that about the history! Nor did I think of "outsourcing" the state to the user. That resonates deeply!

system · July 15, 2023, 6:36pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.