How can we fix a bug in the ACME protocol?

I wish that were true:

The vast majority of those 500 LoC are to account for the fact that we need to start over after 1 of N challenges fail. The reality is that you CAN'T drop all state, otherwise you'd end up retrying the same challenge that already failed. If port 80 is blocked, for example, you have to remember that you can't try the http challenge again.

It's a LOT of state management. I am not sure why you think you can just "drop all state" when actually you have to remember everything from the previous order so that the next time you don't encounter the same problem.

We're not asking to retry failed challenges though. We're asking to try the OTHER challenges that are offered, before the order is closed.

As an ACME client developer who is trying to make the client ecosystem more robust across the board, I would definitely like to have this discussion.

Why open 3 orders when you can open just 1?

Because you don't. The vast majority of clients get by without retaining information from a previous run. In fact, many don't even specifically retry failed orders: they just log, exit 1, and wait for their next cron trigger to try again. It's very cool that Caddy goes to these lengths, but it is not necessary. As I said, most failures are either fundamental (falling back to other methods won't help) or transient (retrying the same method will help). The case you're solving for is a slim minority as far as I'm aware.

Yes, which is why I addressed that in the second half of my message. The spec does not allow for it, the vast majority of clients would break if we tried, and I don't believe this use-case is prevalent enough to be worth updating the spec to account for it. In fact, many clients can't fall back anyway! Many can only do HTTP-01: they aren't configured with credentials to update DNS, and they're not integrated into the server so they can't respond to a TLS-ALPN-01 request.

5 Likes

Thanks for your reply.

Ok, and maybe this emphasizes what the actual goal is here: Can we elevate our vision a little bit? Raise the bar a little higher? The goal for all of us is a robust and resilient ecosystem, not one where we rely on brittle and naive cron jobs for ACME clients that work only in 1/3 use cases. We should not be optimizing for that scenario. That is not the Internet we want to live in.

There's a need here for clients and servers to work together, to cooperate on improving the status-quo, so I am hoping there is a way forwards instead of a response that tells everyone, essentially, "What we have is good enough," and, "It's too hard," and "It's not worth it."

Maybe the reason more clients don't already do this is because of this: it's too complicated currently. I'm suggesting we lower that barrier. It will promote more integrated, fully-native ACME clients.

Anyway, this fix should be trivial compared to starting the world's first automated and non-profit CA from scratch.

It's OK if it's not an errata. But we should find a way to make it happen nonetheless.

I disagree with this reading, and believe this is where Errata is needed. The Errata would likely be striking this section, as making this section plausible would require significant changes that necessitate a new RFC.

Why? If a failed challenge were left in the "processing" state, under the current framework there would be no mechanism to notify the Client of the CA's validation attempt and subsequent failure. This notification is only made through polling for status. In certain situations a user may be able to detect the specific network traffic, but this is an onerous task.

The only ways I can imagine this section working are either undoable:

  • The status would need to change to something else, but that would break most clients.
  • A challenge failure would not trigger an Authorization Failure, which then triggers an Order Failure. This would likely break existing usage patterns of Clients and CAs.

Or would require more work on the spec:

  • A new field is added to the payload, which states there was a failure, and the transition to "invalid" is given a window that would allow for an immediate or intended retry.
  • The CA offered mechanisms to "revive" the "Challenge", Authorization and Order.

Our internal client does this, and it had been massively helpful when troubleshooting issues.

An issue I've had is ISRG's decision to re-use certain objects under certain circumstances. This makes it slightly more difficult to persist information and analyze things.

Edit: Sorry if I'm complaining a lot on this. I know the odds of changing are small, but I do believe (i) the spec is incredibly shortsighted here; (ii) some of ISRG's decisions make it a bit harder to do widespread metrics and analysis;l and (iii) complaining is part of my healing process in accepting this
won't change even though it should.

5 Likes

@mholt, can you describe a scenario where you think a cleverer client will be able to benefit from falling back to a different challenge type?

My intuitions may have been narrowed by years of trying to help people use Certbot, which would generally not be able to fall back this way (basically no Certbot authenticators can actually natively successfully complete more than one challenge type, while --preferred-challenges is basically only used with -a manual and does not actually imply that automation would be able to solve multiple different challengesĀ¹). So I think my intuitions developed this way match up well with @aarongable's intuitions.

But I did just read a thread (and a blog post by you) where you and @bmw were talking about how, indeed, non-tightly-server-integrated clients like Certbot are not that great at automation and reliability compared to integrated ones like Caddy. Which I think most of us have agreed on for years. :slight_smile:

Still, I don't immediately how a more sophisticated or merely more integrated ACME client will commonly be able to benefit from falling back to a different challenge type, even given that it can potentially solve more than one (at least the ALPN challenge in addition to the HTTP challenge). Is there a likely case for that? A server firewall that blocks port 80 and not port 443, or blocks port 443 and not port 80?

Ā¹ Edit: well, there is an experimental Certbot plugin written by @_az that does a super-magic thing with Linux networking to intercept challenge requests before they even reach the web server, and I suggested that as an April Fool's joke this could be proposed to implement TLS-ALPN-01 too, which @Osiris seemed to think would be genuinely worthwhile and doable. If certbot-standalone-nfq did implement both HTTP-01 and TLS-ALPN-01 then I guess it would be a novel example of a Certbot authenticator that could natively and automatically solve both of these challenges without additional user scripting.

7 Likes

Is it possible to modify the ACME server logic to fulfill the ACME client's requirement described in this topic without breaking any existing stateless ACME client?

I imagine doing the following way.

The ACME client fulfills the condition of all the ACME challenge types it is capable to use for a given identifier. It makes sens if there is at least two, or at most three with the current specs. Then, the ACME client triggers the challenge verification for those challenges simultaneously minimizing the delay among them.
Here comes the slightly modified logic of the ACME server. A trigger time attribute is needed if there is no such attribute yet for the challenge object. The special value 0 signifies no trigger received for the challenge object. There is a new attribute internal state as well.
When a trigger arrives for a given challenge, the ACME server updates both the internal and public state as "processing", and notes the trigger time. Then, it process the challenge verification.
When a challenge verification terminates, it immediately updates only the internal object state according to the verification result. It leaves the public state in "processing".
Then, it waits up to one second from the trigger time. Please note that there may be no wait is needed, since the verification itself took longer than a second.
After that one second grace period, it checks is there any challenge with internal state "processing" for the same identifier? If yes, it finish.
If there is no challenge with internal state "processing" found, then it copies the internal state to the public state for each challenge type. Then it processes the state transition of the authorization object based on the status of all challenges.

Tradeoff relative to the current situation:
The stateless ACME clients will encounter at least one second delay from triggering the challenge up to getting know the final state of the challenge and authorization.
The multi-challenge ACME client will get the opportunity of one second window to trigger multiple challenges for the same identifier. If there is any delay due to network or ACME server load condition, the client must be ready to fall back as it was earlier.

2 Likes

Sure, I thought it would be obvious, but:

Port 80 is blocked, port 443 is not. Thus retrying http-01 will not work, so must use tls-alpn-01. We see this frequently. That's why these two different challenges exist.

ACMEz will quickly "learn" that port 80 is blocked in this scenario and use tls-alpn first in the future. I don't expect clients to even be that clever, but there is state whether we like to admit it or not.

Another very common scenario: CDN fronting. If you have Cloudflare in front of your site, HTTP is forwarded but TLS is terminated.

If we want HTTPS to "just work" robustly and reliably (i.e. 0-config), we need to accept that retrying without state will not work.

Thanks!

I think I slightly disagree as a historical matter that the existence of the ALPN challenge derives from port blocking. I think it derives from some hosting providers or CDN providers telling the Mozilla people that it would actually be easier for them to do bulk issuance via a port 443 challenge involving proxy or load balancer configuration (not that port 80 was blocked for them!). That then led to the TLS-SNI challenge method and then the ALPN method to replace it.

Your point is indeed a ways outside of my intuition because I'm used to telling people here on the forum over and over again that having port 80 blocked is a misconfiguration that they must fix. But

I guess the "stop blocking port 80" is really outsourcing the "state" to the human user!!

And so on reflection, I think you're right to say that we would get better, easier, and more universal HTTPS deployment if we didn't do so.

(Also, analogously, with your CDN example I'm used to telling people "turn off your CDN for your certificate renewal" or "change your certificate deployment model" or something. Which is also outsourcing state to the human user.)

5 Likes

I appreciate your insights, @schoen. I didn't know that about the history! Nor did I think of "outsourcing" the state to the user. :thinking: That resonates deeply! :bulb:

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.