How can we fix a bug in the ACME protocol?

mholt · June 13, 2023, 10:38pm

A few years ago, I opened a discussion on the IETF mailing list about what I think is a bug in RFC 8555:

I later filed an errata here:

https://www.rfc-editor.org/errata/eid6317

Basically:

An order is created, and 3 challenge types are offered (http, tls-alpn, and dns)
One challenge type fails
The entire order is closed
In order to try the next challenge type, a new order has to be created, and we have to carry the state to a higher scope to remember which challenge type to NOT use. Repeat steps 1-4 until all challenge types are spent.

This results in more complex client code and higher costs for ACME servers because of all the opened orders. It also burns through rate limits a lot quicker. (ACMEz, the client I was developing, remembers which challenge types succeed most and prefers those, but that's fairly unique.)

I suggest if CAs want to lower their operating costs and/or reduce their DB load, a simple change should be made to ACME whereby this flow works instead:

An order is created, and 3 challenge types are offered (http, tls-alpn, and dns)
One challenge type fails
The next challenge is attempted until one succeeds or challenges run out
The order is closed

This makes way more delectable code And drastically reduces load/traffic for the CA.

But the errata and ML discussion seem to have been forgotten.

Should I open an issue with Boulder, for starters? I know Boulder already diverges from RFC 8555 in several ways.

_az · June 13, 2023, 11:09pm

Is it a protocol bug? Or a choice of Let's Encrypt?

ISTM that allowing clients to drive challenge retries would enable this workflow. The protocol already makes room for this:

Clients can explicitly request a retry by re-sending their response
to a challenge in a new POST request (with a new nonce, etc.). This
allows clients to request a retry when the state has changed (e.g.,
after firewall rules have been updated). Servers SHOULD retry a
request immediately on receiving such a POST request. In order to
avoid denial-of-service attacks via client-initiated retries, servers
SHOULD rate-limit such requests.

I don't know why a failed challenge results in a failed authorizations with Boulder. Protocol doesn't say to do that. I can only assume it is done for some intentional operational reasons and new orders are less painful for them than challenge retries.

I don't much like Sectigo's implementation either, where they will do server-initiated retries without exposing any information on the challenge or authorization resources. It's just a black box that eventually fails (sometimes with a 24-hour Retry-After ...).

Edit: regarding client-initiated retries though, the challenge state would have to stay processing, right? Clients would have to dramatically adjust their challenge polling logic to look at the error field of a processing challege ... seems messy.

Osiris · June 14, 2023, 5:42am

I believe it does. While you're correct the challenge can linger in its processing state, being retried by either the client of server, for the authz object, RFC 8555 (section 7.1.6) is more clear:

Authorization objects are created in the "pending" state. If one of
the challenges listed in the authorization transitions to the "valid"
state, then the authorization also changes to the "valid" state. If
the client attempts to fulfill a challenge and fails, or if there is
an error while the authorization is still pending, then the
authorization transitions to the "invalid" state.

So to keep an authz from NOT being invalidated on a failed challenge, a failed challenge should be kept in its processing state. Once the challenge goes to the invalid state, the authz should be invalidated too.

So you're correct that an ACME server could use challenge retry to get around the issue mentioned in this thread, but it shouldn't invalidate the challenge.

_az · June 14, 2023, 5:44am

Yes, you're right. I should have said "a failed challenge attempt". I am still confused about how a server and client are meant to collaborate to decide that they're "giving up" on a challenge.

Osiris · June 14, 2023, 6:51am

Probably just giving up after some time, I'd say at the servers initiative

jvanasco · June 14, 2023, 3:50pm

This doesn't sound appropriate for Errata as it changes the functionality.

I don't think boulder could implement this as a divergence, because it looks like it would violate the spec. I don't think any of the divergences violate the spec.

The retry mechanism seems like it applies to challenges specifically, not authorizations.

It reads like this would be most appropriate in the next spec that obsoletes the current one.

I do like these ideas, I've just dealt with the formalities process with RFCs before.

mholt · June 14, 2023, 7:11pm

This is a great question, now that you mention it. It's been too long and I don't recall if I was using pebble to test, or LE Staging. Maybe both?

But it also says:

Note that within the "processing" state, the server may attempt to validate the challenge multiple times (see Section 8.2). Likewise, client requests for retries do not cause a state change.

So I feel like the client should be allowed to retry the validation with any of the challenges before the authz is finalized as "invalid". It doesn't even make sense to offer multiple challenges if only 1 can be used.

I think there's an ambiguity in the spec, though. So it's equally hard to say whether there'd be a spec violation.

Would it have to change functionality though? If the server makes one authz per challenge, then the client logic could remain the same as long as it's conforming, AFAIK.

(I do appreciate your perspective, as I don't have RFC experience.)

Osiris · June 14, 2023, 7:14pm

Sure, but also with that the challenge should be kept in the processing state. Let's Encrypt does not do that, so the authz fails immediately too. This is a LE "issue", not an RFC issue methinks.

mholt · June 14, 2023, 7:15pm

If that's the case, then maybe I will open a bug on their issue tracker.

Even if so, I think it'd be helpful for the spec to narrow down the exact behavior here, rather than leaving it ambiguous.

Osiris · June 14, 2023, 7:16pm

I haven't seen the ambiguity yet though? But I'm probably missing it.

mholt · June 14, 2023, 7:17pm

There must be an ambiguity if the server has a "choice" whether to do it one way or another way, as has been stated above.

Osiris · June 14, 2023, 7:19pm

Ah yes. English isn't my primary language so I misinterpreted the use of "ambiguity". One has options indeed. Not sure if that's bad though.

mholt · June 14, 2023, 7:20pm

Generally, it's bad for technical specifications to have ambiguities, since the whole point of a standard is to eliminate different ways of doing things.

Osiris · June 14, 2023, 7:22pm

I fully agree, but also emphasis on the "generally" part. Personally, but not experienced in RFCs to be honest, I don't like how RFC 8555 is written generally (too many words, too little "syntax"). But this ambiguity I can live with.

petercooperjr · June 14, 2023, 7:22pm

Let's Encrypt already has challenges making changes in full compliance with the ACME RFC; making further changes to the actual spec at this point would cause even more pain. (And it may be that the pain would be worth it, don't get me wrong, but I think it'd have to be a pretty high bar to be worth standardizing and going through the effort to get servers and clients to change their behaviors.)

mholt · June 14, 2023, 7:29pm

True, but: I don't think clients will have to change their behaviors (assuming they are already in conformance with spec and not just accidently with LE's implementation). Servers that do change could reduce their operational costs and complexity significantly. And clients that take advantage of this can also greatly reduce their complexity as well.

Seems like a win-win.

petercooperjr · June 14, 2023, 7:31pm

I am in fact saying that it looks like a lot of clients are accidently only in conformance with LE's implementation, if the issues with enabling Asynchronous Order Finalization (which other CAs already do) is any indication.

jvanasco · June 14, 2023, 8:36pm

The two things IETF really cares about are:

Upholding the ALLCAPS IMPERATIVES
Backwards Compatibility / Maintaining functionality

Changing either of those will generally require a new RFC that obsoletes the existing one.
Clarifying issues or making mistakes will generally be done in an Errata.
New functionality that can fit within the existing RFC can generally be done in a standalone RFC that describes the extension.

So, i'll backtrack my previous comment after re-reading the spec. I'm not sure if it could be an Errata or not. Some aspects seem like it should be clarified in the existing RFC, others seem like it should be a new RFC, and others seem like they could be legal divergences.

Again, I generally support this idea. I'm just reading this from the point of someone trying to oppose it on technicalities, which is how RFC stuff typically goes.

I couldn't find any "MUST" clauses, and section 8.2 "Retrying Challenges" ( @_az included an excerpt above; the full section is here: RFC 8555 - Automatic Certificate Management Environment (ACME) ) supports this general idea with one key distinction that section provides for a retry of a specific Challenge, but does not provide for a retry of the Authorization.

Looking elsewhere on the spec, @Osiris referenced this passage above, but I'll point to some specific language bits in it:

"State Transitions for Authorization Objects"
...
If an error occurs at any of these stages, the order
moves to the "invalid" state. The order also moves to the "invalid"
state if it expires or one of its authorizations enters a final state
other than "valid" ("expired", "revoked", or "deactivated").

Note the lack of "MUST" in that passage. Without the capitalized "MUST", the RFC does not require it.

Also note the phrase "final state". The finalization of an Authorization is only referenced one other time:

Responding to Challenges
...
The server is said to "finalize" the authorization when it has
completed one of the validations. This is done by assigning the
authorization a status of "valid" or "invalid", corresponding to
whether it considers the account authorized for the identifier. If
the final state is "valid", then the server MUST include an "expires"
field. When finalizing an authorization, the server MAY remove
challenges other than the one that was completed, and it may modify
the "expires" field. The server SHOULD NOT remove challenges with
status "invalid".

Again, there is a lack of "MUST" on the core logic here.

In terms of Errata, I think the spec either somewhat contradicts itself or is unclear on these two points:

Section 7.5.1 defines the final state as completing one of the authorizations based on the result of a challenge.
Section 8.2 defines a method to explicitly retry the challenge

The 8.2 retry makes no sense, as the status would need to remain in "processing". For this reason, I think the topic needs to be clarified in an Errata.

Let's look at how the status options are defined to see if there is anything to utilize there:

Note the valid options for Challenge objects: (RFC 8555 - Automatic Certificate Management Environment (ACME))

status (required, string): The status of this challenge. Possible values are "pending", "processing", "valid", and "invalid" (see Section 7.1.6).

Note the valid options for Authorization objects: (RFC 8555 - Automatic Certificate Management Environment (ACME))

status (required, string): The status of this authorization. Possible values are "pending", "valid", "invalid", "deactivated", "expired", and "revoked". See Section 7.1.6.

The options are limited, but both are defined with "Possible values". I am not sure if this phrasing limits the only allowed values to this selection or not BUT I think far too many clients could break if arbitrary status identifiers were used though. So I don't think we could use a "pending-invalid" marker or anything similar – clients not expecting it are likely to break.

So how could one try to jam this functionality in?

IMHO, I think the spec's wording allows for the various objects to be extended with new fields. I don't see anything that could be construed to ban additional fields in the objects, I just see a listing of mandated fields and the optional "meta" on accounts. (Perhaps I missed something?). I recall pushing for a new field once before, but I forgot for what. I don't recall getting pushback on the concept of adding a field, just the utility of what I wanted to do. [It's been many years, I've wanted to change many things]

So, IMHO, I think a one could implement the desired workflow through a new RFC that extends ACME by adding additional fields to the Challenge and Authorization objects. Perhaps these fields are named "retry-options" or similar, and offer URLs that would re-activate the objects when possible.

The ACME flow for existing clients would not be changed, unless they throw errors if extraneous fields show up. (I do not know of any clients that do this).
There does not seem to be a requirement in the current rfc that REQUIRES an action to be fatal to the entire chain upwards. The fatal behavior is described and inherent to the design, but there does not seem to be an IMPERATIVE for it.
A client conforming to the extension could simply hit a URL offered in the payload to re-enable the object.

aarongable · June 14, 2023, 9:06pm

As written, I think that RFC8555 is fairly clear:

Challenges can be retried: if a challenge validation fails, the ACME server may choose to leave that challenge in the "processing" state rather than moving it to the "invalid" state. The ACME server may choose to re-attempt validation on its own. The ACME client may choose to re-request validation as well.
Authorizations cannot: as soon as any challenge associated with an authorization is "invalid", then the whole authorization (and the whole order!) is also "invalid".

Let's Encrypt does not allow Challenges to be retried. As soon as a single challenge validation attempt fails, the challenge (and therefore the authorization, and the order) is marked as "invalid". This is not a bug, it is a deliberate choice which simplifies the code paths through one of the most critical pieces of code we maintain: the domain control validation code. This is not a divergence from the spec, as nothing in RFC8555 says that ACME servers must allow challenges to be retried, it simply allows for the option and gives the server full control over when the challenge moves into the "invalid" state. This decision is one which could be revisited and changed. At this time, we don't have a lot of evidence that doing so would meaningfully reduce our traffic or improve our traffic patterns. I'm happy to be shown otherwise.

But regarding the original request at the top of this thread:

If Let's Encrypt were to allow clients to try to fulfill other challenges on a single authorization after one had transitioned into the "invalid" state, that would be a divergence from the spec. I don't think we are willing to diverge from the spec in this way.

If Let's Encrypt were to leave failed challenges in the "processing" state, it would break a huge population of clients. The vast majority would continue polling that challenge, waiting for it to transition into either "valid" or "invalid", as the spec says they should. Only a tiny fraction of clients (perhaps ACMEz) would "know" that a long-processing challenge means they should attempt other challenges instead. This widespread breakage is not worth the slight optimization for a few clients.

It's totally possible that an RFC could define new behavior which makes fallback between different challenges for the same authorization the Right Thing To Do. But I think the situation presented at the top of this thread, where one challenge type fails but the client is smart enough to fall back to a different challenge type which subsequently succeeds, is vanishingly rare. Most validations fail for one of two reasons: the client doesn't actually control the name in question (e.g. because someone left an acme client running long after their domain registration expired), or there was a transient failure (e.g. their DNS provider was slow to propagate the new TXT record). In some cases, these failures would be resolved by trying a different method. But I believe that in the vast majority of those cases, the failure would also be resolved simply by retrying the same method.

As far as ACME clients are concerned, orders are cheap. If something goes wrong, drop all state, and retry from the top. I don't think there's a large appetite for adopting more complexity than that -- after all, we couldn't even get clients to poll for order finalization; how many do you think are going to poll and retry challenge validation?

jvanasco · June 14, 2023, 9:11pm

I forgot to mention above, Boulder will reuse an Order and Authorization objects when possible.

Topic		Replies	Views
Regarding Retrying Challenges Client dev	3	1353	April 25, 2019
Can someone please confirm a few behaviors of the spec regarding failed challenges for me? Client dev	4	731	March 13, 2020
HTTP challenges retry Help	5	1839	June 12, 2020
The uniqueness of challenges (Boulder specific) Client dev	3	777	May 23, 2020
Problems(Flaws?) with Base+Wildcard Validation Issuance Tech	6	1187	April 27, 2018

How can we fix a bug in the ACME protocol?

Related topics