400s on challenge ready

voutasaurus · September 21, 2017, 8:50pm

I have a custom Let’s Encrypt client. I’m getting 400s from Let’s Encrypt when submitting a message that says the challenge response is ready. This isn’t consistent though. It works almost all the time, but sometimes I’m getting 400s. Has anybody seen this kind of thing before?

I have a hunch it could be to do with the nonce because that’s the only factor that’s really changing.

The presigned message looks like this: {"Resource": "challenge", "Type": <type>, "KeyAuth": <keyAuth>}

I am not checking the response body for this error yet, that will be my next step though.

voutasaurus · September 21, 2017, 11:16pm

Here’s the error with the details and response headers:

Sep 21 16:07:59 unexpected error generating cert for [001999.example.com], got failed challenge for host "001999.example.com" with error error reporting ready to LE: LE initial response failure with status code: 400, error: Unable to update challenge :: The challenge is not pending., response headers: map[Content-Length:[132] Expires:[Thu, 21 Sep 2017 23:07:58 GMT] Cache-Control:[max-age=0, no-cache, no-store] Date:[Thu, 21 Sep 2017 23:07:58 GMT] Server:[nginx] Content-Type:[application/problem+json] Boulder-Requester:[1832031] Replay-Nonce:[_ON0DgsfZEnNuj9lYIxhMeZfsvfiToOSfNMqISEIr-0] Pragma:[no-cache]]

voutasaurus · September 21, 2017, 11:20pm

Has anybody seen this error before:

Given that the challenge is "not pending" is there any way to know what status the challenge is actually in at this point?

The other consideration is that I eventually got the cert for this challenge, maybe as a result of a different challenge.

cpu · September 22, 2017, 5:24pm

HI @voutasaurus,

I think we can disscount this hunch because the error message for a bad nonce is pretty distinct. The layer of the codebase that would raise a 400 for a challenge being the wrong state is beyond the WFE's check of the nonce.

You can send a GET request to the challenge URI that you are POSTing and it will have the status in the returned challenge object. It will also return a Link header with a rel="up" relation that will point to the URI of the Authorization object the particular challenge is associated with. Sending a GET to the Authorization URI will also include its state in the response.

Yup! That sounds likely. I think you're seeing the result of authorization reuse.
We just recently enabled this for pending authorizations. This was something we previously only did for valid authorizations.

I suspect that your system made two new-authz requests for the same identifier (let's say example.com). Maybe one request was made in Thread A and one in Thread B. If A is the first to POST new-authz for example.com it would get back a brand new pending authorization (let's say ID=abcd) for the identifier with a set of challenges. If B then subsequently POSTed new-authz for example.com it would get back pending authz ID=abcd. Now if A POSTs on the pending authz abcd challenges, and the Boulder VA is happy, the authz will get switched to state valid. If B comes along at this point and tries to do the same thing, POST one of the challenges, it will run into an error because the challenge is no longer pending. In fact the authorization the challenge is associated with is already valid and so none of the challenges are required.

Does that make sense? It may also explains why the issue isn't consistent. There are likely factors at work in your system that prevent this from happening in all cases.

voutasaurus · September 22, 2017, 6:02pm

Thanks @cpu

This sounds likely. We have logic to lock workers per domain. If the lock expires another worker will try. It’s possible that the locking logic is broken.

As a thought experiment: what if A makes a new-authz request, completes the challenge, but times out before getting the cert? Then B makes a new-auth request, tries to do the challenge and then fails because it’s already valid. This would block us from getting the cert at all, unless our client is capable of skipping the challenge if it’s in a valid state.

Given that, should I also update our client to skip the challenge if it’s already valid?

cpu · September 22, 2017, 6:16pm

n/p @voutasaurus, happy to help!

To make sure I understand, you mean A would POST the new-cert endpoint with a CSR, but timeout, not getting the certificate that is returned from the API as a result?

If I'm understanding right that shouldn't influence the problem at hand. The primary issue with that scenario is that you can't recover the certificate from the API and have eaten the rate limit cost of issuing it without having the resulting certificate.

Yes - but to be even more explicit it would be best if your client skipped challenges if the authorization the challenge is associated with is already valid. The reason the distinction matters is a scenario where a pending authorization "abcd" is shared between two workers but one worker prefers TLS-SNI-01 challenges and one worker prefers HTTP-01 challenges.

In that case Worker A might complete a TLS-SNI-01 challenge for authz "abcd", making the pending authorization switch to a valid authorization with a valid TLS-SNI-01 challenge. The DNS-01 and HTTP-01 challenge associated with that valid authorization will still be "pending". So if Worker B were to only look at the HTTP-01 challenge it wanted to POST it would see a "pending" challenge and decide that the authorization must be pending too when in fact it is already "valid" by way of the TLS-SNI-01 challenge.

Does that make sense? Challenges and authorizations both have a "status" field. When the status of a specific challenge for an authorization changes to valid, the status of the authorization changes to valid. The other unsolved challenges associated with the authorization are locked in the state they were in before the associated authorization was switched to valid.

voutasaurus · September 22, 2017, 6:19pm

Thanks @cpu that makes sense. I will skip the challenge if the authorization is valid.

system · October 22, 2017, 6:19pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.