JWS has invalid anti-replay nonce

We use Google DNS API to create DNS records and Let's Encrypt with AcmeSharp client to automate certificate generation.
We did not change this part in our application but from 1 of May we are getting a lot of errors "JWS has invalid anti-replay nonce" on function call SubmitChallengeAnswer.
We had delay time between DNS record creation and sending certificate request of 80 seconds.
After changing this delay to 30 seconds, errors rate decreased from 60% to 7% but still it's big enough.
Setting lower delay values leads to another error:

Lets Encrypt authorization status 'invalid'. Authorization not completed. type : urn:acme:error:dns; detail : DNS problem: NXDOMAIN looking up TXT for _acme-challenge.topas.3cx.eu; status : 400

Before 1 of May we had less then 1% failed certificate requests.
Retry on SubmitChallengeAnswer did not help much.

My domain is: activation.3cx.com

I ran this command:

ACMESharp.AcmeClient.SubmitChallengeAnswer(AuthorizationState authzState, String type, Boolean useRootUrl)

It produced this output:
Error status code: BadRequest, Error detail: {
"Type": "urn:acme:error:badNonce",
"Title": null,
"Status": 400,
"Detail": "JWS has invalid anti-replay nonce 4AYTyD85MM0vHP8QfVAVmQjgh0XpVm7KdZ16vPKY_eQ",
"Instance": null,
"OrignalContent": "{\n "type": "urn:acme:error:badNonce",\n "detail": "JWS has invalid anti-replay nonce 4AYTyD85MM0vHP8QfVAVmQjgh0XpVm7KdZ16vPKY_eQ",\n "status": 400\n}"

My web server is (include version): IIS (Version 8.5.9600.16384)

The operating system my web server runs on is (include version): Windows Server 2012 R2 (Version 6.2. Build 9200)

My hosting provider, if applicable, is: OVH

I can login to a root shell on my machine (yes or no, or I don't know): no

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): yes (MS IIS snap-in)

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): ACMESharp (version 0.8.1.0)

1 Like

It seems that the incidence of badNonce errors has gone up recently, but it's not unexpected and not a fatal error. It should ideally be handled transparently by the ACMESharp library, by retrying any such request a limited number of times, with fresh nonces.

If ACMESharp doesn't do this for you, it's up to your implementing code to handle that scenario.

RFC8555 writes:

When a server rejects a request because its nonce value was
unacceptable (or not present), it MUST provide HTTP status code 400
(Bad Request), and indicate the ACME error type
"urn:ietf:params:acme:error:badNonce". An error response with the
"badNonce" error type MUST include a Replay-Nonce header field with a
fresh nonce that the server will accept in a retry of the original
query (and possibly in other requests, according to the server's
nonce scoping policy). On receiving such a response, a client SHOULD
retry the request using the new nonce.

Really? Submitting the same request with the Replay-Nonce from the error response led to another badNonce error?

1 Like

Thank you, that helped.

1 Like

Now we are getting from time to time “Lets Encrypt authorization status ‘invalid’. Authorization not completed. type : urn:acme:error:dns; detail : DNS problem: NXDOMAIN looking up TXT for _acme-challenge.oslhk.3cx.hk; status : 400”
Wait time between DNS creation and certificate generation is now 80 seconds.
How to deal with this error?

Are you waiting a flat 80 seconds, or is it 80 seconds after the change status goes from pending to done?

If you look at the way Certbot’s Google plugin does it, it appears to wait 60 seconds (default_propagation_seconds) after the change leaves the pending state.

As far as I can see we are not waiting DNS Change “done” status, we are resolving DNS entry from our side and then waiting 80 seconds.
Ok, noted this.

We also have some errors “type : urn:acme:error:dns; detail : DNS problem: query timed out looking up TXT for _acme-challenge.rewind.3cx.co.uk; status : 400”

and

"Error status code: InternalServerError, Error detail: {
“Type”: “urn:acme:error:serverInternal”,
“Title”: null,
“Status”: 500,
“Detail”: “Problem getting authorization”,
“Instance”: null,
“OrignalContent”: “{\n “type”: “urn:acme:error:serverInternal”,\n “detail”: “Problem getting authorization”,\n “status”: 500\n}”

Are they all of the same nature and could be healed waiting “done” status?

Waiting for the done status (and then sleeping for e.g. 60s) is definitely worthwhile, because there might be things like clustering and anycast at play which can make a DNS resolution test (from your ACME client's perspective) unreliable.

But:

and

are quite worrying.

I wouldn't expect Google's Cloud DNS service to randomly time out. Seems more likely to be an issue on Let's Encrypt's side. At least, if it is happening on a regular basis.

The serverInternal appears to be an database issue on Let's Encrypt's side, that you can do nothing about. If you have some order URLs related to these errors, you might want to tag some of the Let's Encrypt staff to look into them. There have been a couple of other threads recently with similar errors, without any explanation as of yet.

1 Like

We’ll take a look at the load issues / serverInternal errors.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.