Getting The client sent an unacceptable anti-replay nonce

williamsdb · January 20, 2016, 12:16pm

I am running the following command:

sudo ./letsencrypt-auto certonly --standalone -d secure.domain.com --debug

This is for the second cert on this box, the first worked fine. However, the second throws the following error:

Version: 1.1-20080819
Version: 1.1-20080819
Traceback (most recent call last):
  File "/root/.local/share/letsencrypt/bin/letsencrypt", line 11, in <module>
    sys.exit(main())
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/letsencrypt/cli.py", line 1398, in main
    return args.func(args, config, plugins)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/letsencrypt/cli.py", line 600, in obtain_cert
    _auth_from_domains(le_client, config, domains)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/letsencrypt/cli.py", line 404, in _auth_from_domains
    lineage = le_client.obtain_and_enroll_certificate(domains)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/letsencrypt/client.py", line 283, in     obtain_and_enroll_certificate
    certr, chain, key, _ = self.obtain_certificate(domains)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/letsencrypt/client.py", line 266, in obtain_certificate
    return self._obtain_certificate(domains, csr) + (key, csr)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/letsencrypt/client.py", line 224, in _obtain_certificate
    authzr = self.auth_handler.get_authorizations(domains)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/letsencrypt/auth_handler.py", line 74, in get_authorizations
    domain, self.account.regr.new_authzr_uri)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/acme/client.py", line 215, in request_domain_challenges
    typ=messages.IDENTIFIER_FQDN, value=domain), new_authz_uri)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/acme/client.py", line 195, in request_challenges
    response = self.net.post(new_authzr_uri, new_authz)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/acme/client.py", line 634, in post
    return self._check_response(response, content_type=content_type)
  File "/root/.local/share/letsencrypt/local/lib/python2.7/site-packages/acme/client.py", line 550, in _check_response
    raise messages.Error.from_json(jobj)
Error: urn:acme:error:badNonce :: The client sent an unacceptable anti-replay nonce :: JWS has invalid anti-replay nonce

I have searched both here and generally on the web but cannot find anything that helps.

Any ideas?

kelunik · January 20, 2016, 12:19pm

Just retry it. May be the same issue as https://github.com/letsencrypt/boulder/issues/1217.

williamsdb · January 20, 2016, 12:43pm

Yes that was it, thanks!

chriswheeler · January 20, 2016, 4:59pm

I got this too a few minutes ago…

jsha · January 20, 2016, 7:52pm

Thanks for the report! In general, clients should retry a reasonable number of times if they get a badNonce error (a fresh nonce is included in the reply). However, it sounds like we are serving these more often than we expect. We’ll look into whether there’s something amiss.

Sid · January 22, 2016, 6:58pm

I had the same issue. Had to run the command 4 or 5 times before it finally worked.

eva2000 · January 22, 2016, 8:14pm

same hit this yesterday !

shazde · January 24, 2016, 1:39am

Same problem for me. Have tried it from different IPs and for different domains, they all fail.
No workaround either.

tomC · January 24, 2016, 11:19am

Got this over night. Retrying by hand works fine. Sounds like the LE client doesn’t do what it should ?

shazde · January 24, 2016, 12:20pm

The workaround for me is to resend the last request again by using the nonce that is been sent back in the failed attempt response.
I am using the ruby client and I have traced all the packets.
The nonce that is being used in the last phase is valid but get rejected but the second attempt using a new nonce works fine.

There is obviously a newly introduced bug on the server side.
Was there a recent code deploy to production in the past 3 days?

sahsanu · January 24, 2016, 12:32pm

Yes, last boulder update succeed at January 21, 2016 6:50PM UTC.

These are the changes introduced to the update of boulder to +77d5114

unixcharles · January 25, 2016, 7:49pm

In general, clients should retry a reasonable number of times if they get a badNonce error (a fresh nonce is included in the reply).

I find that to be a surprising behaviour. I don't think anybody would implement something like this looking at the spec.

My expectation when implementing acme in acme-client was that if I get a badNonce error crashing/raising an error would be the sensible behaviour and retrying would result in request failing in the same way.

Not too sure what would be the proper way to document that

jsha · January 25, 2016, 8:45pm

That's good feedback, thanks! The best way to improve the documentation would be to send a PR to GitHub - ietf-wg-acme/acme: A protocol for automating certificate issuance, and then email acme@ietf.org with a link to the PR and details for discussion.

jcjones · January 28, 2016, 7:15am

@jsha found the nonce problem tonight; it is my fault.

The last thing I did before I moved back to Mozilla projects was re-enable caching at our CDN; we’re setting cache headers in Boulder for cacheable things like the terms of service. That change rolled out to staging in December and went to production a few weeks ago… when this all started.

With the traffic levels in staging we didn’t see any problems from the change, but for whatever reason, things that shouldn’t be cached are caching, and it’s possible right now to request a resource twice in a row and to get the same anti-replay nonce each time due to an unexpected cache hit.

Ops will be undoing my final contribution soon, and in the mean time: sorry!

kikinovak · January 28, 2016, 8:29am

I’m running letsencrypt on two of my servers. Version 1.1 worked fine. Yesterday I upgraded to 0.2.0 and got these same errors as described in this thread.

An unexpected error occurred:
The client sent an unacceptable anti-replay nonce :: JWS has invalid anti-replay nonce
Please see the logfiles in /var/log/letsencrypt for more details.

This morning I upgraded to the freshly released 0.3.0, but got the same result. I have a script that generates/refreshes certificates for a given domain and subdomains (www.example.com, example.com, mail.example.com, cloud.example.com). The script succeeded for the first two and last domain, but failed mysteriously on the third one.

I’m not sure I understand the technical descriptions above, but here’s my question: does it make sense to downgrade the letsencrypt client to version 0.1.1, which worked fine?

Cheers,

Niki

thefalken · January 28, 2016, 2:52pm

So how does the client recover then ?

jsha · January 28, 2016, 5:24pm

The bug is on the server side. It's possible there were changes between 0.1.1 and 0.2.0 that would trigger the bug more frequently. We haven't found anything to indicate that. Most likely your previous successful issuances happened before we introduced the server-side bug.

So, downgrading is probably unlikely to fix the issue, and we will be rolling out a server side fix very soon. Sorry for the trouble!

kikinovak · January 28, 2016, 6:32pm

OK thanks very much! Just upgraded letsencrypt to 0.3.0 on my servers.

Cheers,

Niki

jsha · January 28, 2016, 11:34pm

We’ve deployed the server-side fix for this. Please post if you still get this problem.

jsha · January 29, 2016, 9:07pm

Here’s a summary of the problem:

On January 20, 2:32AM UTC, we rolled out a change to our Akamai configs. The goal of the change was to disable HTTP/2, but it accidentally included a change to our caching setup that we made back in December but had not yet pushed to our production Akamai configs.

Previously, no ACME API requests were ever cached. After the change, Akamai would cache API requests that looked cacheable. In particular, HEAD /acme/new-authz started to be cached, along with its headers, which include the Replay-Nonce header. The Python client uses that request to fetch its initial nonce, to use in a subsequent POST request.

The main property required of a nonce is that it never be reused, so you can see why it’s a big problem to return a cached Replay-Nonce header. Whenever someone tried to submit a POST with a nonce that had already been used, either by themselves or another person, their POST would be rejected with badNonce. We didn’t have 100% failure rate, because not everyone received a cached response to that POST.

We fixed the issue by rolling back the caching changes in our Akamai config. We’re also going to update Boulder to set explicit caching headers indicating that no requests should be cached except those we explicitly specify (/acme/cert and /acme/issuer-cert).