Intermittent badNonce errors

nickmcdonnough · August 9, 2016, 3:55pm

I randomly get badNonce errors when placing orders. The system I’ve built to place orders has retry logic for when unexpected errors are raised (the error is caught and the operation is retried a certain amount of times before being allowed to fail). The operation completes successfully on the second attempt every time so it appears simply retrying “fixes” the issue. However I’d like to figure out if there is something I can do to avoid this.

I received a badNonce error three times this morning for a single order and the error appeared at all three points of communication with Let’s Encrypt below is a list of the operations and the nonces that were rejected.

request authorization: B2sDFNf5FesNVowkqdbwZMQzxa9J1b6Mad99WDjAnKA
request verification: 7qJoQuc-yTNqzXMM1gNvaR-kJ_rLqHa8JA-WBU5-TmM
get new certificate: pwkVxyknk2qkSs5gRpY-tIO7bG2ByYWx5QPp8d4spXM

Again, each operation was automatically retried and successfully completed on the second attempt. Ultimately the order went through fine and the cert as obtained.

Thanks.

Please fill out the fields below so we can help you better.

My domain is: terell.tech

cpu · August 9, 2016, 6:40pm

Is this built on top of an existing ACME client, or is this a from-scratch ACME implementation? Is the source available to view anywhere?

It appears as though there is a bug in the code being used to select the correct RequestNonce to send back to the server.

For an example, looking at the request with the nonce pwkVxyknk2qkSs5gRpY-tIO7bG2ByYWx5QPp8d4spXM (received at 15:32:08.956 UTC) I see that there was a request immediately preceding it (at 15:32:08.146 UTC) that received a response including a ResponseNonce value of RDY_wHFdPwhDKDnFn0_tbLKbvnjPY1ZtbuNaL1AcyOc. I would expect to see this nonce sent as the RequestNonce in the next request, but instead I see the rejected RequestNonce of pwkVxyknk2qkSs5gRpY-tIO7bG2ByYWx5QPp8d4spXM. Searching for the rejected nonce value I see that Let's Encrypt returned that as a RepsonseNonce at 12:22:52 UTC, much earlier than the 15:32:08.956 UTC time at which it was sent as a RequestNonce.

nickmcdonnough · August 10, 2016, 4:14am

Hey cpu. Thanks for the reply.

I should have mentioned that the system I’m building uses an existing ACME client written in Ruby. It can be found at https://github.com/unixcharles/acme-client

Thanks for the explanation of the when the nonces were originally received. It has been pretty helpful in establishing a timeline. I’m going to add some logging for nonces so I have a bit more visibility into what is going on. Did you not see any requests on your end with the other two nonces (around 15:30 utc or earlier).

Thanks again.

cpu · August 10, 2016, 2:52pm

Sounds good.

I see 7qJoQuc-yTNqzXMM1gNvaR-kJ_rLqHa8JA-WBU5-TmM first as a ResponseNonce delivered to the client on 09/08/2016 12:22:11.122 UTC. It was sent back by the client as a RequestNonce on 09/08/2016 15:31.47.388 UTC.

For B2sDFNf5FesNVowkqdbwZMQzxa9J1b6Mad99WDjAnKA it was first sent as a ResponseNonce on 09/08/2016 12:21:05.980 and was received again as a RequestNonce at 09/08/2016 15:30:35.410.

Hope that helps!

nickmcdonnough · August 10, 2016, 2:53pm

Thanks a ton. I’m going to dig in over the next couple of days and I’ll let you know what I find.

cpu · August 10, 2016, 2:56pm

Happy to help Best of luck with the digging.

nickmcdonnough · August 10, 2016, 6:17pm

An update on my dig thus far…

I think In order to understand what’s going on here it’s important to understand how this system I’m working on processes requests for certificates. Essentially the entire thing is done asynchronously with message queues. The process has eight or nine queues however only three of them deliver messages to pollers that actually communicate with Let’s Encrypt. Those three pollers each call one Let’s Encrypt API endpoint. They are: new-authz, authz verification, and new-cert.

The way AcmeClient works is that when it goes to send a request to Let’s Encrypt it checks its in memory storage of nonces for a nonce and uses it. If there aren’t any nonces it makes a request to Let’s Encrypt essentially to grab a nonce out of the response that that can be used for the request. Let’s Encrypt gives an appropriate response and the nonce received in that response is stored.

Because the request happens a single time and because a message is then passed to the next poller to handle the next step the nonce that was received in the response from Let’s Encrypt does not get used until a new message is received by that poller. [I hope I’m making sense here].

All of this led me to believe that the nonces that I wrote about yesterday must have expired over the three hours between when they were received and when they were used again. Then I saw a comment on another thread from jsha where he clearly states that nonces do not expire due to some passage of time. He says that nonces can go “bad” because they’ve already been used or “any time Boulder’s frontend gets restarted, existing nonces are no longer valid.”

So this leaves me wondering, were those nonces that I got errors for used in a previous request (it doesn’t look like it based on the info you provided cpu) or was the Boulder “frontend” (not sure what that refers to) restarted yesterday between 12:22UTC and 15:30UTC?

cpu · August 10, 2016, 8:16pm

It looks like the last restart of the WFE (web front end, the part of Boulder that handles API requests) in production was on 01/08/2016 and wouldn't have been responsible for the bad nonce errors.

I'll re-read your system description (Thank you for providing that!) and see if I can ascertain any other hints from the logs. I probably won't have a chance to look until tomorrow AM - thanks for your patience!

nickmcdonnough · August 10, 2016, 8:23pm

Thanks. Hopefully the logging I’m putting in place will clue me into whether or not nonces have already been used (they must have it seems). Thanks for all your help on this.

nickmcdonnough · August 16, 2016, 7:15pm

@cpu. Were you ever able to tell if the nonces that I received badNonce errors for were ever used in previous requests?

cpu · August 16, 2016, 7:50pm

Hi @nickmcdonnough - apologies on the delay getting back to you. I can’t find any evidence that the nonces were previously used, or that the WFE was restarted. Very puzzling. I’ve tapped some folks with more experience than myself to see if they can shed some light on this. I’ll let you know what I find out.

cpu · August 16, 2016, 8:27pm

Well, I’ve learned a few things about how Boulder’s nonce generation is implemented Apologies for leading you down the wrong path initially.

The short answer here is that the nonces you submitted that generated a bad nonce error were expired.

Like @jsha mentions in the thread you linked the nonces do not have a fixed validity period (e.g. good for 1d, 4h, etc). Rather Boulder has a fixed number of nonces that it keeps in-play, and if more than that number are requested some of the existing nonces fall out of validity and are in-effect expired.

Based on current usage volume I’d estimate we start rolling nonces out of validity inside of ~1-2 hours for production. Since your requests were ~4 hours apart the nonces had expired. I think the ultimate take-away should be to view the bad nonce errors as non-fatal and have your systems recover by refreshing the nonce used.

nickmcdonnough · August 17, 2016, 1:12am

That is awesome! Thanks for the detailed explanation. This is a relief because the behavior I was experience led me to believe exactly what you described.

Thanks a ton for taking the time to look into this. FYI, when my team and I finally flip the switch on this service we’ll be placing lots of orders and likely wont see this problem again.

cpu · August 17, 2016, 1:18pm

@nickmcdonnough No problem! Thanks for your patience, apologies I didn't have this answer for you 7d ago. It was a learning experience

Wonderful! Let us know if you have any other problems.

Take care,

system · September 16, 2016, 1:19pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sustained badNonce Errors Help	12	1245	March 28, 2019
Is the letsencrypt V1 server working correctly recently? Help	8	1014	August 31, 2018
Regular badNonce errors Client dev	12	2501	May 16, 2018
Intermittent badNonce error when using connection pool (with fixed source IP) Help	5	652	April 27, 2023
Trying to understand urn:acme:error:badNonce Help	20	20804	November 7, 2019

Intermittent badNonce errors

Related topics