I randomly get badNonce errors when placing orders. The system I’ve built to place orders has retry logic for when unexpected errors are raised (the error is caught and the operation is retried a certain amount of times before being allowed to fail). The operation completes successfully on the second attempt every time so it appears simply retrying “fixes” the issue. However I’d like to figure out if there is something I can do to avoid this.
I received a badNonce error three times this morning for a single order and the error appeared at all three points of communication with Let’s Encrypt below is a list of the operations and the nonces that were rejected.
request authorization: B2sDFNf5FesNVowkqdbwZMQzxa9J1b6Mad99WDjAnKA
request verification: 7qJoQuc-yTNqzXMM1gNvaR-kJ_rLqHa8JA-WBU5-TmM
get new certificate: pwkVxyknk2qkSs5gRpY-tIO7bG2ByYWx5QPp8d4spXM
Again, each operation was automatically retried and successfully completed on the second attempt. Ultimately the order went through fine and the cert as obtained.
Thanks.
Please fill out the fields below so we can help you better.
Is this built on top of an existing ACME client, or is this a from-scratch ACME implementation? Is the source available to view anywhere?
It appears as though there is a bug in the code being used to select the correct RequestNonce to send back to the server.
For an example, looking at the request with the nonce pwkVxyknk2qkSs5gRpY-tIO7bG2ByYWx5QPp8d4spXM (received at 15:32:08.956 UTC) I see that there was a request immediately preceding it (at 15:32:08.146 UTC) that received a response including a ResponseNonce value of RDY_wHFdPwhDKDnFn0_tbLKbvnjPY1ZtbuNaL1AcyOc. I would expect to see this nonce sent as the RequestNonce in the next request, but instead I see the rejected RequestNonce of pwkVxyknk2qkSs5gRpY-tIO7bG2ByYWx5QPp8d4spXM. Searching for the rejected nonce value I see that Let's Encrypt returned that as a RepsonseNonce at 12:22:52 UTC, much earlier than the 15:32:08.956 UTC time at which it was sent as a RequestNonce.
Thanks for the explanation of the when the nonces were originally received. It has been pretty helpful in establishing a timeline. I’m going to add some logging for nonces so I have a bit more visibility into what is going on. Did you not see any requests on your end with the other two nonces (around 15:30 utc or earlier).
I see 7qJoQuc-yTNqzXMM1gNvaR-kJ_rLqHa8JA-WBU5-TmM first as a ResponseNonce delivered to the client on 09/08/2016 12:22:11.122 UTC. It was sent back by the client as a RequestNonce on 09/08/2016 15:31.47.388 UTC.
For B2sDFNf5FesNVowkqdbwZMQzxa9J1b6Mad99WDjAnKA it was first sent as a ResponseNonce on 09/08/2016 12:21:05.980 and was received again as a RequestNonce at 09/08/2016 15:30:35.410.
I think In order to understand what’s going on here it’s important to understand how this system I’m working on processes requests for certificates. Essentially the entire thing is done asynchronously with message queues. The process has eight or nine queues however only three of them deliver messages to pollers that actually communicate with Let’s Encrypt. Those three pollers each call one Let’s Encrypt API endpoint. They are: new-authz, authz verification, and new-cert.
The way AcmeClient works is that when it goes to send a request to Let’s Encrypt it checks its in memory storage of nonces for a nonce and uses it. If there aren’t any nonces it makes a request to Let’s Encrypt essentially to grab a nonce out of the response that that can be used for the request. Let’s Encrypt gives an appropriate response and the nonce received in that response is stored.
Because the request happens a single time and because a message is then passed to the next poller to handle the next step the nonce that was received in the response from Let’s Encrypt does not get used until a new message is received by that poller. [I hope I’m making sense here].
All of this led me to believe that the nonces that I wrote about yesterday must have expired over the three hours between when they were received and when they were used again. Then I saw a comment on another thread from jsha where he clearly states that nonces do not expire due to some passage of time. He says that nonces can go “bad” because they’ve already been used or “any time Boulder’s frontend gets restarted, existing nonces are no longer valid.”
So this leaves me wondering, were those nonces that I got errors for used in a previous request (it doesn’t look like it based on the info you provided cpu) or was the Boulder “frontend” (not sure what that refers to) restarted yesterday between 12:22UTC and 15:30UTC?
It looks like the last restart of the WFE (web front end, the part of Boulder that handles API requests) in production was on 01/08/2016 and wouldn't have been responsible for the bad nonce errors.
I'll re-read your system description (Thank you for providing that!) and see if I can ascertain any other hints from the logs. I probably won't have a chance to look until tomorrow AM - thanks for your patience!
Thanks. Hopefully the logging I’m putting in place will clue me into whether or not nonces have already been used (they must have it seems). Thanks for all your help on this.
Hi @nickmcdonnough - apologies on the delay getting back to you. I can’t find any evidence that the nonces were previously used, or that the WFE was restarted. Very puzzling. I’ve tapped some folks with more experience than myself to see if they can shed some light on this. I’ll let you know what I find out.
Well, I’ve learned a few things about how Boulder’s nonce generation is implemented Apologies for leading you down the wrong path initially.
The short answer here is that the nonces you submitted that generated a bad nonce error were expired.
Like @jshamentions in the thread you linked the nonces do not have a fixed validity period (e.g. good for 1d, 4h, etc). Rather Boulder has a fixed number of nonces that it keeps in-play, and if more than that number are requested some of the existing nonces fall out of validity and are in-effect expired.
Based on current usage volume I’d estimate we start rolling nonces out of validity inside of ~1-2 hours for production. Since your requests were ~4 hours apart the nonces had expired. I think the ultimate take-away should be to view the bad nonce errors as non-fatal and have your systems recover by refreshing the nonce used.
That is awesome! Thanks for the detailed explanation. This is a relief because the behavior I was experience led me to believe exactly what you described.
Thanks a ton for taking the time to look into this. FYI, when my team and I finally flip the switch on this service we’ll be placing lots of orders and likely wont see this problem again.