One user gets regular badNonce error, for details please see the issue:
No other user has reported this error, might be user environment specific?
I went through the code and found no place where any nonce would be reused. Even the debug code has shown that the badly used nonce was matching the nonce issued recently by the boulder. I wonder what might be the real reason of that badNonce error reported? Is there a way that you check the server log of the boulder?
Thanks for reporting this. I haven’t seen any other complaints about nonce errors recently. Pretty strange!
Is there any chance the user in question is accessing the API from multiple egress public IP addresses?
Unfortunately since https://github.com/letsencrypt/boulder/pull/3421 landed I’m not able to look at the nonces that Boulder replied with to the client. I can only see the bad nonce errors in the logs but those alone aren’t helpful in this case.
Is it possible you could update your debugging code to log the ACME request & response during issuance? It was hard looking at the debug output in the linked issue to tell which endpoint returned the nonce and which request subsequently used it.
One other suggestion: Your client code should treat badNonce errors as recoverable. The badNonce error response itself has a fresh nonce that the client can use to retry the request immediately. I recommend adding the ability to retry requests that encounter a badNonce error up to a fixed maximum, perhaps with some exponential delay logic as well.
The user wrote: “I use only one IP to access the API, but domains are set up on various IPs.”
I wonder why is that matter? A nonce should be accepted precisely one time independently of client IP addresses. I mean the client IP to which the nonce was issued may differ from the client IP address which uses the nonce, legitimately.
I asked the user is there a problem is domain names are revealed due to the debug output? He prefers private channel to disclose that information. I will still try to improve the debug without disclosing domain names.
That is a feasible workaround for the problem. On the other hand I prefer postpone it unless we can find a solution to the problem. For the time being I cannot even exclude the option that an actual man-in-the-middle attack goes on between the user’s ACME client and the boulder server. With that workaround the client side protection is weakened because the attacker may force a set of invalid transactions helping to collect information to improve its attack.
Hi @bruncsak, thanks for the follow-up information!
The nonce pool is per-environment (staging vs prod) and per-datacentre. If the egress IP changes between requests our load balancing may send a request that retrieved a nonce from one DC to another DC that doesn’t know it.
Since the user is only using one IP we can rule this out as a problem.
Great, thanks! I provided my e-mail to the user on the Github ticket so they can send the domains to me that way.
This isn’t a workaround per say - even if we resolve this particular problem it’s a best practice that will help in the future if (for example) a nonce expires out of the pool between when you get it and use it.
I think you are misunderstanding the purpose of the nonce. I recommend you review Section 6.4 and the ACME threat model. The nonce is strictly to prevent replay attacks from a middle party that terminates the TLS connection (e.g. a CDN). The goal is to prevent the CDN from replaying a request it processed and forwarded to the ACME server previously. A man-in-the-middle attacker is unable to modify the ACME request without breaking the JWS that authenticates it with the user’s account key. Retrying requests on nonce-failure will not give a MITM any information they couldn’t already observe.
Let me describe in details a theoretical attack scenario. I know that the “nonce” feature is to protect against or replay attack. In my scenario, there is no even attempt of replay. The attacker’s goal is the client’s authentication key and to achieve its goal it misuses the “nonce” feature. Let’s assume, I have a dumb client implementation of badNonce error recovery: unlimited retry, without exponential back-off. The attacker catch the traffic from the client and does not even forward it to the boulder server. All interaction stays between the attacker and the client. The attacker always answer to the client badNonce error, and selects a new nonce the attacker feels appropriate to gather the optimal information from the client. The client always signs the same message with different nonces using the same key. This repetition may leak enough information to the attacker to reconstruct the client’s account key, or, at minimum, get signed a different message of his choice.
You’re talking about trying to learn the ACME account key by forcing the client to sign requests that contain nonces of the attackers choosing? What you’re describing is akin to an adaptive chosen message attack on a signature algorithm. Any signature algorithm that reveals private key data through signatures is completely broken.
We’re muddying the waters of debugging this badNonce problem so I think we should table this sub-discussion. You’re free to choose not to implement retries if that’s your decision but the attack you’re describing presumes faults in internet standard cryptography that are unrealistic in this context.
I had the code ready for the retry algorithm even before I opened that issue, I just did not push to github. I was considering the trade off between the client security and the script usability, since there is already a primitive retry mechanism in place: the user can re-run the script, and seemingly it is slowly converging to the point the all the certificates could be created.
I like the problem we are facing at and I do not want to hide it with a workaround. Does the boulder answer frequently badNonce errors in general?
It’s not something that we expect frequently. Cases like this (consistent badNonce errors affecting the same user, with no obvious explanation) are rare and I agree we should keep looking into it
There are cases that can cause occasional badNonce errors during normal operation: for instance if your client gets a nonce, then sets up a bunch of DNS TXT records for a DNS-01 challenge, waiting around for them to become ready, and then uses the nonce. In that case if enough time elapses between getting the nonce and using it, you can get a badNonce error that would be fixed with a retry with a fresher nonce. How long the wait between getting the nonce and using it needs to be to provoke an error is not constant, it depends on the overall load on Let’s Encrypt and so retrying is a sensible way to recover gracefully when it happens.
Keeping this thread up to date with the Github thread.
We’ve ruled out the proxy angle I suggested above. I believe I’ve identified the root cause: the client is sending most of its requests from an IPv6 address and being load balanced to one DC, a very small number of requests are being sent from a different IPv4 address and being load balanced to a different DC.
I feel restricting the accessibility of an issued nonce per-datacenter is a violation of the protocol. On the other hand implementing the boulder that way is totally understandable performance-wise.
My idea for improvement: There is nothing in the standard which says that the nonce cannot have internal structure. It is possible to add into the nonce the datacenter ID. If a use of a nonce occurs, and the datacenter ID in the nonce matches the actual datacenter ID than the algorithm goes ahead as before. That is the huge majority of the connections, no compromise on the performance. In the rare cases when the datacenter ID included in the nonce does not match the actual datacenter ID the boulder connects the other data center for the arbitration of the validity of the nonce.
Hi @cpu,
I restrict the egress IP address by using the “only ipv6” curl options (-6). It’s been working really well ever since!
But it seems you have a problem with your ipv6 ips for some days: