We're running our ACME client on a public cloud network and noticing that we're regularly getting the
badNonce error for some reason. Most requests are succeeding fine, but there are some small amounts of requests that are falling even after retrying, which is happening on a daily basis.
I understand that the nonce pool is shared per-datacentre, and if the egress IP changes between requests, requests may fail with such an error. But we're using neither multiple egress IPs nor dualstack (IPv4/IPv6).
During the investigation, we tried disabling the connection pool that our client was using (e.g., agentkeepalive - npm), and this surprisingly improved the situation, and we no longer see the error since then. But we need to understand better why that could be the case.
Let's assume an ACME client is running on a public cloud (e.g., AWS), and in a given region...
Is it possible that requests from a single AWS region to Boulder (
https://acme-v02.api.letsencrypt.org/) may hit different data centers because of the anycast effect?
For instance, I see you're using Cloudflare Spectrum (and possibly with Argo?), and Cloudflare probably has private peering with AWS (e.g., their blog post). So if multiple Cloudflare regions are connected with the AWS region via private peering and network cost to each DC is roughly equivalent (assuming they're announcing the same anycast IP from multiple regions), could the requests from the AWS region to Boulder land on different Cloudflare regions and hence different Boulder datacenters? How does that play?
In our case, assuming the above, the created connection pool might have been creating a few out of hundreds of connections, which are routed to a different Cloudflare region, and was rolling the dice (?). Does this theory sound reasonable to you?
Or, is there any special algorithm that would make your origin think requests were made from different actors even though they're from the same source IP/network and to the same DC?
Thanks in advance!