Intermittent badNonce error when using connection pool (with fixed source IP)

Hi team,

We're running our ACME client on a public cloud network and noticing that we're regularly getting the badNonce error for some reason. Most requests are succeeding fine, but there are some small amounts of requests that are falling even after retrying, which is happening on a daily basis.

I understand that the nonce pool is shared per-datacentre, and if the egress IP changes between requests, requests may fail with such an error. But we're using neither multiple egress IPs nor dualstack (IPv4/IPv6).

During the investigation, we tried disabling the connection pool that our client was using (e.g., agentkeepalive - npm), and this surprisingly improved the situation, and we no longer see the error since then. But we need to understand better why that could be the case.

Let's assume an ACME client is running on a public cloud (e.g., AWS), and in a given region...

Is it possible that requests from a single AWS region to Boulder (https://acme-v02.api.letsencrypt.org/) may hit different data centers because of the anycast effect?

For instance, I see you're using Cloudflare Spectrum (and possibly with Argo?), and Cloudflare probably has private peering with AWS (e.g., their blog post). So if multiple Cloudflare regions are connected with the AWS region via private peering and network cost to each DC is roughly equivalent (assuming they're announcing the same anycast IP from multiple regions), could the requests from the AWS region to Boulder land on different Cloudflare regions and hence different Boulder datacenters? How does that play?

In our case, assuming the above, the created connection pool might have been creating a few out of hundreds of connections, which are routed to a different Cloudflare region, and was rolling the dice (?). Does this theory sound reasonable to you?

Or, is there any special algorithm that would make your origin think requests were made from different actors even though they're from the same source IP/network and to the same DC?

Thanks in advance!

1 Like

Nonce redemption should work no matter which of our DCs you hit: we look at the prefix on the nonce to locate which nonce server to redeem it from.

There’s only one case I know of where nonce redemption should fail, which is if a nonce server restarts. We only store nonces ephemerally in memory, so if a nonce is from an instance that’s fine, we can’t redeem it anymore. We’re working on making this better, but it’s not all the way done yet.

Now, it’s possible there’s reasons for nonce redemption to fail that I don’t know about! Can you characterize what “intermittent” means any better? If we can show it happened outside of times we restarted nonce servers, then perhaps we need to investigate further.

5 Likes

@mcpherrinm Thanks for getting back on this!

That's interesting! I was referring to these old comments earlier:

So is this no longer the case today? This would bring more mysterious to this problem, though...

There’s only one case I know of where nonce redemption should fail, which is if a nonce server restarts.

Based on the fact that just disabling the connection pool made a significant difference, I doubt that's the case here. But that's good to know, thanks for sharing that. Appreciate your transparency.

Can you characterize what “intermittent” means any better? If we can show it happened outside of times we restarted nonce servers, then perhaps we need to investigate further.

As an example, requests to /acme/new-order was failing -30% ratio (UPDATE: This number was not accurate. This had to be lower. But as we call multiple ACME endpoints throughout the new order process, we saw the badNonce error across endpoints, and that rolled up to a total ~30% failure in the overall process) with no clear characteristic that would indicate server restarts or as such. This went down to 0% after disabling the connection pool.

We'd love to provide further data if we can discuss this privately over email or something.

Thanks!

2 Likes

Sure. The most helpful things to know would be the account IDs, timestamps you saw nonce errors, and the values of the nonces. You can send me a direct message on this forum with that if you don’t want to post it.

2018-era comments are definitely outdated; nonces have improved since then.

3 Likes

@mcpherrinm Thanks! Just sent you a message

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.