On the host that renew poly.tique.info there is only one lego process at once.
And lego only keep the nonce in memory so no shared cache here.
Inspecting TLS traffic golang
I craft solutions from problems and constraints.
On the host that renew poly.tique.info there is only one lego process at once.
And lego only keep the nonce in memory so no shared cache here.
I have installed certbot via apt. It's version 4.0.0.
I requested a certificate with:
certbot certonly --manual --preferred-challenges dns -d poly.tique.info,www.tique.info
It was successful.
This command :
I noticed certbot specify a different algorithm in the JWS protected field 'RS256' versus 'ES256' for lego.
I wait for advice on what I should do now to narrow the source of the problem.
When you configured the systemd service for Lego, did you set Type=oneshot?
If not, then I would think systemd attempts to spawn it again immedietaly when it fails. So in that case, thousands of retries would not be surprising.
Not currently, though I am in the process of adding some. It's not high priority because this is a very isolated incident, but I am very curious to understand more about what's gone wrong here.
Seems like Lego will use a HTTP forward proxy if the HTTP(S)_PROXY environment variable is set (it also checks the lower case name). If that's the case on your machine, then that would be worth checking out.
There are two Connection headers in the newNonce response, which I don't think makes sense and can't reproduce locally with curl.
I also see a different header order, although perhaps that just varies between LE servers (or Go/curl reorders them).
$ curl --http1.1 -I https://acme-v02.api.letsencrypt.org/acme/new-nonce
HTTP/1.1 200 OK
Server: nginx
Date: Fri, 17 Oct 2025 19:20:12 GMT
Connection: keep-alive
Cache-Control: public, max-age=0, no-cache
Link: <https://acme-v02.api.letsencrypt.org/directory>;rel="index"
Replay-Nonce: 9FCfmSBhCF4-AYi3KE0cDNHBaPT2_BT3ZsYgIi84kE9wbUBow2Q
X-Frame-Options: DENY
Strict-Transport-Security: max-age=604800
A malfunctioning proxy is my best (well, only) guess.
Still confusing since it looks like certbot also checks these env vars for a forward proxy.
I did not configured a HTTP forward proxy on my servers.
I did not spot that but you are right. And I've got another log with the same double Connection header.
I cannot say whether it's a log error or the response does have 2 Connection header.
It doesn't show on all operations but only in the sequence after newNonce response.
Do you mean only on the HEAD request to new-nonce and not for any of the GET or POST requests?
And, what about the Certbot new-nonce request? Did its log show the two 'connection' headers?
Yes.
And, what about the Certbot new-nonce request? Did its log show the two 'connection' headers?
No the Certbot log doesn't show those two 'Connnection' headers.
I cannot say whether it's a log error or the response does have 2 Connection header.
I looked into it some more and it's almost certainly an issue in the logging.
I tried to add logging based on this blog post, and it logs the requests as HTTP/1.1 but the responses as HTTP/2.0, which is strange. The protocol on the request object is probably not yet set correctly at the time the logging occurs.
If I disable HTTP/2 support in Go/Lego via GODEBUG=http2client=0 then I also see two Connection response headers with close and keep-alive.
For one attempt, I captured the traffic to look at in Wireshark, and I only see one Connection header with keep-alive set. That pretty much confirms that the Go logging isn't entirely accurate.
Importantly, all my attempts to get certificates with Lego succeeded, with and without the logging patch and with both HTTP/1.1 and HTTP/2. I didn't see any badNonce errors.
I'm at a loss. Hopefully Aaron's logging changes will help narrow down what the issue is.
Thanks for taking some time to try to reproduce my configuration.
It's a good point that you did reproduce the problem with logging.
I'm curious how you manage to analyse the exchange you had captured with Wireshark. Did you use an HTTP proxy to get ride of TLS ?
as you touched source code already I think you can add sslkeylogflie too:
I craft solutions from problems and constraints.
So far so good, but Golang does not support it out of the box. But it supports (at least as for Golang 1.20.4) the
KeyLogWriteroption in thetls.Configstruct (docs). It accepts aio.Writer, which is handily returned byos.Create.
Yep, this is pretty much what I did. I also set the max tls version to 1.2 because I read somewhere that Wireshark doesn't support all 1.3 ciphersuites
It might be interesting to have a capture of your traffic @Romuald
This was my patch:
diff --git a/lego/client_config.go b/lego/client_config.go
index 969135a1..9d2ff509 100644
--- a/lego/client_config.go
+++ b/lego/client_config.go
@@ -71,6 +71,14 @@ type CertificateConfig struct {
// and potentially a custom *x509.CertPool
// based on the caCertificatesEnvVar environment variable (see the `initCertPool` function).
func createDefaultHTTPClient() *http.Client {
+ ssl_log_file, _ := os.Create("/tmp/go-ssl-log")
+ tlsClientConfig := &tls.Config{
+ ServerName: os.Getenv(caServerNameEnvVar),
+ RootCAs: initCertPool(),
+ KeyLogWriter: ssl_log_file,
+ MinVersion: tls.VersionTLS12,
+ MaxVersion: tls.VersionTLS12,
+ }
return &http.Client{
Timeout: 2 * time.Minute,
Transport: &http.Transport{
@@ -81,10 +89,7 @@ func createDefaultHTTPClient() *http.Client {
}).DialContext,
TLSHandshakeTimeout: 30 * time.Second,
ResponseHeaderTimeout: 30 * time.Second,
- TLSClientConfig: &tls.Config{
- ServerName: os.Getenv(caServerNameEnvVar),
- RootCAs: initCertPool(),
- },
+ TLSClientConfig: tlsClientConfig,
},
}
}
This is how I captured the traffic. I ran this inside a Debian 13 Docker container without IPv6 support. If you do have IPv6 connectivity then you'll have to adjust this some.
tcpdump "ip and tcp port 443 and host $(dig +short a acme-staging-v02.api.letsencrypt.org | tail -1)" -w /tmp/get-cert.pcap
By the way, I've been doing most of my testing on the staging environment. I suggest you do the same unless it's not possible to reproduce the problem there (which would be an interesting data point if that were the case).
Ok, I've built a patched version of lego with the patch provided by @dextercd.
I have runned it to request a certificate for poly.tique.info with lets encrypt staging servers.
It did succeed.
I was available to analyse the pcap in WireShark with the content provided by the keyLogWriter.
So I tried on production and it failed as usual which is good for debugging !
As I've captured the exchange with tcpdump, I could analyzed it too.
It's a request for renewal so the process is different:
This last request is answered with HTTP/1.2 429 Too Many Requests and the detail explains "/detail:Your account is temporarily prevented from requesting certificates for poly.tique.info and possibly others. Please visit: Let's Encrypt - Portal".
But lego doesn't show this, it only logs:
2025/10/20 19:03:29 [INFO] [poly.tique.info] acme: renewalInfo endpoint indicates that renewal is needed
2025/10/20 19:03:29 [INFO] [poly.tique.info] acme: Trying renewal with -244 hours remaining
2025/10/20 19:03:29 [INFO] [poly.tique.info] acme: Obtaining bundled SAN certificate
2025/10/20 19:03:31 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:badNonce :: Unable to validate JWS :: JWS has an invalid anti-replay nonce: "9FCfmSBhbP-MUFNbwvsH_UWhNBcz1kZXGcMMNmkP0MKaiSfYw28"
2025/10/20 19:03:32 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:badNonce :: Unable to validate JWS :: JWS has an invalid anti-replay nonce: "9FCfmSBhRum-dgbt7tnjvY0pNDqpC_rGXOV6S0yJlFumhpxiXJM"
2025/10/20 19:03:34 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:badNonce :: Unable to validate JWS :: JWS has an invalid anti-replay nonce: "NKzugzQtO7mdK2C8HPxdZIJ5-dU49_oWcFxLB-45WY8Q5Y9HSpg"
2025/10/20 19:03:36 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:badNonce :: Unable to validate JWS :: JWS has an invalid anti-replay nonce: "NKzugzQtAuNhbn5Y0XYHCBnwywPdRLJYdeFGDLrjL_8oyWLpAn4"
2025/10/20 19:03:38 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:badNonce :: Unable to validate JWS :: JWS has an invalid anti-replay nonce: "9FCfmSBh3qYCJKFs9J8D14xtugiLGLbWb9EWy71yMOdaRRhiZcg"
2025/10/20 19:03:40 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:badNonce :: Unable to validate JWS :: JWS has an invalid anti-replay nonce: "NKzugzQtS2FnzvEncUngc3edbjnq74ofXely-ZE6xBx4z9JyMy4"
2025/10/20 19:03:43 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:badNonce :: Unable to validate JWS :: JWS has an invalid anti-replay nonce: "NKzugzQt-2QKa1G_abDOCTwE_r-DQbyDspYiI9oqrEbcdAI0ggk"
2025/10/20 19:03:48 acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:badNonce :: Unable to validate JWS :: JWS has an invalid anti-replay nonce: "9FCfmSBhR2edpF02kgYtAvvtRgmtb6FdhhvoTun6gn_CUPCzqlI"
So to me the conclusion is that the problem is in lego.
I will submit a bug report to lego.
Before I proceed to unpause my account I can do some more tests if the context cannot be reproduced.
Thank you all for your help and for all the things I have learned in this quest.
So Lego does a POST /acme/new-order, this returns a 429 error, then the exact same request is attempted again, which returns a 400 error for the reused nonce. Do I have that right?
That seems like very weird behaviour to me. Especially the fact that it doesn't show the 429 errors in the LEGO_DEBUG_CLIENT_VERBOSE_ERROR logs nor the logging you patched in.
Can you drop a link here once you open an issue on the Lego issue tracker? I'd be interested in seeing their analysis/fix for this issue.
Yeah, the code here might be the issue:
It only tries to fetch a new nonce if a bad nonce error is returned, but I don't think it makes sense to ever reuse a nonce. Transparently retrying a POST request if a 429 error is returned also doesn't seem that great.
never mind that, that code isn't used by Lego I don't think. I just came up when I grep'd through all downloaded Go modules in the container.
It seems like Lego's retrievablePost shouldn't do retries in this case, but perhaps it's bugged.
I have created an issue in lego.
Already closed…?
I must admit I don't fully understand the bug you filed. The problem appears to be that LEGO is presenting the same nonce more than once.
Are the logs that you included in that bug verbatim? Particularly the fact that you got the same exact error message, stating the same exact nonce, two requests apart? Because if yes, that's a serious Lego bug.
But if you were paraphrasing, and your logs don't actually show that, then we need to keep digging. And my own reading of the Lego code just now seems to incidate that it is doing the right thing with its nonces.
The log of the lego command is verbatim. But none of the nonce are repeated.