Error renewing certificate from LE: NS returned REFUSED for _acme-challenge

Your pcap hides the IPs.
I suspect there is an IPv4 or (more likely) IPv6 issue.
Where some of the IPs "work" for you and some don't.
Try testing each of the eight IPs individually:
did +short NS your.domain @216.239.32.107
did +short NS your.domain @2001:4860:4802:32::6b
...
[each IP for each of the four nameservers]

1 Like

@rg305 IPv6 traffic isn't allowed by my firewall rules, but does that even matter if IPv4 works?

If it is trying an IPv6 address...
And the firewall is returning the negative response "Refused", then we might have found the problem.

1 Like

I doubt that is the issue.
This is what LE specifications say about IPv6:

When making outbound domain validation requests for a domain that has both IPv4 and IPv6 addresses (e.g. both A and AAAA records) Let’s Encrypt will always prefer the IPv6 addresses for the initial connection. If the IPv6 connection fails at the network level (e.g. there is a timeout) and there are IPv4 addresses available then we will retry the request with one of the IPv4 addresses.

So even if IPv6 fails, the validation should be successful in any case.

Looking at the pcap file, I can clearly see that the packet that gets the "Refused" is an IPv4 one.
The "Refused" is within the DNS response from the Google nameserver, it's not a "Refused" as in dropped packet.
Below the pcap packet details:

Request:

Frame 151: 94 bytes on wire (752 bits), 94 bytes captured (752 bits)
Ethernet II, Src: 9a:78:bc:f4:23:2b (9a:78:bc:f4:23:2b), Dst: 72:b9:75:24:91:70 (72:b9:75:24:91:70)
    Destination: 72:b9:75:24:91:70 (72:b9:75:24:91:70)
    Source: 9a:78:bc:f4:23:2b (9a:78:bc:f4:23:2b)
    Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 10.10.1.216, Dst: 216.239.34.107
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
        0000 00.. = Differentiated Services Codepoint: Default (0)
        .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
    Total Length: 80
    Identification: 0xd24a (53834)
    Flags: 0x40, Don't fragment
        0... .... = Reserved bit: Not set
        .1.. .... = Don't fragment: Set
        ..0. .... = More fragments: Not set
    ...0 0000 0000 0000 = Fragment Offset: 0
    Time to Live: 63
    Protocol: UDP (17)
    Header Checksum: 0x6216 [validation disabled]
    [Header checksum status: Unverified]
    Source Address: 10.10.1.216
    Destination Address: 216.239.34.107
User Datagram Protocol, Src Port: 37620, Dst Port: 53
    Source Port: 37620
    Destination Port: 53
    Length: 60
    Checksum: 0x078a [unverified]
    [Checksum Status: Unverified]
    [Stream index: 75]
    [Timestamps]
    UDP payload (52 bytes)
Domain Name System (query)
    Transaction ID: 0xb4a1
    Flags: 0x0000 Standard query
        0... .... .... .... = Response: Message is a query
        .000 0... .... .... = Opcode: Standard query (0)
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...0 .... .... = Recursion desired: Don't do query recursively
        .... .... .0.. .... = Z: reserved (0)
        .... .... ...0 .... = Non-authenticated data: Unacceptable
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 1
    Queries
        _acme-challenge.mydomain.com: type TXT, class IN
    Additional records
    [Response In: 152]

Response:

Frame 152: 83 bytes on wire (664 bits), 83 bytes captured (664 bits)
Ethernet II, Src: 72:b9:75:24:91:70 (72:b9:75:24:91:70), Dst: 9a:78:bc:f4:23:2b (9a:78:bc:f4:23:2b)
    Destination: 9a:78:bc:f4:23:2b (9a:78:bc:f4:23:2b)
    Source: 72:b9:75:24:91:70 (72:b9:75:24:91:70)
    Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 216.239.34.107, Dst: 10.10.1.216
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
        0000 00.. = Differentiated Services Codepoint: Default (0)
        .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
    Total Length: 69
    Identification: 0xaeda (44762)
    Flags: 0x40, Don't fragment
        0... .... = Reserved bit: Not set
        .1.. .... = Don't fragment: Set
        ..0. .... = More fragments: Not set
    ...0 0000 0000 0000 = Fragment Offset: 0
    Time to Live: 63
    Protocol: UDP (17)
    Header Checksum: 0x8591 [validation disabled]
    [Header checksum status: Unverified]
    Source Address: 216.239.34.107
    Destination Address: 10.10.1.216
User Datagram Protocol, Src Port: 53, Dst Port: 37620
    Source Port: 53
    Destination Port: 37620
    Length: 49
    Checksum: 0x9a18 [unverified]
    [Checksum Status: Unverified]
    [Stream index: 75]
    [Timestamps]
    UDP payload (41 bytes)
Domain Name System (response)
    Transaction ID: 0xb4a1
    Flags: 0x8085 Standard query response, Refused
        1... .... .... .... = Response: Message is a response
        .000 0... .... .... = Opcode: Standard query (0)
        .... .0.. .... .... = Authoritative: Server is not an authority for domain
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...0 .... .... = Recursion desired: Don't do query recursively
        .... .... 1... .... = Recursion available: Server can do recursive queries
        .... .... .0.. .... = Z: reserved (0)
        .... .... ..0. .... = Answer authenticated: Answer/authority portion was not authenticated by the server
        .... .... ...0 .... = Non-authenticated data: Unacceptable
        .... .... .... 0101 = Reply code: Refused (5)
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 0
    Queries
        _acme-challenge.mydomain.com: type TXT, class IN
    [Request In: 151]
    [Time: 0.000398000 seconds]

Can this be caused by the following error?

urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce

I have the feeling we discarded that too fast as possible root cause.

1 Like

Make sure that you're actually running a recent Traefik version. Based on your other thread on the Traefik community (Error renewing certificate from LE: NS returned REFUSED for _acme-challenge - Traefik v2 (latest) - Traefik Labs Community Forum) I see that you're indeed using image: traefik:latest, but the local image may be stale. Very old Traefik versions (i.e < 1.2) seem to not have correct retry logic on nonce errors. A GitHub search does not reveal any unusual spike of issues related to nonces in recent versions.

Based on other people's experiences with similar problems (Problem renewing the cert with DNS-01 challenge - Traefik v2 (latest) - Traefik Labs Community Forum, Constellix DNS-01 challenge not working · Issue #1188 · go-acme/lego · GitHub) I suspect something is wonky with Docker + DNS.

You can test this by skipping Traefik/Lego's internal challenge validation (disablePropagationCheck = true). This is really only a hack and not a good viable solution, but doing this would at least enable to get a second network perspective view (from LE) of the situation. Make sure to add a manual delay when doing this, as the records very likely aren't going to be there right away. Specfiy a delay of at least 2 * TTL (I think delayBeforeCheck does this? Or is this a no-op with propagation check disabled?) before skipping the propagation check.

3 Likes

Hi @Nummer378,
Traefik's version is:

time="2022-03-21T17:56:40Z" level=info msg="Traefik version 2.6.1 built on 2022-02-14T16:50:25Z"

I tried with the options you mentioned and this is what happens:

time="2022-03-23T11:06:17Z" level=debug msg="legolog: [INFO] [*.mydomain.com] acme: use dns-01 solver"
time="2022-03-23T11:06:17Z" level=debug msg="legolog: [INFO] [mydomain.com] acme: Could not find solver for: tls-alpn-01"
time="2022-03-23T11:06:17Z" level=debug msg="legolog: [INFO] [mydomain.com] acme: Could not find solver for: http-01"
time="2022-03-23T11:06:17Z" level=debug msg="legolog: [INFO] [mydomain.com] acme: use dns-01 solver"
time="2022-03-23T11:06:17Z" level=debug msg="legolog: [INFO] [*.mydomain.com] acme: Preparing to solve DNS-01"
time="2022-03-23T11:06:18Z" level=debug msg="legolog: change (Create): {\"additions\":[{\"name\":\"_acme-challenge.mydomain.com.\",\"rrdatas\":[\"acUH70DPvVbQCSzxvSk_UO0O3EaZRGGAwfKNS_oko7k\"],\"ttl\":120,\"type\":\"TXT\"}]}"
time="2022-03-23T11:06:19Z" level=debug msg="legolog: [INFO] Wait for apply change [timeout: 30s, interval: 3s]"
time="2022-03-23T11:06:19Z" level=debug msg="legolog: change (Get): {\"additions\":[{\"name\":\"_acme-challenge.mydomain.com.\",\"rrdatas\":[\"acUH70DPvVbQCSzxvSk_UO0O3EaZRGGAwfKNS_oko7k\"],\"ttl\":120,\"type\":\"TXT\"}]}"
time="2022-03-23T11:06:22Z" level=debug msg="legolog: change (Get): {\"additions\":[{\"name\":\"_acme-challenge.mydomain.com.\",\"rrdatas\":[\"acUH70DPvVbQCSzxvSk_UO0O3EaZRGGAwfKNS_oko7k\"],\"ttl\":120,\"type\":\"TXT\"}]}"
time="2022-03-23T11:06:22Z" level=debug msg="legolog: [INFO] [mydomain.com] acme: Preparing to solve DNS-01"
time="2022-03-23T11:06:23Z" level=debug msg="legolog: change (Create): {\"deletions\":[{\"kind\":\"dns#resourceRecordSet\",\"name\":\"_acme-challenge.mydomain.com.\",\"rrdatas\":[\"acUH70DPvVbQCSzxvSk_UO0O3EaZRGGAwfKNS_oko7k\"],\"ttl\":120,\"type\":\"TXT\"}]}"
time="2022-03-23T11:06:23Z" level=debug msg="legolog: [INFO] Wait for apply change [timeout: 30s, interval: 3s]"
time="2022-03-23T11:06:23Z" level=debug msg="legolog: change (Get): {\"deletions\":[{\"kind\":\"dns#resourceRecordSet\",\"name\":\"_acme-challenge.mydomain.com.\",\"rrdatas\":[\"acUH70DPvVbQCSzxvSk_UO0O3EaZRGGAwfKNS_oko7k\"],\"ttl\":120,\"type\":\"TXT\"}]}"
time="2022-03-23T11:06:23Z" level=debug msg="legolog: change (Create): {\"additions\":[{\"name\":\"_acme-challenge.mydomain.com.\",\"rrdatas\":[\"KkbMgl-3zo1id6ByIuvmuXGZzU6ZcjXxhXIFy1Q-c-k\",\"acUH70DPvVbQCSzxvSk_UO0O3EaZRGGAwfKNS_oko7k\"],\"ttl\":120,\"type\":\"TXT\"}]}"
time="2022-03-23T11:06:24Z" level=debug msg="legolog: [INFO] Wait for apply change [timeout: 30s, interval: 3s]"
time="2022-03-23T11:06:24Z" level=debug msg="legolog: change (Get): {\"additions\":[{\"name\":\"_acme-challenge.mydomain.com.\",\"rrdatas\":[\"KkbMgl-3zo1id6ByIuvmuXGZzU6ZcjXxhXIFy1Q-c-k\",\"acUH70DPvVbQCSzxvSk_UO0O3EaZRGGAwfKNS_oko7k\"],\"ttl\":120,\"type\":\"TXT\"}]}"
time="2022-03-23T11:06:24Z" level=debug msg="legolog: [INFO] [*.mydomain.com] acme: Trying to solve DNS-01"
time="2022-03-23T11:06:24Z" level=debug msg="legolog: [INFO] [*.mydomain.com] acme: Checking DNS record propagation using [1.1.1.1:53 8.8.8.8:53]"
time="2022-03-23T11:06:29Z" level=debug msg="legolog: [INFO] Wait for propagation [timeout: 3m0s, interval: 5s]"
time="2022-03-23T11:06:29Z" level=debug msg="Delaying 240000000000 rather than validating DNS propagation now." providerName=googleresolver.acme
time="2022-03-23T11:10:29Z" level=debug msg="legolog: [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/chall-v3/90613102090/IC3ZSg :: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: \"0001i1WIfXE5pcJ3BvUXl--C5_jdCSjTmjdIYg_SE9q1HOA\""
time="2022-03-23T11:10:35Z" level=debug msg="legolog: [INFO] [*.mydomain.com] The server validated our request"
time="2022-03-23T11:10:35Z" level=debug msg="legolog: [INFO] [mydomain.com] acme: Trying to solve DNS-01"
time="2022-03-23T11:10:35Z" level=debug msg="legolog: [INFO] [mydomain.com] acme: Checking DNS record propagation using [1.1.1.1:53 8.8.8.8:53]"
time="2022-03-23T11:10:40Z" level=debug msg="Delaying 240000000000 rather than validating DNS propagation now." providerName=googleresolver.acme
time="2022-03-23T11:10:40Z" level=debug msg="legolog: [INFO] Wait for propagation [timeout: 3m0s, interval: 5s]"
time="2022-03-23T11:14:40Z" level=debug msg="legolog: [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/chall-v3/90613102100/BpYDlg :: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: \"0002WHwIciFVvaiQLZodLQuQ6CqUkWYeHicrEjDLmCb19L4\""
time="2022-03-23T11:14:48Z" level=debug msg="legolog: [INFO] [mydomain.com] The server validated our request"
time="2022-03-23T11:14:48Z" level=debug msg="legolog: [INFO] [*.mydomain.com] acme: Cleaning DNS-01 challenge"
time="2022-03-23T11:14:49Z" level=debug msg="legolog: [INFO] [mydomain.com] acme: Cleaning DNS-01 challenge"
time="2022-03-23T11:14:49Z" level=debug msg="legolog: [INFO] [mydomain.com, *.mydomain.com] acme: Validations succeeded; requesting certificates"
time="2022-03-23T11:14:51Z" level=debug msg="legolog: [INFO] [mydomain.com] Server responded with a certificate."
time="2022-03-23T11:14:51Z" level=debug msg="Certificates obtained for domains [mydomain.com *.mydomain.com]" providerName=googleresolver.acme ACME CA="https://acme-v02.api.letsencrypt.org/directory"
time="2022-03-23T11:14:51Z" level=debug msg="Configuration received from provider googleresolver.acme: {\"http\":{},\"tls\":{}}" providerName=googleresolver.acme
time="2022-03-23T11:14:51Z" level=debug msg="No default certificate, generating one" tlsStoreName=default
time="2022-03-23T11:14:51Z" level=debug msg="Adding certificate for domain(s) mydomain.com,*.mydomain.com"

Traefik's dashboard and internal docker containers now have a valid certificate, some external hosts too, others report a wrong certificate error, but that could be a configuration issue.
So, what did we learn from this test?

2 Likes

Since you succesfully got certificates

the DNS-01 challenge did succeed.

Therefore the issue must be with the way Traefik/Lego/Docker/DNS are interacting with each other, which causes seemingly bad DNS results returned, even though everything is fine.

I'm wondering if this may be related to some DDoS protection? Traefik/Lego seems to hammer the authoritative servers rather frequently (which makes me wonder why it even requests a recursive resolver?) which might trigger DDoS protection on Google's nameservers, causing the servers to return REFUSED? (On second thought: Doesn't sound plausible)

3 Likes

In your opinion, what's the best way forward now?
Should I raise an issue in the lego repository for this?

That's up to you, but yeah if you want to do that go ahead. As an intermediate solution, you can keep the workaround to skip the internal challenge validation. This should be fine until you can figure out why Lego doesn't behave as expected.

4 Likes

That doesn't matter because you didn't have any AAAA records last time I checked.

1 Like

Your response is all about incoming IPv6 requests (from LE).
The problem shown ("Refused") was with outbound DNS requests (from Traefik).

You seem to miss(understand) my point.

1 Like

@rg305 I understood your point, but even if the outgoing IPv6 packet gets dropped isn't the client supposed to fall back on IPv4?
IPv6 could not work for an infinite amount of reasons, starting from the OS not supporting it, the network now allowing the traffic, to the AAAA record being missing (ok in this last case the client would get a NOERROR response and it would gracefully terminate the connection). Maybe I'm wrong, but I would assume the client doesn't require a IPv6 DNS response at all costs in order to fall back to IPv4, also cause they basically get triggered at the same time.

No, NXDOMAIN means there are no records at all on this label or any below. You'd get NOERROR.

1 Like

@9peppe yes you are right, it would be a NOERROR

1 Like

There was no "drop"; There was a "Refused" response.
[which I can only presume did NOT come from the actual DNS server queried]

1 Like

If we're talking about IPv6, the outgoing query got dropped by my firewall, that I'm 100% sure of, so it never left my network.
The "Refused" came as response of the IPv4 query and it was sent by Google's DNS, that is visible in the pcap I pasted on post #26 where it shows:

Internet Protocol Version 4, Src: 216.239.34.107, Dst: 10.10.1.216

And

Flags: 0x8085 Standard query response, Refused

Hello, I'm getting the same issue with Traefik + acme-dns both running in same Docker network...

Setting disablePropagationCheck = true on Traefik did the trick, but like said above and in the documentation this shouldn't be set in order to work...

1 Like

This is confusing; As that IP (216.239.34.107) returns the expected for your DNS zone.
This implies there is a "routing" issue or a MiTM issue.

1 Like

I agree it is confusing, but MiTM is very unlikely, also, if that was the case I would see something unusual in the pcap.
Even with routing the issue would be visible through the pcap file.
I suspect it's just a bug in the lego client implementation, else I'm not able to explain why the client keeps waiting for propagation of the records when the propagation has already occurred.

It's worse than that.
It sends a DNS request and receives a DNS reply.
One that doesn't match the reality of what should be replied.
["Refused" is a DNS reply (from a DNS speaking system)]

1 Like