DNS-01 Challenge Failing

My domain is:
jellyoctomedia.com

I've been trying to use Caddy to reverse proxy. Unfortunately whenever I try I get this in the logs:

{"level":"error","ts":1719656534.6553514,"logger":"tls.obtain","msg":"will retry","error":"[jellyoctomedia.com] Obtain: [jellyoctomedia.com] solving challenges: presenting for challenge: could not determine zone for domain "_acme-challenge.jellyoctomedia.com": unexpected response code 'SERVFAIL' for _acme-challenge.jellyoctomedia.com. (order=https://acme-staging-v02.api.letsencrypt.org/acme/order/153444353/17488164413) (ca=https://acme-staging-v02.api.letsencrypt.org/directory)","attempt":5,"retrying_in":600,"elapsed":605.4238237,"max_duration":2592000}

I use cloudflare as my provider. I've tried many different possible solutions but none have helped me so far. Unfortunately this isn't my expertise area and networking gets my head in a couple of loops.

Frankly, I don't really understand Caddy's error message.

On the one hand, it says "could not determine zone for", which suggests it perhaps had some issues with the Cloudflare API or something like that..

On the other hand, it claims it has gotten a "SERVFAIL" error from somewhere, which suggests a lookup for _acme-challenge.jellyoctomedia.com hostname failed with said DNS error.. Sometimes the Let's Encrypt validation error returns a SERVFAIL, but I can't reproduce that. And it's also not mentioned by the error message. Secondly, if you follow the provided order URL and check the authorizations of that order at https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/12956871693, you can clearly see all challenges in the "pending" state: so Let's Encrypt did not even try to validate any of them, including the dns-01 challenge, thus the SERVFAIL is not from the Let's Encrypt side.

So it's kinda weird to me and Caddy is not helping debugging this at all.. In what stage did it encounter the SERVFAIL error? The CF API? Or something else?

Maybe @mholt, the developer of Caddy, can shed some light on this?

Unfortunately you've removed most of the questions from the questionnaire, which included the version of the ACME client you're using, Caddy in your situation. Of course we'd like to know which version of Caddy you're using. Maybe this is a bug which has been fixed already :person_shrugging:t2:

3 Likes

I must apologize about removing the ACME version. Like I said I've only recently started looking at networking and my head has been in a bit of a haze, especially with this in which debugging seems... tricky.

The version I'm using of Caddy is 2.8.4, with the cloudflare DNS plugin. The only other question that I definitely can answer that I missed is that I run it on Windows 11. I've made sure that the firewall for windows is good and that the correct ports should have no issues.

1 Like

Caddy looks up the authoritative zone for a domain so it can use the DNS provider API provided by the relevant 3rd party plugin, which takes the zone and list of zone-relative records as inputs, to set the TXT record needed for the ACME challenge.

If Caddy can't find the zone then you get "could not determine zone". The "SERVFAIL" means the DNS responder, whether local or remote, replied with the equivalent of an HTTP 500. This usually indicates a misconfigured DNS resolver, usually on the local machine or network.

2 Likes

@mholt Are there debugging logs so the user can actually pin-point where the problem originates? The current output saying "something went wrong somewhere, but I'm not telling you where exactly" doesn't really help IMO.

2 Likes

We already print the error we get from the remote. I'm not really an expert at debugging DNS, unfortunately. (It's beyond Caddy at this point.)

2 Likes

It might be helpful to know in what stage of the process the error originated from. E.g., in this case it seems to be in the stage of determining which DNS plugin to use, right? (According to your previous post.) But for me, that's not apparent from the error message. For all I know, the error was from the Cloudflare API?

3 Likes

I agree. To get to that stage of the error the local resolver had to at least resolve the Let's Encrypt API hostname (in this case the staging endpoint). So, I wouldn't think it was a general problem with local resolver.

2 Likes

@OctoL Without more clarity on which component is issuing that error we have to guess.

My guess is the error is coming from the Cloudflare API. Cloudflare itself is very reliable so the query from Let's Encrypt auth servers is not likely the one getting the SERVFAIL. And, as Osiris pointed out the status of your cert request says it never reached the point that LE checks for that TXT record.

I suggest
Review your Caddy Cloudflare config GitHub - caddy-dns/cloudflare: Caddy module: dns.providers.cloudflare

If that doesn't resolve it try posting at the Help - Caddy Community or maybe even the Cloudflare community

Or, even try using the HTTP Challenge in Caddy rather than Cloudflare. Or even the TLS-ALPN Challenge. Was there a particular reason you chose the Cloudflare and DNS Challenge? Automatic HTTPS — Caddy Documentation

I don't have much expertise in Caddy. Maybe a different volunteer here will offer more specific advice

2 Likes

It knows which plugin to use already, it's just trying to determine the inputs to it. A DNS provider plugin takes a zone and a list of records for that zone to perform the action on. Given only a FQDN in the config, a lookup has to be performed to determine the zone.

SERVFAIL is a DNS error message.

1 Like

Actually -- we've seen this be a misconfiguration of a local resolver, most definitely.

This is common with split horizon DNS setups or really anything that messes with DNS (pihole, etc).

2 Likes

It is not coming from the Cloudflare API. SERVFAIL is a standard DNS error response. It's unfortunately so ambiguous, like HTTP 500, that there is a proposed standard to make more specific errors: RFC 8914: Extended DNS Errors

Sorry @OctoL -- but a lot of the other replies have wrong information. Unfortunately this is an error from the DNS resolver, either local or remote, and that is all the information it gives us.

1 Like

We are quite aware of what SERVFAIL is. The question is which component is doing the DNS query resulting in that failure. Is it something in Caddy itself, or its Cloudflare module, or the Let's Encrypt auth server. I think we have ruled out LE.

Should they be able to reproduce the SERVFAIL with something like nslookup?

2 Likes

Why? I believe many Certbot plugins simply try and remove the leftmost label if fail, until either the zone is found or nothing is left to try.

Also, which lookup was performed which produced the SERVFAIL message? To which DNS server? Does Go also use an equivalent to the glibc getaddrinfo() function? Would Go have the information to know which DNS server is faulty?

Et voila, see here the confusion due to a lack of essential information from the error message :slight_smile: I still believe it would be helpful for the error message to mention in which stage the error was triggered.

2 Likes

I actually can't find the error message "presenting for challenge" or "solving challenges" in either the Caddy or CertMagic code.

It's not always prudent to make extra requests especially for APIs that have limits.

Perhaps. It would give us more information. It would need to be done from the exact same machine or container if relevant.

Since I actually can't find that error message in my code, I'm not even sure how I would go about improving it.

2 Likes

Usually, when presenting error messages, one presents some kind of "trace". If the error is produced by some plugin, the error message should represent the fact that Caddy used / called a function in that plugin and the plugin caused the error..

E.g.

"DNS plugin cloudflare produced an error when…: 'Lorem ipsum…'"

By the way, that specific string seems to be part of acmez:

1 Like

In Go programs, errors are conventionally traced by prepending to the error messages:

if err != nil {
    return fmt.Errorf("doing something: %v", err)
}

Thus, you can search for "doing something" in the code and find out exactly which line of code produced the error, and by looking up a line you can see where the next part of the error came from.

1 Like

I'm pretty sure regular users don't want to dive into open source code to figure out what went wrong with the product...................

But because we're volunteers on this great Community, let's dive deeper into the not-so-helpful error message. The next part seems to be coming from certmagic:

Which seems to be doing a SOA RR lookup:

Nothing wrong with OPs SOA RR:

osiris@erazer ~ $ dig _acme-challenge.jellyoctomedia.com SOA

; <<>> DiG 9.16.42 <<>> _acme-challenge.jellyoctomedia.com SOA
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11048
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;_acme-challenge.jellyoctomedia.com. IN	SOA

;; AUTHORITY SECTION:
jellyoctomedia.com.	1800	IN	SOA	destiny.ns.cloudflare.com. dns.cloudflare.com. 2345140976 10000 2400 604800 1800

;; Query time: 24 msec
;; SERVER: x.x.x.x#53(185.93.175.43)
;; WHEN: Sat Jun 29 18:07:05 CEST 2024
;; MSG SIZE  rcvd: 125

osiris@erazer ~ $ 

@OctoL Can you do such a SOA lookup as above on the computer you're running Caddy on? Not sure how that would work on Windows, probably using nslookup?

1 Like

Yes, my Windows 11 has that. I think it was standard and not something I installed.

@OctoL Can you try this and show result? If you are running Caddy inside a container or VM you need to run it there

nslookup -q=SOA _acme-challenge.jellyoctomedia.com

jellyoctomedia.com
        primary name server = destiny.ns.cloudflare.com
        responsible mail addr = dns.cloudflare.com
        serial  = 2345140976
        refresh = 10000 (2 hours 46 mins 40 secs)
        retry   = 2400 (40 mins)
        expire  = 604800 (7 days)
        default TTL = 1800 (30 mins)
2 Likes

Even when you think THAT would be easy. For some reason my Windows on that particular PC doesn't have it installed?

Edit: I can use the Resolve-dnsname in powershell though and that gives this:

1 Like