Cannot renew anymore, DNS problem

My domain is: flowm.daemon.contact

I ran this command: certbot renew

It produced this output: DNS problem: query timed out looking up A for flowm.daemon.contact; DNS problem: query timed out looking up AAAA for flowm.daemon.contact

My web server is (include version): n/a (we don't get that far)

The operating system my web server runs on is (include version): n/a

My hosting provider, if applicable, is: n/a

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): certbot 1.22.0

Hija,
I dont currently see what is wrong with the DNS. I can see DNS queries coming in and getting promptly answered.
I tried to test DNS with online tools but got the error "invalid domain name" (whatever that might mean). But Verisign and IPVOID say the DNSSEC is fine, all green.

DNSViz does present an error (the warnings are probably not relevant): flowm.daemon.contact | DNSViz.

Not sure if these errors are also the issue here, but they could be.. But what I did notice is how SLOW the analysis was! Painstakingly slow! I can't reproduce that slowness using dig +trace, but it makes me wonder if there are some nameservers in the chain which aren't very timely with their answers. Note that this doesn't have to be your DNS server, it can also be one of the .contact TLD nameservers perhaps.. I just don't know.

Also strange is that Unboundtest seems to be quite happy with your hostname: https://unboundtest.com/m/A/flowm.daemon.contact/4TGC6DY3

Note that it isn't allowed according to the RFCs to have a CNAME and other RR for the same hostname. But probably not relevant in this case tho.

2 Likes

Hi, thanks,
now that's getting difficult...

I see. They say the error is that there is no answer over UDP. But I I cannot find a place from where I could reproduce this. :frowning:
From any place I try, I get the answers over UDP just fine. (Obviousely, when I ask for DNSKEY or other cipher stuff, I only get a Truncated message over UDP, because it doesnt fit into single packet size. AFAIK that is the correct behaviour; the client should then implicitely switch to TCP.)

You can find out: just do a geolocation on the nameservers.

The Warnings are interesting, because I didn't know that. I deliberately chose the most extensive algorithm and the most challenging location, because I want to see where and why it fails if it fails. And I want to see that now, and not in two years when another DNSSEC KSK gets added to the cycle and then things suddenly fail. Or such.
(You know the Internet was originally designed for rugged military use, not for superfast-cloudflare-streaming-video-crap. The standards still hold to that, but if iservices don't comply to it any longer, then there is a problem.)

One cannot rely on these webtools. Maybe these guys have not updated their software since the vanity-TLDs came into play; I don't know.

Yes and there shouldn't be any. Did You find one? Then that is a mistake of mine.

Question would be: does Let'sencrypt work with a pure CNAME record? (I think it should, because lots of webserver-virtual-hostname identities are just aliases to a single real machine name. And "flowm" should work in exactly that fashion.)

Unboundtest is developed by one of the developers of Let's Encrypt and part of the team developing the software used by Let's Encrypt (Boulder) :wink: It's supposed to mirror the behaviour of the Unbound-instances used by Boulder, although it's always possible slight differences might appear.

See the Unboundtest-link above, it clearly shows an A RR as wel as a CNAME RR.

Boulder does follow CNAME RRs, yes. Shouldn't be an issue.

2 Likes

Yeah, but that's the one that works and recognizes the domain, right?
And that link doesn't look like the usual web services, anyway. Looks rather like my system logs - I like it. :slight_smile:

Ah, didn't know that. So we might have a chance to see the error failing the verification in these logs also, right?

Sorry, I don't find it. Not in the zonefile, and not in that link:

;; ANSWER SECTION:
flowm.daemon.contact.	0	IN	CNAME	flag.daemon.contact.
flag.daemon.contact.	0	IN	A	51.158.21.23

"flowm" points to "flag", that's how it is intended.

I found a bunch of other possible issues.

  • I get a mass of AAAA requests - but IPv6 is not yet implemented on these public nameservers. OTOH, certbot runs in an infrastructure that does already use IPv6 and will likely connect outwards per IPv6.
    Then the query for flowm.daemon.contact. IN AAAA is answered with a CNAME. Does it then correctly unravel to the A record from there?

  • I get a couple of queries for CAA records. I don't have these. Should I?

  • One of the nameserver machines is sometimes not receiving data from here. When I try to reach this community from there, I get filtered replies. There are no errors, only the content is removed from the webpages, just like with russian webpages.
    Strangely this here is the only webpage that showed this problem, others do work normally.

Uch, my bad, my brains didn't recognize the difference, I guess it stopped reading after "fl.." :frowning_face: My apologies.

It'll try both A as wel as AAAA. If an AAAA record doesn't exist, this should not be an issue.

They're not mandatory.

3 Likes

Oh sorry, this is indeed misleading, yes, indeed.
It just happened to grow that way - the pole and the flag are real nodes, and flowm is the name of a software.

It does only partially look like that.
It definitely honors the TrunCated flag and retries with TCP, so this is probably not the issue. But only a few origins do actually ask for an A record (while all do ask for AAAA):

3.16.166.167   | pole   | TCP   | NOERROR | cd      | FLOwM.daEMon.CoNTacT. IN A
3.16.166.167   | wand   | TCP   | NOERROR | cd      | FLAG.DaeMoN.cOnTAct. IN A
3.21.98.80     -> did not ask for it
3.72.17.85     -> did not ask for it
3.145.128.161  -> did not ask for it
3.145.213.155  -> did not ask for it
18.192.120.162 | pole   | TCP   | NOERROR | cd      | FlAG.DAEmOn.cOntacT. IN A
18.192.120.162 | wand   | TCP   | NOERROR | cd      | flOWM.dAemon.coNTact. IN A
34.209.250.20  -> did not ask for it
34.215.243.254 | pole   | TCP   | NOERROR | cd      | flAg.dAemON.ContACt. IN A
34.215.243.254 | wand   | TCP   | NOERROR | cd      | Flowm.DaEMOn.CONtacT. IN A

Full log attached:
dnstap.txt (30.9 KB)

[edit]
It doesn't actually look like a "timeout" to me. In the log we can see for all queries at first a request appearing via UDP, getting TC reply, and then a request via TCP. For those servers not asking for an A record, we see nothing at all there. If there were indeed timeouts, they would appear somewhere in the process.

But these IP addresses did ask for an AAAA RR? Weird.. Personally, I don't know what's going on to be honest.. Maybe someone else does. If noone does, we might ask the LE staff for help, maybe they know something we don't.

1 Like

Trying dig -t AAAA flowm.daemon.contact@pole.daemon.contact I got 1 SERVFAIL then just NXDOMAIN, so not sure if I'm trying the wrong nameserver or not but an intermittent SERVFAIL sounds like a server needs restarted.

dig -t NS flowm.daemon.contact returns a CNAME, and I was expecting a nameserver or two.

1 Like

Well, requesting flowm.daemon.contact\@pole.daemon.contact. IN AAAA is probably resulting in a NXDOMAIN for a lot of servers too :stuck_out_tongue: Probably needs a space before the @?

When I "hammer" the servers for a little bit, most of the time I'm getting a response immediately, but often it's also a little bit slow in the order of multiple seconds (about 5). Not sure if that's long enough to cause a timeout though.

3 Likes

Absolutely yes. And DNSKEY and (some) CAA.

Okay. I do now see the logs from yesterday, they look the same.

I will now start a test suite. I would like to see what happens when I switch off the second nameserver. I would also like to see the reaction without DNSSEC, but then, my whole intranet is attached below this domain and will fall apart when I delete the RRSIG records, and it will become an elaborate training in desaster-recovery...

Hi. This looks like a syntax issue. (dig doesn't care about these and just produces the literal answer. That's the philosophy of the ISC people. :wink: )
Try this one:
dig -t NS @pole.daemon.contact flowm.daemon.contact

This is AFAIK as it should be. As @Osiris mentioned further up, the CNAME should be the only record for the name. In this case, it returns the CNAME, and it returns the SOA (to be used for further queries).

1 Like

Sadly, I cannot test the issue, because the test environment works.

My DNS is apparently only broken for the let'sencrypt production environment, not for the staging environment.

[edit]
This is reproducible. The production env has always failed. The staging env gives repeated success. (It failed once, but such has happened earlier, too.)

The next step therefore should be to look into the design differences of the DNS lookup between your staging and production environment. But I cannot do this. :frowning:

Some more findings:

  • In the staging environment the validation takes 11 seconds and returns successful.
    In the production environment the validation does always take 31 seconds and reports DNS problem: query timed out ...

  • I tried switching off one of the nameservers. (Things are supposed to work nevertheless, nameservers are redundant only for failsafety, since they can fail.) No matter which one I switch off, the result is always the same: DNS problem: query timed out ...

  • I switched off DNSSEC entirely. That is, I quickly swapped in the unsigned raw zonefile, restarted the nameservers, ran the certbot renew, swapped the zonefiles back and restarted the nameservers.
    Now the answers were much smaller, and queries did not need to repeat per TCP - OTOH this would look just like a MitM attack, and might/should fail for a couple of reasons. Anyway, the result was just the same: DNS problem: query timed out ...

  • Finally, I added the AAAA record for the name. This doesn't help either, the error is still DNS problem: query timed out looking up A for flowm.daemon.contact; DNS problem: query timed out looking up AAAA for flowm.daemon.contact

I got one step further:

  • I changed the CNAME into an A record. Now the error message changes: DNS problem: query timed out looking up CAA for flowm.daemon.contact

  • So I added these CAA records, one for the delegation point and one for flowm.daemon.contact. Then I got this error: DNS problem: query timed out looking up A for flowm.daemon.contact; no valid AAAA records found for flowm.daemon.contact
    This is okay, I had removed the AAAA record again, because it points to nowhere.

  • So I added the AAAA record back into the zonfile. And then, finally, I got this message:

        "type": "urn:ietf:params:acme:error:connection",
        "detail": "2001:bc8:32d7:135::2a: Fetching http://flowm.daemon.contact/.well-known/acme-challenge/tI5_jV5R2Jc5BfwGeAfRgwuUp6Ct3UhY6Qy3k_7L5d8: Timeout during connect (likely firewall problem)",

This is now alright: it tries to connect to an IPv6 address, and it is indeed the address I had edited into the zonefile! This address is not wired, not routed, not connected, not enabled, no nothing yet. So it cannot work, and the timeout appears to be the correct diagnosis.

So this is still no success, but it is now an expected and understandable issue.

1 Like

Further finding:
at 31.05.2022 23:03:51 CEST validation resulted in failure.
My webserver logs show the following:

2022-05-31 23:03:53.69+02   200    ec2-18-191-108-177.us-east-2.compute.amazonaws.com  /.well-known/acme-challenge/k9IMoAxh0yQwbn0RmbU_c1Yye3_Hhku9mxYTfCzDFz8  Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)
2022-05-31 23:03:55.173+02  200    ec2-3-67-70-80.eu-central-1.compute.amazonaws.com  /.well-known/acme-challenge/k9IMoAxh0yQwbn0RmbU_c1Yye3_Hhku9mxYTfCzDFz8  Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)
2022-05-31 23:04:04.557+02  200    ec2-54-245-188-205.us-west-2.compute.amazonaws.com  /.well-known/acme-challenge/k9IMoAxh0yQwbn0RmbU_c1Yye3_Hhku9mxYTfCzDFz8  Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)

All three downloads were successfull (code 200).
At 2022-05-31 23:04:22,070 however, the error message says that the host could not be accessed due to DNS failure:

        "type": "urn:ietf:params:acme:error:dns",
        "detail": "DNS problem: query timed out looking up A for flowm.daemon.contact; no valid AAAA records found for flowm.daemon.contact",

Did the 200s return a redirection?

I only get:
curl: (56) Recv failure: Connection reset by peer

2 Likes

Yes, but the production system currently checks from 4 locations. You saw the 3 servers in AWS but LE currently base one server pool using Flexential too (Colorado).

3 Likes

No they don't. They're straight file read.

I don't know what You tried, but there is currently nothing You could read. You should get 403 or 404.

Alias "/.well-known" "/ext/www/.well-known"
<Directory "/ext/www/.well-known/acme-challenge">
    Options None
    AllowOverride None
    Require all granted
</Directory>

I throw something in, so You can read something:

$ echo "Hello!" >  /ext/www/.well-known/acme-challenge/.hello
$ fetch http://flowm.daemon.contact/.well-known/acme-challenge/.hello
.hello                                                   7  B   53 kBps    00s
$ cat .hello
Hello!
$

It's an IPv6 problem:

curl -Ii6 http://flowm.daemon.contact/.well-known/acme-challenge/.hello
curl: (56) Recv failure: Connection reset by peer

curl -Ii4 http://flowm.daemon.contact/.well-known/acme-challenge/.hello
HTTP/1.1 200 OK
Date: Sat, 04 Jun 2022 02:38:55 GMT
Server: Apache/2.4.53 (FreeBSD)
Last-Modified: Sat, 04 Jun 2022 02:34:29 GMT
ETag: "7-5e096140a1437"
Accept-Ranges: bytes
Content-Length: 7

IPv4 works.

3 Likes