DNS problem - SERVFAIL for (seemingly) correctly replied names

Hello,

we are a webhosting provider trying to troubleshoot occasional certificate renewal failures, where the LE API responds with DNS-related error like this one:

{ "type":"urn:ietf:params:acme:error:dns",
  "detail":"DNS problem: SERVFAIL looking up TXT for _acme-challenge.lidovapisen.cz - the domain's nameservers may be malfunctioning",
  "status":400 }

We tried checking our servers for errors and outages, but in the end - after we came up with nothing - we just got packet dump of port 53 during the LE renewal process. From the packet dump it seems like everything worked correctly on our side, see example for one of the domains below:

Nameserver 1:

No.	Time	Source	Destination	Protocol	Length	Info
3378442	7369.380986	66.133.109.36	91.239.200.243	DNS	101	Standard query 0x5181 TXT _acme-challenge.lidovapisen.cz OPT
3378443	7369.381195	91.239.200.243	66.133.109.36	DNS	157	Standard query response 0x5181 TXT _acme-challenge.lidovapisen.cz TXT OPT

(repeated 3 times)

Nameserver 2:

3952689	7368.768877	66.133.109.36	82.100.6.2	DNS	101	Standard query 0xde97 TXT _acme-challenge.lidovapisen.cz OPT
3952690	7368.769044	82.100.6.2	66.133.109.36	DNS	236	Standard query response 0xde97 TXT _acme-challenge.lidovapisen.cz TXT NS ns1.thinline.cz NS ns2.thinline.cz NS

(repeated 2 times)

Nameserver 3:

3177831	7368.831524	66.133.109.36	91.239.202.18	DNS	101	Standard query 0x06d9 TXT _acme-challenge.lidovapisen.cz OPT
3177832	7368.831902	91.239.202.18	66.133.109.36	DNS	236	Standard query response 0x06d9 TXT _acme-challenge.lidovapisen.cz TXT NS ns1.thinline.cz NS ns2.thinline.cz NS ns3.cesky-hosting.eu OPT

(repeated 4 times)

(All times are in seconds relative to approximately 1:30 UTC on November 30th.)

Additionaly, there were other source IP addresses trying to request the same record, I presume those are different machines belonging to the LE infrastructure trying to check the records from different places.

The failure is not restricted to TXT records, it appears for CAA records as well:

{ "type":"urn:ietf:params:acme:error:dns",
  "detail":"DNS problem: SERVFAIL looking up CAA for www.laysedlakova.cz - the domain's nameservers may be malfunctioning",
  "status":400 }

Again, from the packet dump it seems like the server replied correctly - both for this variant of the name and for no-www variant laysedlakova.cz as well (shortened listing with nameservers merged):

No.	Time	Source	Destination	Protocol	Length	Info
8819581	18415.323444	3.137.221.195	91.239.200.243	DNS	86	Standard query 0x8f84 CAA LaysedLAKovA.CZ OPT
8819582	18415.323826	91.239.200.243	3.137.221.195	DNS	146	Standard query response 0x8f84 CAA LaysedLAKovA.CZ SOA ns1.thinline.CZ OPT
8822237	18420.615618	66.133.109.36	91.239.200.243	DNS	90	Standard query 0xe991 CAA www.laysedlakova.cz OPT
8822238	18420.615784	91.239.200.243	66.133.109.36	DNS	150	Standard query response 0xe991 CAA www.laysedlakova.cz SOA ns1.thinline.cz OPT

From what I can see, it seems like our nameservers send a reply properly and it gets lost in transit somewhere. If that is the case, it's definitively an intermittent problem, we are generally trying to renew few hundreds of certificates each day and only get these failures for 5-10 of them. Considering that one of the servers is in a different datacenter and the problem appears when checking all nameservers at the same time, it feels like a packet loss on some international line or further, ie. nothing we will be able to fix.

So, few questions:

  1. Can you spot something we clearly missed while trying to solve this?

  2. Is LE trying to evaluate all nameservers and returning failure when it doesn't get response from all of them? Would it be possible to re-try later?

  3. How should we solve this? I mean retrying from our side is an obvious solution to this problem but that feels like doing it at an incorrect end of things. (That said, we are not opposed to retrying - that's what we've been doing to renew affected certificates anyway. But we would like a confirmation from LE side that this is the preferred way of handling the situation.)

Additional notes:

I tried to search the LE forum for similar problems but came up with nothing that felt relevant. The cases I found seemed related to oddly behaving nameservers or randomized capitalization not being supported on the nameserver side. None of these seem to apply in this case.

The machine handling the renewal process is a Debian (Buster, then Bullseye) Linux, client used is
GitHub - unixcharles/acme-client: A Ruby client for the letsencrypt's ACME protocol. (considering the type of failure it feels like this is not relevant but putting it here for completness)

EDIT: formatting

5 Likes

Wow, thanks for including all that detail. I'm not sure if it'll be me, but I suspect somebody here will be able to help you.

First, to try to answer your questions:

Well, I've got some questions below to help dig into this, but it does seem like something weird is going on, that you'd see a specific response but Let's Encrypt would see a SERVFAIL, especially if the domain isn't using DNSSEC.

Let's Encrypt does check multiple nameservers, from multiple places on the Internet. They need to ensure that you actually own the name as seen from everywhere on the Internet, and the Internet doesn't always show the same things at different places, so they need to be thorough.

If the error message says "secondary validation" in it, then it worked from one location (their primary) but failed at a couple of their other locations. I don't know as that's what happening for you here, though.

Well, the general advice for when to renew is to start 30 days before certificate expiration, and then if you have a problem you can continue retrying once or twice a day. Intermittent problems do happen, and Let's Encrypt does occasionally go down or suspend issuance, and you should probably having monitoring that alerts you if several attempts for a name haven't worked and the expiration is nearer. But if you're consistently having 5%-ish of your attempts fail, then that does seem high and yes it's probably good to dig into it like you are here.

(And just to be clear, since you're saying you want "confirmation from LE side", I'm just a random person on the Internet and not any sort of official spokesperson for Let's Encrypt.)


Secondly, I've got a few questions for you, too:

  1. Are you always both the web hosting provider as well as the DNS provider for the domains that (sometimes) have problems? That is, is it the same DNS servers for all these names?
  2. Are you always using the DNS-01 challenge?
  3. Can you leave some test value in a _acme-challenge TXT record so that we can try hammering it with various DNS clients and seeing if we can see something odd?
  4. It seems weird to me that your packet captures are for IPv4 addresses, when your DNS servers (or at least the servers for _acme-challenge.lidovapisen.cz) support IPv6 as well. Let's Encrypt would use IPv6 if possible. Do you have any packet captures for IPv6?
  5. Are your packet captures for UDP, TCP, or both?
  6. When you try your hundreds of certificates and 5–10 fail, how often do those failures then work on a retry attempt? Like, does it almost always work the second time?
  7. What time of day does your renewal process happen? Is it spread out throughout the day for those hundreds of certificates, or do they all happen "at once"? Does it often happen at zero minutes past an hour?
  8. Do any of the names use DNSSEC?
  9. What DNS server software and version are you running?

Hopefully that will help people here be able to dig into this more.

8 Likes

There's one aspect that comes to my mind here:

In the (not so distant) past there were issues with Let's Encrypt returning misleading error messages (such as query timed out, or SERVFAIL messages) when it was in fact the Let's Encrypt software erroring out (e.g due to high load). Here's the thread for reference: Consistent "During secondary validation: DNS problem" between 01:00 UTC and 02:00 UTC? - #13 by jcjones

However that issue (turned out to be DNS queries getting dropped due to rate-limits) and the related misleading error messages should have been fixed already. So I wouldn't immediatly assume correlation. But I do think that spurious DNS error messages can still happen during high load times - for varying reasons potentially not related to previous things.

Can you reproduce these errors during any time of the day, or only during high-load times (which were last reported to spike at around 01:00 - 02:00 UTC)?

7 Likes

Thanks for the replies, I will try to answer both of them here.

Yes, during my research I did find out that this can happen, but as far as I remember, this never happened to us. What I was wondering here about is that if LE requires all your nameservers to respond or if they refuse the validation if one of them fails to respond. The difference between those two is (the first) "transmission error somewhere upstream, nothing we can do" and (the second) "intermittent connection failure in our datacenter or close to it, might be able to do something about it"

Yes, yes and yes. The company decided that you are only allowed to request LE certificate if you are using our nameservers. This is checked on each renewal too, for all names in the certificate. (The software taking care of the issue/renew process can theoretically do HTTP-01 but it prefers DNS-01 and I never noticed it doing anything else.)

Sure, _acme-challenge.ledebug.eboss.cz and _acme-challenge.ledebug.gorun.cz - one has DNSSEC, the other doesn't.

This struck me as odd as well. But that was the only address I could easily identify as common source. The packet dumps have accesses from IPv6 though (I assumed those are from the secondary locations) and are both TCP and UDP (used tcpdump's 'port 53' filter). See an example for IPv6:

No.	Time	Source	Destination	Protocol	Length	Info
8708444	18209.202829	2600:3000:2710:200::15	2a00:1ed0:2:0:1:5bef:c8f3:1	DNS	123	Standard query 0x3335 TXT _aCME-CHAllenGE.WwW.BfkONTAkT.cZ OPT
8708445	18209.203205	2a00:1ed0:2:0:1:5bef:c8f3:1	2600:3000:2710:200::15	DNS	179	Standard query response 0x3335 TXT _aCME-CHAllenGE.WwW.BfkONTAkT.cZ TXT OPT

As far as I know, it pretty much always works the second time, trying on the same day.

The renewal process is currently scheduled to start at 2:26 UTC (1:26 UTC during daylight savings time.) Usually it takes several (8-ish) hours to finish. As of now, there is no concurrency, only one certificate is being worked on at any given time. The software isn't trying to avoid any time of the day so yes, some certificates probably start their renewal at 0 minutes past an hour.

I am aware that there's a problem of people starting their cron jobs at rounded times (which we try to avoid for exactly this reason), but I don't think this is the cause of the issue though. The failures - or, more precisely, the times LE tried to check DNS for failed certificates - are spread throughout the hour.

Yes, they do. Doesn't seem to make a difference though, as far as I am aware, both DNSSEC-enabled and not enabled domains are failing.

PowerDNS and NSD, both from Debian Bullseye. They are recently upgraded from Debian Buster but to my knowledge, that made things neither worse nor better. Specific versions are 4.1.6 and 4.4.1 for PowerDNS, 4.1.26 and 4.3.5 for NSD.

(Note: we are using PowerDNS live signing mode for DNSSEC. Thus, all domains - even those without DNSSEC - are having their SOA serial incremented by 1 on each Thursday by the primary nameserver. This change is not immediately propagated to secondaries.

So, if anyone tries to check our nameservers today - or on Thursday in general - there is no need to tell us that our serial differs and our nameserver synchronization might not be working. We know about it, it is working and any other change triggers immediate zone transfer. Also, the software handling certificate renewal queries all our nameservers for the required DNS records first - before contacting LE and asking for validation.)

7 Likes

Let's Encrypt uses the unbound library for DNS resolving (together with some custom glue code).

Each Validation Authority (VA) in Let's Encrypt checks your DNS (TXT and CAA records) independently. They all do their own resolving. I believe unbound picks one of your domain's authoritative nameservers at random to resolve the query. Given that there seem to be currently 4 VA's, that means an authoritative nameserver is choosen at random 4 times, so there's a high chance that the queries hit more than one nameserver.

There is definetly retry logic (in the VA glue code and/or unbound) to retry in case one nameserver does not respond as intended, but I'm not sure what the exact conditions for retry are*. I believe that if the VA gets at least one valid answer, it will consider the query valid. In order for the entire validation to be successful, the primary datacenter must be successful, and at least 2 out of 3 secondaries (primary and secondary refer to Let's Encrypts infrastructure).

There's a website - https://unboundtest.com/ - that has configured its instance very similar to Let's Encrypts production servers. It shows you the logs and all, so if you can reproduce your errors there, Let's Encrypt will see the same.

*I believe the Let's Encrypt VA tries another unbound instance if the current one failed, and unbound will probably do a similar thing for the nameservers. Retry may only apply in case there was no response at all - not sure about the current behaviour here.

7 Likes

Hmm… I've tried the things I can think of and the tools I know about and can't see a problem.

  • Unboundtest (I even tried a few times, and get NOERROR and the TXT value every time.)
  • DNSViz
  • ISC EDNS tester
  • dig +dnssec +bufsize=512 TXT _acme-challenge.ledebug.gorun.cz from my own system also looks good; I tried against each of the IPv4 and IPv6 authoritative addresses.

I did note that your packet logs in your first post don't look like they always have the 0x20-case-randomization going on (It's a request for _acme-challenge.lidovapisen.cz in all lowercase). Maybe it's that your logging lowercases it, but then the CAA checks having Standard query 0x8f84 CAA LaysedLAKovA.CZ in mixed case but the www CAA check having Standard query 0xe991 CAA www.laysedlakova.cz OPT in lowercase seems weird. And your IPv6 log has _aCME-CHAllenGE.WwW.BfkONTAkT.cZ in mixed case again. I would expect any half-recent version of PowerDNS to handle case randomization properly, though, so I think I'm probably just grasping at straws.

Do you have any indication of when the problem started? That is, do you think you might have been having this issue all along (and how long have you been using this setup), and are just now digging into failures to try to find out why, or did "something change" at some point?

I think I may also try to escalate this to the big leagues: Hey @jsha, if you get a chance, can you take a look at another weird DNS issue? This integrator is getting 5%-ish of their DNS-01 challenges rejected with a SERVFAIL message, but it generally works on a retry, and we can't figure out why.

7 Likes

I noticed that one as well but unless tcpdump/wireshark decided to lowercase those particular packets and not others, my conclusion was that for some reason the requests came in lowercased and were (correctly) replied to in lowercase.

Good catch on your side though - now that you pointed it out though I did some digging in that direction. Went over the logs and the packet dump again and looked up packets for 10 failed domains of that batch. Each and every one of them was queried in lowercase for the record that failed. See an example for domain that failed with "SERVFAIL looking up CAA for www.esterpavlu.cz":

No.	Time	Source	Destination	Protocol	Length	Info
6297496	13475.122192	66.133.109.36	91.239.200.243	DNS	104	Standard query 0x8907 TXT _aCME-cHAlleNgE.wWW.ESterpavlu.CZ OPT
6297498	13475.122562	91.239.200.243	66.133.109.36	DNS	160	Standard query response 0x8907 TXT _aCME-cHAlleNgE.wWW.ESterpavlu.CZ TXT OPT
6297528	13475.198045	54.201.180.224	91.239.200.243	DNS	88	Standard query 0x1b8b CAA www.eSTErPaVLu.CZ OPT
6297529	13475.198247	91.239.200.243	54.201.180.224	DNS	148	Standard query response 0x1b8b CAA www.eSTErPaVLu.CZ SOA ns1.thinline.CZ OPT
6299746	13479.040015	66.133.109.36	91.239.200.243	DNS	88	Standard query 0x3144 CAA www.esterpavlu.cz OPT
6299747	13479.040178	91.239.200.243	66.133.109.36	DNS	148	Standard query response 0x3144 CAA www.esterpavlu.cz SOA ns1.thinline.cz OPT
6300480	13480.697110	66.133.109.36	91.239.200.243	DNS	88	Standard query 0xef70 CAA www.esterpavlu.cz OPT
6300481	13480.697187	91.239.200.243	66.133.109.36	DNS	148	Standard query response 0xef70 CAA www.esterpavlu.cz SOA ns1.thinline.cz OPT
6300642	13481.005659	66.133.109.36	91.239.200.243	DNS	88	Standard query 0xa356 CAA www.esterpavlu.cz OPT
6300643	13481.005773	91.239.200.243	66.133.109.36	DNS	148	Standard query response 0xa356 CAA www.esterpavlu.cz SOA ns1.thinline.cz OPT

TXT record is resolved with case randomization by 66.133.109.36 (outbound1.letsencrypt.org). Same goes for CAA record from IP 54.201.180.224 (unnamed AWS machine). However, 66.133.109.36 did query for the CAA record in all-lowercase, multiple times.

This behaviour matches the error message returned by LE API in all 10 cases - the record that is resolved in all-lowercase is the record the message from the API names as unresolvable.

Opposed to that, I randomly picked few domains that got their certificate without any error and searched for packets from the same IP. An example:

No.	Time	Source	Destination	Protocol	Length	Info
6464370	13805.054637	66.133.109.36	91.239.200.243	DNS	106	Standard query 0xdbbe TXT _ACME-chALLEngE.wwW.foXmarKetINg.cZ OPT
6464371	13805.055080	91.239.200.243	66.133.109.36	DNS	162	Standard query response 0xdbbe TXT _ACME-chALLEngE.wwW.foXmarKetINg.cZ TXT OPT
6464435	13805.198432	66.133.109.36	91.239.200.243	DNS	86	Standard query 0xa69b CAA fOxmaRkeTinG.CZ OPT
6464436	13805.198641	91.239.200.243	66.133.109.36	DNS	146	Standard query response 0xa69b CAA fOxmaRkeTinG.CZ SOA ns1.thinline.CZ OPT

Randomized case, single attempt for both records, certificate issued.

I mean... if the CAA record has 15 characters, then with one bit per character there's a reasonable 1 in 30000 chance to get all-lowercase after randomization. And since the client knows it is using randomization, it might be rejecting reply to such a request based on the lowercase only, without checking what went out? Thing is, even if that was the case, getting all-lowercase randomly with TXT record (13 bits just in acme-challenge) is far less likely and getting it in 10 cases during one night should be statistically impossible.

I have only wild theories at this point... LE DNS client just randomly deciding to not randomize but to expect randomized reply? Some cleverbox en route randomly altering DNS packets?

If I remember correctly, this has been a problem for past year, maybe two? I think it sort of crept in with number of failures slowly increasing over time from something that could be dealt with manually to daily time consuming annoyance. It is certainly possible that the failure rate is constant over time but the number of certificates grew.

The setup is in use since LE started. Except bi-yearly Debian upgrades no other change comes to mind.

4 Likes

Yikes. That is interesting, and definitely makes me think the case randomization is somehow going wrong, but I'm not sure what else to look at to try to figure it out. The unbound "use-caps-for-id" option (which is what Let's Encrypt is using, if I understand correctly) I think may try to do something smart to figure out if the DNS server supports case-echoing, but I'm not sure what, or why it would only sometimes be applying.

There's no chance of some sort of "smart" firewall intercepting and rewriting DNS queries somewhere on your network before your main DNS server gets them, is there? Again, I'm not sure where else to look.

7 Likes

A very interesting problem. First, a couple of minor clarifications:

Let's Encrypt would use IPv6 if possible.

One thing I should clarify on that page: for the HTTP-01 request, we have an IPv6 preference. For the DNS resolution, it's up to Unbound how to select which IP address to connect to. I think it chooses mainly by past responsiveness, within bands, but would have to double check.

We actually use standalone Unbounds as a separate process.

I believe the Let's Encrypt VA tries another unbound instance if the current one failed, and unbound will probably do a similar thing for the nameservers.

This gets a little complicated. Yes, VA will try to query another DNS backend if the first one failed. But if we hit an unbound that already has a query in progress for the same qtype and qname, it will try to coalesce them rather than have duplicate queries.

Now, the finding about lowercase is interesting. Unbound tries to keep track of whether an authority server supports mixed-case response aka dns-0x20 (by echoing the query case in the response). What I might expect to see is an authority server that fails at 0x20 handling, causing the first validation to fail, while a second validation succeeds because Unbound has figured out the authority doesn't support 0x20.

However, we seem to have the reverse! You're getting success when you receive the mixed-case queries, and failure when you receive the all-lowercase queries. I'm not sure what could be causing that. The "clever middlebox lowercasing queries" idea is tempting but seems too bizarre to be true. Unless you happen to know you might be behind some odd firewall or anti-DoS device?

If we discard the clever middlebox hypothesis for now, the main thing that would cause lowercase queries is if Unbound concluded your authority servers don't support dns-0x20. Which seems unlikely, since PowerDNS supports it, and I just checked each of the IPs for each of your NS records (based off the example of www.laysedlakova.cz), and they all support dns-0x20.

The DNSSEC angle seems interesting, particularly since we saw a bug in PowerDNS some years ago with regards to DNSSEC signing of empty responses (and CAA responses are usually empty). But you say this happens for signed and unsigned zones alike, which seems to rule out a DNSSEC related cause.

I'll keep thinking on it. This is really interesting. Thank you for the detailed background on your setup and the packet capture information. That really makes it much easier and quicker to reason about what might be going on.

7 Likes

Thanks for looking into it. I still don't see how, even if somehow Let's Encrypt's Unbound thought that it should use an all-lowercase request (like somehow one unbound process on one server somewhere didn't have use-caps-for-id enabled or something wacky like that), why the lowercase response would cause a SERVFAIL.

Problems that can't be easily reproduced are the worst to try to diagnose. *sigh*

7 Likes

I looked at our logs for the last 7 days, filtered by your account and "SERVFAIL." I noticed that the validation latency was consistently right around 8.5 seconds, which is slightly unusual. Boulder is configured to so that if a request to Unbound times out, it will retry up to twice, with a timeout of 10 seconds. So when a DNS server is unresponsive, we often see validation latency at 30s. On the other hand, if there's a DNSSEC problem we would expect to see SERVFAIL right away.

Unbound is a little inconsistent about when it decides to return a SERVFAIL vs wait a long time to reply to Boulder. This page describes the logic. Basically it seems Boulder is getting a SERVFAIL (not a timeout) from the first Unbound it queries.

Your packet capture from DNS problem - SERVFAIL for (seemingly) correctly replied names - #7 by tlghacc (for A www.esterpavlu.cz) seems to indicate Unbound sent 1 query with mixed case at 75.19 from a remote VA, which presumably succeeded. A different Unbound sent 1 query with lowercase at 79.04 (about four seconds later), then retried twice more with lowercase at 80.70 and 81.05. It seems like Unbound was not getting these responses even though you were sending them.

Another useful thing to know: Unbound triggers its capsforid fallback (i.e. use lowercase) after 3 timeouts: unbound/iterator.c at 919c8c9527281a7289415c00f8f2aed12b17a9aa · NLnetLabs/unbound · GitHub. I suspect the root cause here is that the replies from your authority server are not reaching Unbound, and the lowercase queries are just a symptom of that - Unbound has determined (incorrectly) that the authority server does not support 0x20.

From a cursory reading of that code, it looks like Unbound tracks capsforid fallback on a per query basis. So Unbound's conclusion that the server doesn't support 0x20 probably isn't held over from a previous query. But if that were the case, why don't we see the initial, mixed-case queries for www.esterpavlu.cz in the packet captures?

7 Likes

So, if I can rephrase what you're saying, it looks like there's some level of packet loss between Let's Encrypt's main datacenter and these DNS servers, such that Unbound keeps on retrying, including falling back to an all-lowercase query, and eventually gives up on the server entirely and returns SERVFAIL.

So, if that's the case, is there some easy way to measure packet loss level (preferably of just UDP packets?) between Let's Encrypt's datacenter and these DNS servers? Though it seems like something where all the UDP port 53 packets are lost entirely for seconds at a time before connectivity is restored, rather than something where like 1% of all packets are just randomly dropped.

Can you check logs for the secondary datacenters too to see if they have the same kind of packet-loss/timeouts/SERVFAILs? Or do the secondary systems not get checked at all if the primary validation fails?

7 Likes

The secondary validation happens in parallel with the primary validation. If both primary and secondary validation fail, we only report the details of the primary validation. If primary validation succeeds but secondary validation fails, we report a random choice of the secondary validation errors and report that with "During secondary validation: ...". For secondary validation to fail, more than a threshold of remote VAs must fail - currently that's 1. Since this error doesn't say "During secondary validation," it's an issue at the primary datacenter.

I'll ask for some traceroutes to those nameservers.

7 Likes

Thanks again. I guess I was hoping that if the secondary validations were running (which is sounds like they are), if you could easily tell from the logs if the same retries and timeouts were happening there too, or if all those requests were succeeding. That might help narrow down if the network gremlins were closer to Let's Encrypt's network or the DNS server's network.

5 Likes

Perhaps to help clarify matters, one might add some test subdomains (zones) and have each one only use one of the listed authoritative DNS servers.
Like:

ns1zone.laysedlakova.cz nameserver = ns1.thinline.cz
ns1zone.laysedlakova.cz nameserver = ns2.thinline.cz
ns3zone.laysedlakova.cz nameserver = ns3.cesky-hosting.eu

Then try issuing a certs with like 50 names on it (one per each zone) - thus trying to force DNS via TCP.

Trying to isolate the DNS servers from each other and DNS UDP from DNS TCP.

If that fails to make anything obvious... I would have to dig deeper...
Like:

  • IPv4 / IPv6 differences
  • locality/region (AS numbers involved, Country) issues
    [Are you required to only use DNS servers located in your country?]
5 Likes

Well, our server housing provider has some kind of anti-DDoS solution - as far as I know, it's something developed in-house and it's a blackbox from our point of view. However, NS2 is running in a different datacenter operated by a different provider and to our knowledge, they have nothing like that.

(We tried to check with the provider few times in the past when dealing with issues that could be explained by undesired interference by their anti-DDoS. According to their logs, it was never the case. And with one of those issues they even helped us to convict someone else of interfering, so all in all, they seem to have a good track record. Also I don't think that an anti-DDoS would filter traffic going out of it in the first place, which is what seems to be taking place here.)

Yes, happens for both - an example for a signed domain:

No.	Time	Source	Destination	Protocol	Length	Info
11230343	22690.119849	2600:3000:2710:200::19	2a00:1ed0:2:0:1:5bef:c8f3:1	DNS	103	Standard query 0xd5fb CAA SERemeDKY.CZ OPT
11230347	22690.121915	2a00:1ed0:2:0:1:5bef:c8f3:1	2600:3000:2710:200::19	DNS	464	Standard query response 0xd5fb CAA SERemeDKY.CZ SOA ns1.thinline.CZ RRSIG NSEC3 RRSIG OPT
11230351	22690.129603	2600:1f14:804:fd01:9719:d96:1bb8:1ffa	2a00:1ed0:2:0:1:5bef:c8f3:1	DNS	103	Standard query 0x7e90 CAA sEREmeDKY.cZ OPT
11230353	22690.129879	2a00:1ed0:2:0:1:5bef:c8f3:1	2600:1f14:804:fd01:9719:d96:1bb8:1ffa	DNS	464	Standard query response 0x7e90 CAA sEREmeDKY.cZ SOA ns1.thinline.cZ RRSIG NSEC3 RRSIG OPT
11230415	22690.267832	2600:3000:2710:200::19	2a00:1ed0:2:0:1:5bef:c8f3:1	DNS	103	Standard query 0xdb02 DNSKEY sEREMEdkY.CZ OPT
11230416	22690.268000	2a00:1ed0:2:0:1:5bef:c8f3:1	2600:3000:2710:200::19	DNS	291	Standard query response 0xdb02 DNSKEY sEREMEdkY.CZ DNSKEY RRSIG OPT
11230429	22690.311221	18.237.94.63	91.239.200.243	DNS	83	Standard query 0x56df DNSKEY SErEMedkY.cZ OPT
11230430	22690.311432	91.239.200.243	18.237.94.63	DNS	271	Standard query response 0x56df DNSKEY SErEMedkY.cZ DNSKEY RRSIG OPT
11230431	22690.312743	2600:1f16:269:da00:932c:cecf:74a2:e192	2a00:1ed0:2:0:1:5bef:c8f3:1	DNS	103	Standard query 0x298c DNSKEY sEREmEDky.CZ OPT
11230432	22690.312964	2a00:1ed0:2:0:1:5bef:c8f3:1	2600:1f16:269:da00:932c:cecf:74a2:e192	DNS	291	Standard query response 0x298c DNSKEY sEREmEDky.CZ DNSKEY RRSIG OPT
11230527	22690.521433	35.158.38.206	91.239.200.243	DNS	83	Standard query 0x3a4f CAA seRemEDky.cZ OPT
11230528	22690.521576	91.239.200.243	35.158.38.206	DNS	444	Standard query response 0x3a4f CAA seRemEDky.cZ SOA ns1.thinline.cZ RRSIG NSEC3 RRSIG OPT
11230541	22690.545154	2600:1f14:804:fd01:1489:bd54:5710:fd35	2a00:1ed0:2:0:1:5bef:c8f3:1	DNS	123	Standard query 0xfa65 TXT _aCME-CHALleNGe.WWW.sErEMEdKY.Cz OPT
11230543	22690.545712	2a00:1ed0:2:0:1:5bef:c8f3:1	2600:1f14:804:fd01:1489:bd54:5710:fd35	DNS	287	Standard query response 0xfa65 TXT _aCME-CHALleNGe.WWW.sErEMEdKY.Cz RRSIG TXT OPT
11230589	22690.623655	2600:1f16:269:da02:56d2:238b:b12d:a6c4	2a00:1ed0:2:0:1:5bef:c8f3:1	DNS	103	Standard query 0xb905 DNSKEY sereMedKY.CZ OPT
11230590	22690.623873	2a00:1ed0:2:0:1:5bef:c8f3:1	2600:1f16:269:da02:56d2:238b:b12d:a6c4	DNS	291	Standard query response 0xb905 DNSKEY sereMedKY.CZ DNSKEY RRSIG OPT
11230767	22690.909988	2600:3000:2710:200::21	2a00:1ed0:2:0:1:5bef:c8f3:1	DNS	103	Standard query 0xd525 DNSKEY SEReMeDky.cz OPT
11230768	22690.910158	2a00:1ed0:2:0:1:5bef:c8f3:1	2600:3000:2710:200::21	DNS	291	Standard query response 0xd525 DNSKEY SEReMeDky.cz DNSKEY RRSIG OPT
11233358	22695.746506	66.133.109.36	91.239.200.243	DNS	87	Standard query 0x4cba CAA www.seremedky.cz OPT
11233360	22695.748858	91.239.200.243	66.133.109.36	DNS	447	Standard query response 0x4cba CAA www.seremedky.cz SOA ns1.thinline.cz RRSIG NSEC3 RRSIG OPT
11233434	22695.900268	66.133.109.36	91.239.200.243	DNS	87	Standard query 0x04a7 CAA www.seremedky.cz OPT
11233435	22695.900416	91.239.200.243	66.133.109.36	DNS	447	Standard query response 0x04a7 CAA www.seremedky.cz SOA ns1.thinline.cz RRSIG NSEC3 RRSIG OPT
11235240	22698.905169	66.133.109.36	91.239.200.243	DNS	87	Standard query 0x5f82 CAA www.seremedky.cz OPT
11235241	22698.905396	91.239.200.243	66.133.109.36	DNS	447	Standard query response 0x5f82 CAA www.seremedky.cz SOA ns1.thinline.cz RRSIG NSEC3 RRSIG OPT
11235324	22699.056819	66.133.109.36	91.239.200.243	DNS	87	Standard query 0x4d10 CAA www.seremedky.cz OPT
11235325	22699.057065	91.239.200.243	66.133.109.36	DNS	447	Standard query response 0x4d10 CAA www.seremedky.cz SOA ns1.thinline.cz RRSIG NSEC3 RRSIG OPT

A question to this one: are 2600:3000:2710:200::19, ::21 and 66.133.109.36 the same machine? If so, it would indicate this chain of events to me:

  1. ::19 asks for CAA seremedky.cz at 22690.119849 and gets a reply
  2. immediately after receiving the reply, ::19 asks for DNSKEY seremedky.cz at 22690.267832 (time difference matches RTT between the hosts) and reply is generated but not received
  3. ::19 asks as ::21 for DNSKEY seremedky.cz again at 22690.909988. The time difference here is 640ms, my test with IPv6-enabled Unbound where you take away its IPv6 connectivity showed repeated attempt after 380ms, then it falled back to IPv4 after another 380ms.
  4. Unbound gives up on IPv6.
  5. No traffic seen from any of those 3 IPs for 5 seconds. Might be the route is dropping packets in both directions now?
  6. Fallback to IPv4 and to all-lowercase happened in Unbound, 66.133.109.36 is attempting to query the CAA repeatedly, with all-lowercase, probably still not getting the replies

It seems to be consistent with the previous paragraph regarding seremedky.cz. At 13475.12 the IP 66.133.109.36 sends mixed-case query for www.esterpavlu.cz TXT, the query is received by the authoritative NS but the reply does not reach Unbound (speculation: the route is dropping packets in one direction.) Then we see 4 seconds gap again (speculation: Unbound is retrying but the route is dropping packets in both directions.) And finally at 79.04 we see all-lowercase CAA query (speculation: over the 4 second gap we lost the mixed-case CAA query and Unbound went for the fallback, route is dropping in one direction again.)

If there's something you want us to try or check, let me know.

Also thanks to everyone spending their time on this.

As far as I am aware, we are not.

6 Likes

If you've got things narrowed down with good timestamps, it may be a good idea to reach out to those providers anyway at this point (assuming that it's not expensive or complicated to open a support case with them). It wouldn't be the first time that an overaggressive firewall saw a bunch of traffic coming at once from Let's Encrypt and thought it should "protect" the DNS server by dropping traffic. Let's Encrypt does send several very-similar requests all at once from several geographically distributed locations. Even if it's not the providers' anti-DDoS equipment directly, they might be able to trace routes with packet loss and look in their logs and help with the diagnostics.

Can you tell if during these 5-second-or-so outages, whether other DNS traffic (not related to Let's Encrypt) is being sent/received successfully by your servers?

6 Likes

Then, I would look into adding some geographic diversity to your authoritative DNS presence.

4 Likes

While some geographic diversity might help, they're already using two different Internet providers. And assuming most of their users are near them, it doesn't make much sense to put a DNS server on the other side of the world just to help Let's Encrypt find it, especially if it might also mean that their users just get the packet loss issues trying to talk to the DNS server instead.

However, it may make sense for them to ensure that their Internet providers have actually different upstream connections to some reasonable degree. There are plenty of cautionary tales out there of companies that thought they had redundancy but one backhoe managed to take both of their connections out at once. It may not be possible to feasibly do, though, depending on how many Internet traffic transit points the region actually has.

3 Likes

A. It doesn't have to be on the other side of the world - it merely needs to be outside their country.
B. DNS is built for redundancy - putting all DNS eggs into any one basket pretty much takes away all that potential redundancy.

So, to me, it makes all the sense in the (internet) world to spread DNS around as much as possible.
Case in point:
[from which I've never had a DNS related issue]

image

Recap:
4 AS numbers
4 ISPs
2 Countries
[and that's just a play site - business critical sites should be as good (or better)]

3 Likes