Spurious CAA SERVFAIL responses during finalize

jvanasco · November 30, 2023, 10:04pm

What rate-limits are you experiencing from this pattern or are worried about?

The only one I can imagine that would be affected by these failures is the pending authorization limit - which can jam up HostingProviders/SAAS/PAAS/etc who use multi-SAN certs as the aggregate of pending auths across orders will often exceed the limit and prevent new orders from being made.

If that is the limit you are worried about, your client can/should be cleaning up pending authorizations by deleting all the still-pending authorizations if a certificate fails. That is the status-quo and recommended practice, unless you have other methods in place -- such as (iteratively) trying to complete the order with the successful and pending validations (e.g. retry the order without the failed domain; and repeat this process until the order eventually passes).

In terms of being worried about Rate Limits in general...

Digging into your Cert history and company, I see the following setup:

You create a certificate with a CN ssl{n}.ipaper.io and assign <=99 customer domains to the SAN.
You host multiple subdomains per customer; one on crt.sh | 6181122217 has 30 subdomains
The customer's subdomain CNAMEs onto your system

LetsEncrypt has a rate-limit of 50 Certificates per Registered Domain per week. Considering you are aggregating everything under 11 Certificates on your domain right now - I would not worry about your domain triggering a ratelimit under normal circumstances.

I think you are triggering that rate-limit because your usage seems to be constantly re-configuring the domains assigned to ssl{n}.ipaper.io and re-issuing "that" certificate within a few days of the last issue. IMHO that is an anti-pattern, and I don't think LetsEncrypt is likely to approve a rate limit exception when they look at your usage during review.

Personally, I suggest you stop reconfiguring the certificates, and instead reconsider how you deploy them in your system(s). Several gateways and webservers will let you dynamically load a SSL Certificate. If you are on nginx, the OpenResty fork supports doing this through Lua scripting and should not take more than an afternoon to implement. I can point you to resources and code examples if interested.

I would also start migrating your certificate structure to a dedicated client-based one -- something like:

cn: ssl{x}.iopaper.io
san: x.client1.com, y.client1.com

This is the strategy that cloudflare utilized on their systems when they required their name on a certificate (I don't think they do so any longer). If you follow that convention, you can onboard 50 new clients per week after validating DNS for all their domains is correctly set up -- as renewals do not count towards rate limits.

I think you are also running into this issue because you have one client's DNS changes affecting all the other clients on that certificate. By isolating clients to their own certificates, that problem will stop.

In the interim, I would have someone on your team write a script to do a pre-flight check before renewal and ensure the DNS is correctly set up for every domain on each certificate before trying to renew.

So my advice in order:

pre-flight before renewal
look into dynamic SSL loading
partition certificates by client
request a ratelimit extension.

I am confident that LetsEncrypt will approve a ratelimit extension for a SAAS/PAAS vendor that partitions certificates by client and uses them for 90 days (unless an addition/deletion is needed), as that is very typical for them to do.

I am not confident that LetsEncrypt will approve an extension for a SAAS/PAAS vendor that constantly repartitions various domains across certificates and orders a re-issue within a few days, as that is an anti-pattern that ties up their resources.

theduderog · December 1, 2023, 1:18am

Updating here....we still are seeing a high amount of failures.

ITNiels · December 1, 2023, 10:24am

@jvanasco Thank you very much for that detailed explanation, we are aware that our current system is very far from ideal, and we have it in our roadmap to completely rewrite how we do certificates and partitioning them by customer is very high on the wish list to avoid 99% of the issues we are facing today, but right now we are stuck with this for the next 6 months at least, so will have to come up with some solution in the meantime to get going again.

Yes it is the 50 certificate renewals per week we are worried about running into, it has only happened twice ever afaik, so we are being very careful to not hit the limits, but as the service is working now it is a lot of manual work and checking when something fails.

I will try and see if it will make a successful run today, and see if there are some repeats that could be causing the failures as well by checking them manually.

Thanks again for the clarifications and help

ITNiels · December 1, 2023, 12:13pm

I just did 2 runs 10 minutes apart, it needed to renew 8 certificates (ssl4-ssl11)
In the first run ssl9 fails to check CAA for a single domain
In the second run, ssl8 fails to validate 2 domains (that previously succeeded 10 mins earlier) with a Rechecking CAA exception

So it does not seem to reuse the CAA checks between runs even with identical certificates.

ITNiels · December 1, 2023, 1:54pm

Okay!!
I have added some more retry logic to avoid having to retry ALL 8 certificates, but can retry a single one up to 3 times using the same request.

I did now see the same domain stick out again and it seems they are using ns.namebrightdns.com which there is an old article for on this site, will maybe have to exclude this domain from our list!

Guide to change DNS servers from NameBright to CloudFlare - Issuance Tech - Let's Encrypt Community Support (letsencrypt.org)

ITNiels · December 1, 2023, 2:50pm

I think I hit some limit somewhere, not getting certificates back anymore for ssl{4-8}.ipaper.io.
It used to say how many certificates were counted towards your limit on letsdebug.net, is where a way to see my current status?

aarongable · December 1, 2023, 5:44pm

Ah, I know what's happening here. I described the behavior slightly incorrectly in my post above. We don't actually cache CAA lookups independently from domain control validation -- they're stored in the same place. When we do initial validation and CAA checks, if both succeed, we store that success in the database. We continue to use the domain control validation record for up to 30 days, but any time after the first 7 hours, we have to re-check CAA.

So any domains which completed domain control validation a couple days ago will now be doing CAA checks every time you retry. (This confusing behavior is one of the many reasons that we'd like to shorten our domain control validation caching time to be the same 7 hours as CAA. But that's a topic for elsewhere.)

There are currently basically two ways around this:

Deactivate all of the current authorizations. This will force the next round of renewals to create brand-new authorizations. Then you can retry for up to 7 hours before CAA rechecks start happening again.
Switch to wildcard certificates, one per *.sslX.ipaper.io. You'll still be issuing the same number of certificates, with the same number of private keys, deployed on the same number of servers you control. I don't think this would require any fundamental changes to your deployment methodology. But it would reduce the number of validations and CAA checks you have to do from ~100 per cert to 1 per cert. Edit: My apologies, now that I've taken a closer look at one of your certificates, I now realize why this route doesn't work for you.

I also fully support @jvanasco's recommendations, but recognize that they're a more complex overhaul of your system that's not achievable in the immediate short-term. I'll also re-state my support for requesting a rate limit override so you could simply increase your number of sslX.ipaper.io hosts by a factor of 10, and reduce the number of domains per cert by the same factor of 10.

Nummer378 · December 1, 2023, 7:52pm

Let's Debugs cert-search tool tries to estimate the rate limit based on public information. The tool's data is not authoritative though and can miscalculate things. The tool uses public data from crt.sh to calculate rate limits.

For your specific domain (ipaper.io), it appears that there are more than 10000 certificates issued for it in total. There's a known limitation that breaks certificate search once more than 10000 certificates exist for a domain. This bug causes Let's Debug to find 0 active certificates for your domain, which then causes the rate limit calculation to not work.

This bug used to affect both crt.sh's web-based search and Let's Debug. It appears that crt.sh's frontend has since been improved to handle this a bit better, perhaps Let's Debug can fixed as well. Will have to look into it sometime.

theduderog · December 1, 2023, 9:44pm

@jcjones Our renewals for this domain are failing consistently today. Could you dump a detailed log please? I think it would be helpful to narrow down what's wrong with our DNS setup.

Error finalizing order :: Rechecking CAA for "*.uaenorth.azure.glb.confluent.cloud" and 1 more identifiers failed. Refer to sub-problems for more information, problem: "urn:ietf:params:acme:error:caa" :: Error finalizing order :: While processing CAA for *.uaenorth.azure.glb.confluent.cloud: DNS problem: SERVFAIL looking up CAA for uaenorth.azure.glb.confluent.cloud - the domain's nameservers may be malfunctioning, problem: "urn:ietf:params:acme:error:caa" :: Error finalizing order :: While processing CAA for *.cert-intl-9d8pq0m5.uaenorth.azure.glb.confluent.cloud: DNS problem: SERVFAIL looking up CAA for cert-intl-9d8pq0m5.uaenorth.azure.glb.confluent.cloud - the domain's nameservers may be malfunctioning

MikeMcQ · December 1, 2023, 10:11pm

@theduderog dnsviz is reporting that you are not sending SOA records for nodata.
And, that was one of the things noted by someone else responding to you in a different thread.

I don't know why unboundtest would not flag that but dnsviz does

https://dnsviz.net/d/uaenorth.azure.glb.confluent.cloud/dnssec/

theduderog · December 1, 2023, 10:21pm

@MikeMcQ Any idea why dnsviz would say that but dig shows we are returning SOA?

dig a uaenorth.azure.glb.confluent.cloud

; <<>> DiG 9.10.6 <<>> a uaenorth.azure.glb.confluent.cloud
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20839
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;uaenorth.azure.glb.confluent.cloud. IN	A

;; AUTHORITY SECTION:
uaenorth.azure.glb.confluent.cloud. 900	IN SOA	glbns1.confluent.cloud. hostmaster.confluent.cloud. 1 7200 900 1209600 900

;; Query time: 63 msec
;; SERVER: 100.217.4.1#53(100.217.4.1)
;; WHEN: Fri Dec 01 14:21:05 PST 2023
;; MSG SIZE  rcvd: 117

Bruce5051 · December 1, 2023, 10:42pm

Here is https://www.hardenize.com/ perspective
Hardenize Report: uaenorth.azure.glb.confluent.cloud

MikeMcQ · December 1, 2023, 10:43pm

I do not. It is the first time I have ever seen it say that. It just reminded me of the comments on the other thread.

Bruce5051 · December 1, 2023, 10:49pm

This is what I get using nslookup

$ nslookup -q=soa uaenorth.azure.glb.confluent.cloud glbns1.confluent.cloud.
Server:         glbns1.confluent.cloud.
Address:        75.2.51.54#53

uaenorth.azure.glb.confluent.cloud
        origin = glbns1.confluent.cloud
        mail addr = hostmaster.confluent.cloud
        serial = 1
        refresh = 7200
        retry = 900
        expire = 1209600
        minimum = 900

MikeMcQ · December 1, 2023, 10:55pm

I also got failures at the edns compliance tester below. This was also mentioned on that other thread

Using your zone and DNS server

https://ednscomp.isc.org/ednscomp/36090ff4e7

Bruce5051 · December 1, 2023, 10:57pm

And check these results too https://check-your-website.server-daten.de/?q=uaenorth.azure.glb.confluent.cloud

rg305 · December 2, 2023, 12:01am

These first two seem problematic (to me):

nslookup -q=ns uaenorth.azure.glb.confluent.cloud
*** UnKnown can't find uaenorth.azure.glb.confluent.cloud: Server failed

nslookup -q=ns azure.glb.confluent.cloud
*** UnKnown can't find azure.glb.confluent.cloud: Server failed

These last two answer as expected:

nslookup -q=ns glb.confluent.cloud
glb.confluent.cloud     nameserver = glbns1.confluent.cloud
glb.confluent.cloud     nameserver = glbns2.confluent.cloud

nslookup -q=ns confluent.cloud
confluent.cloud nameserver = ns-1101.awsdns-09.org
confluent.cloud nameserver = ns-2008.awsdns-59.co.uk
confluent.cloud nameserver = ns-336.awsdns-42.com
confluent.cloud nameserver = ns-648.awsdns-17.net

I tried the "same thing" with one of my domains and they all provided (better) answers:

nslookup -q=ns non.existent.sub.domains.beer4.work
*** Can't find non.existent.sub.domains.beer4.work: No answer
Authoritative answers can be found from:
beer4.work
        serial = 2023021815
        ...

nslookup -q=ns sub.domains.beer4.work
*** Can't find sub.domains.beer4.work: No answer
Authoritative answers can be found from:
beer4.work
        serial = 2023021815
        ...

jvanasco · December 2, 2023, 5:21pm

Writing a pre-flight script to check the domains on all the certs first is pretty simple. A few weeks ago, a LetsEncrypt staff member posted a quick script used to check TLDs for CAA records. It should be a good starting point to write something that can analyze your certs.

I get that DevOps stuff and housekeeping can get pushed back, but your Product/Tech leads should be prioritizing this in your next sprint. I am not speaking as a community member here, but someone who was formerly c-level of your target customer demographic and currently advises companies who are in your target demographic. You are in a hole of technical debt and should be digging yourself out of it, not deeper into it.

theduderog · December 4, 2023, 3:44am

We still do not know what the root cause is of these SERVFAIL responses but have noticed that adding a CAA record to our domain prevents the error from happening.

It would still be good to understand the root cause and why it's non-deterministic, failing most of the time but occasionally succeeding.

ITNiels · December 5, 2023, 4:34pm

Just a quick status and thank you to everyone that commented and helped
We are now back to issuing certificates again, I have improved our flow quite a bit with now caching the successful certificates so let's say we have to renew 8 and 7 succeed then we save them and only retry number 8, we also added retrying each a few times to get around the spurious CAA checks.

We will still be doing a major overhaul of our LE service next year, but at least we are back and doing things slightly better than before.

Have a great day everyone.
Kind regards
Niels

Topic		Replies	Views
SERVFAIL causing issuance failures, unable to reproduce in Unbound or locally Help	46	4319	September 6, 2018
False CAA failure when issuing certs Issuance Tech	35	4101	August 9, 2018
DNS Resolver Upgraded to Unbound 1.18, Empty Responses require SOA sections API Announcements unbound , dns	0	923	December 19, 2023
CAA SERVFAIL changes API Announcements	3	15619	September 7, 2017
DNS problem: SERVFAIL looking up CAA for Help	3	899	February 26, 2019

Spurious CAA SERVFAIL responses during finalize

These first two seem problematic (to me):

These last two answer as expected:

I tried the "same thing" with one of my domains and they all provided (better) answers:

Related topics