DNS problem: server failure at resolver looking up CAA for chat

nurupo · August 3, 2019, 5:34pm

Let’s Encrypt API endpoint has failed to issue certificates for some of *.tox.chat subdomains, while succeeded for others. Re-running the request has worked after a couple of retries. Perhaps some API backed instance has DNS issues?

{
  "type": "urn:acme:error:caa",
  "detail": "Error creating new cert :: While processing CAA for nodes.tox.chat: DNS problem: server failure at resolver looking up CAA for chat",
  "status": 403
}

jsha · August 4, 2019, 2:01pm

Hm, this one is a little unusual. More often, we see SERVFAIL problems, which means either the authoritative servers timed out or DNSSEC was signed wrong. However, “server failure at resolver” is more of a catch-all that may indicate some timeout internal to Let’s Encrypt.

Do you get this error consistently? Have you been able to issue for other names under .chat? Is this your first time issuing for a .chat name?

github.com

letsencrypt/boulder/blob/6f93942a0449ea6bc0f6927c5e7520c1034f49b6/bdns/problem.go#L36


			} else {
				detail = detailDNSNetFailure
			}
			// Note: we check d.underlying here even though `Timeout()` does this because the call to `netErr.Timeout()` above only
			// happens for `*net.OpError` underlying types!
		} else if d.underlying == context.Canceled || d.underlying == context.DeadlineExceeded {
			detail = detailDNSTimeout
		} else {
			detail = detailServerFailure
		}
	} else if d.rCode != dns.RcodeSuccess {
		detail = dns.RcodeToString[d.rCode]
	} else {
		detail = detailServerFailure
	}
	return fmt.Sprintf("DNS problem: %s looking up %s for %s", detail,
		dns.TypeToString[d.recordType], d.hostname)
}


// Timeout returns true if the underlying error was a timeout
func (d DNSError) Timeout() bool {

nurupo · August 4, 2019, 4:47pm

Do you get this error consistently?

No, it's not consistent. After first encountering the error I have re-ran the script, it has failed with the same error once more but on the second try it has succeeded.

Have you been able to issue for other names under .chat?

Yes, only 2 out of 10 or so subdomains have failed, and they both failed with this error. There is nothing special about those names that have failed, we haven't touched DNS records in a long time, and after a couple of retries those 2 succeeded too.

Is this your first time issuing for a .chat name?

We have been using Let's Encrypt on this domain since mid 2016 or so, it has never returned an error response before, either this or any other.

Something I have noticed is certificate transparency log saying that Let's Encrypt has issued a pre-certificate for nodes.tox.chat, but hasn't issued a leaf certificate for it. This happened when this error occurred. https://censys.io/certificates?q=347830211024851204302430990231197014451134

mnordhoff · August 4, 2019, 6:03pm

Let's Encrypt logs final certificates on a best-effort basis. After all, you can't unissue a certificate if there's a CT log outage after you've already done it. They could put them in a queue and retry later but, as far as I know, they don't.

You can currently download the certificate here:

https://acme-v02.api.letsencrypt.org/acme/cert/03fe2e5de5c479933ca73bdfc79e464663be

The number in the URL is the serial number in hex. (Your Censys link used the serial number in decimal.)

I just downloaded it and submitted it to some CT logs and now it shows up on crt.sh:

jsha · August 4, 2019, 9:52pm

That's correct. We try to submit for five minutes, and if that fails we give up.

nurupo · August 4, 2019, 10:27pm

Interesting. So, I get this right: Let’s Encrypt API endpoint replied to me with urn:acme:error:caa, so I didn’t get the certificate, but since Let’s Encrypt has created it locally, LE reported it to CT logs? I guess that makes sense.

I’m curious as what is the step after:

verifying that the request to issue a certificate is legitimate (e.g. through .well-known/acme-challenge)
signing the certificate
reporting the certificate to CT / adding it to the report queue

, i.e. what is the 4th step, that has failed with the urn:acme:error:caa DNS error?

_az · August 4, 2019, 11:34pm

The CAA check is always strictly before any kind of certificate is issued.

The finalization process is more like:

Check the authorizations are still valid
Recheck the domain’s CAA records
Issue the precertificate
Submit precertificate to CT logs in exchange for SCTs
Issue final certificate, embedding the SCTs
a. Return the final certificate to the client
b. Submit final certificate to CT logs (asynchronous, best-effort)

nurupo · August 5, 2019, 12:47am

So, how come the final certificate was submitted to CT logs (https://crt.sh/?id=1740322533) if the client, instead of getting the final certificate, got the urn:acme:error:caa DNS error back? Based on these steps, even the pre-certificate (https://crt.sh/?id=1731001687) shouldn’t have been submitted, the process should have failed somewhere in step (2).

_az · August 5, 2019, 1:37am

I’m going to go out on a limb and say that’s impossible. The CAA error and that certificate came from different orders.

mnordhoff · August 5, 2019, 8:34am

To clarify – or, more likely, to write a wall of text and make things less clear:

We don’t know for sure whether Let’s Encrypt tried to submit that final certificate to any CT logs or whether it succeeded.

We know that they issued it, stored it, made it available to the certificate download API, and are properly handling OCSP.

I submitted the certificate to several CT logs. (Anyone in the world can submit a public certificate to a public CT log. You just need a copy of the certificate and the intermediate(s) and some CT software. You don’t need to be a CA or have access to any private information possessed by the CA or the subscriber.)

My actions are why it shows up on crt.sh and, now, Censys.

The problem is that visibility into Certificate Transparency is cloudy. [sound effect] A human can’t read CT logs directly with a web browser and their own eyes. You need software to do it. Websites like crt.sh and Censys run software to download data from the CT servers and provide a human-friendly interface to it.

But, for a human to use one of those websites to to learn what’s going on, you’re relying on how the website is implemented, whether it’s working correctly, and even whether it’s honest.

To make up a simple hypothetical example, if you were using a CT website that only updated once a day, you wouldn’t have any idea whether a certificate issued a few hours ago was logged.

I believe Let’s Encrypt logs final certificates to the Argon and/or Oak logs. (I also submitted it to those two logs, among others.) Censys appears not to track Oak (it’s new). And crt.sh is backlogged processing them both:

https://crt.sh/monitored-logs

Going by Censys’s data, it appears that Let’s Encrypt did not successfully log the certificate in question to Argon, but I don’t know whether they logged it to Oak.

nurupo · August 5, 2019, 1:51pm

Oh, the forum got fixed, I wanted to reply earlier but neither me nor this thread existed on the forum

I’m going to go out on a limb and say that’s impossible. The CAA error and that certificate came from different orders.

You are actually right. I just double-checked the logs and it looks like the cron'ed renewal has failed with a different error than the one I got (and posted here) when I have manually tried to renew the cert after the cronjob has failed. I mistakenly thought that it was the same error but they are two different ones. I'm sorry for the confusion. Here is the error cronjob got on Fri, 2 Aug 2019 01:08:11 +0000 (UTC), i.e. when the pre-cert was published to CT (https://crt.sh/?id=1731001687):

Parsing account key...
Parsing CSR...
Registering account...
Already registered!
Verifying nodes.tox.chat...
nodes.tox.chat verified!
Signing certificate...
Traceback (most recent call last):
  File "/usr/local/bin/acme_tiny.py", line 198, in <module>
    main(sys.argv[1:])
  File "/usr/local/bin/acme_tiny.py", line 194, in main
    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)
  File "/usr/local/bin/acme_tiny.py", line 161, in get_crt
    raise ValueError("Error signing certificate: {0} {1}".format(code, result))
ValueError: Error signing certificate: 500 {
  "type": "urn:acme:error:serverInternal",
  "detail": "Error creating new cert",
  "status": 500
}

Anyway, this is getting a bit off-topic. All I wanted to do with this thread is to report the API endpoint failing, which I haven't seen before, in hope that you could identify the issue internally and maybe fix it, if there is an issue.

I got rather curious as how was LE able to issue a certificate before getting the DNS error and the whole thread got sidetracked

mnordhoff · August 5, 2019, 4:47pm

Are you getting those serverInternal errors frequently?

Nothing is perfectly reliable, but Let’s Encrypt shouldn’t return that error message often.

At best, it indicates something prosaic happened but the real error message got eaten. At worst, there’s a bug or operational issue, and it ought to get fixed.

jsha · August 5, 2019, 6:59pm

Thanks for all the help diagnosing and explaining, all! The fact that you also got a 500 for issuance helps explain what’s going on. If you get a 403 (due to failed CAA check), the issuance never proceeds.

If you get a 500, often that means that the issuance was successful, but there was a timeout writing the issued certificate to the database. If that’s the case, a precertificate has already been submitted. In some cases, despite the timeout, the DB query was actually successful, so the final certificate shows up in the DB. We also log the full certificate in such cases, and have a job that handles cleaning up and resubmitting to the DB until it succeeds.

system · September 4, 2019, 6:59pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.