Consistent 500's for new-cert (failing CAA for one domain)

I'm not sure if it's appropriate to revitalize this thread or start a new one.

We're seeing consistent failures 500 urn:acme:error:serverInternal: Error creating new cert in creating a cert with the following domain list:

5642684278505472-fe2.pantheonsite.io
acc.vanillastage.com
alljerseymovers.com
blog.socialsecurity.gov
buderim9.com.au
c-s-f.org
chasschwartz.com
client.achieveinternet.com
crc.thewhitehawkgroup.com
ctb.pernod-ricard.io
dev.dailywire.com
farmermac.com
fuseiq.com
gardnermuseum.org
globalcdn.bcrfcure.org
healthyerhacks.com
hissingkitty.com
i3.neimanhpi.com
info.phsc.edu
inogo.stanford.edu
internal-setenv.com
lematinshopping.ch
lipkinhiggins.com
live.fordsgin.com
live.odden.io
m.medifab.co.nz
medifab.co.nz
medifab.com.au
medsummit-cecme.org
messageagency.com
miss-cme.org
mobile-test.nationalreview.com
navesdemerced.org
neimanhpi.org
novahealthfdn.org
novonordiskportal.ca
nowrealtynd.com
oa.achieveinternet.com
odden.io
openatrium.achieveinternet.com
opm.rent
pags-cme.org
pcpc-cme.com
philanthropysouthwest.org
phsc.edu
plymouthbayculture.com
plymouthbayculture.org
prixharmonie.com
stepaheadpaediatrics.com.au
sunshinecoastopenhouse.com.au
test.petersonhealthcare.org
test.wellcertified.com
umces.edu
www.alljerseymovers.com
www.buderim9.com.au
www.c-s-f.org
www.chasschwartz.com
www.farmermac.com
www.fuseiq.com
www.gardnermuseum.org
www.healthyerhacks.com
www.hissingkitty.com
www.info.phsc.edu
www.internal-setenv.com
www.lematinshopping.ch
www.lipkinhiggins.com
www.medifab.co.nz
www.medifab.com.au
www.messageagency.com
www.neimanhpi.org
www.novahealthfdn.org
www.novonordiskportal.ca
www.nowrealtynd.com
www.odden.io
www.opm.rent
www.pawpatrollive.com
www.philanthropysouthwest.org
www.phsc.edu
www.plymouthbayculture.com
www.plymouthbayculture.org
www.prixharmonie.com
www.stepaheadpaediatrics.com.au
www.sunshinecoastopenhouse.com.au
www.townsendsecurity.com
www.umces.edu
www.wwpr.org
www.wyattirrigation.com
www.wyattsupply.com
www.zelojobs.com
wyattirrigation.com
wyattsupply.com
zelojobs.com

@cpu, could you take a look into the circumstances surrounding this new 500 error?

I've split out this topic since it's a different root cause than the previous thread. This is related to the issuance-time CAA checking we just enabled. One of the domains on that certificate is failing CAA checks. This is the error Boulder is supposed to show you:

unable to create new cert: Rechecking CAA: DNS problem: SERVFAIL looking up CAA for prixharmonie.com, DNS problem: SERVFAIL looking up CAA for www.prixharmonie.com

Unfortunately, there appears to be a bug that is turning this into a generic ServerInternal error with no additional detail. We'll fix that bug in next Thursday's release.

The immediate fix is to issue a certificate without prixharmonie.com and www.prixharmonie.com, and contact the owners of that site to see about fixing their DNS. See Certificate Authority Authorization (CAA) - Let's Encrypt for more documentation about CAA debugging. You could also, of course, continue to use any current certificate closer to expiry, in order to maximize the opportunity for prixharmonie.com to fix their DNS.

If you or anyone else experiences consistent 500's for new-cert between now and September 7, when we expect to deploy a better error message, a quick and easy test to see if it's the same issue is to run dig caa example.COM @8.8.8.8 (note intentionally mixed case) for each domain in the certificate. Note that Let's Encrypt's nameservers are slightly stricter than 8.8.8.8, so it's possible this may miss some issues, but it should catch most of them.

3 Likes

Hi @jsha and @cpu, we just encountered this error again trying to renew a certificate, where the CAA lookup for a domain was returning SERVFAIL but Let’s Encrypt was returning 500. The domain is test.MIKe-riChARdSON.COm and the error 500 urn:acme:error:serverInternal: Error creating new cert

Hi @marktheunissen,

The bugfix for returning a non-500 when rechecking CAA fails hasn’t gone to production yet. I believe it will as part of this afternoon’s Boulder update.

1 Like

Hi again @marktheunissen,

Today’s update of the production Boulder instance is finished. The changelog includes the fix for the bug you’re observing.

Please let us know if you continue to see this error now that the bugfix has been deployed.

Thanks!

Hi @cpu, thanks we can see the 403 error now. The concern that we have is that we now need to parse the error message, e.g. for example.com:

403 urn:acme:error:unauthorized: Error creating new cert :: Rechecking CAA: DNS problem: SERVFAIL looking up CAA for example.com

This contains the failing domain example.com, but parsing error messages is not ideal for an automated cert issuance system. If there are multiple domains failing, how are they separated and can we safely rely on that error message not changing format in future?

We could check the CAA ourselves, but reproducing Boulder’s logic in our system seems like an untenable practice due to the likely divergence of the codebases.

Is there an API we can use to determine if a certificate can successfully be issued? We were relying on auths but in this case, the auth will be valid yet the cert will fail to issue.

Yep, I agree it’s not ideal that you have to parse out the results. We’ll brainstorm a solution to get the results in a more structured way.

2 Likes

@jsha @cpu We're getting an error now that says CAA is failing for a domain that we aren't trying to issue a cert for.

The error is:

403 urn:acme:error:unauthorized: Error creating new cert :: Rechecking CAA: DNS problem: SERVFAIL looking up CAA for sitelockcdn.net

The list of domains we are trying to issue the cert for, is below, and does not include sitelockcdn.net:

[dev.famsf.org legionofhonor.org famsf.org www.apacf.org www.deyoungmuseum.org test.famsf.org www.bigstepslongstrides.com www.prsnapshot.com jmfebizhub.net www.sprint-review.pan-dns-test-4.com deyoungmuseum.org blog.haircuttery.com president.utexas.edu manningleaver.com wfmt.com dev-deyoung.famsf.org www.launchpad.utexas.edu www.gonetotexas.utexas.edu www.trustntm.com www.wintonsteak.com themodern.org launchpad.utexas.edu dev-radionetwork.wfmt.com thinker.org www.jmfebizhub.net www.alumni.techhub.osu.edu pantheon.sujay.me www.labragirlfilmproject.org ajc-live.jacksonriverdev.com standardsfacility.org www.cpusa.org gonetotexas.utexas.edu merchantcapitalsource.com www.famsf.org bluecandlelight.org test-legionofhonor.famsf.org www.brit.org test-deyoung.famsf.org legionofhonor.famsf.org soundonsound.com www.legionofhonor.org wintonsteak.com bigstepslongstrides.com www.standardsfacility.org apacf.org thinker.com www.orangeenergysolutions.com www.merchantcapitalsource.com dev-legionofhonor.famsf.org www.motorsportsmediagroup.com cccu.org www.thinker.org www.themodern.org www.manningleaver.com highlands.famsf.org www.president.utexas.edu radionetwork.wfmt.com presentation.fjorgedigital.com blogs.brit.org orangeenergysolutions.com 5769015641243648-fe4.pantheonsite.io brit.org www.cccu.org deyoung.famsf.org motorsportsmediagroup.com www.thinker.com www.bluecandlelight.org www.bodylogix.ca trustntm.com bodylogix.ca labragirlfilmproject.org www.wfmt.com test-radionetwork.wfmt.com alumni.techhub.osu.edu prsnapshot.com]

Any idea what's going wrong?

We've checked the CAA for all of the ones we are trying to put on the cert, and they're all fine from what we can tell.

This continues to fail even after retry, we’ve been trying for about an hour now.

It looks like one of the domains in that list (www.merchantcapitalsource.com) has a CNAME to sitelockcdn.net.

dig +short www.merchantcapitalsource.com
35uee.sitelockcdn.net.
107.154.155.92

Chasing that CNAME and resolving CAA for sitelockcdn.net results in a SERVFAIL:

 dig @8.8.8.8 -t CAA sitelockcdn.net

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @8.8.8.8 -t CAA sitelockcdn.net
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 36048
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;sitelockcdn.net.		IN	CAA

;; Query time: 795 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Sep 15 12:35:15 EDT 2017
;; MSG SIZE  rcvd: 44

This CNAME chasing behaviour is unfortunate but required if you follow the letter of the CAA RFC ignoring recent errata, which is the circumstance we find ourselves in now based on the baseline requirement interpretation presented by the root programs.

Thanks for the explanation, would it be possible to get the original domain we requested back in the error message?

I opened Include original domain in `treeClimbingLookupCAAWithCount` errors · Issue #3094 · letsencrypt/boulder · GitHub for this.

2 Likes

If we follow the current RFC 6844, would we not be trying to get CAA records for the following:

  1. X.Y.Z
  2. Alias (X.Y.Z)
  3. Y.Z
  4. Alias (Y.Z)
  5. Z
  6. Alias (Z)

For this particular case, we would try to get CAA records for:

  1. www.merchantcapitalsource.com
  2. Alias (www.merchantcapitalsource.com) => 35uee.sitelockcdn.net
  3. merchantcapitalsource.com
  4. Alias (merchantcapitalsource.com) => not an alias
  5. com
  6. Alias (com) => not an alias

Note that unlike “www.merchantcapitalsource.com”, “merchantcapitalsource.com” is not an alias. In other word, we should not have to look up CAA records for “sitelockcdn.net” that results in a SERVFAIL.

Therefore one should be able complete all six steps with no error. The following is the relevant dig commands for the six steps:

Step 1
dig @8.8.8.8 www.merchantcapitalsource.com caa

; <<>> DiG 9.10.4-P8 <<>> @8.8.8.8 www.merchantcapitalsource.com caa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43355
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.merchantcapitalsource.com. IN CAA

;; ANSWER SECTION:
www.merchantcapitalsource.com. 3599 IN CNAME 35uee.sitelockcdn.net.

Step 2
dig @8.8.8.8 35uee.sitelockcdn.net. caa

; <<>> DiG 9.10.4-P8 <<>> @8.8.8.8 35uee.sitelockcdn.net. caa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63359
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;35uee.sitelockcdn.net. IN CAA

Step 3
dig @8.8.8.8 merchantcapitalsource.com caa

; <<>> DiG 9.10.4-P8 <<>> @8.8.8.8 merchantcapitalsource.com caa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20824
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;merchantcapitalsource.com. IN CAA

Step 4
Not applicable as merchantcapitalsource.com is not an alias.

Step 5
dig @8.8.8.8 com caa

; <<>> DiG 9.10.4-P8 <<>> @8.8.8.8 com caa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20644
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;com. IN CAA

Step 6
Not applicable as com is not an alias.

Your comments on the above interpretation of the RFC would be appreciated.

Your reading of the RFC seems to be describing the tree climb as I understand erratum 5065 describes it, not as the legacy CAA interpretation of the base RFC describes.

RFC6844 specifies both an algorithm, and an example application of that algorithm. Unfortunately, the example doesn't match the algorithm. It looks like this is taken from the example section of RFC6844:

That's a reasonable interpretation, and is in fact what erratum 5065 specifies. Unfortunately, the legacy implementation has to additionally look up Parent(Alias(X.Y.Z)), Parent(Parent(Alias(X.Y.Z))), and so on. That's how we get to sitelockcdn.net.

1 Like

After reading errata 4515 (which is covered by errata 5065), I see the problem with the discrepancy in the original RFC to which you were referring. I suppose either interpretation of the original RFC is somewhat valid given the discrepancy and the subsequent correction.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.