Certificate renewal fails: urn:ietf:params:acme:error:caa 403

The issue found by @rg305 was sadly false. It was just the way I had copied the names over from the existing cert and incorrectly added a www. prefix to the CN and wildcard domain. It's not in the cert and I posted the actual SAN list ... which for some reason cannot be copied as block out of Firefox, hence recreating it manually, hence the up-cock in transcription. Oh that life could be so easy :slight_smile:

This is true for the first issue.

But the second:

And third:

And fourth:
sprakekingsleyllp.co.uk | DNSViz

have not been addressed.

2 Likes

So you see in issues in the signing of some zones? That's odd. I'd best check them all. Thanks for the lead :slight_smile:

2 Likes

Interesting that DNSViz thinks there's an RRSIG issue...

chloefox.co.uk

chloefox.me.uk

chloe-fox.co.uk

chloe-fox.me.uk

So VeriSign say there's no issue with the zone signing, but then look at that 2x no NS response. Is this the issue ... NS response is leading to both the LE renewal failure and the DNSViz results, as VeriSign saying the sigs are in fact good. Back to Mike's thoughts, or areas around no response?

Just to throw something else in...

I tried a few single domain renewals from the same machine, so just the unqualified name + www ... all worked without issue. I tried another principal + 5 domain alias 'subscription' ... so equal in size to the ChloeFox sub ... and that one worked, while the ChloeFox one does not, not the larger SprakeKingsley one.

Was working on 'proving' the longer SAN list provoking the issue theory, but the alternative 6 name renewal working blew that idea :face_with_symbols_over_mouth:

DNSViz reporting all domains as being okay today, as VeriSign were doing last night, but LE cert rotation still fails with the same error, but then...

In Plesk, deselect all cert options except the unqualified principal domain name, i.e. the CN. No www name. No wildcard. No additional domains, just the one single unqualified name, run the update and it works. But it doesn't end there. Look at what was issued...

Of course this leaves the issue unresolved. Just what is going on and why these failures? I'll have to ask Plesk why I get a 'full set' renewal having selected only the single unqualified name, but the fact the cert was issued says DNS must be okay.

This post gets to an interesting conclusion. I hope it’s worth the read :slightly_smiling_face:

On the sprakekingsleyllp.co.uk rotation failure, I went been through and set CAA on every name and now the rotation succeeds.

Previously there were no CAA and the block wasn't reported on the CAA not permitting. I did see that error by making an error on one CAA, but after correcting, it all went through. So, is this a lazy DNS not responding when there is nothing to serve? Seems unlikely as other cert rotations do not suffer this issue, use the same DNS and also have no CAA RR all without issue. DNSViz was showing RRSIG failures last night, but I’ve never had an issue with zone signing, and at the same time VeriSign was showing all well. DNS latency issue?

So, my 'issue' is I haven't really identified why it worked and how I solved it. And that means it will be back and as a community we've not really learnt anything, and I've seen others reporting the identical issue. It would be good to get to the bottom of this.

There may be a Plesk dimension to this. In both successes I deselected all renewal options including any ‘alias domains’, deselected wildcard and www prefix, and opted to renew only the unqualified CN. These went through, but with the complete name list, i.e. with principal name wildcard and all domain alias unqualified and www prefix names.

I’ve just retried sprakekingsleyllp.co.uk leaving all the options enabled and saw two new things…

  1. LE asked for the _acme-challenge value to be updated … not unusual, but this time to the exact-same string already set … and indeed not updating and continuing then worked!
  2. Then LE threw a new error: ‘Could not obtain a replay nonce: Server error: HEAD https://acme-v02.api.letsencrypt.org/acme/new-nonce resulted in a 503 Service Temporarily Unavailable response’

Not seen either of those before!

Anyway, attempting a rotation with all the Plesk cert options enabled got me back to the old ‘Type: urn:ietf:params:acme:error:caa … Error finalizing order :: Rechecking CAA’ error. But this does reveal something…

  1. Yes, we have Plesk asking for ‘the full load’ even when only the unqualified CN is requested, but the full load does then work, i.e. there is no issue with CAA records, RRSIGs or anything else. Go check out the A+ rating at SSL Server Test: sprakekingsleyllp.co.uk (Powered by Qualys SSL Labs)
  2. But then in actually selecting options for the ‘full load’ we get: ‘Detail: Error finalizing order :: Rechecking CAA for "sprakekingsleyltd.co.uk" and 11 more identifiers failed. Refer to sub-problems for more information’

This is significant as it tells us there is misreporting/misdirection by LE. Point 1 tells us unambiguously that there are not issues with CAA, RRSIG or anything else. The checks are good and the cert can be issued. Why then in 2 is LE telling us the issue is in ‘Rechecking CAA’? The success in 1 proves there is no issue with the CAA, so why is LE reporting this as the issue? Whatever the issue, and whatever the difference in what Plesk is requesting between 1 and 2 might be, LE is misdirecting in its report about what is blocking cert issuance.

I will see if I can learn more from the Plesk dimension, but I think there is an LE issue in we now know, falsely pointing to CAA checks when the CAA RR are proven to be okay by 1.

The "full load" only involved chloefox and chloe-fox domain names. So, the second test failing for sprakekingsleyltd is a different cert request with different names. The second failure proves nothing about the first request.

Sure others have seen this same error message. But, the underlying reasons may be different. LE issues like 3 million certs / day. Your erratic DNS issues are more likely caused by your DNS setup / servers rather than a fundamental problem at LE. That is just playing the odds :slight_smile:

4 Likes

No sorry @MikeMcQ, you have it entirely wrong.

Yes, there are two certs and two renewals here, both displaying the same 'CAA recheck' problem while other cert renewals go through without issue.

The ‘it fixed it, but why?’ solution to both renewals was to deselect all the name options in Plesk except the unqualified CN. This isn't a ‘always works’ cure, but it is what I did during testing that resulted in both renewals going through with the full load of name despite the selected option. So, no, Mike, I am not pointing to results on the ChloeFox renewal and applying them to the sprakekingsleyllp.co.uk renewal.

Just to avoid any misunderstanidng… for both certs I found that deseleteing all name options except the principal domain unqualified name (the CN), the certs refreshed with the full load of names despite th name selction. This was true with both certs. Then with both certs, retrying the renewal but selecting the full load of names, both failed with the CAA error.

Hence what I said was corect. The initial succss including all names poves there are no CAA issues as it both renewals went through. However, then retrying but actually selecting the full load of names, fails giving this CAA error … yet the prior success confirms the CAA to be good. I meant what I said because it is accurate.

Doing yet more test renewals…

Starting point:

  • Both the ChloeFox and SK certs rotated successfully this morning after deselecting all name options except the base CN. These certs include the full name list
  • Both ‘full load’ certs were issued even though the renewals were requested as only the unqualified CN. What is Plesk doing here??

Now, running some back-to-back rotations:

On the sprakekingsleyltd.co.uk cert:

  1. SK full load names selected > Detail: Error finalizing order :: While processing CAA for www.sprakekingsley.biz: DNS problem: query timed out looking up CAA for www.sprakekingsley.biz
  2. Repeat SK all options > Detail: Error finalizing order :: Rechecking CAA for "www.sprakekingsley.org" and 1 more identifiers failed. Refer to sub-problems for more information (the usual!)
  3. SK unqual. CN only > Success. The full load of names are included despite the selection option. No CAA or other issue. This shows the CAA must be good
  4. SK all options > Asked me to update the _acme-challenge to the existing value. That’s twice now. Failed: Detail: Error finalizing order :: While processing CAA for sprakekingsleyltd.uk: DNS problem: query timed out looking up CAA for sprakekingsleyltd.uk

Th time-out in 1 is interesting. Is this the cause? Are CAA checks failing due to timeouts but being incorrectly flagged in the ‘usual’ Rechecking CAA’ error message??

But note … request a full load, get a ‘CAA recheck’ error, select CN only, get a full load cert, re-request full load, get a ‘CA recheck error’. But clearly the CAA must be okay as certs are issues including all names.

Now moving on to the ChloeFox cert:

  1. ChloeFox all options > Asked me to update the _acme-challenge to the existing value. That’s a third time now. > Failed: Detail: Error finalizing order :: Rechecking CAA for "chloe-fox.org.uk" and 10 more identifiers failed. Refer to sub-problems for more information

  2. ChloeFox unqual. CN only >

  3. Attempt 1: Detail: Error finalizing order :: Rechecking CAA for "chloe-fox.org.uk" and 10 more identifiers failed. Refer to sub-problems for more information – Note "chloe-fox.org.uk" was not requested in the Plesk UI

  4. Attempt 2: Detail: Error finalizing order :: Rechecking CAA for "chloe-fox.org.uk" and 10 more identifiers failed. Refer to sub-problems for more information … but issued the cert in only the unqualified CN … so says it fails, actually worked, and actually resulted in only the single unqual name as requested

  5. Attempt 3: Reports success. Single unqualified CN only

  6. ChloeFox unqual + www principal names only > Success as requested only. Not full load

  7. ChloeFox principal wildcard > Success as requested only. Not full load … but asks for the _acme.challenge to be updated to the existing value. 4th occasion for this now.

  8. ChloeFox request full load > asks _acme.challenge to be updated to the existing value. 5th occasion > Detail: Error finalizing order :: Rechecking CAA for "chloe-fox.me.uk" and 11 more identifiers failed. Refer to sub-problems for more information

  9. ChloeFox full load > Detail: Error finalizing order :: Rechecking CAA for www.chloefox.co.uk and 11 more identifiers failed. Refer to sub-problems for more information

  10. ChloeFox full load minus chloefox.co.uk (reported problem) > Detail: Error finalizing order :: While processing CAA for chloefox.co.uk: DNS problem: SERVFAIL looking up CAA for chloefox.co.uk - the domain's nameservers may be malfunctioning … note inclusion of the chloefox.co.uk even though de-selected

  11. ChloeFox full load - chloefox.co.uk (reported problem) > Detail: Error finalizing order :: Rechecking CAA for www.chloefox.co.uk and 11 more identifiers failed. Refer to sub-problems for more information … again, note “chloefox.co.uk” was excluded from the cert request

  12. ChloeFox principal wildcard > Detail: Error finalizing order :: Rechecking CAA for "chloe-fox.org.uk" and 11 more identifiers failed. Refer to sub-problems for more information … note chloe-fox.org.uk was again excluded from the request

  13. ChloeFox full load > Detail: Error finalizing order :: Rechecking CAA for www.chloefox.co.uk and 11 more identifiers failed. Refer to sub-problems for more information

So having got a ‘full load’ renewal this morning, I can no longer repeat and so the best I’m doing right now is principal domain with wildcard, but no alias names. Makes the point this ‘deselect all but the CN’ solution is far from a sure-fire solution. But what we have learnt is…

  1. LE will report DNS timeouts – “DNS problem: query timed out looking up CAA…”
  2. LE will report DNS SERVFAILs – “DNS problem: SERVFAIL looking up CAA…”
  3. But LE will also ‘misreport’ “Error finalizing order :: Rechecking CAA” when the CAA are in fact good … but presumably this isn’t a timeout or SERVFAIL, or at least not reported as such

I was hoping for a ‘fails this way, works this way’ result. In fact the ChloeFox tests how flakey the ‘works sometimes’ solution is. Clearly Plesk is doing something different, and why does it result in full load renewals when only the CN have been requested? It seems to be remembering what was done before and ending up one request behind. Odd.

Maybe the timeouts and SERVFAIL are the root of my issues, although these two certs seem to attract this issue while others on the same machine do not, which doesn’t seem to make sense. And if timeout and SERVFAIL are the issue, it’s sad LE isn’t reporting these (as it can) and so point people towards a solution. I still think there’s a lesson here

This has to be DNS playing in here but not being reported by LE.

Have just had another round of trying with various 'CAA rechecking' errors on domains excluded from the renewal in the Plesk UI, and then just keep trying and it works. See it here, 'full load' names - SSL Server Test: chloefox.org.uk (Powered by Qualys SSL Labs)

This has got to be DNS timeout and/or SERVEFAILs, but it would help greatly if LE could report these or even include a hint in the error report when it points people of bad CAA. I am guessing a CAA lookup is failing. Sometimes this can be reported as such, but it would seem too often it's reported as a 'CAA Recheck' which is the point of the failure, not the reason for the failure.

I guess that's the takeaway ... LE is reporting what stage it got to, not the cause of the failure. Time to probe DNS I think as well as find out what Plesk if doing as I'm quite sure LE wouldn't return a full load cert unless this is what is bing asked for, contrary to the UI

At least with DNS query timeouts that is true I believe. There is some cumulative timeout allowed and the failing one might just be where it reached the limit.

For SERVFAIL I don't think that is true. That is just an outright failure.

Agree it can be hard to pinpoint DNS failures. Individual testing tools can't mimic the burst volumes of requests LE makes from various global points. And, it is hard to know what is happening with the queried DNS servers. Maybe at times they get too busy handling other requests so Let's Encrypt queries just push them over the edge so to speak.

This is why I suggested trying another DNS provider like Cloudflare. And, I suggested adding CAA records to reduce the number of queries. This might help reduce the load on the queried DNS servers.

PS: Sorry for mis-reading that recent post. I read too many posts too quickly.

4 Likes

CAA records added to both these bundles.

I'm wondering if the DNS is under DoS and/or a DoS mechanism might be triggering ... but then it's not consistent and the same bundle of names is going to generate the same volume of requests and it seems more transitory that that.

One piece that bothers me is that other Plesk 'subscriptions' do not suffer the same problem, and yet more names, more NS lookups, you'd think would result in a greater failure rate. Still not really worked out what is going on!

And there's this strange request to update the_acme.challenge to the same string as already being published?? :thinking:

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.