Thanks, both. I read the CAB requirements, and I am still unsure of where the boundary falls for "possibly affected certs" versus "definitely affected certs" for this incident, and I can appreciate erring on the side of caution for something as serious as CAA validation. A serious warning to all possibly affected subscribers is definitely in order either way, but since severe downtime will result for many affected subscribers, I would hope that the response is a bit more moderated, e.g. immediate revocation happens only for certs where an included domain's CAA history has changed and could indicate a compromise of validation integrity.
In my org's case, no revocation was necessary because all relevant domains have never had CAA records; I'm not sure how your logs could prove this, but if possible, that sort of criterion could be used to limit the revocations to a much smaller set of certificates. Could you describe how you generated the list mentioned above? I see that the list has annotations such as "missing CAA checking results for <domain>" but it only mentions that for one of the domains in each cert.
Edit: I understand that the above missing results may invalidate the whole cert due to BR 4.9.1.1 point 4 in the first list, just because it's questionable, e.g. you can't prove it was valid so it doesn't matter whether anyone else can prove it wasn't valid.
My apologies if this is a stupid question, and please know that I very much appreciate your replies so far.
This is actually a great question, and gets at the heart of how the BRs and CAA work. I've split it out into its own topic.
The BRs are process-oriented more than outcome-oriented, particularly in the CAA requirements. The process is "CAs must check CAA for all hostnames within 8 hours prior to certificate issuance." The outcome is "hostnames with CAA records forbidding issuance don't get certificates issued for them."
Even though the outcome is the same for your hosts (because there were never any CAA records), the problem is that we failed to properly follow the process. Per the BRs, that's grounds for revocation.
On a more practical level, because we didn't check CAA records at the proper time, we have have no way of knowing for sure that they didn't exist at that time (as you mentioned).
I'll also note: If you've checked your certificates and found they don't need renewal, it's probably not because they lacked CAA records, but because they were renewed more recently than the issuance that was affected by the bug.
I forgot to answer this part: For each hostname on each certificate issued during the currently-valid window, we searched for records from our validation server saying "Checked CAA for " within the past 8 hours. If any of the names didn't show up as properly checked, we considered the whole certificate affected, and listed that name as an example. Some certificates might have only one affected name; others might have several. We didn't include the whole list of affected names because more names wouldn't change the status of the certificate.
Ok, but there is one thing I don't understand yet. Going by the description given at Bugzilla:
Our CA software, Boulder, checks for CAA records at the same time it validates a subscriber’s control of a domain name. Most subscribers issue a certificate immediately after domain control validation, but we consider a validation good for 30 days. That means in some cases we need to check CAA records a second time, just before issuance. Specifically, we have to check CAA within 8 hours prior to issuance (per BRs §3.2.2.8), so any domain name that was validated more than 8 hours ago requires rechecking.
The bug: when a certificate request contained N domain names that needed CAA rechecking, Boulder would pick one domain name and check it N times. What this means in practice is that if a subscriber validated a domain name at time X, and the CAA records for that domain at time X allowed Let’s Encrypt issuance, that subscriber would be able to issue a certificate containing that domain name until X+30 days, even if someone later installed CAA records on that domain name that prohibit issuance by Let’s Encrypt.
This seems to say that the CAA is always checked at the same time as the challenge (DNS, www, whatever), and that it is checked correctly at that time. So if the certificate is issued right after that, the BRs should be met.
Our automation does immediately get the certificate. I don't think it even has any other mode of operation. Yet several of our certificates were included in the bad list with the annotation "missing CAA checking results for ...". One of those serial numbers is 03a00c2c5832cbf394e623d900f555844d5b . Those facts seem inconsistent, so I'm thinking I'm still not fully understanding something.
no, it's not. If the challenge is cached 30 days (without a new check) and a user creates a new certificate (sample: Ten days later), the CAA must be rechecked again.
That required recheck didn't work correct -> that's the bug.
If a user creates new certificates 60 days after the old certificates, no cached result is used -> no problem.
The bug is triggered if users create certificates with a subset of domain names some days later after creating the first certificate.
@mnordhoff Indeed, we were fine-tuning our setup while switching to Let’s Encrypt.So indeed we got new certificates a few times for the same domains. But am I to understand that for those cases, the CAA records were not checked again, even if the DNS challenge was done again? That seems to contradict the text I quoted.
As you can see, we have already replaced the certificates, so it’s not about that. It is just that I think I see a contradiction, which may mean that I’m missing something, or that the description is imperfect, or possibly that there is still some bug somewhere (whether on the over-cautions side, or on the insecure side). I’m trying to find out which
I don’t think you’re missing something. Either the DNS challenge wasn’t actually done again for that one certificate, or something weird and unexplained is happening.
One possibility is that if you use Certbot, older versions go through the motions of validating even when they don’t have to. It would look like you’re validating again but Let’s Encrypt would not actually do so. You have to check the logs to be certain.
@mnordhoff We’re using lego (Let’s Encrypt in GO), so it’s probably not that. In any case, you have flattered me by saying that I’m not missing something, so I’m happy I also feel satisfied that I pointed out something unexplained, which Let’s Encrypt could possibly look into, if it seems suspicious enough.
So the renew-by-default bug did increased the number of certificates affected, but some websites were not affected by the revocation, because newer certificates could have been created since the discovery and fix of the CAA problem.