I agree that we should have informed the community sooner and I apologize to you for the difficulty this incident has caused you. The time we spent since initial discovery was to build as accurate of a list of affected certificates given time constraints, and not on deciding if we should build tooling to reduce the size of the affected subscriber list. That last decision period was relatively short, and was performed in parallel with other blocking tasks. Once the list was compiled we began the public notification process.
This is kind of odd - technically Iâm affected, though Iâm not.
I downloaded the list and found one affected certificate, but this is the really strange thing - sure this is one of the certificates issued to me and it is theoretically still valid, but the stated time of the failure, meaning
> missing CAA checking results for <domainname> at <date and time> +0000 UTC
is basically the same time I last renewed the certificate for these domains (I might make a mistake due to a slightly different time zone). The affected certificate is the one I replaced at the mentioned time and as far as I know the script I am using doesnât work in a way that causes the error (acme_tiny.py).
So I would assume Iâm in the list because my renewal somehow triggered the bug, but why would that cause my old certificate to be flagged?
We have 20k+ domains spread over multiple certs and with the ongoing NetworkSolutions/Web.com (DNS failures (SERVFAIL, timeout) for domains using Network Solutions/Web.com/worldnic.com nameservers) throttling issue weâre going to have a lot of unprotected sites. The turnaround time is not sufficient. We received the email 2 hours ago.
@eldoran Iâm not exactly sure if I understand what youâre saying correctly, but yes, itâs possible that you may have a newer certificate that wasnât affected by the bug. The only way to know which certs were affected is by using the checking tool or downloading the list linked to in the first post in this thread and comparing their serial numbers.
If youâre wondering about the timestamp in the log message, this would be expected to be the same as when your cert was issued, which is when CAA checking should have been done (but wasnât).
@prashantrajan I understand the pain that this issue has been causing and that the timing of the notification doesnât give you much time to react. Unfortunately we are required by the Baseline Requirements to revoke the affected certificates within this deadline. We worked over the weekend to compile the list of affected certificates and sent the notifications as soon as we were ready to. We regret that sending all of the emails took so long, and weâre looking for ways to make that faster in the future.
If anyone does not have access to the serial numbers of your certs but has the domains this PHP script will cross reference the serial dump by domain - not the fastest grepping 1.3GB thousands of times but it was quick to thrown together and let me identify a few of our certs to re-order.
<?php
$domain_file = '/home/dave/potential_domains.csv';
$cert_issues = '/home/dave/Downloads/caa-rechecking-incident-affected-serials.txt';
$match_dump_file = '/home/dave/affected_domain_match.csv';
$counter = 0;
$match_domains = array();
# Grab the seed domain named
if (($handle = fopen($instiller_domain_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle)) !== FALSE) {
# Clean up the domain name to grep the other file
$counter++;
$domain_name = trim($data[0]);
# initialise the match state
$status = 'NOT_MATCHED';
# Create the command line to grep the files
$command_line = 'grep "' . $domain_name . '" ' . $cert_issues;
# Only need the last line for a match
$buffer = exec($command_line, $buffer);
if (trim($buffer) != '') {
$status = 'MATCHED';
}
# Dump the status
echo $counter . " :: " . $status . " :: " . $command_line . " --> [" . $buffer . "]\n";
#
if ($status == 'MATCHED') {
$match_domains[] = $domain_name;
}
}
fclose($handle);
}
echo "\n\n Dumping Matched \n\n";
var_export($match_domains);
echo "\n\n DONE \n\n";
$fp = fopen($match_dump_file, 'w');
foreach ($match_domains as $fields) {
fputs($fp, $fields);
}
fclose($fp);
?>
Thanks @instiller, that script is much appreciated.
As we hold large amounts of customers, I had no other way but to parse files and build up a lookup tool. If someone finds it usefull, it can be found here: https://www.certic.info/tools-letsencryptrenewcheck.php
Unfortunately, it was obvious this is about to happen during the outage in late night of February 29th, I asked to get more information, unfortunately it was ignored completely.
Now we are facing short notice. Letâs Encrypt is a serious and probably one of the best project ever, but it really needs to come up with better support on public networks.
https://twitter.com/cs_networks/status/1233704143224791042
Totally ignored, yet it was clear this is likely to happen. Now facing a few hours notice, not doing good to a public. Let me know If I can be of any help, but PR really needs to get a bit better on this.
@yuriks I think you and the letsencrypt team should stop trying to explain why you wasted away the time figuring out which certificates were affected and blaming your late notification on the Baseline Requirements. If you had five days of notice period, you should have informed the community immediately, not after you had compiled a list of affected certificates. letsencrypt - you provide a great service for the web community, but just take it on board that youâve handled this issue terribly - you need to stop trying to make excuses for it, just apologise, accept our feedback and move on without offering up excuses.
maybe they could have write a message like "you may need to renew your certificate, we don't know yet", but it may had harmful consequences:
- Too many people trying to renew without needs, which could have cause an outage
- Too many people on the forum asking for details, that they couldn't give yet, diverting their attention from more urgent things
Unfortunately at the time we werenât sure of the scale of the impact and so wouldnât be able to give people useful guidance yet. At the time we were focusing on patching the bug and then posted an explanation of the issue at 2020.02.29 CAA Rechecking Bug.
Thanks for posting your checking tool. I have some questions about it Iâm going to send in a private message to not clutter the thread.
The system should be built in such a manner that it can handle all of the certificates being requested simultaneously as regardless of the likely random timing of requests in normal operation, it is already possible that a large percentage of all certificates could be requested to be renewed simultaneously.
With respect to your second point, that is why clear, concise communication is more effective and important than verbose explanations and excuses. A perfect example is that the email mentioned a date without a time and/or a timezone; that suggests that LEâs communication is a last minute thought and LE has clearly underestimated how much time people needed to renew their certificates across their many distributions and personal scenarios.
Again, itâs not worth making excuses for this - itâs worth finding the reasons that the correct procedures and processes were not in place for such an event as this, which is why I asked whether the governance documents are available for LE so that the community can perhaps help contribute to better disaster resolution processes and procedures and quite frankly, if LE lacks the number of people that are required to handle such an event, they need to ask the community for more help. LE doesnât just need technical staff that are capable of handling the bug, like in this circumstance; they need the right people in the organisation to help ensure that events such as these are planned for, well thought out and tested in advance for robustness.
We did not receive any email from LetsEncrypt and found out about this on ArsTechnica. 12am UTC deadline is absolutely unreasonable.
I think its very easy, especially as engineers, to respond in such a way. Promoting better practices, showing the golden path of a new problem. The fact is that we are here. And all of the shoulda, coulda, wouldaâŚare not helpful in a thread like this. Helping people solve the problem is the desire so when folks run into the fact that they are running X hundreds or domains, they can fix it, not get bogged down with posts about what could have been. Open a new thread, link it to this. Iâm not a member of LetsEncypt team in any way shape or form, but helping the community is a better use of time then supporting past decisions.
In addition to this, i used https://github.com/hannob/lecaa earlier today. I have about 140 domains with LE, it was a great help. Hopefully its correct.
Do we know when the certs will actually be revoked? I was told 3/4/2020 18:00 UTC by my CDN provider, but I dont see the same confirmation from LE. Does anyone have any details they can share?
Iâm not talking about past decisions - I am talking about how LE are handling this event and providing feedback on the answers they are providing in this thread. My comments about how they can be better prepared in the future are not about a âpast decisionâ - they are about the decisions they are making right now in this thread and my comments in this thread have already lead to a clarification about the timezone as being UTC. We are not getting bogged down posts about what could or should have been and even if you feel that your criticism is appropriate, it is equally as âoff pointâ as mine would be.
The community should feel welcome to comment about whatever they wish - if you feel like creating a new thread that should be focussed on specific technical fixes, go right ahead and do that.
@jxman they have mentioned in the edits at the top of this thread that they do not have a locked down specific time, but that they suggest that you should consider your certificates as having been revoked as of 2020-03-04T00:00Z (midnight at the start of the 4th March UTC)
We have not started revocations but stated that 00:00 UTC on 04 March 2020 would be the earliest we would start that process. When we begin the revocations, we will post an update here.
A post was split to a new topic: Replacing certificates with acme.sh