Revoking certain certificates on March 4

I agree that we should have informed the community sooner and I apologize to you for the difficulty this incident has caused you. The time we spent since initial discovery was to build as accurate of a list of affected certificates given time constraints, and not on deciding if we should build tooling to reduce the size of the affected subscriber list. That last decision period was relatively short, and was performed in parallel with other blocking tasks. Once the list was compiled we began the public notification process.

5 Likes

This is kind of odd - technically I’m affected, though I’m not.

I downloaded the list and found one affected certificate, but this is the really strange thing - sure this is one of the certificates issued to me and it is theoretically still valid, but the stated time of the failure, meaning
> missing CAA checking results for <domainname> at <date and time> +0000 UTC
is basically the same time I last renewed the certificate for these domains (I might make a mistake due to a slightly different time zone). The affected certificate is the one I replaced at the mentioned time and as far as I know the script I am using doesn’t work in a way that causes the error (acme_tiny.py).

So I would assume I’m in the list because my renewal somehow triggered the bug, but why would that cause my old certificate to be flagged?

1 Like

We have 20k+ domains spread over multiple certs and with the ongoing NetworkSolutions/Web.com (DNS failures (SERVFAIL, timeout) for domains using Network Solutions/Web.com/worldnic.com nameservers) throttling issue we’re going to have a lot of unprotected sites. The turnaround time is not sufficient. We received the email 2 hours ago.

2 Likes

@eldoran I’m not exactly sure if I understand what you’re saying correctly, but yes, it’s possible that you may have a newer certificate that wasn’t affected by the bug. The only way to know which certs were affected is by using the checking tool or downloading the list linked to in the first post in this thread and comparing their serial numbers.

If you’re wondering about the timestamp in the log message, this would be expected to be the same as when your cert was issued, which is when CAA checking should have been done (but wasn’t).

4 Likes

@prashantrajan I understand the pain that this issue has been causing and that the timing of the notification doesn’t give you much time to react. Unfortunately we are required by the Baseline Requirements to revoke the affected certificates within this deadline. We worked over the weekend to compile the list of affected certificates and sent the notifications as soon as we were ready to. We regret that sending all of the emails took so long, and we’re looking for ways to make that faster in the future.

3 Likes

If anyone does not have access to the serial numbers of your certs but has the domains this PHP script will cross reference the serial dump by domain - not the fastest grepping 1.3GB thousands of times but it was quick to thrown together and let me identify a few of our certs to re-order.

    <?php 

	$domain_file = '/home/dave/potential_domains.csv'; 
	$cert_issues = '/home/dave/Downloads/caa-rechecking-incident-affected-serials.txt'; 
	$match_dump_file = '/home/dave/affected_domain_match.csv';

	$counter = 0;
	$match_domains = array();

	# Grab the seed domain named 
	if (($handle = fopen($instiller_domain_file, "r")) !== FALSE) {
		while (($data = fgetcsv($handle)) !== FALSE) {
			# Clean up the domain name to grep the other file 
			$counter++;
			$domain_name = trim($data[0]);

			# initialise the match state 
			$status = 'NOT_MATCHED'; 

			# Create the command line to grep the files 
			$command_line = 'grep "' . $domain_name . '" ' . $cert_issues;

			# Only need the last line for a match
			$buffer = exec($command_line, $buffer);
			if (trim($buffer) != '') {
				$status = 'MATCHED'; 
			}

			# Dump the status 
			echo $counter . " :: " . $status . " :: " .  $command_line .  " --> [" . $buffer .  "]\n";

			# 
			if ($status == 'MATCHED') {
				$match_domains[] = $domain_name;
			}
		}

		fclose($handle);
	}

	echo "\n\n Dumping Matched \n\n";
	var_export($match_domains);
	echo "\n\n DONE \n\n";

	$fp = fopen($match_dump_file, 'w');

	foreach ($match_domains as $fields) {
		fputs($fp, $fields);
	}

	fclose($fp);


?>
6 Likes

Thanks @instiller, that script is much appreciated.

3 Likes

As we hold large amounts of customers, I had no other way but to parse files and build up a lookup tool. If someone finds it usefull, it can be found here: https://www.certic.info/tools-letsencryptrenewcheck.php

Unfortunately, it was obvious this is about to happen during the outage in late night of February 29th, I asked to get more information, unfortunately it was ignored completely.

Now we are facing short notice. Let’s Encrypt is a serious and probably one of the best project ever, but it really needs to come up with better support on public networks.

Screenshot 2020-03-03 at 23.52.58
https://twitter.com/cs_networks/status/1233704143224791042

Totally ignored, yet it was clear this is likely to happen. Now facing a few hours notice, not doing good to a public. Let me know If I can be of any help, but PR really needs to get a bit better on this.

2 Likes

@yuriks I think you and the letsencrypt team should stop trying to explain why you wasted away the time figuring out which certificates were affected and blaming your late notification on the Baseline Requirements. If you had five days of notice period, you should have informed the community immediately, not after you had compiled a list of affected certificates. letsencrypt - you provide a great service for the web community, but just take it on board that you’ve handled this issue terribly - you need to stop trying to make excuses for it, just apologise, accept our feedback and move on without offering up excuses.

3 Likes

maybe they could have write a message like "you may need to renew your certificate, we don't know yet", but it may had harmful consequences:

  • Too many people trying to renew without needs, which could have cause an outage
  • Too many people on the forum asking for details, that they couldn't give yet, diverting their attention from more urgent things
6 Likes

Unfortunately at the time we weren’t sure of the scale of the impact and so wouldn’t be able to give people useful guidance yet. At the time we were focusing on patching the bug and then posted an explanation of the issue at 2020.02.29 CAA Rechecking Bug.

Thanks for posting your checking tool. I have some questions about it I’m going to send in a private message to not clutter the thread.

8 Likes

@tdelmas

The system should be built in such a manner that it can handle all of the certificates being requested simultaneously as regardless of the likely random timing of requests in normal operation, it is already possible that a large percentage of all certificates could be requested to be renewed simultaneously.

With respect to your second point, that is why clear, concise communication is more effective and important than verbose explanations and excuses. A perfect example is that the email mentioned a date without a time and/or a timezone; that suggests that LE’s communication is a last minute thought and LE has clearly underestimated how much time people needed to renew their certificates across their many distributions and personal scenarios.

Again, it’s not worth making excuses for this - it’s worth finding the reasons that the correct procedures and processes were not in place for such an event as this, which is why I asked whether the governance documents are available for LE so that the community can perhaps help contribute to better disaster resolution processes and procedures and quite frankly, if LE lacks the number of people that are required to handle such an event, they need to ask the community for more help. LE doesn’t just need technical staff that are capable of handling the bug, like in this circumstance; they need the right people in the organisation to help ensure that events such as these are planned for, well thought out and tested in advance for robustness.

2 Likes

We did not receive any email from LetsEncrypt and found out about this on ArsTechnica. 12am UTC deadline is absolutely unreasonable.

1 Like

I think its very easy, especially as engineers, to respond in such a way. Promoting better practices, showing the golden path of a new problem. The fact is that we are here. And all of the shoulda, coulda, woulda…are not helpful in a thread like this. Helping people solve the problem is the desire so when folks run into the fact that they are running X hundreds or domains, they can fix it, not get bogged down with posts about what could have been. Open a new thread, link it to this. I’m not a member of LetsEncypt team in any way shape or form, but helping the community is a better use of time then supporting past decisions.

4 Likes

In addition to this, i used https://github.com/hannob/lecaa earlier today. I have about 140 domains with LE, it was a great help. Hopefully its correct.

4 Likes

Do we know when the certs will actually be revoked? I was told 3/4/2020 18:00 UTC by my CDN provider, but I dont see the same confirmation from LE. Does anyone have any details they can share?

1 Like

I’m not talking about past decisions - I am talking about how LE are handling this event and providing feedback on the answers they are providing in this thread. My comments about how they can be better prepared in the future are not about a ‘past decision’ - they are about the decisions they are making right now in this thread and my comments in this thread have already lead to a clarification about the timezone as being UTC. We are not getting bogged down posts about what could or should have been and even if you feel that your criticism is appropriate, it is equally as ‘off point’ as mine would be.

The community should feel welcome to comment about whatever they wish - if you feel like creating a new thread that should be focussed on specific technical fixes, go right ahead and do that.

2 Likes

@jxman they have mentioned in the edits at the top of this thread that they do not have a locked down specific time, but that they suggest that you should consider your certificates as having been revoked as of 2020-03-04T00:00Z (midnight at the start of the 4th March UTC)

1 Like

We have not started revocations but stated that 00:00 UTC on 04 March 2020 would be the earliest we would start that process. When we begin the revocations, we will post an update here.

1 Like

A post was split to a new topic: Replacing certificates with acme.sh