Revoking certain certificates on March 4

@fleish

I truly understand your sentiment, but I apologize that we must keep a strict deadline here or risk further baseline requirement violations. We definitely have an improvement to make in our notification emailer throughput. Unfortunately that doesn’t help at this very moment.

1 Like

@Phil_LE, the improvement to make is that in future, you should not waste away the five days of notification period on you deciding to make a tool to check which certificates were affected; you should have immediately informed the community and let them renew all of their certificates to resolve the issue.

Is there a place where we can see how your internal governance planned to handle an event such as this one? I think that the community is quite obviously telling you all that your process in this circumstance was wildly misguided and I want to know how you all plan to improve upon this in the future.

To be clear, the fact you had a bug is not the problem; the fact that you completely wasted the notification period is the problem.

5 Likes

Yeah, same here. Got an email 4.5h before 00:00 UTC. Glad I get off on abuse and risk. :blush:

1 Like

There are 3 million certificates affected out of 116 million. Under which circumstances is a certificate affected?

I am asking because in my case, I got the email for my personal websites and account where every domain was affected. For my business I did not get an email, so I checked all domains against the list with affected serials and not one of them is affected (list contains ~2000 domains).

I am quite scared that my customers will contact us tomorrow and report that their domains are not working anymore. But I checked it so many times know with different ways (extracting the serial from the pem file, via openssl and searched the domain names in that file) and never got a result, so I am probably safe, but it still does not feel good.

Can you give us any more information why some domains are affected and some not? Is it only because of regular reissuance as you said in your first post? For my private websites I am using Traefik, so this webserver is then more regularly renewing the certs than my business which uses the the official letsencrypt client?

EDIT: The bug was confirmed on the 29th of February (according to the announcement). There have been some renewals in that timeframe since then, but in general the registration and renewal time frames are quite random, so the certificates do not have been renewed recently, there are still several from e.g. January.

1 Like

Yes you are right, it was renewed 30 days before it was supposed to expire. Thanks again @yuriks for looking into this

2 Likes

@digilist only certificates containing more than one SAN (domain) and that had a specific timing requirement (authorization reused and 8 hours or more after the initial CAA check. See 2020.02.29 CAA Rechecking Bug for details.) are affected by this. If the serials for your certs are not in https://letsencrypt.org/caaproblem/ then they will not be revoked.

2 Likes

I added a new entry to the FAQ talking about reasons you may receive an email and yet not need to renew your certificate. This covers the problem experienced by @vedranl as well as some others. Thank you!

3 Likes

I agree that we should have informed the community sooner and I apologize to you for the difficulty this incident has caused you. The time we spent since initial discovery was to build as accurate of a list of affected certificates given time constraints, and not on deciding if we should build tooling to reduce the size of the affected subscriber list. That last decision period was relatively short, and was performed in parallel with other blocking tasks. Once the list was compiled we began the public notification process.

4 Likes

This is kind of odd - technically I’m affected, though I’m not.

I downloaded the list and found one affected certificate, but this is the really strange thing - sure this is one of the certificates issued to me and it is theoretically still valid, but the stated time of the failure, meaning
> missing CAA checking results for <domainname> at <date and time> +0000 UTC
is basically the same time I last renewed the certificate for these domains (I might make a mistake due to a slightly different time zone). The affected certificate is the one I replaced at the mentioned time and as far as I know the script I am using doesn’t work in a way that causes the error (acme_tiny.py).

So I would assume I’m in the list because my renewal somehow triggered the bug, but why would that cause my old certificate to be flagged?

We have 20k+ domains spread over multiple certs and with the ongoing NetworkSolutions/Web.com (DNS failures (SERVFAIL, timeout) for domains using Network Solutions/Web.com/worldnic.com nameservers) throttling issue we’re going to have a lot of unprotected sites. The turnaround time is not sufficient. We received the email 2 hours ago.

2 Likes

@eldoran I’m not exactly sure if I understand what you’re saying correctly, but yes, it’s possible that you may have a newer certificate that wasn’t affected by the bug. The only way to know which certs were affected is by using the checking tool or downloading the list linked to in the first post in this thread and comparing their serial numbers.

If you’re wondering about the timestamp in the log message, this would be expected to be the same as when your cert was issued, which is when CAA checking should have been done (but wasn’t).

3 Likes

@prashantrajan I understand the pain that this issue has been causing and that the timing of the notification doesn’t give you much time to react. Unfortunately we are required by the Baseline Requirements to revoke the affected certificates within this deadline. We worked over the weekend to compile the list of affected certificates and sent the notifications as soon as we were ready to. We regret that sending all of the emails took so long, and we’re looking for ways to make that faster in the future.

2 Likes

If anyone does not have access to the serial numbers of your certs but has the domains this PHP script will cross reference the serial dump by domain - not the fastest grepping 1.3GB thousands of times but it was quick to thrown together and let me identify a few of our certs to re-order.

    <?php 

	$domain_file = '/home/dave/potential_domains.csv'; 
	$cert_issues = '/home/dave/Downloads/caa-rechecking-incident-affected-serials.txt'; 
	$match_dump_file = '/home/dave/affected_domain_match.csv';

	$counter = 0;
	$match_domains = array();

	# Grab the seed domain named 
	if (($handle = fopen($instiller_domain_file, "r")) !== FALSE) {
		while (($data = fgetcsv($handle)) !== FALSE) {
			# Clean up the domain name to grep the other file 
			$counter++;
			$domain_name = trim($data[0]);

			# initialise the match state 
			$status = 'NOT_MATCHED'; 

			# Create the command line to grep the files 
			$command_line = 'grep "' . $domain_name . '" ' . $cert_issues;

			# Only need the last line for a match
			$buffer = exec($command_line, $buffer);
			if (trim($buffer) != '') {
				$status = 'MATCHED'; 
			}

			# Dump the status 
			echo $counter . " :: " . $status . " :: " .  $command_line .  " --> [" . $buffer .  "]\n";

			# 
			if ($status == 'MATCHED') {
				$match_domains[] = $domain_name;
			}
		}

		fclose($handle);
	}

	echo "\n\n Dumping Matched \n\n";
	var_export($match_domains);
	echo "\n\n DONE \n\n";

	$fp = fopen($match_dump_file, 'w');

	foreach ($match_domains as $fields) {
		fputs($fp, $fields);
	}

	fclose($fp);


?>
5 Likes

Thanks @instiller, that script is much appreciated.

2 Likes

As we hold large amounts of customers, I had no other way but to parse files and build up a lookup tool. If someone finds it usefull, it can be found here: https://www.certic.info/tools-letsencryptrenewcheck.php

Unfortunately, it was obvious this is about to happen during the outage in late night of February 29th, I asked to get more information, unfortunately it was ignored completely.

Now we are facing short notice. Let’s Encrypt is a serious and probably one of the best project ever, but it really needs to come up with better support on public networks.

Screenshot 2020-03-03 at 23.52.58
https://twitter.com/cs_networks/status/1233704143224791042

Totally ignored, yet it was clear this is likely to happen. Now facing a few hours notice, not doing good to a public. Let me know If I can be of any help, but PR really needs to get a bit better on this.

1 Like

@yuriks I think you and the letsencrypt team should stop trying to explain why you wasted away the time figuring out which certificates were affected and blaming your late notification on the Baseline Requirements. If you had five days of notice period, you should have informed the community immediately, not after you had compiled a list of affected certificates. letsencrypt - you provide a great service for the web community, but just take it on board that you’ve handled this issue terribly - you need to stop trying to make excuses for it, just apologise, accept our feedback and move on without offering up excuses.

2 Likes

maybe they could have write a message like “you may need to renew your certificate, we don’t know yet”, but it may had harmful consequences:

  • Too many people trying to renew without needs, which could have cause an outage
  • Too many people on the forum asking for details, that they couldn’t give yet, diverting their attention from more urgent things
5 Likes

Unfortunately at the time we weren’t sure of the scale of the impact and so wouldn’t be able to give people useful guidance yet. At the time we were focusing on patching the bug and then posted an explanation of the issue at 2020.02.29 CAA Rechecking Bug.

Thanks for posting your checking tool. I have some questions about it I’m going to send in a private message to not clutter the thread.

7 Likes

@tdelmas

The system should be built in such a manner that it can handle all of the certificates being requested simultaneously as regardless of the likely random timing of requests in normal operation, it is already possible that a large percentage of all certificates could be requested to be renewed simultaneously.

With respect to your second point, that is why clear, concise communication is more effective and important than verbose explanations and excuses. A perfect example is that the email mentioned a date without a time and/or a timezone; that suggests that LE’s communication is a last minute thought and LE has clearly underestimated how much time people needed to renew their certificates across their many distributions and personal scenarios.

Again, it’s not worth making excuses for this - it’s worth finding the reasons that the correct procedures and processes were not in place for such an event as this, which is why I asked whether the governance documents are available for LE so that the community can perhaps help contribute to better disaster resolution processes and procedures and quite frankly, if LE lacks the number of people that are required to handle such an event, they need to ask the community for more help. LE doesn’t just need technical staff that are capable of handling the bug, like in this circumstance; they need the right people in the organisation to help ensure that events such as these are planned for, well thought out and tested in advance for robustness.

1 Like

We did not receive any email from LetsEncrypt and found out about this on ArsTechnica. 12am UTC deadline is absolutely unreasonable.