Hi there! Got an obscure bug I could use a pointer or two developing. Here's the synopsis:
I'm running a frontend using the golang autocert package, in order to provision certs just-in-time for user-supplied domain names.
It mostly works, however I see an issue for the very first request: Chrome on Mac will display the certificate as invalid for the rest of the session; and Safari on Mac will display it as invalid for just that request. Example from Safari:
Inspecting the cert chain in the UI shows everything is valid.
I'm wondering if this comes down to some sort of expected client/server race condition with local time and NotBefore of the issued certs. If that's the case, I am guessing this comes down to the consequences of JIT provisioning & imperfect client clocks. I can think of three possible solutions:
Ask LetsEncrypt for an earlier NotBefore timestamp. (Probably not allowed, for good reason..?)
Provision the cert as early as possible, ahead of any real customer request (i.e. avoid JIT).
Add an artificial extra delay on first response (eg delay 5 seconds = we can tolerate clients whose clocks are up to 5sec slow).
As far as I know, LE already issues certs with an earlier NotBefore entry, I think it's an hour earlier than the actual issuing date (but could be mistaken in that one hour).
Unfortunately, I can't see in your screenshot why the certificate was marked as invalid, as the valid snippet is pasted over the error..
Wow thanks for the fast responses! Great community here.
The error in Chrome is indeed NET::ERR_CERT_AUTHORITY_INVALID. If you are curious to try reproducing it directly, I've pointed *.taps.live at the frontend; any random host repros it for me.
I do have one theory about why this kind of thing could happen ... but I have very little proof so far. (My related post).
I think it could be related to clock skew, but related to the timestamp in the certificate transparency SCT rather than in the certificate itself. The SCT timestamp is not backdated, and browsers are supposed to reject timestamps in the future. Chromium and Safari both enforce SCTs, but I'm not sure whether they have done something to account for this. Safari is closed source and Chromium's SCT validator seems to change constantly.
In your case though, if you're getting a transient ERR_CERT_AUTHORITY_INVALID error (rather than the CT one), I think this it's probably something else.
You could try create the net-export dump as documented in my post anyway, as it creates very detailed logs about the certificate verification process. Could help, who knows.
FWIW I couldn't reproduce the problem with your domain on Linux+Chromium.
I can't reproduce it either on my Chromium browser. But of course, I've got ntpd running
Hah - forgot to say I did check that as well; no smoking gun, drift of a few microseconds..
You could try create the net-export dump as documented in my post anyway, as it creates very detailed logs about the certificate verification process. Could help, who knows.
I'll give that a shot! I've noticed the repo isn't 100% reliable, I managed to see one domain complete successfully.
Okay! Got a capture of a repro with Chrome. If you're interested in taking a look, would you mind DM'ing me? (I don't think my newbie forum rep allows me to initiate, but am guessing I can respond).
Looks like it is not the SCTs. Chrome's certificate verification process begins around ~350-400ms after the SCT timestamps in the certificate; no problems there.
The actual cause of the ERR_CERT_AUTHORITY_INVALID is unfortunately not surfaced in the logs. But I really appreciate you collecting them.
I had the same thing when trying to catch this on Windows. I wonder if there's a difference between the Chromium and Google builds relating to when built-in libraries vs the macOS Security Framework is used to verify.
Doubly annoying because there's a neat cert_verify_tool that you can use debug verifiction at a certain timestamp without constantly recreating certifictes, but you'd need the one from a Google build where the problem is reproducible.
I wonder if there's a difference between the Chromium and Google builds relating to when built-in libraries vs the macOS Security Framework is used to verify.
Hey! I think you might be on to something there. I don't know these systems all that well, but on a lark I popped open console, tried to repro, and dug through all the garbage in syslog when I did. Sure enough:
Damned if that doesn't look like some sort of milisecond-precision clock and microsecond-precision clock getting into a silly disagreement, eh...?
Correcting myself: Probably right idea, wrong units: The value on the right (trustd's now()?) looks like a seconds-precision timestamp being converted to & compared with millis. In all repros I've found, the value is 0 % 1000. Fresh example:
SCT is in the future: 1604197302365 > 1604197302000
Wow, trustd logs. That's a really nice find, good job! I am also very pleased that my theory actually went somewhere.
Let's say that the immediate issue is that trustd is querying a clock with seconds precision (or is lazy code which only looks at timeval.tv_sec?).
Doesn't this mean that certificate backdating (to fix client clock skew) is basically ineffectual for any UA with a CT policy? I don't think it's possible/practical to backdate SCTs. @jsha have you come across this before?
Interesting! I haven't seen anything like this before. And I think you may be right that backdating to any time earlier than one of the SCTs would be expected to break on clients that enforce "SCT not in future."
One simple fix for ACME clients is to wait a little while before making a certificate available. In general one can't rely on issuance happening within seconds anyhow.
Today I've learned a lot more about CT than I bargained for, but it's been fun! Thanks for all the help and pointers. I have a workaround, and now just some curiosities:
I think you may be right that backdating to any time earlier than one of the SCTs would be expected to break on clients that enforce "SCT not in future."
Which system actually creates this timestamp, and what would be the consequences of back dating it?
Something that's unsatisfying to me -- and maybe it just needs to be -- is that there isn't doesn't seem to be an accommodation for clock skew possible in the CT protocol here (where dialing back NotBefore an hour is the solution for the same real-world problem for cert validity).
In general one can't rely on issuance happening within seconds anyhow.
I wonder if autocert is a bit of an unwitting culprit here, maybe even shifting that expectation: It sure does make it very easy to code up "just in time"-issuing frontends. Which in turn seem like they can achieve a much shorter window between <time cert issued> and <time of first real client request>, leading to this kind of sensitivity.
(I guess the trustd lazy timeval bug is just 'helpfully' simulating up to 999ms of ntp drift..)
One simple fix for ACME clients is to wait a little while before making a certificate available.
Let's Encrypt gets countersignatures from Google + 1 other trusted log operator, for the certificate it intends to issue, in the form of SCTs.
The timestamp itself is generated by that log operator and Let's Encrypt doesn't have any say in its value.
Let's Encrypt then includes those SCTs in the final certificate as proof that the certificate has been publicly logged.
Reading RFC6962-bis ("Certificate Transparency Version 2.0"), there are some changes:
The requirement to reject future timestamps is gone and instead it's upto the client.
Log entries don't need to be in chronological order
but I don't think these exist to enable backdating.
I also realized that OCSP is probably one more place where you could run into trouble because those responses are not backdated by Let's Encrypt either. However, I think browsers are pretty lax when it comes to OCSP problems.
On macOS/OSX, Chrome seems to cache invalid certs until you restart the browser application. This has happened to me a few times in the past.
In the cases I have witnessed, it was not because of this time-delay issue, but because of expired certificates. The servers had been restarted with updated certificates and were serving them - even for days - but Chrome kept believing/insisting the old certificates were in use. While I usually experienced this as a Publisher, I experienced this as a consumer this past week, when gitter.im pushed a new release and accidentally deployed an expired certificate.
Perhaps the Chrome bug is not due to caching the certificate, but due to caching the certificate's validation.
I think this is true, and in general I strongly discourage writing just-in-time issuing frontends. A couple of reasons:
This style of frontend takes a much harder dependency on Let's Encrypt's uptime. If we're down, you're down.
We don't guarantee that issuance will happen in seconds. We may very well add validation processes in the future that take longer, which would break the assumptions of this style of clients.
I have also noticed that autocert in particular is prone to a particular failure mode where it spams our service extremely rapidly - up to millions of requests. Because it doesn't log requests or alert its maintainer, this often goes unfixed. And because autocert's example code doesn't set an email address, most autocert users don't set one in their configs. That means it's hard for us to reach out when clients go bad and get the underlying bugs fixed.