Ocsp server responses for new certificates

A day ago, on Dec. 2nd at ~22:41 UTC I rewnewed a certificate of mine. The OCSP server though did not have the certificate status yet. This is a problem because the certificate has the "must-staple" extension, so clients will reject the certificate if there is no valid ocsp response. I tried fetching the new certificate's ocsp from the server and it took 30 minutes until the ocsp server had it.

  1. Was there a known problem with the distributing new cert statuses to the ocsp server?

  2. Is there a time window, that an client should wait between certificate issuance and expecting to make a successful ocsp fetch? Or should the ocsp servers have the information instantly?

  3. A delayed ocsp status availability has been problematically here before (most people don't use must-staple certs and do not see those issues). I think letsencrypt needs to improve the processes there. I think actually for a certificate that has the must-staple extension the issued/signed certificate must only be made available to the acme client, if letsencrypt could make sure that the ocsp server(s) got the status for the new certificate. If you cannot ensure that the ocsp server(s) got the information, you should delay the reply to the acme client or let the issuance process fail. If you let the issuance process finish without the ocsp server being ready you are breaking the sites using the certs.

3 Likes

This is interesting - thanks for reporting it!

  1. No, we've had no known problems with OCSP distribution recently.

  2. Under all normal conditions, there will be no delay at all. A conservative wait time would be two seconds. But the ideal ACME client / OCSP stapling stack should reach out and double-check that a newly obtained certificate has a valid OCSP response available before actively using it. We try to guarantee this from our side, but this approach would be most fault-tolerant.

  3. I agree; we already try hard to ensure immediate availability, and failures are rare. Here's what our current architecture looks like:

  • Our front-end CDN caches our origin's responses for 12 hours. It doesn't have its own database: all uncached requests are forwarded to our origin.
  • When we revoke a certificate, we reach out and immediately purge its old (valid) OCSP response from the CDN's cache.
  • Our origin is load balanced (usually equally) across multiple data centers.
  • Our origin OCSP service reads directly from the same (ACID compliant, MySQL flavor) database that the rest of our CA stack uses.
  • An OCSP response is generated and stored in the database before an issued certificate is returned to the client.
  • Our database uses a simple primary (read/write) / replica (read-only) model.

If we have a long delay replicating from our primary DB to one of our replicas (and one of our CDN's POPs happens to hit the affected replica), that's when a client could have received a certificate without a valid OCSP response being available yet.

This is pretty rare: our replication lag is usually far lower (a few milliseconds) than the time it takes the client to retrieve the certificate and get its first OCSP request back through to one of our database servers. We monitor this metric closely and alert our SRE team when it's elevated. I just checked this metric at your issuance time, and it appears to have been too low for a failure to happen this way. So, I'm stumped.

If you observe this again, it would be really useful to know the timestamp again; the certificate's serial number; and which source and destination IP addresses you used for the OCSP request. We keep minimal OCSP logs for privacy reasons, but this will give us the best chance of tracking down a little more information about how your request was served.

4 Likes

What's the caching policy for negative responses? That is, if the first request to the front-end CDN happened moments before the certificate status shows up in the back-end, would that cause something like this where it's not available for some length of time? Or is there a CDN cache purge once the certificate status is available in the back-end?

1 Like

Yes, that is indeed possible.

We do cache negative responses, because we sometimes see continuous high request volumes for serial numbers that don’t exist. And unfortunately, our CDN would not support the volume of proactive cache purges if we did one for every OCSP update.

It’s on our radar to offload part of our OCSP handling to a separate layer that’s easier to scale, which could allow a shorter cache time at the external CDN.

2s sounds like a reasonable time to wait, however already had added a 5s delay into my dehydrated acme client after previous issues with the ocsp response not being distributed to the ocsp servers before. 30min to pop up on the servers looked really not okay, and made me finally bring up this topic here.

Sure, but I would not like to post those details in this forum here. Is there a mail address to send that information to?

1 Like

I don't know how well other acme client do handle this, for dehydrated I filed this: cert deployment should not be finished if ocsp fetch failed with must_staple enabled · Issue #787 · dehydrated-io/dehydrated · GitHub

1 Like

Could it support the volume of proactive cache purges just for new must-staple certificates? (I'm just brainstorming and certainly don't know any more specifics of your infrastructure, or if this is even the right approach at all.)

1 Like

Feel free to DM info to me here on the forum -that’d be the best way.

Yes, I think so - but I expect we’d rather focus on the root cause of any replication lag, and block issuance on that instead, since cache purges aren’t instant either.

3 Likes

I debugged this further with the help of JamesLE, thanks again James! It turned out to be a haproxy problem: Haproxy allows certificate replacements on the fly, as described in https://www.haproxy.com/blog/dynamic-configuration-haproxy-runtime-api/. This doesn't work though if the intermediate certificate changes, as haproxy still continues to use the previously configured intermediate cert and does not put the new intermediate cert in place as it would be expected to happen. With the wrong intermadiate cert it will then not be able to match the ocsp response because the issuer does not match. What you need to do to get the new intermediate activated is reload the haproxy service.
Maybe this is helpful for some other people here and hopefully this will be fixed in haproxy in one of the next releases.

7 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.