OCSP Request failed with following message

@ibehm Not really the greatest news, but many browsers don’t actually check OCSP on browsing to a site. Chrome has its own curated CRL and won’t check CA maintained OCSP or CRL. Firefox currently checks only OCSP by default then fails open if it can’t get a response. Last I checked, Microsoft does check OCSP, then the CRL if OCSP fails and it may or may not then warn that there is an issue. Safari doesn’t check OCSP. Opera checks, but also fails open on no response, I believe.

So, in general, if OCSP is unavailable most browsers will happily assume the certificate is fine. The browser with the largest market share doesn’t check OCSP at all. This is one reason Let’s Encrypt believes 90 day certs are so important. Shortening the lifetime is the only reliable way of making sure a compromised cert is unusable.

2 Likes

@isk I’ve only received one more 503 since my last post:

[Wed Jan 24 19:26:02.986209 2018] [ssl:error] [pid 25266:tid 139905301694208] [client 162.197.232.103:59521] AH01980: bad response from OCSP server: 503 Service Unavailable

My IP is 77.96.80.209.

@isk Just a short update… we got 318 503 errors in the last hour (Germany). I provided some details via PN earlier.

I am still getting random OSCP server errors from Nagios monitoring my SSL certificates from Let’s Encrypt. Servers in the UK.

ocsp.int-x{1…4}.letsencrypt.org resolving to 2.21.67.*

My server IP is 89.145.86.110.

Update, now working again, ocsp server IPs now 2.22.146.*

I’ve noticed that whilst running cerbot-auto certificates that I tend to get 503 quite often on my sites that get very infrequent traffic. If you need any info let me know.

The infrequent traffic component seems to fit well with some of what we’re looking at now. Those sites would be less likely to already be in the cache.

We have made some progress, I am still looking at this and will hopefully have another update soon.

1 Like

Evening update: We’ve spent the day troubleshooting the internet. At present there is a lot of “It’s not our problem, it’s their problem going on” and undoubtedly at least one of the potentially responsible parties is probably mostly right. Right now, we’ve got a plan to try something that may be somewhat disruptive in about 13 hours and we will be prepping to make that as non-disruptive as possible. That may fix this issue or it may just reduce the chance that it is being caused by that particular piece of the environment.

We’re working on it. We’ll update here and ask you to let us know how that turns out for you.

Just for the curious among you: from CDN logs, the problem does still seem to occur primarily with traffic coming from Europe, for no reason we’ve found yet, although we have been able to determine it occasionally happens with traffic from Asia and North America as well.

7 Likes

Okay, we’ve made a change that will help us pin this down. It might even temporarily fix it. Can I get a sound off over the next little while from my loyal testers/victims here as to whether there is a change in the frequency of 500s starting roughly 5 minutes ago?

This is your minion from Saltmine "A-38 (left)", please accept this humble state of being report:

Restarted nginx (somewhere in Germany) at 18:35:17 UTC to clear the cache.
Last seen 503 was even bevor that at 16:19:47 UTC!

Here is a list of 503 error distribution over time (UTC hour of the 26.):

Hour: ErrorCount
01: 265
02: 483
03: 540
04: 600
05: 354
06: 128
07: 210
08: 261
09: 412
10: 395
11: 230
12: 312
13: 261
14: 198
15: 18
16: 2
17: 0
18: 0
19: 8 (still running)

Basicly it was looking great.. until 19:07:10 to 19:11:32 (all 8 errors happend in this timeframe).

Currently I am kicking myself for not getting a reference number for any of those 8 events. I will report again as soon as I get something usefull :frowning:

Back to the saltmines for me...
I thank the great overloard(s) for the time and efford and progress on this issue.

2 Likes

Actually, @ePhil I like that answer better. I’ll move toward scheduling a brief service interruption for the next change.

Thank @isk, maybe I get so see daylight again :slight_smile:

Got 1 (just one) more 503 at 19:33:24 UTC with Reference #102.57d91002.1516995204.4c1b233.

Testing further with yet another change to rule out a possible cause. Let me know if things worsen, please.

It’s been quiet for me in the UK. No errors seen since the change, but then again I hadn’t seen any errors for about ~5 hours before the change (last error was at 13:04 UTC).

I’m merely another system admin, trying to provide some helpful guidance from my own prior experience.

I’d like to just take a moment to point out that (unless things have quite recently improved a great deal), using either the in-built OCSP stapling functionality for Apache or for nginx is folly.

Seriously, don’t use those. They’ll amplify a momentary glitch fetching an OCSP response into a site outage for you. Promise. Just give them time.

Last time I looked (several months ago) they were finally looking into improving the validation of the responses they got and caching old responses to be held in case they get a later bad response or no response, etc, etc, but they were also discussing things like whether the OCSP fetcher even had to actually be an HTTP/1.1 client. (For what it’s worth, whether or not full true HTTP/1.1 support is actually required, at a minimum the Host: record is essential for getting OCSP responses off the CDNs that are where you’re generally getting the response from.)

It’s great you want to improve security overall by OCSP stapling your TLS sessions. But… Those two pieces of software have completely immature OCSP stapling mechanisms (again, at least as of several months ago.)

Furthermore, even if/when they fix that (either or both of them), it will take SIGNIFICANT time before such improvements in the main line are imported into the builds that your typical Linux distro X is shipping. Probably more than a year. Sometimes years. The downline packagers are looking for critical matters and security FIXes, not enhancements.

If you insist on doing OCSP stapling, run Microsoft’s web server (and who would have thought?!?) or Caddy.

The CDNs (and some CAs) are to blame for OCSP reliability being what it is, but no single element on the Internet is meant to be reliable anyway.

Not all OCSP stapling implementations are equal. Last I looked, Apache’s and NGINX’s were far less equal than others.

2 Likes

You are correct. And although I am biased, I will echo your recommendation to use Caddy. :slight_smile: Its OCSP stapling implementation is more robust than that of nginx and Apache.

  • Caddy updates the staple halfway through its validity period.
  • Updates happen in the background, not during requests.
  • Staples are only accepted if they are actually valid, including edge cases we’ve seen where the staple might expire after the certificate.
  • Caddy staples OCSP to all qualifying certificates by default.
  • OCSP staples are cached to disk so it can weather outages that are several days long, usually plenty long enough to gain connectivity to one of the responders.

When major OCSP outages happened a few months ago that even took down gnu.org and many other sites in Firefox and other clients that enforce revocation checking, Caddy sites stayed afloat.

Caddy isn’t Free software which makes it a non-viable option for me. (Corrected thanks to @elcore for pointing out my error.)

If Let’s Encrypt wants wide spread adoption of OCSP stapling, rather than lots of people disabling it, then it sounds like some work needs to be done with the Apache and Nginx projects to improve the implementation in these web servers.

Caddy’s source code is licensed under Apache 2.0

2 Likes

Elcore is right, Caddy is entirely free software. The licenses are for the binaries downloaded from the website which are convenient for businesses or personal use. Personal use is also free. It's like nginx and nginx Plus, except Caddy doesn't limit you on features. And plenty of people use nginx and nginx Plus.

1 Like

@mdhardeman ( @mholt / @chrisc ) while I do think that you are raising a very valid point, I feel that this thread will go (way) off topic if we where to follow that train of thought. Please note that I am feeling bad writing this especially since, from what I can tell, you are quite right. But… in the end, if nobody informs let’s encrypt or Akamai off a problem with 503s, how are they supposed to know(1)? That said: Thank you for posting the writeup.
(1) One could analyses delivered 503 and look for… ehhh let’s not get into that either :slight_smile:

@isk all in all I got around 100 503s yesterday and only 1 today (day is currently not-so-young (~20:00 MET).

@ePhil,

I agree and don’t wish to change the context of the broader discussion into system administration practices, but I felt the dialog presented here might lead some down the path of assuming that OCSP stapling and OCSP must-staple represent a bigger-picture best practice right now.

The state of the most commonly used server software presently negates that possibility.

I agree that it’s a vital function for further into the future to stabilize OCSP response infrastructure, but pragmatically, I think real significant use of OCSP in the future may not care about brief spontaneous failures.

As OCSP-must-staple becomes a thing and as browser implementers more and more broadly stop live OCSP checking, demands on the OCSP delivery infrastructure will be lightened. In a world in which the only OCSP queries come from server instances needing current staples, the volume of requests lessens so much that perhaps massive CDN distribution becomes unnecessary. In short, significant shift in the deployment and use pattern is likely to reshape the delivery infrastructure regardless.

I just wanted to help out by saving future grief on the part of any ambitious new admins out there who are trying to get ahead of the curve on the next thing. Unless you’re a domain expert or running a solution which automatically gets stapling perfect (as noted, that’s rare), it is likely that getting ahead of the curve on OCSP must-staple means you are going to run right off the track. :slightly_smiling_face: Being early to the OCSP must-staple race is unlikely to get you a raise. Causing significant site outages by deploying a new security technology whose goal isn’t to improve your security but rather that of the average browser without any mandate or industry consensus as to best practice, on the other hand, might get you fired.

2 Likes