OCSP Request failed with following message

Thanks for the suggestion - our operations team has opened a status page incident for this: https://letsencrypt.status.io/pages/incident/55957a99e800baa4470002da/5a6753733800d404cc4ea7db

I’m hopeful someone will be able to update this thread with more information about the remediation discussions later today.

cc @isk @devnullisahappyplace

Just a thought:
If we disable OCSP stapling on our servers, we would only move the problem to our users.
The browsers/UserAgents would still try to fetch the OCSP response presumingly resulting in the same error.
(please correct me if browsers can handle this in a better fashion)

In one of my servers (doesn’t support stapling) I fetch the OCSP response manually in a file and provide that to my nginx. That process runs as a cron script every hour.

I had failures of SOME requests (not all) at least since 15.1.18. But also some months ago (22.09.17 08:23UTC).
So going with the hypothesis that this is just a load problem, I simply changed the execution-time of my cron script to not-so-defaulty-times, resulting in fewer failures (so far - or maybe you just changed something on your side).

Also: Please PN me if someone knows how browsers implement redundancy in ocsp requests, if an ocsp responder is dead or returns rubbish. Do they try multiple or all hosts behind the cname of the ocsp-uri in our certificates? What would happen if the ca specifies multiple ocsp-uris?

@seanmavley Unfortunately nginx will cache DNS resolutions indefinitely after querying once. Restarting nginx would make it resolve a new IP address for the OCSP server, which may help things.

@cpu @isk @devnullisahappyplace I’m not sure if you’re aware of this nginx behavior, and people are experiencing this with other clients so I doubt stale IPs are responsible, but it could be exacerbating the situation…

There’s actually been several issues that we have been chasing down. The problem affecting users in Central Europe has been solved by Akamai shutting down traffic from their Germany region at 1730 UTC and shifting traffic to their Italy region between 1800 and 1900 UTC. Since the region swap event, we have not received any origin connection failure alerts. A separate, but seemingly related, issue regarding OCSP responder timeouts has also been fixed.

Can you please confirm that you’re seeing successful OCSP lookups for your domains?

4 Likes

I haven’t faced any issue in the last 36 hours in EU. However, I faced similar issue in US on ‘Jan 23 05:06 PST’ ( about 3 errors around the same time ).

Sorry, but we are still seeing the 503s (based in Germany). On the bright side, I am no longer able to replicate this on the cli :confused: Maybe I did not try hard enough (a few hundret requests)?

At first I restarted nginx, then I started playing with the nameservers (originaly we where using 8.8.8.8).

Google currently resolves ocsp.int-x3.letsencrypt.org to

a771.dscq.akamai.net.
2.18.212.56
2.18.212.72

… while my local ISP gives me…

a771.dscq.akamai.net.
2.20.189.244
2.20.190.17

Tcpdump was able to get me some references for/from Akamai:
From: 2.20.190.17
Reference #102.27d4dd58.1516779467.46d5c8d
Reference #102.16d4dd58.1516779897.e7dfba
From: 2.20.189.244
Reference #102.16d4dd58.1516780015.e86d3e

Side Note 1: Browsing through the log, the times of the 503s seem to be “clustered” together.
Side Note 2: The “timeout thing” did not affect us, only 503.

Currently I am still hoping this might be a DNS issue but I seriously doubt it.

For my box in Amsterdan, the issue rears its ugly head only on weekends. So saturday sunday, will see how it goes.

The other box in London hasn’t had a single issue, since this OCSP thing. Time will tell

That seems to be working for me, at least for the past 4 weeks. Everything comes back to normal immediately after nginx restart.

Since switching SSLUseStapling back to on about 2 hours ago I’ve had one 503 in my logfiles:

[Wed Jan 24 09:38:40.506794 2018] [ssl:error] [pid 2322:tid 139904619071232] [client 192.168.0.3:28818] AH01980: bad response from OCSP server: 503 Service Unavailable

Too early to tell if that’s at the same frequency I was seeing the error messages in my logfiles as a few days ago when I last had stapling switched on.

Using manual openssl commands to check the response I’ve only got OK messages back over the last couple of hours. Whereas a few days ago I would get a 503 every 1 in approx every 8 or so attempts. So that has certainly improved.

Thanks for the responses regarding the 503s. We are continuing to work with Akamai. We believe the two issues devnull mentioned would both manifest to end users as 503s from Akamai as in both cases Akamai believes they were unable to get a valid response from our origin servers.

@m.sanjay94 Right now we believe the errors you saw in the US were likely due to the responder timeouts issue, which is resolved. Everyone, please let us know if you are currently seeing these issues outside of Europe.

@seanmavley It is really strange to me that this would only happen on weekends. Would you mind posting here or in DM the IP that Akamai would see that traffic coming from?

@rleeden we’re definitely interested in the pattern of 503s you see. Would you mind posting here or in DM the IP that Akamai would see the traffic coming from?

@isk I received the same response as I told earlier, in both US and EU about 5 hours back.

‘An error occurred while processing your request.
Reference #102.27d4dd58.1516846269.6f87e2a’

I don’t think these are timeouts which you have mentioned. Do you want my outgoing request’s IP for processing the issuer further ?

Yes, please send the Akamai IP you are getting as well as the IP you are coming from. That reference number is also helpful.

If you got the issue only 5 hours ago, it probably means Akamai is having a problem outside of Europe as well.

@ibehm Not really the greatest news, but many browsers don’t actually check OCSP on browsing to a site. Chrome has its own curated CRL and won’t check CA maintained OCSP or CRL. Firefox currently checks only OCSP by default then fails open if it can’t get a response. Last I checked, Microsoft does check OCSP, then the CRL if OCSP fails and it may or may not then warn that there is an issue. Safari doesn’t check OCSP. Opera checks, but also fails open on no response, I believe.

So, in general, if OCSP is unavailable most browsers will happily assume the certificate is fine. The browser with the largest market share doesn’t check OCSP at all. This is one reason Let’s Encrypt believes 90 day certs are so important. Shortening the lifetime is the only reliable way of making sure a compromised cert is unusable.

2 Likes

@isk I’ve only received one more 503 since my last post:

[Wed Jan 24 19:26:02.986209 2018] [ssl:error] [pid 25266:tid 139905301694208] [client 162.197.232.103:59521] AH01980: bad response from OCSP server: 503 Service Unavailable

My IP is 77.96.80.209.

@isk Just a short update… we got 318 503 errors in the last hour (Germany). I provided some details via PN earlier.

I am still getting random OSCP server errors from Nagios monitoring my SSL certificates from Let’s Encrypt. Servers in the UK.

ocsp.int-x{1…4}.letsencrypt.org resolving to 2.21.67.*

My server IP is 89.145.86.110.

Update, now working again, ocsp server IPs now 2.22.146.*

I’ve noticed that whilst running cerbot-auto certificates that I tend to get 503 quite often on my sites that get very infrequent traffic. If you need any info let me know.

The infrequent traffic component seems to fit well with some of what we’re looking at now. Those sites would be less likely to already be in the cache.

We have made some progress, I am still looking at this and will hopefully have another update soon.

1 Like

Evening update: We’ve spent the day troubleshooting the internet. At present there is a lot of “It’s not our problem, it’s their problem going on” and undoubtedly at least one of the potentially responsible parties is probably mostly right. Right now, we’ve got a plan to try something that may be somewhat disruptive in about 13 hours and we will be prepping to make that as non-disruptive as possible. That may fix this issue or it may just reduce the chance that it is being caused by that particular piece of the environment.

We’re working on it. We’ll update here and ask you to let us know how that turns out for you.

Just for the curious among you: from CDN logs, the problem does still seem to occur primarily with traffic coming from Europe, for no reason we’ve found yet, although we have been able to determine it occasionally happens with traffic from Asia and North America as well.

7 Likes

Okay, we’ve made a change that will help us pin this down. It might even temporarily fix it. Can I get a sound off over the next little while from my loyal testers/victims here as to whether there is a change in the frequency of 500s starting roughly 5 minutes ago?