OCSP Request failed with following message

Since 13.01.2018 12:59 our Webserver logs (NGINX Server in Germany) are also beeing floaded with
OCSP responder sent invalid “Content-Type” header: “text/html” while requesting certificate status, responder: ocsp.int-x3.letsencrypt.org, peer: 2.16.186.27:80

In case anyone is wondering about a temporary and hacky solution, I’ve added to /etc/hosts:

63.243.228.17   ocsp.int-x3.letsencrypt.org

So far, I’ve no errors since 3 days. We need to follow this up (and remove this dirty fix when things go back to normal), so indeed, as K.A.B mentionned, it would be really nice to have some visibility on https://letsencrypt.status.io/

1 Like

It’s safe to say that akamai’s servers are still not OK :frowning:

# cd /etc/letsencrypt/live/status.dogsbody.com
# openssl ocsp -issuer chain.pem -cert cert.pem -text -url http://ocsp.int-x3.letsencrypt.org
OCSP Request Data:
    Version: 1 (0x0)
    Requestor List:
        Certificate ID:
          Hash Algorithm: sha1
          Issuer Name Hash: 7EE66AE7729AB3FCF8A220646C16A12D6071085D
          Issuer Key Hash: A84A6A63047DDDBAE6D139B7A64565EFF3A8ECA1
          Serial Number: 0356E5FA189305B3BA1BC2C8D13D993D83F8
    Request Extensions:
        OCSP Nonce: 
            041047D59827ABED1246FAB062C7E3A7C30F
Error querying OCSP responder
140474384783000:error:27076072:OCSP routines:PARSE_HTTP_LINE1:server response error:ocsp_ht.c:314:Code=400,Reason=Bad Request

A host lookup on ocsp.int-x3.letsencrypt.org from the affected server…

# host ocsp.int-x3.letsencrypt.org
ocsp.int-x3.letsencrypt.org is an alias for ocsp.int-x3.letsencrypt.org.edgesuite.net.
ocsp.int-x3.letsencrypt.org.edgesuite.net is an alias for a771.dscq.akamai.net.
a771.dscq.akamai.net has address 92.123.64.234
a771.dscq.akamai.net has address 92.123.64.201
a771.dscq.akamai.net has IPv6 address 2a02:26f0:e8::6856:6fb0
a771.dscq.akamai.net has IPv6 address 2a02:26f0:e8::6856:6f88

Please at least update your status page to show that this is an ongoing issue :-/

2 Likes

We have set SSLUseStapling off in our apache config for now and the bad response from OCSP server: 503 Service Unavailable error in the apache logs is gone since then.

I have visited three websites in the last few days where Firefox wouldn’t allow me to access the site due to OCSP being unavailable. I have no control over they setup their servers :-/

I’ve also been seeing a lot of these errors, an example from an Apache errors log from this morning:

[Tue Jan 23 10:08:05.314862 2018] [ssl:error] [pid 23183] AH01941: stapling_renew_response: responder error

I’m using the Mozilla recommended settings:

SSLUseStapling          on
SSLStaplingResponderTimeout 5
SSLStaplingReturnResponderErrors off
SSLStaplingCache        shmcb:${APACHE_RUN_DIR}/ocsp(128000)

Is there a way to improve this configuration in order to mitigate the current situation where the Let’s Encrypt OCSP servers are rather unreliable?

1 Like

Yes, you can turn stapling off in the interim to remove the server’s dependency on the OCSP servers.

SSLUseStapling off

There may be some way to keep stapling on and tune it to deal with the errors better, but I am not sure that an acceptable configuration is possible with the way Apache currently works.

Thanks @_az, if there is no better configuration then I guess disabling it on all our servers is the best option since the only other option would be to advise clients to disable it in their web browsers :frowning:.

Thanks for the suggestion - our operations team has opened a status page incident for this: https://letsencrypt.status.io/pages/incident/55957a99e800baa4470002da/5a6753733800d404cc4ea7db

I’m hopeful someone will be able to update this thread with more information about the remediation discussions later today.

cc @isk @devnullisahappyplace

Just a thought:
If we disable OCSP stapling on our servers, we would only move the problem to our users.
The browsers/UserAgents would still try to fetch the OCSP response presumingly resulting in the same error.
(please correct me if browsers can handle this in a better fashion)

In one of my servers (doesn’t support stapling) I fetch the OCSP response manually in a file and provide that to my nginx. That process runs as a cron script every hour.

I had failures of SOME requests (not all) at least since 15.1.18. But also some months ago (22.09.17 08:23UTC).
So going with the hypothesis that this is just a load problem, I simply changed the execution-time of my cron script to not-so-defaulty-times, resulting in fewer failures (so far - or maybe you just changed something on your side).

Also: Please PN me if someone knows how browsers implement redundancy in ocsp requests, if an ocsp responder is dead or returns rubbish. Do they try multiple or all hosts behind the cname of the ocsp-uri in our certificates? What would happen if the ca specifies multiple ocsp-uris?

@seanmavley Unfortunately nginx will cache DNS resolutions indefinitely after querying once. Restarting nginx would make it resolve a new IP address for the OCSP server, which may help things.

@cpu @isk @devnullisahappyplace I’m not sure if you’re aware of this nginx behavior, and people are experiencing this with other clients so I doubt stale IPs are responsible, but it could be exacerbating the situation…

There’s actually been several issues that we have been chasing down. The problem affecting users in Central Europe has been solved by Akamai shutting down traffic from their Germany region at 1730 UTC and shifting traffic to their Italy region between 1800 and 1900 UTC. Since the region swap event, we have not received any origin connection failure alerts. A separate, but seemingly related, issue regarding OCSP responder timeouts has also been fixed.

Can you please confirm that you’re seeing successful OCSP lookups for your domains?

4 Likes

I haven’t faced any issue in the last 36 hours in EU. However, I faced similar issue in US on ‘Jan 23 05:06 PST’ ( about 3 errors around the same time ).

Sorry, but we are still seeing the 503s (based in Germany). On the bright side, I am no longer able to replicate this on the cli :confused: Maybe I did not try hard enough (a few hundret requests)?

At first I restarted nginx, then I started playing with the nameservers (originaly we where using 8.8.8.8).

Google currently resolves ocsp.int-x3.letsencrypt.org to

a771.dscq.akamai.net.
2.18.212.56
2.18.212.72

... while my local ISP gives me...

a771.dscq.akamai.net.
2.20.189.244
2.20.190.17

Tcpdump was able to get me some references for/from Akamai:
From: 2.20.190.17
Reference #102.27d4dd58.1516779467.46d5c8d
Reference #102.16d4dd58.1516779897.e7dfba
From: 2.20.189.244
Reference #102.16d4dd58.1516780015.e86d3e

Side Note 1: Browsing through the log, the times of the 503s seem to be "clustered" together.
Side Note 2: The "timeout thing" did not affect us, only 503.

Currently I am still hoping this might be a DNS issue but I seriously doubt it.

For my box in Amsterdan, the issue rears its ugly head only on weekends. So saturday sunday, will see how it goes.

The other box in London hasn't had a single issue, since this OCSP thing. Time will tell

That seems to be working for me, at least for the past 4 weeks. Everything comes back to normal immediately after nginx restart.

Since switching SSLUseStapling back to on about 2 hours ago I’ve had one 503 in my logfiles:

[Wed Jan 24 09:38:40.506794 2018] [ssl:error] [pid 2322:tid 139904619071232] [client 192.168.0.3:28818] AH01980: bad response from OCSP server: 503 Service Unavailable

Too early to tell if that’s at the same frequency I was seeing the error messages in my logfiles as a few days ago when I last had stapling switched on.

Using manual openssl commands to check the response I’ve only got OK messages back over the last couple of hours. Whereas a few days ago I would get a 503 every 1 in approx every 8 or so attempts. So that has certainly improved.

Thanks for the responses regarding the 503s. We are continuing to work with Akamai. We believe the two issues devnull mentioned would both manifest to end users as 503s from Akamai as in both cases Akamai believes they were unable to get a valid response from our origin servers.

@m.sanjay94 Right now we believe the errors you saw in the US were likely due to the responder timeouts issue, which is resolved. Everyone, please let us know if you are currently seeing these issues outside of Europe.

@seanmavley It is really strange to me that this would only happen on weekends. Would you mind posting here or in DM the IP that Akamai would see that traffic coming from?

@rleeden we’re definitely interested in the pattern of 503s you see. Would you mind posting here or in DM the IP that Akamai would see the traffic coming from?

@isk I received the same response as I told earlier, in both US and EU about 5 hours back.

‘An error occurred while processing your request.
Reference #102.27d4dd58.1516846269.6f87e2a’

I don’t think these are timeouts which you have mentioned. Do you want my outgoing request’s IP for processing the issuer further ?

Yes, please send the Akamai IP you are getting as well as the IP you are coming from. That reference number is also helpful.

If you got the issue only 5 hours ago, it probably means Akamai is having a problem outside of Europe as well.