OCSP responder timed out

Since Feb 19 the following is written to this nginx/1.22.1 error log:

[error] 1429815#1429815: OCSP responder timed out (110: Connection timed out) while requesting certificate status, responder: r3.o.lencr.org, peer: [2a02:26f0:ee00::217:5fa8]:80, certificate: "/etc/ssl/private/hopyard.pem"

[error] 1429815#1429815: OCSP responder timed out (110: Connection timed out) while requesting certificate status, responder: r3.o.lencr.org, peer: [2a02:26f0:ee00::217:5fa1]:80, certificate: "/etc/ssl/private/hopyard.pem"

This message is repeated for both 2a02:26f0:ee00::217:5fa1 and 2a02:26f0:ee00::217:5fa8 every hour or so. letsencrypt.status.io says everything is fine. The Nginx configuration has not changed:

ssl_stapling            on;
ssl_stapling_verify     on;

I have NOT set a ssl_stapling_responder URL as the responder can be extracted from the certificate just fine:

$ openssl x509 -in /etc/ssl/private/hopyard.pem -noout -text | grep -A2 'Authority Information'
            Authority Information Access: 
                OCSP - URI:http://r3.o.lencr.org
                CA Issuers - URI:http://r3.i.lencr.org/

....or so I thought, because:

$ getent hosts r3.o.lencr.org
2a02:26f0:300::58dd:da32 a1887.dscq.akamai.net
2a02:26f0:300::58dd:da09 a1887.dscq.akamai.net

...these IP addresses look very different from the addresses reported above.

This topic turns up every now and then and I had this seen in my webserver logs before, and I think the last time the errors just vanished after a few days and I suspect the same will happen this time.

Does anybody know of current problems with the OCSP responder that are not shown on letsencrypt.status.io?

Also, does anybody know why the Nginx documentation recommends to set an explicit resolver directive, thus overriding the (working) system's resolver? I know, this should be better asked in an Nginx discussion forum, but maybe someone know about this anyway.

Thanks for any ideas on this topic.

Different IPs is likely because you're hitting different CDN nodes. I don't have a lot of visibility into how the CDN vendor manages their IPs to direct traffic around, but all the IPs you listed are from Akamai.

Overall our success rates on OCSP seem well within normal and our monitoring says we're healthy, but I'll take a bit of a deeper look at performance. We do heavily rely on caching for OCSP, as the traffic is very much above what we could ever hope to handle within the CA infrastructure itself, so there are definitely times we slow down a bit with heavier load.

Do you know what timeout value you're hitting? (ie, 1 second?)

6 Likes

nginx doesn't use glibc/system resolvers, but uses its own DNS resolver. There is a fallback to resolvers found in /etc/resolv.conf, but that might not work properly in all cases (DNS architecture in Linux is fairly complicated and various setups are used, i.e. systemd-networkd can be used entirely over DBus). Having an explicit resolver specified may therefore be required for nginx to resolve a name correctly. It can also be important in split-DNS scenarios, where only certain resolvers can resolve a name.

6 Likes

Two main reasons I have been told in the past:

  1. The resolver directive gives fine grained control - not just on which resolver to use, but how to use it. This means a multi-node application's configuration exists within nginx - which is ideal - and does not require configuring the operating systems themselves.
  2. Local system DNS resolvers tend to be blocking, which can cause dogpiles and timeouts in high performance systems. Nginx is designed for high throughput multiprocessing and simultaneous connections, so this design choice avoids congestion issues.
6 Likes

Wow, so many responses so quickly, I didn't expect that :slight_smile:

Overall our success rates on OCSP seem well within normal and our monitoring says we're healthy, but I'll take a bit of a deeper look at performance. We do heavily rely on caching for OCSP, as the traffic is very much above what we could ever hope to handle within the CA infrastructure itself, so there are definitely times we slow down a bit with heavier load. Do you know what timeout value you're hitting? (ie, 1 second?)

OK, thanks for checking, but if nobody else complains, please don't waste more time on this. I don't know the (internal) timeout that Nginx hits here, but your comment made me consider adding the ssl_ocsp_cache option to maybe lessen the load on the OCSP responders.

nginx doesn't use glibc/system resolvers, but uses its own DNS resolver. There is a fallback to resolvers found in /etc/resolv.conf, but that might not work properly in all cases

OK, I didn't know that. I've never set the resolver option before as everything else seems to just work. But reading articles on why people setting this directive now may even explain what's happening here:

but the default behaviour of the Nginx upstream and proxy modules is to resolve DNS at startup, but not to re-resolve DNS during the operation of the application.

So, maybe the address of r3.o.lencr.org has changed (and it looks like it has) but Nginx never re-resolves it, and logs these errors instead. I'm indeed using systemd-networkd and as mentioned, this seems to work just fine but I'll try the resolver directive and see if the messages go away. Actually, I've restarted nginx already and of course it re-resolved the names now, so....that "helped" already and the messages are gone now :person_facepalming:

Regarding the last response there: understood, thank you. However, in this case neither multi-node applications are in use here nor anything close to "high throughput multiprocessing" is happening here, this is a really idle webserver for personal usage :slight_smile:

Thank you all for responding!

5 Likes

We pushed a change just this morning that is intended to lower latency and give better observability for our new OCSP responder infrastructure, unrelated to the timeouts you're seeing... I'm joining Matthew looking through performance trying to figure out where we might be hitting issues.

There weren't changes over the weekend, but we are actively making adjustments to this infrastructure -- it's comparatively very new. Nothing we're seeing implies that our 99th-percentile return values are ever, even briefly, above 2.4s. That said, 2.4s is still a problem we're working on with changes like the one this morning. Our 90th percentile latency within our DCs is more like 1.1s (and 50th percentile being 3.1ms):

I know nginx doesn't let you set stapling timeouts like Apache does; I don't remember how tight theirs is, but I'd be surprised if 2.4s was hitting it. What that tells me is we have someplace also getting slow that isn't instrumented (cough cough CDN).

Thank you for posting about this. We're working on smoothing out those bumps in the p99 graph above.

8 Likes

FWIW, there is a configurable resolver_timeout

OSCP connect/read/write timeouts were originally hardcoded to 60s with responses cached to 1hr. I do not know if these were ever changed. See http://nginx.org/patches/attic/ocsp-stapling/README.txt

Akamai is weird. Personally, I avoid them as a vendor for billing issues, but their service is just weird and hard to debug around, especially with automated tools. For example: I don't know if they still do this, but the error pages generated on their edge servers would be sent in plaintext but have headers indicating zip compression. Anyways, Good Luck. You'll need it.

5 Likes

Looks like I spoke too soon; this wasn't slowness-to-timeout in our infra, but a DNS resolution timeout [1]. Still, my graph above is a focus, and I've some ideas to try and pull the p99 latency down during high-load times.

[1] Of course

6 Likes

We pushed a change just this morning that is intended to lower latency and give better observability for our new OCSP responder infrastructure

Impressive to see how big of an infrastructure is needed here to provide "just" an OCSP service!

FWIW, there is a configurable resolver_timeout

Yes, now that I've set the resolver directive I see if its default of 30 seconds works for me before tuning further.

[1] Of course

Again, thanks for the all the responses on that topic. Ever since I restarted nginx yesterday evening the timeout messages are gone and thanks to check_logfiles I can see when the messages re-appear, and maybe report back here.

5 Likes

Slight note on this: in your original post, here's a snippet of the error message:

OCSP responder timed out (110: Connection timed out) while requesting certificate status, responder: r3.o.lencr.org, peer: [2a02:26f0:ee00::217:5fa8]:u6e80

(emphasis mine)

"Connection timed out" refers to a TCP connection, and a TCP connection can only be established after a DNS lookup succeeds. And indeed right there in the error message, there's an IP address.

So this isn't a timeout looking up DNS. I think you were on the right track earlier:

To add detail: CDNs will rotate out frontends over time. They do that by removing those frontend IP addresses from DNS and expecting traffic to stop. If an application keeps sending traffic to those frontends after they are taken down, that application would see TCP connection timeouts (or, less likely, connection refused). So that's compatible with the symptoms you observed.

It's also the case that restarting Nginx might make these symptoms go away for days or weeks, since that causes a re-resolution of DNS, and Akamai might not rotate out the particular frontends you are talking to for a long time.

In the quote above I emphasized upstream and proxy. The article you cited is talking about those modules, not the ssl module. However, I just confirmed that the OCSP directives also use resolver:

For a resolution of the OCSP responder hostname, the resolver directive should also be specified.

So setting the resolver directive is definitely correct, and I think will solve your problem.

To synthesize what's going on here: the system resolver (e.g. getaddrinfo()) is blocking. Almost everything in Nginx is non-blocking (also known as async), so that Nginx can achieve high throughput multiprocessing (whether you need it or not). Blocking and non-blocking are like oil and water. Calling a blocking API in a thread that is also doing non-blocking work will stall out that other work unpredictably.

So, presumably, Nginx came up with a simple solution early on: there are very few things in Nginx that require name resolution, and most of them are statically configured, as the mailing list post you shared mentions. So Nginx can look them up once at startup, using the system resolver in its blocking style, without worrying about slowing down any live traffic. But it can never look them up again or it will risk stalling many live requests.

That turns out to be inconvenient for proxy_pass, and downright incorrect for ssl_stapling. It would be better to do DNS lookup in a non-blocking way. There's literally no way to do this with the system resolver (argh!), so Nginx has to make the DNS queries itself, as a regular UDP service. Doing that requires that Nginx knows where the DNS resolver is, which is why it takes a whole separate directive to specify. So when the resolver directive is present, it's feasible for Nginx to do re-resolution while the server is running, because it becomes a regular network task that can be run non-blocking.

That's why the resolver directive is required for correct operation of ssl_stapling.

14 Likes

"Connection timed out" refers to a TCP connection, and a TCP connection can only be established after a DNS lookup succeeds. And indeed right there in the error message, there's an IP address. So this isn't a timeout looking up DNS.

Yes, of course, thank you for clarifying, both the stale DNS entry and the ensuing connection attempt to a no longer active CDN node are two disparate things indeed.

That's why the resolver directive is required for correct operation of ssl_stapling.

Makes sense. I've now used both ssl_ocsp_cache and resolver, and I'll keep an eye on those logs. Thanks!

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.