Intermittent problem with certificates

I have a weird problem with certificates on all of my websites using LetsEncrypt. Whenever I browse my website, I occasionally get "certificate date not valid", however, refreshing the page with F5 a few times fixes it. Refreshing several times in a row may yield certificate error again. Whenever it's a failed https page load, If I click the error msg in browser, I can actually see the outdated certificate shown. Whenever https page loads fine, certificate date is ok.

All this started about 3 months ago. And I've used them for about 2 years in total, never had any problem. I am located in Russia/Moscow, I hear people experienced this behavior based on erroneous IP blocks imposed by Russian Authorites, but if it is the case, I need to confirm it, whether it's not my configuration that causes the error. I also keep getting error notifications from Yandex (a major Russian search engine/indexing service)

thank you.

Domain: www.gsm-plus.ru, www.plenk.ru, www.mixmo.ru
Web server: Apache/2.4.18 (Ubuntu)
OS: Linux 4.4.0-179-generic
Hosting provider: Leaseweb DE, virtual server
Root login: Yes
Client: Certbot 0.28.0

1 Like

When the certificate date shows up incorrectly, get a copy of the entire cert in use (not just the date).

1 Like

I think the problem is that your webserver has orphaned processes which have an old version of the certificate loaded:

$ openssl s_client -connect  www.mixmo.ru:443  -showcerts 2>/dev/null  | openssl x509 -noout -dates
notBefore=Apr 14 19:01:23 2020 GMT
notAfter=Jul 13 19:01:23 2020 GMT

$ openssl s_client -connect  www.mixmo.ru:443  -showcerts 2>/dev/null  | openssl x509 -noout -dates
notBefore=Aug 15 23:00:42 2020 GMT
notAfter=Nov 13 23:00:42 2020 GMT

In the above, you can see I connect twice to the domain. In the first case I see an expired certificate, in the second case I see the unexpired one.

What happens then is that when a new connection comes in, that connection has a 1/N chance of connecting to one of the orphaned workers, and getting the old certificate.

Why does this happen? Not sure.

How can you fix it? Stop your webserver. Then make sure you kill any leftover webserver processes with killall -9. Or reboot your server entirely to make sure they're all gone.

2 Likes

That means it has probably been like that since before June 13.
Check all the available resources before you reboot it:
Like:
top
free
df -h
[anything else you can think of]

then also compare them after the reboot.

I got this old one:

1 Like

_az, I think you pointed me in the right direction, thank you! I just rebooted the server and it looks like it's good now. Is it possible to check Apache2 logs to see the actual path of certificates served? I don't get however how it happens, since it is not possible to have more than one server listening on the same port. That means it's actually Apache that randomly serves the old certificate.

2 Likes

The certificates are served from memory, not from disk. I'm not sure there's any straightforward way to inspect that.

That's not quite true. It does depend how Apache is configured, but simple version: the master Apache process creates the listening socket, then it forks a bunch of workers. These workers all share the same socket and accept clients from it.

When the Apache master process reloads its configuration (say, after a certificate renewal), it forks some new workers using the new configuration, waits for the old workers to finish up, and kills them. (Somewhat oversimplified and probably incorrect).

If those old workers (for some reason) don't get killed properly and continue to take new connections from the listening socket, you end up seeing some connections being served on the old configuration, and others on the new configuration.

Indeed, I can't see the old certificate anymore.

2 Likes

:thinking:

That would allow for uninterrupted service. I would think these threads might be kept alive with "persistent connections", but that certainly doesn't explain the old threads taking on new connections, which would be a deliberate design choice. I'm obviously not familiar with the internals, but your analysis sparked a few thoughts. Very interesting...

1 Like

Very low resources can be a one cause for this to happen.

1 Like

Look into "graceful shutdowns" or "reloads" (not stop & start)

1 Like

Or if you really need to see the processes "in action", try:
sudo lsof -iTCP -sTCP:LISTEN -P | grep -Ei 'nginx|apache|command'

I get:

COMMAND     PID            USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nginx      1291            root    6u  IPv4  28235      0t0  TCP *:80 (LISTEN)
nginx      1291            root    7u  IPv6  28236      0t0  TCP *:80 (LISTEN)
nginx      1292        www-data    6u  IPv4  28235      0t0  TCP *:80 (LISTEN)
nginx      1292        www-data    7u  IPv6  28236      0t0  TCP *:80 (LISTEN)
nginx      1293        www-data    6u  IPv4  28235      0t0  TCP *:80 (LISTEN)
nginx      1293        www-data    7u  IPv6  28236      0t0  TCP *:80 (LISTEN)
apache2    1299            root    4u  IPv6  25845      0t0  TCP *:81 (LISTEN)
apache2   41737        www-data    4u  IPv6  25845      0t0  TCP *:81 (LISTEN)
apache2   41738        www-data    4u  IPv6  25845      0t0  TCP *:81 (LISTEN)

And also:
sudo ps -ef | grep -Ei 'nginx|apache'

1 Like

Unfortunately, I saw your reply too late to capture free memory info before the reboot, however I can say I have /usr/bin/certbot renew set as my daily task in cron. Could that have triggered this behavior? I put it there because sometimes certificates were not updated automatically.

1 Like

Not likely.
certbot renew (usually) would not be trying to stop/start your web server.
[unless you told it to do that]

1 Like

Until an actual renewal took place and the apache plugin was used.

2 Likes

I was wondering if we could find a pointer (with an inode number) in /proc/[pid]/fd but it turns out that we can't in this case because the file is closed, not held open.

I was able to attach to a running apache2 process with gdb and cause a core dump

host# gdb
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) cd /root
Working directory /root.
(gdb) 
(gdb) file /usr/sbin/apache2
Reading symbols from /usr/sbin/apache2...(no debugging symbols found)...done.
(gdb) !pidof apache2
28384 28383 28382
(gdb) attach 28384
Attaching to program: /usr/sbin/apache2, process 28384
[New LWP 28390]
[New LWP 28391]
[New LWP 28392]
[New LWP 28393]
[New LWP 28394]
[New LWP 28395]
[New LWP 28396]
[New LWP 28397]
[New LWP 28398]
[New LWP 28399]
[New LWP 28400]
[New LWP 28401]
[New LWP 28402]
[New LWP 28403]
[New LWP 28404]
[New LWP 28405]
[New LWP 28406]
[New LWP 28407]
[New LWP 28408]
[New LWP 28409]
[New LWP 28410]
[New LWP 28411]
[New LWP 28412]
[New LWP 28413]
[New LWP 28414]
[New LWP 28415]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fc166b8a394 in __libc_read (fd=7, buf=0x7ffd6c923473, nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:27
27	../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) generate-core-file 
warning: target file /proc/28384/cmdline contained unexpected null characters
Saved corefile core.28384

which contains some certificate data (e.g. strings core.28384 | grep ocsp) but this didn't feel super-practical:

(1) sometimes gdb failed to attach (maybe it mattered what syscall apache2 was inside of or something?)
(2) the corefile is kind of enormous, I guess it contains all of the shared libraries and allocated-but-uninitialized memory
(3) I don't happen to know what data structures to look for (presumably you could learn more about symbols that you could read in the apache2 process in order to find certificates in memory more expeditiously with gdb)
(4) this feels moderately dangerous on a production system, at least because users' incoming web connections might time out while a particular apache2 is frozen by gdb
(5) it might not be a good idea to randomly make spare copies of your private keys this way :slight_smile:
(6) I think various security-oriented kernel patches and defaults will reduce your ability to attach a debugger to a running process

Arguably a better answer is to get Apache to behave better in response to service apache2 graceful, or else to substitute service apache2 restart for service apache2 graceful in the command you're using to restart Apache (although this might also cause timeouts for live users).

But, the information in memory on your system, including certificates and key material, is ultimately accessible to you as the sysadmin if your OS hasn't been designed to hide it from you. :slight_smile:

1 Like

(I mean, it seems like this behavior shows that old worker processes can somehow get in a state where they persist much longer than you would expect after a graceful, which seems like an Apache bug...)

1 Like

I am not sure that restart is reliable either. Just based on observing other users have hit this, it seems like those orphans become totally detached from the master, and keep living even after the original PPID is gone.

Or maybe that sysvinit/systemd has lost track of the original PPID, so at some point, the control scripts become totally unable to interact with the right processes.

Or maybe the system is operating with insufficient resources...
Or with outdated underlying components...
Or is a side effect of some nefarious external actions...

which are causing (or helping to cause) this bug type action.

[Apache 2.4.18 should be able to handle these requests as expected]

I'm a little confused about why the new Apache is allowed to rebind the same TCP ports under this condition. I would think the kernel would forbid binding the port if an independent process still had it bound.

... huh, apparently not with SO_REUSEPORT? ... maybe Apache is using that for other speed or reliability reasons?

It does on my system!

[root@plugindev ~]# bpftrace -e 'tracepoint:syscalls:sys_enter_setsockopt /comm == "httpd"/ { printf("%s (tid %d) name:%ld val:%ld\n", comm, tid, args->optname, *args->optval); }'
Attaching 1 probe...
httpd (tid 12899) name:15 val:1

/usr/include/asm-generic/socket.h:#define SO_REUSEPORT  15

But probably the pidfile getting blasted (race condition in control scripts?) is still more likely.

1 Like

Something with lsof perhaps? It'll have the file path of deleted files (”DEL") with the PID of the process.