I have a weird problem with certificates on all of my websites using LetsEncrypt. Whenever I browse my website, I occasionally get "certificate date not valid", however, refreshing the page with F5 a few times fixes it. Refreshing several times in a row may yield certificate error again. Whenever it's a failed https page load, If I click the error msg in browser, I can actually see the outdated certificate shown. Whenever https page loads fine, certificate date is ok.
All this started about 3 months ago. And I've used them for about 2 years in total, never had any problem. I am located in Russia/Moscow, I hear people experienced this behavior based on erroneous IP blocks imposed by Russian Authorites, but if it is the case, I need to confirm it, whether it's not my configuration that causes the error. I also keep getting error notifications from Yandex (a major Russian search engine/indexing service)
thank you.
Domain: www.gsm-plus.ru, www.plenk.ru, www.mixmo.ru
Web server: Apache/2.4.18 (Ubuntu)
OS: Linux 4.4.0-179-generic
Hosting provider: Leaseweb DE, virtual server
Root login: Yes
Client: Certbot 0.28.0
In the above, you can see I connect twice to the domain. In the first case I see an expired certificate, in the second case I see the unexpired one.
What happens then is that when a new connection comes in, that connection has a 1/N chance of connecting to one of the orphaned workers, and getting the old certificate.
Why does this happen? Not sure.
How can you fix it? Stop your webserver. Then make sure you kill any leftover webserver processes with killall -9. Or reboot your server entirely to make sure they're all gone.
That means it has probably been like that since before June 13.
Check all the available resources before you reboot it:
Like: top free df -h
[anything else you can think of]
_az, I think you pointed me in the right direction, thank you! I just rebooted the server and it looks like it's good now. Is it possible to check Apache2 logs to see the actual path of certificates served? I don't get however how it happens, since it is not possible to have more than one server listening on the same port. That means it's actually Apache that randomly serves the old certificate.
The certificates are served from memory, not from disk. I'm not sure there's any straightforward way to inspect that.
That's not quite true. It does depend how Apache is configured, but simple version: the master Apache process creates the listening socket, then it forks a bunch of workers. These workers all share the same socket and accept clients from it.
When the Apache master process reloads its configuration (say, after a certificate renewal), it forks some new workers using the new configuration, waits for the old workers to finish up, and kills them. (Somewhat oversimplified and probably incorrect).
If those old workers (for some reason) don't get killed properly and continue to take new connections from the listening socket, you end up seeing some connections being served on the old configuration, and others on the new configuration.
That would allow for uninterrupted service. I would think these threads might be kept alive with "persistent connections", but that certainly doesn't explain the old threads taking on new connections, which would be a deliberate design choice. I'm obviously not familiar with the internals, but your analysis sparked a few thoughts. Very interesting...
Unfortunately, I saw your reply too late to capture free memory info before the reboot, however I can say I have /usr/bin/certbot renew set as my daily task in cron. Could that have triggered this behavior? I put it there because sometimes certificates were not updated automatically.
I was wondering if we could find a pointer (with an inode number) in /proc/[pid]/fd but it turns out that we can't in this case because the file is closed, not held open.
I was able to attach to a running apache2 process with gdb and cause a core dump
host# gdb
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) cd /root
Working directory /root.
(gdb)
(gdb) file /usr/sbin/apache2
Reading symbols from /usr/sbin/apache2...(no debugging symbols found)...done.
(gdb) !pidof apache2
28384 28383 28382
(gdb) attach 28384
Attaching to program: /usr/sbin/apache2, process 28384
[New LWP 28390]
[New LWP 28391]
[New LWP 28392]
[New LWP 28393]
[New LWP 28394]
[New LWP 28395]
[New LWP 28396]
[New LWP 28397]
[New LWP 28398]
[New LWP 28399]
[New LWP 28400]
[New LWP 28401]
[New LWP 28402]
[New LWP 28403]
[New LWP 28404]
[New LWP 28405]
[New LWP 28406]
[New LWP 28407]
[New LWP 28408]
[New LWP 28409]
[New LWP 28410]
[New LWP 28411]
[New LWP 28412]
[New LWP 28413]
[New LWP 28414]
[New LWP 28415]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fc166b8a394 in __libc_read (fd=7, buf=0x7ffd6c923473, nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:27
27 ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) generate-core-file
warning: target file /proc/28384/cmdline contained unexpected null characters
Saved corefile core.28384
which contains some certificate data (e.g. strings core.28384 | grep ocsp) but this didn't feel super-practical:
(1) sometimes gdb failed to attach (maybe it mattered what syscall apache2 was inside of or something?)
(2) the corefile is kind of enormous, I guess it contains all of the shared libraries and allocated-but-uninitialized memory
(3) I don't happen to know what data structures to look for (presumably you could learn more about symbols that you could read in the apache2 process in order to find certificates in memory more expeditiously with gdb)
(4) this feels moderately dangerous on a production system, at least because users' incoming web connections might time out while a particular apache2 is frozen by gdb
(5) it might not be a good idea to randomly make spare copies of your private keys this way
(6) I think various security-oriented kernel patches and defaults will reduce your ability to attach a debugger to a running process
Arguably a better answer is to get Apache to behave better in response to service apache2 graceful, or else to substitute service apache2 restart for service apache2 graceful in the command you're using to restart Apache (although this might also cause timeouts for live users).
But, the information in memory on your system, including certificates and key material, is ultimately accessible to you as the sysadmin if your OS hasn't been designed to hide it from you.
(I mean, it seems like this behavior shows that old worker processes can somehow get in a state where they persist much longer than you would expect after a graceful, which seems like an Apache bug...)
I am not sure that restart is reliable either. Just based on observing other users have hit this, it seems like those orphans become totally detached from the master, and keep living even after the original PPID is gone.
Or maybe that sysvinit/systemd has lost track of the original PPID, so at some point, the control scripts become totally unable to interact with the right processes.
Or maybe the system is operating with insufficient resources...
Or with outdated underlying components...
Or is a side effect of some nefarious external actions...
which are causing (or helping to cause) this bug type action.
[Apache 2.4.18 should be able to handle these requests as expected]
I'm a little confused about why the new Apache is allowed to rebind the same TCP ports under this condition. I would think the kernel would forbid binding the port if an independent process still had it bound.
... huh, apparently not with SO_REUSEPORT? ... maybe Apache is using that for other speed or reliability reasons?