Hi guys, my certbot behaves very strangely. It is not able to renew certificate in 95% of cases. Sometimes it is successful, but in most cases it fails (without changing any configuration, just two subsequent runs of the command - one fails and one succeeds - I have logs of both such runs).
Any idea what it may be caused by? It was working for months.
Help highly appreciated.
Could be due to some nginx configuration which certbot doesn't understand properly. I'm not seeing a difference between IPv4 or IPv6, so that's probably not it..
Could you paste the output of nginx -T? You can edit out /etc/nginx/mime.types as that probably won't be relevant.
@Osiris thanks for your reply. Unfortunately I cannot paste whole nginx configuration here as it contains production virtualhosts and I don't find it secure to share publicly. I can paste the relevant parts though.
Just let me know which they are if I miss anything.
The commented parts are the ones that are usually working and I commented them out now when trying to make it work. Nothing helped though.
And the few successful challenges were not redirected to https despite "return 301..." line being present in the default server block for port 80.
And those files are the only references to api.bustravel.is? Because I don't see any reason why the server block added by certbot wouldn't be triggered..
Yes, those are the only mentions in whole nginx configuration for that domain.
I don't see a reason either. I feels like the nginx sometimes serves the challenge file and sometimes redirects to HTTPS and ends up with 404. Could it be somehow affected by HSTS? I just managed to run it two times with exactly the same configuration and stored logs from both runs. I am just not sure if it's safe to share them here publicly (the whole letsencrypt.log files).
As far as I know, the Let's Encrypt validation server ignores HSTS headers.
The log file only contains public keys (for the ACME connection), no private keys are stored in it. You can however remove parts if you think it's necessary.
Very strange. It should not randomly fail. The nginx configuration used looks the same to me between succes and failure. The IPs aren't by any chance load balancers just before your actual server?
No, we don't have any load balancer here. Just a server with static IP.
Yes, that's what I was frustrated from after 3 hours debugging yesterday. I couldn't find any reason for it to stop working (I thought it might be in updating of nginx package or somethign similar that would indirectly break it, but then I saw it succeed and fail with no change in config so I decided to contact the letsencrypt community as you might have better experience with such problems).
@rg305
the lines that are commented are usually in use. I commented them when trying to make the certbot command work becuase in failed certbot run the problem was that challenge request (/.well-known/acme-challenge/....) was redirected to https (it shouldn't have been). If you check the letsencrypt.log files I posted above (pastebin links) the block actually looks like this:
As for webroot plugin - I guess that one would work but I haven't tried yet. I count on that as fallback solution if we don't manage to fix the behavior with nginx plugin - but thanks for the suggestion.
@Osiris I will try to adjust the nginx logging and check if I see anything wrong there (any suggestion what to look for or how to adjust the logging?)
@rg305 yes, I know - it is much simpler method and easy to fallback to but I don't like switching from something that is not working if I don't understand why
@Osiris after certbot edits nginx configuration, do you know how does it reload the nginx afterwards. Could it be that it reloads the service asynchronously so if reloading takes longer, nginx doesn't manage to apply the changes fast enough? Or do you know anybody that could answer this? Would it be considered a bug?
Check the /etc/letsencrypt/letsencypt.log file.
If there is insufficient detail to answer your question, try it again with -v or -vv or -vvv
[each v would increase the amount of detail entered into the log file]
Good chance you're running into this asynchronous reloading issue I think, due to a lack of better explanation
I'd try the flag @_az has implemented if I were you and see if that helps!
A nice feature would be for certbot to only continue with the challenge if all previous worker processes have stopped.. But with a quick Google search, I'm unable to find if such a simple check exists.