Certbot fails to renew certificate using nginx plugin

Hi guys, my certbot behaves very strangely. It is not able to renew certificate in 95% of cases. Sometimes it is successful, but in most cases it fails (without changing any configuration, just two subsequent runs of the command - one fails and one succeeds - I have logs of both such runs).
Any idea what it may be caused by? It was working for months.
Help highly appreciated.

My domain is: api.bustravel.is

I ran this command: certbot renew --cert-name api.bustravel.paxflow.io --dry-run

It produced this output:

Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/api.bustravel.paxflow.io.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cert is due for renewal, auto-renewing...
Plugins selected: Authenticator nginx, Installer nginx
Starting new HTTPS connection (1): acme-staging-v02.api.letsencrypt.org
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for api.bustravel.is
Waiting for verification...
Challenge failed for domain api.bustravel.is
http-01 challenge for api.bustravel.is
Cleaning up challenges
Attempting to renew cert (api.bustravel.paxflow.io) from /etc/letsencrypt/renewal/api.bustravel.paxflow.io.conf produced an unexpected error: Some challenges have failed.. Skipping.
All renewal attempts failed. The following certs could not be renewed:
  /etc/letsencrypt/live/api.bustravel.paxflow.io/fullchain.pem (failure)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
** DRY RUN: simulating 'certbot renew' close to cert expiry
**          (The test certificates below have not been saved.)

All renewal attempts failed. The following certs could not be renewed:
  /etc/letsencrypt/live/api.bustravel.paxflow.io/fullchain.pem (failure)
** DRY RUN: simulating 'certbot renew' close to cert expiry
**          (The test certificates above have not been saved.)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 renew failure(s), 0 parse failure(s)

IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: api.bustravel.is
   Type:   unauthorized
   Detail: Invalid response from
   http://api.bustravel.is/.well-known/acme-challenge/DIPIp7zfacU_xL6wwzkd17QS_bb1VCEtyj4Rn4upc-U
   [2a01:4f8:221:205a::2]: "<html>\r\n<head><title>404 Not
   Found</title></head>\r\n<body>\r\n<center><h1>404 Not
   Found</h1></center>\r\n<hr><center>nginx</center>\r\n"

   To fix these errors, please make sure that your domain name was
   entered correctly and the DNS A/AAAA record(s) for that domain
   contain(s) the right IP address.

My web server is (include version): nginx version: nginx/1.16.1

The operating system my web server runs on is (include version):
CentOS Linux release 7.8.2003 (Core

My hosting provider, if applicable, is:
Hetzner

I can login to a root shell on my machine (yes or no, or I don't know):
yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel):
no, just bare console

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
certbot 1.7.0

$ cat /etc/letsencrypt/renewal/api.bustravel.paxflow.io.conf

# renew_before_expiry = 30 days
version = 1.0.0
archive_dir = /etc/letsencrypt/archive/api.bustravel.paxflow.io
cert = /etc/letsencrypt/live/api.bustravel.paxflow.io/cert.pem
privkey = /etc/letsencrypt/live/api.bustravel.paxflow.io/privkey.pem
chain = /etc/letsencrypt/live/api.bustravel.paxflow.io/chain.pem
fullchain = /etc/letsencrypt/live/api.bustravel.paxflow.io/fullchain.pem

# Options used in the renewal process
[renewalparams]
authenticator = nginx
installer = nginx
account = f6fb3dcb6db3eb975a0128963a92c3a4
server = https://acme-v02.api.letsencrypt.org/directory
renew_hook = nginx -t 2>&1 && systemctl reload nginx
1 Like

Could be due to some nginx configuration which certbot doesn't understand properly. I'm not seeing a difference between IPv4 or IPv6, so that's probably not it..

Could you paste the output of nginx -T? You can edit out /etc/nginx/mime.types as that probably won't be relevant.

1 Like

@Osiris thanks for your reply. Unfortunately I cannot paste whole nginx configuration here as it contains production virtualhosts and I don't find it secure to share publicly. I can paste the relevant parts though.
Just let me know which they are if I miss anything.

redirecting HTTP to HTTPS

# redirecting HTTP to HTTPS
server {
       listen 80 default_server;
       listen [::]:80 default_server;

       #server_name paxflow.is *.paxflow.is;

   #include snippets/cbs-location-restrictions.conf;

   #location / {
   #    return 301 https://$host$request_uri;
   #}
}

configuration file /etc/nginx/sites-enabled/api.bustravel.paxflow.io:

server {
        listen 443 ssl;
        listen [::]:443 ssl;

        server_name api.bustravel.is;

        include snippets/paxflow.io/bustravel/ssl-api.conf;
        include snippets/ssl-params.conf;

        access_log /var/log/nginx/bustravel/api.bustravel.paxflow.io/access.log;
        error_log /var/log/nginx/bustravel/api.bustravel.paxflow.io/error.log;

        root /srv/www/bustravel/api.bustravel.paxflow.io/public;

        index index.php;

        location / {
                if (-f $request_filename) {
                        break;
                }
                rewrite ^/([^/]+)/([^/]+)/$ /index.php?module=$1&action=$2&$args? last;
                rewrite ^/([^/]+)/$ /index.php?module=$1&$args? last;
        }

        location ~ \.php$ {
                include /etc/nginx/fastcgi_params;
                fastcgi_pass  unix:/var/run/php-fpm/php-fpm.bustravel.sock;
                fastcgi_index index.php;
                fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        }
}

server {
        listen 443 ssl;
        listen [::]:443 ssl;

        server_name api.bustravel.paxflow.io;

        include snippets/paxflow.io/bustravel/ssl.conf;
        include snippets/ssl-params.conf;

        return 301 https://api.bustravel.is$request_uri;
}

When taking a look into letsencrypt log, this is what certbot appended in nginx.conf during the challenge.

server {rewrite ^(/.well-known/acme-challenge/.*) $1 break; # managed by Certbot


       listen 80 ;
       listen [::]:80 ;

       #server_name paxflow.is *.paxflow.is;

   #include snippets/cbs-location-restrictions.conf;

   #location / {
   #    return 301 https://$host$request_uri;
   #}

server_name api.bustravel.is; # managed by Certbot
location = /.well-known/acme-challenge/k4czdimvkDg8cHZVL2zFF8n_KIheuhpEgHFGaDDrB6E{default_type text/plain;return 200 k4czdimvkDg8cHZVL2zFF8n_KIheuhpEgHFGaDDrB6E.JTUTFJLaYXwZTd2OS-y1CvfDsJWzwq-yWUrunS2zUSg;} # managed by Certbot

}

The commented parts are the ones that are usually working and I commented them out now when trying to make it work. Nothing helped though.
And the few successful challenges were not redirected to https despite "return 301..." line being present in the default server block for port 80.

1 Like

And those files are the only references to api.bustravel.is? Because I don't see any reason why the server block added by certbot wouldn't be triggered..

1 Like

Yes, those are the only mentions in whole nginx configuration for that domain.
I don't see a reason either. I feels like the nginx sometimes serves the challenge file and sometimes redirects to HTTPS and ends up with 404. Could it be somehow affected by HSTS? I just managed to run it two times with exactly the same configuration and stored logs from both runs. I am just not sure if it's safe to share them here publicly (the whole letsencrypt.log files).

1 Like

As far as I know, the Let's Encrypt validation server ignores HSTS headers.

The log file only contains public keys (for the ACME connection), no private keys are stored in it. You can however remove parts if you think it's necessary.

2 Likes

The logs are here. I removed the parts listing all the domains (all nginx config files - but they are identical in both cases).

Successful run https://pastebin.com/Z1PQrnAV
Failed run https://pastebin.com/b3WEuunD

1 Like

Very strange. It should not randomly fail. The nginx configuration used looks the same to me between succes and failure. The IPs aren't by any chance load balancers just before your actual server?

1 Like

No, we don't have any load balancer here. Just a server with static IP.

Yes, that's what I was frustrated from after 3 hours debugging yesterday. I couldn't find any reason for it to stop working (I thought it might be in updating of nginx package or somethign similar that would indirectly break it, but then I saw it succeed and fail with no change in config so I decided to contact the letsencrypt community as you might have better experience with such problems).

1 Like

Perhaps any difference in the nginx logs between a succesfull and failed run? Perhaps you can increase the verbosity of nginx logging temporarily.

1 Like

You are missing an action and a document root location in that block!
All lines are #'ed out.

If you don't need it - delete it.

1 Like

Have you tried authenticating via

--webroot -w /srv/www/bustravel/api.bustravel.paxflow.io/public
2 Likes

@rg305
the lines that are commented are usually in use. I commented them when trying to make the certbot command work becuase in failed certbot run the problem was that challenge request (/.well-known/acme-challenge/....) was redirected to https (it shouldn't have been). If you check the letsencrypt.log files I posted above (pastebin links) the block actually looks like this:

 server {
       listen 80 default_server;
       listen [::]:80 default_server;

       server_name xxxx;

   location / {
       return 301 https://$host$request_uri;
   }
}

As for webroot plugin - I guess that one would work but I haven't tried yet. I count on that as fallback solution if we don't manage to fix the behavior with nginx plugin - but thanks for the suggestion.

2 Likes

Webroot avoids all the modifications to nginx altogether and does the same thing (in the end).

1 Like

@Osiris I will try to adjust the nginx logging and check if I see anything wrong there (any suggestion what to look for or how to adjust the logging?)

@rg305 yes, I know - it is much simpler method and easy to fallback to but I don't like switching from something that is not working if I don't understand why

2 Likes

I'm afraid not.. This is a very strange issue you have I think..

1 Like

@Osiris after certbot edits nginx configuration, do you know how does it reload the nginx afterwards. Could it be that it reloads the service asynchronously so if reloading takes longer, nginx doesn't manage to apply the changes fast enough? Or do you know anybody that could answer this? Would it be considered a bug?

Inspired by comment from @_az in this thread: Certbot renew with nginx module - returns error 404 for challenge response

1 Like

We added a flag for situations like that: --nginx-sleep-seconds (defaults to 1).

You can try bump it to 30 or something and see if it helps.

2 Likes

Check the /etc/letsencrypt/letsencypt.log file.
If there is insufficient detail to answer your question, try it again with -v or -vv or -vvv
image
[each v would increase the amount of detail entered into the log file]

2 Likes

It seems to run nginx -s reload:

Good chance you're running into this asynchronous reloading issue I think, due to a lack of better explanation :stuck_out_tongue:

I'd try the flag @_az has implemented if I were you and see if that helps!

A nice feature would be for certbot to only continue with the challenge if all previous worker processes have stopped.. But with a quick Google search, I'm unable to find if such a simple check exists.

2 Likes