Certbot gets stuck before saying "Waiting 60 seconds for DNS changes to propagate"

My domain is: ordermade.com

I ran this command:

certbot certonly --cert-name ordermade --dns-rfc2136 --dns-rfc2136-credentials certbot-rfc2136.ini -d *.ordermade.com -d ordermade.com

It produced this output:

Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator dns-rfc2136, Installer None
Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for ordermade.com
dns-01 challenge for ordermade.ws
dns-01 challenge for ordermade.com
dns-01 challenge for ordermade.ws

My web server is (include version): apache2 2.4.41-4ubuntu3.9

The operating system my web server runs on is (include version): Ubuntu 20.04.4 LTS (focal)

My hosting provider, if applicable, is: DigitalOcean (although it is not applicable since I use my own BIND9 server)

I can login to a root shell on my machine (yes or no, or I don't know): yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): certbot 0.40.0

Version of bind: 1:9.16.1-0ubuntu2.9

The logs show everything is working fine, except that the last JSON says "status": "pending" and nothing else happens... Also, I never get the DNS TXT change happening in the ordermade.com zone. I have some other domains (beepbrake.app) which worked just fine (i.e. I pretty much immediately saw the "Waiting 60 seconds for DNS changes to propagate" message and after a minute or so, I got the certificate). I'm at a loss now.

Also, using nsupdate from another computer (not local, not from the secondary DNS), I can add the TXT field as expected.

It seems likely that Certbot is timing out while talking to your BIND server. This could be a networking problem, or just that BIND is dropping the query.

The RFC2136 plugin did not gain a default network timeout until Certbot v1.10.0 (thanks to @Osiris). You are using v0.40.0, so I would guess that it's just hanging forever, waiting for a response from BIND.

What about from the computer where Certbot is running?

5 Likes

Interesting...

Yes, the nsupdate ... command also works from the computer I run Certbot from. I first tested on that computer but thought that was not a very good test since it is likely authorizing in part because it is on the same computer.

I've checked my name server from https://unboundtest.com/ and also from https://mxtoolbox.com/ and it works just fine. So I'm not too sure what my next step should be at the moment... Any idea how I could get this to move forward?

Updating your Certbot client would be the first step.

3 Likes

There isn't a lot of visibility from the RFC2136 plugin to see what's happening.

I would try checking two things:

  • Whether anything shows up in BIND server logs at all
  • Record a packet capture and see where/what packets Certbot is trying to send when talking to BIND, and whether it receives any responses
4 Likes

@jvanasco I would agree, only I was able to create a few before it got stuck... so it sounds strange that it would be necessary to upgrade to finish up. I'll do that anyway, but I don't think it will help.

@_az Right... I'm getting several hits per second on port 53... I'll look into running a tool to see what all the packets are. From the logs, I haven't seen anything of interest. Many of the hits are for domain named tavohosting.com, which I know nothing about. (It looks like some form of hacking because it comes from many places).

You would be surprised! The various improvements since your version, like the default network timeout, tend to minimize the places where edge cases can happen - or improve logging to better debug. I would not be surprised if you are experiencing issues due to a few bugs or deficiencies in Certbot, which have been addressed over the past 30 months. A lot of problems have been solved here by simply updating the client to something recent.

3 Likes

Please make sure your DNS server isn't vulnerable to things like DNS amplification attacks. And with vulnerable I mean being able to be the agent of such an attack, not the target.

2 Likes

@Osiris Thank you for the warning. My server is not a relay. mxtoolbox verifies that issue.

@jvanasco True that the time out would probably be a good thing. Hopefully, there is a timeout on the other end too? So if you try again later, it will work?

@_az I think I found out what is causing the issue (it's still an issue...) I had certbot on my old server and it started "under my feet". And that's about the time when the "pending" issue started on my new server. It looks like my account's state could be wrong, at least at the moment. Is there something to do about that or will the account reset itself after a while? At the moment, it's still stuck after about 20 min. I'm thinking there may be a conflict because two different accounts made a query about the same domain name around the same time. Otherwise, I'll just try again later today and tomorrow. As a temporary measure, I'm using my "old" certificates which are still working (but do not include all the SAN I wanted on the new server...)

2 Likes

Old VPS instances unexpectedly booting or running is a very common issue.

The most common ways to wedge a LetsEncrypt account, typically involve (i) more than 5 Duplicate Certificates per week (ii) more than 300 Pending Authorizations, and (iii) more than 5 failed authorizations per hour. LetsEncrypt's implementation of ACME has a few optimizations that can possibly contribute to a race condition with two competing servers sharing the same account. One that comes to mind, is Pending Authorizations being recycled across AcmeOrders. I'm trying to think of ways that could create a race condition between two servers though, and honestly can't.

I think the most likely explanation is some issue with the DNS plugin or your BIND instance, one of which might be fixed by upgrading your client.

Since you're familiar with BIND, I'll assume you have some devops skills. The two things I suggest are:

  • If you're only using BIND for LetsEncrypt, instead of relying on BIND, consider running an acme-dns instance and delegating the _acme-challenge records to it. It's a much easier system to configure and maintain for DNS challenges. I also recommend using it for security concerns, but it's honestly the easiest way to handle DNS-01. You can use Certbot's pre/post hooks to enable/disable the server, so it only runs as needed.

  • If you need a quick cert and all domains are on that server, Certbot's standalone server running on an alternate port is a great option. I typically have nginx proxypass all /.well-known/acme-challenge requests on port 80 to a higher port, so if I ever have issues I can just run Certbot on that port.

1 Like

That could be my situation. I possibly tried more than 5 times with ordermade.com within one hour. How is an account wedged for? Is it possible to see the account's status and see whether I'm good or not?

In regard to BIND9, I currently manage 42 zones. So I'm not too sure that using acme-certbot would be a good solution for me.

Please use the staging environment for testing purposes.

2 Likes

Okay! I found it...

When I ran my upgrade, the dns_rfc2136_server parameter got updated back to my old VPS...

It would be great to have some info in the logs about which IP gets used by the RFC2136 implementation. That would possibly have alerted me... My next version will regenerate those files, so I should not get the issue.

dns_rfc2136_server = 165.232.146.181         <-- this line was messed up
# Target DNS port
dns_rfc2136_port = 53
# TSIG key name
dns_rfc2136_name = letsencrypt_wildcard.
# TSIG key secret
dns_rfc2136_secret = ...edited...
# TSIG key algorithm
dns_rfc2136_algorithm = HMAC-SHA512

Thank you all for your help. In the end, it helped me think I wasn't completely crazy. :grin:

I've now gotten a new ordermade.com/ws SAN certificate.

It's a rolling rate limit, so it resets within an hour. Boulder (LetsEncrypt's ACME Server) has a plaintext error message for each rate limit, and Certbot exposes them clearly. If you hit this ratelimit, you would have known it.

Because you're just getting a hanging request on that DNS record set, that strongly suggests an issue with: (i) your BIND server, (ii) your Certbot version, or (iii) something else on your server or configuration; because that type of message is emitted before the LetsEncrypt API is involved with anything substantial there.

2 Likes

@jvanasco Yes. It was definitely (iii), a file on my server was still referencing the old server IP address. It's working again now. Thank you for the details about the rate limit and that I would have been told. That in itself is good information!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.