How to avoid challenge failures due to slow propagation of Route53?

Hi Team, I am Solutions Architect in AWS China.

I'm using Lambda and Certbot to automate getting certificates and uploading them to the AWS IAM certificate store in AWS China Region. We made it into a solution and released it: https://www.amazonaws.cn/en/getting-started/tutorials/create-ssl-with-cloudfront/?nc1=h_ls

However, after some customer feedback and testing, we found that there is a probability that the certificate could not be obtained and got some errors because of slow DNS propagation.

Also, we found that dns-route53-propagation-seconds has been deprecated.

So we may need some good way to be able to ensure that we get a certificate after a successful propagation. Or provide something like dns-route53-propagation-seconds, max-retried, command parameter to solve it.

Some code example

certbot_args = [
        '--config-dir', CERTBOT_DIR + "/config",
        '--work-dir', CERTBOT_DIR + "/work",
        '--logs-dir', CERTBOT_DIR + "/logs",

        '--cert-name', "ssl",

        # Obtain a cert but don't install it
        'certonly',

        # Run in non-interactive mode
        '--non-interactive',

        # Agree to the terms of service
        '--agree-tos',

        # Email of domain administrators
        '--email', email,

        # Use dns challenge with dns plugin
        '--dns-route53',
#         '--dns-route53-propagation-seconds', '720',
        '--preferred-challenges', 'dns-01',
        '--issuance-timeout', '900',

        # Use this server instead of default acme-v01
        # '--server', CERTBOT_SERVER,

        # Domains to provision certs for (comma separated)
        '--domains', domains

        # '--dry-run'
    ]

 cert_code = certbot.main.main(certbot_args)

Here is a Certbot log showing the issue (if available):

1713517500553,"Detail: During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.xxxx.people.a2z.org.cn

1713517500553,"Detail: During secondary validation: DNS problem: query timed out looking up TXT for _acme-challenge.xxx.people.a2z.org.cn
"
1713517500553,"Hint: The Certificate Authority failed to verify the DNS TXT records created by --dns-route53. Ensure the above domains have their DNS hosted by AWS Route53.

1713517500554,"[DEBUG]	2024-04-19T09:05:00.554Z	5fc962e8-6c64-4dad-a4de-7e2aabc46111	Encountered exception:
"
1713517500554,"Traceback (most recent call last):
"
1713517500554,"File ""/var/task/certbot/_internal/auth_handler.py"", line 108, in handle_authorizations
"
1713517500554,"self._poll_authorizations(authzrs, max_retries, max_time_mins, best_effort)
"
1713517500554,"File ""/var/task/certbot/_internal/auth_handler.py"", line 212, in _poll_authorizations
"
1713517500554,"raise errors.AuthorizationError('Some challenges have failed.')
"
1713517500554,"certbot.errors.AuthorizationError: Some challenges have failed.
"
1713517500554,"[DEBUG]	2024-04-19T09:05:00.554Z	5fc962e8-6c64-4dad-a4de-7e2aabc46111	Calling registered functions
"
1713517500554,"[INFO]	2024-04-19T09:05:00.554Z	5fc962e8-6c64-4dad-a4de-7e2aabc46111	Cleaning up challenges

The operating system my web server runs on is (include version):
AWS Lambda /w python certbot==2.10.0 & certbot-dns-route53==2.10.0

Hi @pek77, and welcome to the LE community forum :slight_smile:

It seems that your systems may be geolocation blocking.

2 Likes

So my understanding of the certbot plugin is that it uses the AWS GetChange API to confirm that all of the AWS servers have the updated records. So I don't think that's your problem, especially as the error message:

Isn't what I'd expect for servers being out of sync. If it was querying a server that didn't have the update, it would be getting a no-record-found error. But instead it's not getting a reply back at all (at least not within its timeout), from some of the "secondary validation" checks.

That's certainly one possibility. Do the DNS servers accept requests worldwide?

One thing to try to do is to remove certbot from the equation entirely, just put a test TXT record up in your domain, and then try to query it from various places around the world and see how quickly it's getting a response, if it's getting one at all. You may want to particularly check from Sweden and Singapore, but Let's Encrypt is planning on continuing to add more validation perspectives and so your DNS server needs to be able to reply worldwide.

If it's just for using in CloudFront, wouldn't it be easier to just use the Amazon-provided certificates that Amazon Certificate Manager has built in? Why are you looking to get a Let's Encrypt certificate? (I'm not familiar with AWS China, so there may be good reasons for it.)

4 Likes

Correct, the --dns-route53-propagation-seconds was added to the Route53 plugin automatically due to inheritance from the above lying code, but was actually never used at all by the plugin. So you could set it to 1 second or to 1 year, it didn't make a difference. Thus the option is deprecated. It's not possible to straight out remove it, as that would cause errors when using the option on the command line and perhaps some users have hardcoded it somewhere.

3 Likes

Before doing that, I would run the domain through letsdebug.net which closely resembles the dns resolving system that LetsEncrypt uses. That might surface another error in faster time.

I agree that geoblocking is the likely culprit, but do you know offhand if this error is similar to the one that typically happens when there is a bad IPv6 configuration next to a good IPv4?

3 Likes

Maybe, but assuming the domain name's NS delegation is set up correctly to the AWS Route 53 authoritative servers (which I guess should always be double-checked), I would expect the nameservers to work on both IPv4 and IPv6. I haven't seen any issues with AWS's IPv6 implementation of DNS, though I've only ever tested us-east-1 and not anything relating to China. Certainly possible that there's a country-level firewall blocking one and not the other or something like that, though.

4 Likes

Both the Route53 DNS servers and the DNS validation servers are running on AWS. This kind of traffic normally not routed through the Internet but stays within AWS' network. Because it is a DNS query timeout, that may be AWS specific issue.

2 Likes

I forgot about this earlier... this issue also often happens when there is a routing error on the network.

Going into that server and doing a traceroute can help identify this type of issue. If this is the case, it can either be on the customer's internal LAN, the Amazon WAN, or (in other cases) a global link between any 2 provider networks.

3 Likes

There are some difference in AWS China CloudFront, here is the official doc: Amazon CloudFront - Getting Started with Amazon Web Services in China

This problem doesn't just happen every time. I think the solution of this issue should have to give me a longer wait time for propagation? Maybe it's because it propagates globally from Route 53 in China, or maybe it's caused by GFW?

Historically Route 53 has always taken up to 60 seconds for the nameservers to sync, but a query timeout is not the same thing. I'd be looking more at the time for the DNS server to actually respond to a single query, not the propagation time. Maybe look at UDP maybe being blocked, IPv6 only faults etc.

4 Likes

Like I said, your issue isn't propagation at all, but just queries to your authoritative DNS servers timing out.

4 Likes

Any possible to solve it? I have no idea about this..

Speak with your DSP.
As redundant as this may look, something is blocking DNS requests from some countries to these servers:

a2z.org.cn nameserver = ns1.amzndns-cn.cn
a2z.org.cn nameserver = ns1.amzndns-cn.biz
a2z.org.cn nameserver = ns1.amzndns-cn.com
a2z.org.cn nameserver = ns1.amzndns-cn.net
a2z.org.cn nameserver = ns2.amzndns-cn.cn
a2z.org.cn nameserver = ns2.amzndns-cn.biz
a2z.org.cn nameserver = ns2.amzndns-cn.com
a2z.org.cn nameserver = ns2.amzndns-cn.net
1 Like

AWS China and AWS Global are separated(Amazon Web Services in China), DNS validation servers I think it is on AWS Global. So DNS query may cross country from AWS Global to AWS China

1 Like

The IPs for those nameservers are all within vercara.com networks.
ARIN Whois/RDAP - American Registry for Internet Numbers - 156.154.60.0
ARIN Whois/RDAP - American Registry for Internet Numbers - 204.74.120.0
ARIN Whois/RDAP - American Registry for Internet Numbers - 2001:502:4612::
ARIN Whois/RDAP - American Registry for Internet Numbers - 2610:a1::

2 Likes

Well, we've given a few ideas for attempting to make a simpler reproduction of the problem:

But even once you do that, it's really just to give evidence that you can bring to AWS support about issues with their DNS servers' connectivity. The problem isn't on your side or the requests you're sending to update your DNS servers, but with connections from the world to your DNS servers.

Some other workarounds you might want to try are using a different DNS provider, or trying a different CA. (There are many that support ACME validation using certbot, but it's likely they may have the same sorts of issues connecting to your DNS servers as Let's Encrypt does.)

4 Likes

Indeed, I agree your point that connections issues from the world to my DNS servers. But "During secondary validation" issues is intermittent, not happening all the time. This is also something I'm very confused about.

So I would like to ask the experts in this community, is it possible to set the number of retries for certbot as a work around method. After all, this DNS issue may be caused by the characteristics of the internet in China, I guess.

And in our one-click deployment solution adopted certbot, we have set up an IAM Role with permissions for Route53, which allows this solution could publish DNS records automatically(the TXT records from certbot) via Lambda without manual operation of Route53. If it were other DNS providers, it would be difficult for us to achieve full automation. Therefore, keep Route53 of AWS China in this solution remains the only option.

No, "propagation" has nothing to do with it, as Certbot isn't sending the request to the CA to validate until after the AWS DNS servers are in sync.

You should be able to try running certbot a few times, but I don't think there's anything internal to certbot for you to tweak here. Certbot's default recommended configuration is to run twice a day, and it starts trying to renew 30 days before expiration, which should generally allow one to get certificates before expiration if your system is reliable enough.

Sure, but it may be difficult to achieve full automation with this system too, if your DNS servers aren't accessible to the world all the time.

You may want to try a multi-CA strategy, where you try different CAs at different times to get a higher chance of one of them getting through.

3 Likes

"During secondary validation" issues is intermittent, not happening all the time. This is also something I'm very confused about...

    try:
        cert_code = certbot.main.main(certbot_args)
    except certbot.errors.AuthorizationError as e:
        logger.error(e)
        if str(e)=="Some challenges have failed.":
             cert_code = certbot.main.main(new_certbot_args)
        else:
            raise e

The above code is also my “clumsy” method.

This means I can try options other than Let's Encrypt, such as ZeroSSL, SSLForFree, etc?

1 Like