Issue with AWS API when using certbot-dns-route53 with many domains

My current theory is that sending multiple updates to the same hosted zone in succession makes the later updates very slow to go INSYNC. I think it might be possible to speed things up by adding some complexity to the dns-route53 plugin. Right now it sends each update individually, even if there are multiple updates to a single hosted zone. Instead, it could group updates by which hosted zone theybare in, so it can send just one update per zone. Would you be interested in trying to write that patch?

Thanks,
Jacob

For what it’s worth, I’ve never had problems making multiple updates to the same zone.

(I use a custom Certbot manual auth hook with boto 2.)

I probably haven’t ever done more than 30-40 additions in quick succession. (Plus the same number of deletions, but I don’t check how quickly those go INSYNC.)

@jsha about the patch, sound as a good idea, but there are some considerations.

The test that I did and post here was under one Hosted Zone, not several ones as our main production setup. So, in resume, there is not difference if there is one Hosted Zone or many Hosted Zone, the UPSERT time for each _acme... TXT record is the same.

Python is not a language that I know, but I can understand it, maybe is a noob question but in this function of the certbot-route53-plugin ( https://certbot-dns-route53.readthedocs.io/en/latest/_modules/certbot_dns_route53/dns_route53.html ):

def perform(self, achalls):
        self._attempt_cleanup = True

        try:
            change_ids = [
                self._change_txt_record("UPSERT",
                  achall.validation_domain_name(achall.domain),
                  achall.validation(achall.account_key))
                for achall in achalls
            ]

            for change_id in change_ids:
                self._wait_for_change(change_id)
        except (NoCredentialsError, ClientError) as e:
            logger.debug('Encountered error during perform: %s', e, exc_info=True)
            raise errors.PluginError("\n".join([str(e), INSTRUCTIONS]))
        return [achall.response(achall.account_key) for achall in achalls]

The call to _change_txt_record() and later _wait_for_change() are happen synchronously, right? In other words first all records are inserted later, record by record is waiting for the sync?

I contacted again the AWS Support and they answer me that in a test using the AWS CLI the records are inserted quickly, without delay, I requested the command that he used to test it in our Route53, in addition I will try to modify the plugin, for debug and testing purpose, to only do the records insertion and nothing else, to see if just in that process take so long.

@mnordhoff if the code happen synchronously the delay is in the insertion and not only in the waiting for the INSYNC. I tested here with 15 domains and I was refreshing manually the Route53 console to see the records inserted, and it is taking like 10 seconds for each record to be inserted and in a interval of 7 records throw and error of reset connection.

Yep! Note that we need to wait for all records to be in sync before we proceed, so the order in which we check for sync doesn't really matter.

The change I'm proposing is something like this: modify _change_txt_record so it can take multiple [FQDNs, validation] pairs, so long as all those FQDNs are in the same hosted zone. Then do some grouping of FQDNs by hosted zone, so you only need a single change id per hosted zone.

BTW, to make sure we're on the same page: How many total hosted zones do you have, and how many FQDNs per hosted zone are you trying to issue for? I'm interested because you said things start erroring when you try to issue for 21 FQDNs. If that's 21 FQDNs in a single hosted zone, that's one thing; if that's 21 FQDNs across 21 hosted zones, that would lead us in a different direction.

I understood the changes, it make sense and looks as a good optimization, from my side I could try to code it but as not being a python developer it could take me a time, but in other hand it sound like a good opportunity to embed me with python.

About talking to be in the same page: I executed certbot with route-53-plugin in both scenarios 1) a total of 70 domains and subdomains in around 30 different Hosted Zone and in the test 2) 15 subdomains in the same Hosted Zone.

In both case, 1 and 2, the UPSERT of the domain take the same, it looks to doesn’t matter if it is in the same or different hosted zones. Actually, it was in 2, when I realize that a good part of the delay was caused because the UPSERT of the domain taking to long… since I could watch the UPSERT in real time refreshing manually the Hosted Zone records in the AWS console.

Yesterday, with a bash script I tested to insert a burst of 15 domains in the same Hosted Zone used for the test but with the AWS CLI, to simulate the certbot-route53-plugin were made 15 different requests and not all in one batch (15 different batch with 1 domain in each batch). The result was that the domains were UPSERT very quickly. It took ~1 minute, meanwhile the plugin take ~10 minutes for the same amount of domains on the same unique Hosted Zone.

Now I want to try to do the same using boto3, just a plain UPSERT for 15 domains, to see how long it takes to insert the domains in the Hosted Zone.

I’ve mostly used CREATE and DELETE instead of UPSERT. I wonder if UPSERT can hit performance issues.

Or maybe I’m just lucky, or never noticed any slowness I might have experienced, because I have one zone with few records…

@mnordhoff Doing the same, the UPSERT but with AWS CLI works very quickly, but not form the route53-plugin now I am investigating if is something related to boto3.

Can you shared with us your manual setup and scripts? I would like to test it with our domains

It would be good if you can provide: the script that you use for the manual route53 challenge, and how you execute certbot calling this script.

@jsha @schoen I am trying to do some changes in the route53-plugin code and just now I realize something, I always believe that the complete list of domains set in the certbot command was passed to the route53-pluggin and it iterate over the domains, but it is not how it work, it looks that the route53-plugin is call/executed once per each domains.

So, it explain why it take like ~10 seconds to add each domain, since the route53-plugin is call once per each domain, it will wait that the UPSERT domain be INSYNC before finish the execution and be called again with the next domain. As anote the TTL is 10, that why it take exactly around 10 seconds to be INSYNC.

Please correct me if I am wrong, but from here we can deduct two things, the first one, your idea about optimize it doing batch UPSERT by hosted zone is not possible (the plugin script attend only one domain on each call), and the second is that it would not be possible to optimize it from the route53-plugin since the plugin is always executed for one domain at the time.

Some posible solutions could be:
a) Pass from certbot all the domains list to the route53-plugin and manage the batch insertion as you propuse, additionally first UPSERT all the records and later wait for all the INSYNC
b) That certbot call the route53-plugin with one domain at each time (as it looks to be happening now) but asynchronously, then all domains are inserted and waiting for INSYNC in parallel and not in a queue.

How certbot works/interact with the route53-plugin is not very optimized and if it is like that, all make sense.

2 Likes

Your first belief was correct - the certbot-dns-route53 plugin does indeed get the full list of domains. It sends all UPSERTs sequentially, then waits for all changes sequentially. Here's the code: https://github.com/certbot/certbot/blob/5073090a20fa59fae45b4d90bfb41635bc181911/certbot-dns-route53/certbot_dns_route53/dns_route53.py#L47-L62.

Are you using an up-to-date copy of the plugin? Want to share the current contents of your dns_route53.py?

Oh!, the plugin version that we get installing it with apt is not the latest one!

Also I was reading the code from here https://certbot-dns-route53.readthedocs.io/en/latest/_modules/certbot_dns_route53/dns_route53.html (which look to be the latest one) and the installed plugin code, and I didn’t realize it! indeed, in my unknowing of python I missed the line for achall in achalls in the web code that also was the one that I pasted in my comment, all looks to be working as you said.

This is the actual code that we have in the plugin:

"""Certbot Route53 authenticator plugin."""
import collections
import logging
import time

import boto3
import zope.interface
from botocore.exceptions import NoCredentialsError, ClientError

from certbot import errors
from certbot import interfaces
from certbot.plugins import dns_common

logger = logging.getLogger(__name__)

INSTRUCTIONS = (
    "To use certbot-dns-route53, configure credentials as described at "
    "https://boto3.readthedocs.io/en/latest/guide/configuration.html#best-practices-for-configuring-credentials "  # pylint: disable=line-too-long
    "and add the necessary permissions for Route53 access.")

@zope.interface.implementer(interfaces.IAuthenticator)
@zope.interface.provider(interfaces.IPluginFactory)
class Authenticator(dns_common.DNSAuthenticator):
    """Route53 Authenticator

    This authenticator solves a DNS01 challenge by uploading the answer to AWS
    Route53.
    """

    description = ("Obtain certificates using a DNS TXT record (if you are using AWS Route53 for "
                   "DNS).")
    ttl = 10

    def __init__(self, *args, **kwargs):
        super(Authenticator, self).__init__(*args, **kwargs)
        self.r53 = boto3.client("route53")
        self._resource_records = collections.defaultdict(list)

    def more_info(self):  # pylint: disable=missing-docstring,no-self-use
        return "Solve a DNS01 challenge using AWS Route53"

    def _setup_credentials(self):
        pass

    def _perform(self, domain, validation_domain_name, validation):
        try:
            change_id = self._change_txt_record("UPSERT", validation_domain_name, validation)

            self._wait_for_change(change_id)
        except (NoCredentialsError, ClientError) as e:
            logger.debug('Encountered error during perform: %s', e, exc_info=True)
            raise errors.PluginError("\n".join([str(e), INSTRUCTIONS]))

    def _cleanup(self, domain, validation_domain_name, validation):
        try:
            self._change_txt_record("DELETE", validation_domain_name, validation)
        except (NoCredentialsError, ClientError) as e:
            logger.debug('Encountered error during cleanup: %s', e, exc_info=True)

    def _find_zone_id_for_domain(self, domain):
        """Find the zone id responsible a given FQDN.

           That is, the id for the zone whose name is the longest parent of the
           domain.
        """
        paginator = self.r53.get_paginator("list_hosted_zones")
        zones = []
        target_labels = domain.rstrip(".").split(".")
        for page in paginator.paginate():
            for zone in page["HostedZones"]:
                if zone["Config"]["PrivateZone"]:
                    continue

                candidate_labels = zone["Name"].rstrip(".").split(".")
                if candidate_labels == target_labels[-len(candidate_labels):]:
                    zones.append((zone["Name"], zone["Id"]))

        if not zones:
            raise errors.PluginError(
                "Unable to find a Route53 hosted zone for {0}".format(domain)
            )

        # Order the zones that are suffixes for our desired to domain by
        # length, this puts them in an order like:
        # ["foo.bar.baz.com", "bar.baz.com", "baz.com", "com"]
        # And then we choose the first one, which will be the most specific.
        zones.sort(key=lambda z: len(z[0]), reverse=True)
        return zones[0][1]

    def _change_txt_record(self, action, validation_domain_name, validation):
        zone_id = self._find_zone_id_for_domain(validation_domain_name)

        rrecords = self._resource_records[validation_domain_name]
        challenge = {"Value": '"{0}"'.format(validation)}
        if action == "DELETE":
            # Remove the record being deleted from the list of tracked records
            rrecords.remove(challenge)
            if rrecords:
                # Need to update instead, as we're not deleting the rrset
                action = "UPSERT"
            else:
                # Create a new list containing the record to use with DELETE
                rrecords = [challenge]
        else:
            rrecords.append(challenge)

        response = self.r53.change_resource_record_sets(
            HostedZoneId=zone_id,
            ChangeBatch={
                "Comment": "certbot-dns-route53 certificate validation " + action,
                "Changes": [
                    {
                        "Action": action,
                        "ResourceRecordSet": {
                            "Name": validation_domain_name,
                            "Type": "TXT",
                            "TTL": self.ttl,
                            "ResourceRecords": rrecords,
                        }
                    }
                ]
            }
        )
        return response["ChangeInfo"]["Id"]

    def _wait_for_change(self, change_id):
        """Wait for a change to be propagated to all Route53 DNS servers.
           https://docs.aws.amazon.com/Route53/latest/APIReference/API_GetChange.html
        """
        for unused_n in range(0, 120):
            response = self.r53.get_change(Id=change_id)
            if response["ChangeInfo"]["Status"] == "INSYNC":
                return
            time.sleep(5)
        raise errors.PluginError(
            "Timed out waiting for Route53 change. Current status: %s" %
            response["ChangeInfo"]["Status"])

This is the version that we got installed with apt certbout-route53-plugin in Ubuntu 16.04 LTS

 dpkg-query -s python3-certbot-dns-route53
Package: python3-certbot-dns-route53
Status: install ok installed
Priority: optional
Section: python
Installed-Size: 44
Maintainer: Debian Let's Encrypt Team <team+letsencrypt@tracker.debian.org>
Architecture: all
Source: python-certbot-dns-route53
Version: 0.23.0-2+ubuntu16.04.1+certbot+1
Depends: python3-acme (>= 0.22.0~), python3-boto3, python3-certbot, python3-mock, python3-pkg-resources, python3-zope.interface, python3:any (>= 3.3.2-2~)
Enhances: certbot
Description: Route53 DNS plugin for Certbot
 The objective of Certbot, Let's Encrypt, and the ACME (Automated
 Certificate Management Environment) protocol is to make it possible
 to set up an HTTPS server and have it automatically obtain a
 browser-trusted certificate, without any human intervention. This is
 accomplished by running a certificate management agent on the web
 server.
 .
 This agent is used to:
 .
   - Automatically prove to the Let's Encrypt CA that you control the website
   - Obtain a browser-trusted certificate and set it up on your web server
   - Keep track of when your certificate is going to expire, and renew it
   - Help you revoke the certificate if that ever becomes necessary.
 .
 This package contains the Route53 DNS plugin to the main application.
Homepage: https://certbot.eff.org/

and the certbot version

$ certbot --version
certbot 0.26.1
$ dpkg-query -s certbot
Package: certbot
Status: install ok installed
Priority: optional
Section: web
Installed-Size: 48
Maintainer: Debian Let's Encrypt <team+letsencrypt@tracker.debian.org>
Architecture: all
Source: python-certbot
Version: 0.26.1-1+ubuntu16.04.1+certbot+2
Replaces: letsencrypt
Provides: letsencrypt
Depends: python3-certbot (= 0.26.1-1+ubuntu16.04.1+certbot+2), init-system-helpers (>= 1.18~), python3:any
Suggests: python3-certbot-apache, python3-certbot-nginx, python-certbot-doc
Breaks: letsencrypt (<= 0.6.0)
Conffiles:
 /etc/cron.d/certbot 0b97d70db8c43d86fcdc565590414c79
 /etc/letsencrypt/cli.ini dc5a5672c8f966a968ac0c98c447c14e
 /etc/logrotate.d/certbot a815a66a333f2637c00055fcd44b02d4
Description: automatically configure HTTPS using Let's Encrypt
 The objective of Certbot, Let's Encrypt, and the ACME (Automated
 Certificate Management Environment) protocol is to make it possible
 to set up an HTTPS server and have it automatically obtain a
 browser-trusted certificate, without any human intervention. This is
 accomplished by running a certificate management agent on the web
 server.
 .
 This agent is used to:
 .
   - Automatically prove to the Let's Encrypt CA that you control the website
   - Obtain a browser-trusted certificate and set it up on your web server
   - Keep track of when your certificate is going to expire, and renew it
   - Help you revoke the certificate if that ever becomes necessary.
 .
 This package contains the main application, including the standalone
 and the manual authenticators.
Homepage: https://certbot.eff.org/

This week I am in a small holidays :slight_smile: , but at beginning of the next week I will update the plugin code and test it again.

Thanks

1 Like

Hello,

I have some good news, after update the plugin code manually (because it doesn’t come updated with the latest package python3-certbot-dns-route53 in Ubuntu 16.04 LTS), it works much more better.

To generate a SSL certificate for 16 subdomains in the same Hosted Zone, it took around 1 minute, previously it took 10 minutes. Now when I refresh the Hosted Zone in the AWS console I see how quickly all the challenge TXT subdomains are added, like all at the same time, previously each domain took around 10 second to be UPSERT to Route53.

In the other hand, I still see in the console few Resetting dropped connection: route53.amazonaws.com, previously it appear meanwhile the plugin was doing the UPSERT, now it appear when trying to Clean the challenges, I paste the command output:

$ sudo /usr/bin/certbot certonly --non-interactive --dns-route53 --cert-name cuextest --domain 1.domain.com --domain 2.domain.com --domain 3.domain.com --domain 4.domain.com --domain 5.domain.com --domain 6.domain.com --domain 7.domain.com --domain 8.domain.com --domain 9.domain.com --domain 10.domain.com --domain 11.domain.com --domain 12.domain.com --domain 13.domain.com --domain 14.domain.com --domain 15.domain.com  --keep-until-expiring  --renew-with-new-domains --rsa-key-size 2048 --email g.ochsner@revolistic.com --agree-tos --test-cert --debug
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Starting new HTTP connection (1): 169.254.169.254
Starting new HTTP connection (1): 169.254.169.254
Found credentials from IAM Role: RevolisticOpsworksDevelopment
Plugins selected: Authenticator dns-route53, Installer None
Starting new HTTPS connection (1): acme-staging-v02.api.letsencrypt.org
Obtaining a new certificate
Performing the following challenges:
dns-01 challenge for 1.domain.com
dns-01 challenge for 10.domain.com
dns-01 challenge for 11.domain.com
dns-01 challenge for 12.domain.com
dns-01 challenge for 13.domain.com
dns-01 challenge for 14.domain.com
dns-01 challenge for 15.domain.com
dns-01 challenge for 2.domain.com
dns-01 challenge for 3.domain.com
dns-01 challenge for 4.domain.com
dns-01 challenge for 5.domain.com
dns-01 challenge for 6.domain.com
dns-01 challenge for 7.domain.com
dns-01 challenge for 8.domain.com
dns-01 challenge for 9.domain.com
Starting new HTTPS connection (1): route53.amazonaws.com
Waiting for verification...
Cleaning up challenges
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com

IMPORTANT NOTES:
 - Congratulations! Your certificate and chain have been saved at:
   /etc/letsencrypt/live/cuextest/fullchain.pem
   Your key file has been saved at:
   /etc/letsencrypt/live/cuextest/privkey.pem
   Your cert will expire on 2019-02-24. To obtain a new or tweaked
   version of this certificate in the future, simply run certbot
   again. To non-interactively renew *all* of your certificates, run
   "certbot renew"

I will check if updating Ubuntu to 18.04 LTS it comes out of the box in their repository with the latest certbot-route53-plugin version. Also, soon I will check using the latest plugin code to get a certificate for our production app with more than 70 domains in different Hosted Zones.

So far, a good change!
Let’s me know if you have any questions

1 Like

I can’t edit my previous post to update it, so I create a new one:

Successfully we obtain an SSL certificate using certbot-route53-plugin for around 68 domains in differents Route53 Hosted Zones. It took around of 2:30 minutes, which is, I think, inside expectation.

All that worked thanks to the @jsha suggestion to check if we were using the latest version of the plugin, since Ubuntu 16.04 LTS (were our system run) is providing the version 0.23 of the plugin in their official PPA.

Ubuntu 18.04 LTS also doesn’t have the latest plugin version in the package python3-certbot-dns-route53, I am following this instruction for the installation here https://certbot.eff.org/lets-encrypt/ubuntubionic-other

Additionally the version of certbot installed with the official ppa package is not the latest one (0.28.0), it is the 0.26.0.

It looks that Debian Let’s Encrypt Maintainer are a bit behind, but also they doesn’t keep all the package at the same version level in the ppa, because per example in the repo for certbot version 0.26.0 the certbot-route53-plugin https://github.com/certbot/certbot/blob/v0.26.0/certbot-dns-route53/certbot_dns_route53/dns_route53.py is updated and differ with the version installed with the package that come with certbot 0.26.0 provided by the ppa:certbot/certbot

Here I can confirm it https://launchpad.net/~certbot/+archive/ubuntu/certbot/+packages?field.name_filter=&field.status_filter=published&field.series_filter=bionic certbot-route53-plugin version is the 0.23.0 and the certbot is 0.26.0

No idea where request or report it.
But good that now work.

As additionally, I would like to comment that the messages Resetting dropped connection: route53.amazonaws.com continues, but the appear more quickly that with the old version of the plugin, I will try to follow the conversation with the AWS Route53 team to determine the cause and update this conversation if I have some feedback.

1 Like

I think there is no need to follow up with AWS about the reset connections: they are working as intended. Servers often reset connections from clients that have been connected for a while. Typically this is done to preserve resources like memory. That's why urllib3 has retry logic, which is working correctly here.

Regarding packaging: @bmw could you ask the Debian / Ubuntu maintainers to package the latest version of the DNS plugin?

I understand, but is strange for me is that now (with the latest version of the plugin), I can see several Resetting dropped connection: route53.amazonaws.com warning within a period of 10 or 15 seconds. That could be normal?

This is the output of the command for the 68 domains (cut version) which took ~2:30 minutes, it contains 8 reset connection messages, 4 at the moment of UPSERT and exactly 4 when removing the challenge TXT records, all the warning appear in a few seconds of difference:

[...]
dns-01 challenge for ....
dns-01 challenge for ....
Starting new HTTPS connection (1): route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Waiting for verification...
Cleaning up challenges
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Running deploy-hook command: ....
Running deploy-hook command: ....
[...]

Anyway, all things works correctly, so curious about it, but if you think that is normal, then no need and I will close the ticket with them.

1 Like

That number of dropped connections is not totally out of line. I would expect something like that if the server dropped connections that were idle for 15 seconds, which is not an uncommon setting.

It looks like that message is showing up even at a low verbosity level in Certbot, which is interesting. Since it’s an info-level message, i wouldn’t expect it to show up. I’ll take a look and see if we can hide the message since it’s expected.

1 Like

Perfect then, and I keep in mind the refactor to the plugin code to support batch transaction, I hope start with it soon and use it as a good opportunity to learn better python.

Thanks to everybody for the help, post and ideas… and thanks @jsha for be closely envolved which this thread and hit the right solution :slight_smile:

2 Likes

Looks like the log level thing is an issue with the version of urllib3 vendored by botocore. I’ve filed an issue to bring it up to date: https://github.com/boto/botocore/issues/1613

2 Likes

It looks like more recent versions of boto3 don’t use the vendored copy of urllib3, which means that if you’ve got everything up-to-date, the “Resetting” messages get correctly logged at DEBUG level: https://github.com/boto/botocore/issues/1613#issuecomment-443348460. So no action needed there.

1 Like

I was following it, great!, soon we will update to Ubuntu 18.04 LTS so maybe it comes with the latest version of boto, for the moment knowing that the message are not an issue, there is no problem, actually all the process is automatized.

@bmw any update regarding ask the Debian / Ubuntu maintainers to package the latest version of the DNS plugin? or if you can orient me where I can make this request, I will do it.

Thanks!

I reached out to them when jsha initially pinged me and they said they’d do it last week. I reached out to them again on Monday and they said they’d do it this week. It’s still not done though :frowning:

I’ll keep reminding them until they take the latest packages from Debian. With that said, if you or anyone else has experience building packages for Ubuntu and would like to help with this, please let me know. The problem is the PPA is run by volunteers with limited time to work on Certbot.

Due to problems like this, the Certbot team is currently working on our own packages but it’ll take some time before they’re available to the public. We could definitely use help maintaining the PPA in the meantime though.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.