Issue with AWS API when using certbot-dns-route53 with many domains


#1

Hello,

We have this issue from like two months ago and I was exchanging several emails with AWS Support and the Route53 team trying to debug the issue, as it is a lot of material I will just publish the latest information and if you have any questions just let’s me know.

The problem

When you want to generate/renew a certificate with several domains (I don’t have the exactly number but for sure with 15+ domains) using the certbot-dns-route53 plugin certbot enter in a loop displaying the following error Resetting dropped connection: route53.amazonaws.com

How to reproduce it

TLDR;
Go to run the command section using route53 plugin with 15+ domains or subdomains to request a certificate.

Our instance

Ubuntu 16.04 LTS choosen direcly from the EC2 marketplace.

$ uname -a
Linux bristol 4.4.0-1055-aws #64-Ubuntu SMP Thu Apr 5 17:06:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Register a domain and place it in Route53
I think that it issue will be reproducible also if it is only one domain with 20 subdomains. So, the team doesn’t need 20 different domains.

Install AWS CLI and configure permissions (instance role, env variable, config file…)

Install certbot and dns-route53 pluggin

sudo apt-get update
sudo apt-get install software-properties-common
sudo add-apt-repository ppa:certbot/certbot
sudo apt-get update
sudo apt-g$ et install certbot 
sudo apt-get install python3-certbot-dns-route53

More information here https://certbot.eff.org/lets-encrypt/ubuntuxenial-other

Configure AWS policy for Route54 in order to allow access to dns-route53 pluggin

    {
        "Version": "2012-10-17",
        "Id": "certbot-dns-route53 sample policy",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "route53:ListHostedZones",
                    "route53:GetChange"
                ],
                "Resource": [
                    "*"
                ]
            },
            {
                "Effect" : "Allow",
                "Action" : [
                    "route53:ChangeResourceRecordSets"
                ],
                "Resource" : [
                    "arn:aws:route53:::hostedzone/YOURHOSTEDZONEID"
                ]
            }
        ]
    }

More information here https://certbot-dns-route53.readthedocs.io/en/latest/

  • Run the command
/usr/bin/certbot certonly --non-interactive --dns-route53 --cert-name domain.com --domain domain.com --domain 1.domain.com --domain 2.domain.com --domain 3.domain.com --domain 4.domain.com --domain 5.domain.com --domain 6.domain.com --domain 7.domain.com --domain 8.domain.com --domain 9.domain.com --domain 10.domain.com --domain 11.domain.com --domain 12.domain.com --domain 13.domain.com --domain 14.domain.com --domain 15.domain.com --domain 16.domain.com --domain 17.domain.com --domain 18.domain.com --domain 19.domain.com --domain 20.domain.com  --keep-until-expiring  --renew-with-new-domains --rsa-key-size 2048 --email any@email.com --agree-tos --test-cert

What we reach after debugging

TLDR;
The debugging of the network traffic and certbot log using the --debug flag reveal that certbot is trying to reuse a TCP connection which was already in place to send new HTTP request, since the TCP connection might be already timed out, the certbot tries to reuse a TCP connection which was already closed and need to reset it, hence we see the following log line.

Technical information
Looking at the log files, we see successful connection as well as one which was initially dropped:
Successful:

2018-10-10 08:51:49,371:DEBUG:botocore.endpoint:Sending http request: <PreparedRequest [GET]> 2018-10-10 08:51:49,461:DEBUG:requests.packages.urllib3.connectionpool:"GET /2013-04-01/change/C2IWBA1RLANJZ5 HTTP/1.1" 200 314 2018-10-10 08:51:49,462:DEBUG:botocore.parsers:Response headers: {'Date': 'Wed, 10 Oct 2018 08:51:48 GMT', 'x-amzn-RequestId': 'b9b5027d-cc69-11e8-ae8a-03d834cd6f4e', 'Content-Type': 'text/xml', 

Dropped:

2018-10-10 08:51:44,033:DEBUG:botocore.endpoint:Sending http request: <PreparedRequest [GET]> 2018-10-10 08:51:44,033:INFO:requests.packages.urllib3.connectionpool:Resetting dropped connection: route53.amazonaws.com 2018-10-10 08:51:44,361:DEBUG:requests.packages.urllib3.connectionpool:"GET /2013-04-01/change/C2IWBA1RLANJZ5 HTTP/1.1" 200 314 2018-10-10 08:51:44,362:DEBUG:botocore.parsers:Response headers: {'Date': 'Wed, 10 Oct 2018 08:51:43 GMT', 'x-amzn-RequestId': 'b6a9e55e-cc69-11e8-ae8a-03d834cd6f4e', 'Content-Type': 'text/xml',

More from the feedback from AWS Support, I quote:

In a successful request, we see the library being able to get a TCP connection which is most likely to be open state and then it sends the HTTP request:
2018-10-10 08:51:49,461:DEBUG:requests.packages.urllib3.connectionpool:“GET /2013-04-01/change/C2IWBA1RLANJZ5 HTTP/1.1” 200 314

Based on that, I don’t think there’s an issue with Route53 endpoint. Its more likely the “certbot” trying to reuse an expired TCP connection and thus it gets reset by Route53 endpoint.

Also, some HTTP 405 were observed on letsencrypt endpoint which means that some “not allowed” http method are being received:

2018-10-10 08:43:40,613:DEBUG:acme.client:Sending HEAD request to https://acme-staging-v02.api.letsencrypt.org/acme/new-order .
2018-10-10 08:43:40,811:DEBUG:requests.packages.urllib3.connectionpool:“HEAD /acme/new-order HTTP/1.1” 405 0
2018-10-10 08:43:40,812:DEBUG:acme.client:Received response:
HTTP 405
Content-Length: 103
Server: nginx
Connection: keep-alive
Expires: Wed, 10 Oct 2018 08:43:40 GMT
Allow: POST
Content-Type: application/problem+json
Pragma: no-cache
Date: Wed, 10 Oct 2018 08:43:40 GMT
Cache-Control: max-age=0, no-cache, no-store

It is recommended by our Route53 team to try and perform this task using the AWS CLI as certbot is a third party tool and it is really difficult to guess what is happening under the hood (as to how the certbot sends the API request to Route53 endpoint and how can we force the certbot to always initiate a new TCP connection). Thus, we recommend to perform this task of record verification (using the GETCHANGE API) using the AWS CLI. We also recommend you to keep the packet capture and log collection (using --debug) running as it will help us to troubleshoot in case we run into similar issues. Performing this would help us to verify if this issue is pertaining to the certbot.

Conclusion

So far, we were not able to solve the issue and after a long time of debugging together with the AWS team, the issue looks to be from the certbot or certbot-dns-route53 side. It is really outside of our scope can solve it.

As it affect our production server, as we use Cloudfront and a quite complicate setup (that is why we can not use automatize http-01 challenge as we were doing before) I was able to generate the certificate doing a manual bypassing for the http challenge, now this is affecting the renew of the certificate, producing several error and loops.

I was able to reproduce this issue in several different EC2 instance, but I think that also using Route53 and certbot with the plugin in any server, the result will be the same. It happen with stagging and production let’s encrypt servers.

If you need the full log and network capture as well the command output I can provide this information.

Thanks,
J.


#2

I realize the fix may need to be in the certbot logic and how it handles stale connections.
But as that solution may take some time to develop.
You might want to try increasing (or decreasing) the timeout on your server and see if that can overcome this issue.

See:
https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html
http://www.tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html


#3

Thanks for reply @rg305 .

What is throwing timeout is the connection from certbot (which actually use boto) to AWS API servers, that is not configurable from our side :confused:, it is not in a external call to our http server.


#4

Hi @ejds,

Do you think, then, that this is an upstream bug in the boto library which the boto developers might be best-positioned to fix?


#5

Hello @schoen,

Not sure, it could be part of the implementation of the certbot-dns-route53 plugin, there are solution from AWS to deal with this kind of situation, check:

https://docs.aws.amazon.com/general/latest/gr/api-retries.html

I just mention that certbot-dns-route53 is using boto but not sure exactly where could be the issue at this moment, also could be in urllib3.

I am trying to debug it from top to down, and as a bug from AWS side is almost discarded, next is certbot and certbot-dns-route53.

So, since it is very technical maybe somebody more capable, with more experience, could confirm where the bug is, if in certbot, certbot-dns-route53 (most possible at this moment) or some dependency of them.

I see this as a important issue, I don’t have the exact number from which this become an issue, but it is very low amount of domains (8? 10? 15?), then using dns-01 challenge would be impossible to request/renew certificates from a certain numbers of domains, there is a limitation there given by this issue.

@schoen can you as Certbot engineer ensure that this is a boto issue? If yes, then I will open a ticket in boto and sending it here for reference.

Thanks for reply


#6

My first hunch would be that maybe boto3 needs to be setting some retry options when creating a urllib3 client; alternately it may be that urllib3 doesn’t properly handle reset connections.

To verify the latter, you could set up a small Python script that creates a urllib3 client in the same way as boto3, then requests an arbitrary URL from the AWS API, then sleeps a configurable amount of time, then requests that URL again. You would run it with gradually increasing amounts of time between requests in order to eventually get AWS to reset the connection, and see what the behavior is. If you can reproduce the “Resetting dropped connection” error, the issue is in urllib3.

Also, could you upload full Certbot logs as an attachment? The timing and details of the errors are probably informative.


#7

Hello @jsha,

I see what you mean, it makes sense, I am not experienced on python, urllib3 and boto3 but it looks not to be complicated to code, I will give a try!

I don’t find where upload or attach a file here, but I just uploaded it to google drive, here the link:
https://drive.google.com/file/d/1AKsKlQ3Ecb6m-85f55H6YusN48lICr5b/view?usp=sharing

Let’s me know if I can contribute with something else meanwhile,
Thanks for reply


#8

Looking at your logs, here’s the full output that mentions “Resetting”:

2018-10-10 08:51:44,033:INFO:requests.packages.urllib3.connectionpool:Resetting dropped connection: route53.amazonaws.com
2018-10-10 08:51:44,361:DEBUG:requests.packages.urllib3.connectionpool:"GET /2013-04-01/change/C2IWBA1RLANJZ5 HTTP/1.1" 200 314

Notice that that’s an INFO log line, and it’s followed by a successful fetch. I think that part of things is working as intended: The connection gets reset, and urllib is successfully retrying the request.

The root problem here is that when you submit your zone change to Route53, certbot-route53 has to wait for that’s change’s state to go from “PENDING” to “INSYNC.” In the log you shared, it looks like at least one of your zone changes stayed “PENDING” for eight minutes before you killed it. Route53 is usually much faster - on the order of a minute or two.

Maybe you could ask Route53 support why something might stay PENDING for a while? Do you do a lot of rapid automated updates to this zone, or (equivalently) do you have certbot-route53 running very frequently? I’ve seen some issues where the time to INSYNC increases when I’m doing lots of updates.


#9

I see, no we don’t have certbot-route53 running frequently, with few domains it works fine, the problem start when there are a lot of domains.

I will ask to AWS about the PENDING status.

First time that we tried to get a new certificate for all the domains with certbot-route53 it was runing more than 1 hour without success, and the command return several Resetting dropped connection: route53.amazonaws.com like in a loop.


#10

Do you have any other software automating updates to any of your domains?


#11

No, we don’t have other software touching the domains, actually we just migrated our domains to Route53 in order to can use the certbot-route53 plugin before we were using http-01 challenge but after the implementation of CloudFront it was not possible.

The only software is certbot and the cronjob for renew.


#12

BTW, is there a reason you need Let’s Encrypt certificates if your site is behind Cloudfront? I assume you are trying to secure the link from Cloudfront to your servers, which makes sense. But I’m wondering if Cloudfront offers other options for validating those certificates, like pinning to a specific self-signed certificate, or using an internal-only CA like Cloudflare’s Origin CA.


#13

yes, but we are also using it for the connection from the browser to CF, when a certificate is emited or renew, we run a script which automatically upload it to the AWS Certificate Manager that is from where CF and the ALB/ELB consume the certificate.

And our app support 9 languages each one with a domain, plus the www subdomain we there already have 18 domains, plus a main domain with three subdomains: www, api and static, there are 21 domains for LE. Additionally we have a lot of aliases with they www subdomain that redirect our main site (there are not really necessary to have certificate, but with 21 domains certbot-route53 already fail.

I got an answer from AWS regarding the “PENDING” status, I think that this could be the reason of the issue, I quote:

"I see that you have been using a script to make the GETCHANGE API calls. According to the AWS documentation for these API calls, usually it should take up to one minute to get a change updated around all Edge Locations, changing the status from PENDING to INSYNC. However, it also depends upon how big is the change batch that being performed which might take a few more minutes. It seems like the script waits for the change batch to be propagated around all Edge Locations and it waits for the status to change to INSYN to continue sending more change batches, however, at the Route53 endpoint we have a TCP idle timeout setting which would kill the connection if there’s no traffic passing through it. Since the script is waiting for the change to be propagated and there is no traffic for few minutes, R53 closes the connection and when the script tries to send more data, they get reset back from Route53.

Thus, it is recommended to implement a logic where certbot checks the TCP_STATE_MACHINE which would make sure that the TCP connection is open in order to continue to re-use a tcp connection or a logic where the certbot does not try to re-use tcp connection at all and send a new tcp connection for new change batch request."


#14

Is there any indication that the reset connections are a problem?

jsha noted above that boto3 seems to be succeeding in opening a new connection and making a request after it logs the connection reset.

It looks like a red herring.

Certbot’s Route 53 plugin has literally about 20 lines of code dealing with Route 53. If there’s a bug, it’s either in boto – which is developed by Amazon – or one of the underlying libraries boto uses, like urllib3, or Certbot is managing to misuse boto’s API in very little code. :confused:

I’m not trying to shift blame but… okay yes I am. They’re blaming Certbot for TCP stuff that Certbot is operating three steps removed from.


#15

I understand what you said and I don’t agree as well, but from their side the after checking they reach that there is not issue and the API is working as it should be.

So, then you think that it is not a certbot-dns-route53-plugin issue?


#16

It could be some firewalls’ stateful inspection (along the path) that is dropping “stale” connections after some specific amount of time.
Which generally explains why when both sides [client and server] insist that they are not to blame (as their connection is still seen as “active”) there is an actual “culprit”.


#17

As we said before, the “Resetting dropped connection” is almost certainly a red herring. You can see from the logs that requests recovers automatically from it, as it should.

@ejds, you can try this: In your Certbot installation, find dns_route53.py. Look for lines that look like this:

    def _wait_for_change(self, change_id):                                                                                                                                                           
        """Wait for a change to be propagated to all Route53 DNS servers.                                                                                                                            
           https://docs.aws.amazon.com/Route53/latest/APIReference/API_GetChange.html                                                                                                                
        """                                                                                                                                                                                          
        for unused_n in range(0, 120):       

Change that last line to

for unused_n in range(0, 1200):

That will cause the plugin to wait for up to 100 minutes for the changes to propagate, instead of for 10 minutes. Let us know if that works! And if not, please upload a fresh log.


#18

Thanks for reply and follow the issue,

I found the file in our distribution (Ubuntu 16.04 LTS) here /usr/lib/python3/dist-packages/certbot_dns_route53/dns_route53.py and modify the line changing the value of 120 to 1200, later I tried to get a certificate (the full command and it output is also attached with the log file).

The result was exactly the same, it doesn’t worked. As a note, I find that certbot command output Resetting dropped connection: route53.amazonaws.com at the same interval as if the value in the code were not changed.

Here is the clean full log:
https://drive.google.com/open?id=1wTHHnaNwaxRIREOFcIL60a358dm49uMR


#19

I am doing a test with only one domain (one Hosted Zones) and several sub domain where I found some interesting things.

As it is only one Hosted Zone I have the possibility to refresh the AWS Web Console and see how the _acme.. TXT subdomain for the challenge are inserted.

I executed the following command:

certbot certonly --non-interactive --dns-route53 --cert-name domaintest --domain 1.domain.com --domain 2.domain.com --domain 3.domain.com --domain 4.domain.com --domain 5.domain.com --domain 6.domain.com --domain 7.domain.com --domain 8.domain.com --domain 9.domain.com --domain 10.domain.com --domain 11.domain.com --domain 12.domain.com --domain 13.domain.com --domain 14.domain.com --domain 15.domain.com  --keep-until-expiring  --renew-with-new-domains --rsa-key-size 2048 --email one@email.com --agree-tos --test-cert --debug

I realize that to insert the acme TXT records it take a long time, like approximately 10 second per each record. After the insertion of 7 records the command output Resetting dropped connection: route53.amazonaws.com but later it continue with the insertion of the acme TXT records for the next domains, which after exactly 7 records more it throw another Resetting dropped connection: route53.amazonaws.com, later it follow with the insertion of the last domain and it finish properly.

I illustrate a merge between the AWS Console and the command output:

1.domain.com
_acme record created
10.domain.com
_acme record created
11.domain.com
_acme record created
12.domain.com
_acme record created
13.domain.com
_acme record created
14.domain.com
_acme record created
15.domain.com
_acme record created
> Resetting dropped connection: route53.amazonaws.com
2.domain.com
_acme record created
3.domain.com
_acme record created
4.domain.com
_acme record created
5.domain.com
_acme record created
6.domain.com
_acme record created
7.domain.com
_acme record created
8.domain.com
_acme record created
> Resetting dropped connection: route53.amazonaws.com
9.revoldev.com
_acme record created

Additionally after certbot finish to insert all the _acme records, this was the output of the command (full output attached in the file):

Resetting dropped connection: route53.amazonaws.com
Resetting dropped connection: route53.amazonaws.com
Waiting 10 seconds for DNS changes to propagate
Waiting for verification...
Resetting dropped connection: acme-staging-v02.api.letsencrypt.org
Cleaning up challenges
Resetting dropped connection: route53.amazonaws.com

IMPORTANT NOTES:
 - Congratulations! Your certificate and chain have been saved at:
   /etc/letsencrypt/live/cuextest/fullchain.pem
   Your key file has been saved at:
   /etc/letsencrypt/live/cuextest/privkey.pem
   Your cert will expire on 2019-02-11. To obtain a new or tweaked
   version of this certificate in the future, simply run certbot
   again. To non-interactively renew *all* of your certificates, run
   "certbot renew"

I find that the Resetting dropped connection: acme-staging-v02.api.letsencrypt.org could be interesting to notice, as well another Resetting dropped connection: route53.amazonaws.com after all acme records were inserted.

The whole command took ~10 minutes to run, so we can say that each approximately 3 minutes is a Resetting dropped connection: route53.amazonaws.com.

I still keep the change propuse by @jsha from 120 to 1200 in the dns_route53.py code.

After all the certificate was created correctly.

Here a clean log file and command output from this test:
https://drive.google.com/open?id=15dAm8F3YyCgwColM6ZEe5CkLUehWh-cf

Our production app which is affected by this have an amount of 70 domains/subdomains in several hosted zones.

Following the result of the test, get a new certificate or renew it could take up to 46 minutes (10 minutes each 15 domains), the first time that I tried to get a certificate for our production app, certbot was running more than 60 minutes without result, then I opened a ticket in AWS and after a month of research a ticket here.

I will try to leave certbot running several hours to see if it can renew the certificate. But, if I am not confuse, the cron tab should did it, but the certificate is already near to expire and it was not renew.

Let’s me know if you have any question or need more information.


#20

I confirm that I was able to get a certificate in staging with 70 domains using the initial setup domains. It took 43 minutes, more or less as the expected.

I don’t know why, on our first deployment (using OpsWorks (Chef)) it was running more than one hour without success. Now, I run it manually in the console, and after several Resetting dropped connection: route53.amazonaws.com it works.

Actually it were 10 Resetting dropped connection: route53.amazonaws.com, for 70 records, 1 each 7 records, as expected. I am curious about why the exactitud on these measures.

I attach the log and command output:
https://drive.google.com/open?id=1azmhh0wv7ieU0kB1g8D4S4EcD3h56KIa

Now, I am trying to debug, if it works… why in production the cron job for auto-renewal is not renewing the certificate, now it is at 27 days of the expiration Expiry Date: 2018-12-11 10:43:30+00:00 (VALID: 27 days) and it should be auto renewal automatically (yes, cerbot renew is in the cronjob, it comes out of the box in Ubuntu but also I checked)

UPDATE
The renewal looks that was not working because we had a wrong renwal configuration, from a manual test, using webroot which because CloudFront and our setup (more info in first post), doesn’t work.