Terraform aws acme certificate time limit exceeds

Hello,

I'm facing an issue trying to generate certificates with terraform provider vancluever/acme (2.7.0). The domain pacts.cloud is public and under my control. I do have a public route53 zone available.
Here is my code for the certificate creation:

resource "tls_private_key" "private_key" {
    algorithm = "RSA"
}

resource "acme_registration" "registration" {
    account_key_pem = tls_private_key.private_key.private_key_pem
    email_address   = "<my-mail>"
}

resource "acme_certificate" "certificate" {
    account_key_pem           = acme_registration.registration.account_key_pem
    common_name               = "vault.pacts.cloud"
    pre_check_delay            = 1200
    dns_challenge {
        provider = "route53"

        config = {
            AWS_HOSTED_ZONE_ID = var.pacts-cloud-zone-id
        }
    }

    depends_on = [acme_registration.registration]
}

My domain is:
vault.pacts.cloud

When running terraform apply, I' getting the following output:

However, I can see the validation route during the 3min terraform is running.
For me the error suggests that the validation route is not populating in time. However, the delay is set in the certificate seems to get ignored.

Any ideas or tips?

Best regards.

4 Likes

Hi @FlorianGerdes and welcome to the LE community forum :slight_smile:

If the script also removes the TXT, then this is going to be difficult to troubleshoot.
At the moment, there is no such TXT record found.

3 Likes

Thank you for the warm welcome and your response @rg305 .

I think its the provider which automatically removes the TXT validation route. Actually isn't the entire validation route only a temporary thing?

Here is what I've read in a tutorial:

Any way I can enforce the record to stay permanently?

3 Likes

The record is not required after use and is usually deleted.
But that complicates the troubleshooting.
As now there is no evidence about if it was ever there.

If there was some way to pause the script (before the deletion)...

4 Likes

My bad, I thought to complicated. Of course, there is a way to interrupt the script. Here we have the TXT validation route in route53:

Normally, rerunning the script would try to either delete the old resource and recreate it (if it was properly saved in the terraform state file) or run into conflicts (in case of trying to create another validation route).

1 Like

Perfect!
The TXT record is there.

Now you simply need to start a timer, pause the script, and check all (four) of your authoritative DNS servers to see how long it takes for them all to show that new TXT record.
Do that timer test three times. Then take the average update time and multiply it by three.
And use that number in your line:

with that calculated number OR 1200 [whichever is larger].

If the number was less than 1200, then we may have a real (yet unknown) problem and musst continue searching for clues about it.
If the number was larger, then try the new number and report back your findings.
[remember to always use the staging system when conducting such obvious testing]

2 Likes

What should happen is that the ACME client should be checking Route 53's API to see if the DNS servers are in sync (that is, if the change set is done) before proceeding. That'd be more reliable than just waiting for a while. I don't know if that particular client's Route 53 implementation does so.

I'm also quite confused as to why, even if the systems weren't in sync, a REFUSED response would be involved at all. Something about what's going on seems weird, especially if that REFUSED status is reproducible. That error message doesn't look like it's coming from the Let's Encrypt servers, even.

3 Likes

@FlorianGerdes Do you have a support contact at AWS? They maybe could look at logs to see why the Route53 DNS sent a Refused response.

I see in the RFC that the response is for

Refused - The name server refuses to 
                 perform the specified operation for
                 policy reasons.  For example, a name
                 server may not wish to provide the
                 information to the particular requester,
                 or a name server may not wish to perform
                 a particular operation (e.g., zone
                 transfer) for particular data.

I have no idea what policy is being violated. Especially since we can see the TXT record now. But AWS should know or be able to tell you better.

I am not at all expert at DNS so maybe the others here will still resolve it. I am just providing more clues.

UPDATE: Some more random ideas:

  1. I see Route53 offers Traffic Policies. Could you have one that would interfere with requests from LE Servers?
  2. I saw a post via google (which I lost) where someone said they got "Refused" error when the name servers listed in their Registered Domain section in AWS were all valid Route53 Name Servers but they were not identical to the ones listed in the Hosted Zone NS record. I would think if this was your problem many things would fail. Still, "Refused" seems rare so ...
2 Likes

Sorry for the late response. It took me a while to contact AWS support and to discuss the issue. Unfortunately, they weren't able to help me. No hint one why I do get a "REFUSED". They hinted me at Error creating certificate: error: one or more domains had a problem thinking this might be related, but I can't see how.

1 Like

I'll do that check asap

1 Like

So I just now did the test. I got rid of the pre_check_delay and reexecuted the script. Turns out that (as far as I can tell) the TXT record is getting propagated to the route 53 hosted zone NS immediately.

ns-191.awsdns-23.com
ns-608.awsdns-12.net
ns-1278.awsdns-31.org
ns-2032.awsdns-62.co.uk

In addition I might add, that for me it seems like the pre_check_delay is getting ignored no matter what i put there, because the terraform script is failing reliable after 3 min.

I'm performing a few more tests to validate whether it's always the same NS server reporting the error.

2 Likes

Also I dont have any traffic policies in place

1 Like

Just observed another issue. Not sure whether this was just a one time thing.

Also be retrying several times, I get the error from berore for different name servers. So its not just the one.

1 Like

One (out of four) DNS server being unreachable/down is not enough to break the check.

1 Like

Hi,

I have applied terraform with TF_LOG=TRACE and passing recursive_nameservers = ["8.8.8.8:53"] to my acme_certificate asset to attempt to get more data. Here is the piece of result identified with the acme_certificate asset creation:

Extend to see the result (it's somewhat long, that is the reason I have epitomized it!)
Additionally, we have had a go at setting [AWS](https://www.sevenmentor[.]com/amazon-web-services-training-institute-in-pune.php)_PROPAGATION_TIMEOUT = 600 yet the equivalent.

Why do you not have a DNS record for vault.pacts.cloud?

I see your TXT record for it just fine (still), but, how do you plan to access that domain name without a DNS record?

One of my DNS lookup tools refused to show the TXT record because that was missing. Could Terraform be faulting and showing its own odd "refused" message for a similar reason? That may be why we can't reproduce using direct DNS inspection. And maybe why AWS cannot guide either.

Just for my curiosity, what is the Let's Encrypt (LE) cert for vault.pacts.cloud to be used for? I ask because I see that pacts.cloud is managed by AWS CloudFront and you have certs through AWS ACM for that (as normal). Some of your other subdomains also run through CloudFront (but some just EC2). Nothing unusual.

But, I also see an AWS ACM cert for vault.pacts.cloud from several days ago. It would help to understand the context for the LE cert a bit better. Is it for https between CF and your Origin Server, for example? Thanks.

1 Like

Perhaps it is not meant to be online (via the Internet - thus the DNS challenge).

Possible use case:

  • I need a CA validated cert
  • I don't want a wildcard cert
  • I don't want to publish the IP in global DNS
1 Like

Yeah, I understand that. But, we were not making progress so I thought it worth getting more background info. I scoured the webs and did not find many similar cases.

There was a post in 2018 on this forum with same "refused" error message with Terraform and Route53 but no resolution.

The only other one I saw was for a different product (Concourse, not Terraform) but same "refused" message looking up TXT record for _acme-challenge record. This one used google name servers. This resolved after fixing poorly configured DNS apparently.

When stuck, get more info :slight_smile:

3 Likes

Sorry for the late reply everyone.

So first of on the use-case:
With AWS ACM the issue is, that amazon fully manages the certificates for you. So there is way to get your hands on the private certificate yourself to have an EC2 access it. That was my use-case: I wanted an EC2 auto-scaling group to access the certificate directly.
Now, my (current) workaround is to get a loadbalancer in between which can hold the private ACM certificate. No need to get the certificate on the EC2 this way.

Now to the question of why there is no record for vault.pacts.cloud: Because i was struggling with the certificate in the first place. I would setup such a record once I have the rest of the pieces in place. Is that a problem?

My first instinct is to agree this is best path anyway. Usually you want EC2 spin-ups to be fast and reliable. If you were going to acquire a new cert for each new spin-up that would be problematic for those purposes and may incur LE rate limits. To avoid that you would need to acquire certs in a different dedicated process and store them durably for quick load on spin-up. And, occasionally refresh certs in long-running instances. Maybe this was your plan all along but the LB does that for you.

As to the DNS I was just poking around. This is an extremely rare error message. Even AWS could not explain why Route53 would do that. Sometimes asking questions reveals helpful info. The TXT record works sometimes - many of us have retrieved it. But, the error persists so something is likely not setup quite right. It remains a mystery.

2 Likes