Unable to issue ECDSA+RSA in ACMEv2 staging environment

We’ve been experimenting a weird behaviour in the Let’s Encrypt ACMEv2 staging environment. The explained behaviour isn’t reproducible in production right now.

We use our own solution based on the official ACME library for python3 (https://pypi.org/project/acme/). This has been reproduced with acme 0.31.0 and 0.32.0 using the dns-01 challenge.

Upon a new certificate is being configured, we request the ECDSA + RSA versions. In some ocassions the Let’s Encrypt ACMEv2 staging environment sets as invalid the authz for the second version issued:

  1. ECDSA certificate gets issued
  2. RSA certificate fails to validate the dns-01 challenges.

We’ve also observed that the certificates get issued after waiting some minutes.

  1. ECDSA certificate gets issued
  2. RSA certificate fails to validate the dns-01 challenges.
  3. Wait 6 minutes
  4. ECDSA certificate gets issued
  5. RSA certificate gets issued

If we attempt to get issued the RSA certificate first it succeeds and then the ECDSA one fails.

As an example of a failure order with the whole logs from our applications: https://acme-staging-v02.api.letsencrypt.org/acme/order/7090084/30282144. The certificate that we’re trying to get issued is pretty simple, CN: tendril.wikimedia.org.

As it can be seen in the application log at the end of the post, the ECDSA (ec-prime256v1) certificate gets issued successfully but it fails to validate the challenges for the rsa-2048 one. Please take into account that our code validates successfully that the solved dns-01 challenges have been successfully published in our DNS servers. But for some reason the solved challenge that we get in the first attempt for the rsa-2048 certificate: mngAKhYePDExCl80HhVcB97bRt64YoRWPq3O4vp4LiE doesn’t match the one expected by LE staging environment: JtwCcdhuHEgC-lQ2R-rVIFkEgKvVpbWcyIrzFNpLF3E obtained from https://acme-staging-v02.api.letsencrypt.org/acme/authz/hSHz4rypr5VBdA0uq8dDgerFrQ0fO9iT-3HKr2IfW1s
Also it should be taken into account that the challenge solving is handled by the official python3 ACME library and not by our custom integration.

Apr 10 09:33:30: Handling new certificate event for tendril / ec-prime256v1
Apr 10 09:33:30: Running subprocess ['/usr/local/bin/acme-chief-gdnsd-sync.py', '--remote-servers', 'authdns1001.wikimedia.org', 'authdns2001.wikimedia.org', 'multatuli.wikimedia.org', '--', '_acme-challenge.tendril.wikimedia.org', 'eZEj0891mrN6bdKf6Qg3ocaPdurvno6u0aKa3ZZLyws']
Apr 10 09:33:33: Handling pushed CSR event for tendril / ec-prime256v1
Apr 10 09:33:33: Handling validated challenges event for tendril / ec-prime256v1
Apr 10 09:33:33: Handling pushed challenges event for tendril / ec-prime256v1
Apr 10 09:33:34: Handling order finalized event for tendril / ec-prime256v1
Apr 10 09:33:36: Pushing the new certificate for tendril
Apr 10 09:33:36: Waiting till tendril / rsa-2048 is generated to be able to push the new certificate
Apr 10 09:33:36: Handling new certificate event for tendril / rsa-2048
Apr 10 09:33:36: Running subprocess ['/usr/local/bin/acme-chief-gdnsd-sync.py', '--remote-servers', 'authdns1001.wikimedia.org', 'authdns2001.wikimedia.org', 'multatuli.wikimedia.org', '--', '_acme-challenge.tendril.wikimedia.org', 'mngAKhYePDExCl80HhVcB97bRt64YoRWPq3O4vp4LiE']
Apr 10 09:33:38: Handling pushed CSR event for tendril / rsa-2048
Apr 10 09:33:38: Handling validated challenges event for tendril / rsa-2048
Apr 10 09:33:38: Handling pushed challenges event for tendril / rsa-2048
Apr 10 09:33:38: ACME Directory has rejected the challenge(s) for certificate tendril / rsa-2048
Apr 10 09:33:38: ACME directory has rejected the challenge(s) for order https://acme-staging-v02.api.letsencrypt.org/acme/order/7090084/30282144
--- OUTPUT OMITTED. Another attempt happens. In this occasion both ECDSA+RSA certs are issued successfully ---
Apr 10 09:39:30: Handling new certificate event for tendril / ec-prime256v1
Apr 10 09:39:31: Running subprocess ['/usr/local/bin/acme-chief-gdnsd-sync.py', '--remote-servers', 'authdns1001.wikimedia.org', 'authdns2001.wikimedia.org', 'multatuli.wikimedia.org', '--', '_acme-challenge.tendril.wikimedia.org', 'hCDCpO1qpnPTWp-y6MJ5bo_BGzbuyjY0vXImO-VFDnU']
Apr 10 09:39:33: Handling pushed CSR event for tendril / ec-prime256v1
Apr 10 09:39:33: Handling validated challenges event for tendril / ec-prime256v1
Apr 10 09:39:33: Handling pushed challenges event for tendril / ec-prime256v1
Apr 10 09:39:35: Handling order finalized event for tendril / ec-prime256v1
Apr 10 09:39:36: Pushing the new certificate for tendril
Apr 10 09:39:36: Waiting till tendril / rsa-2048 is generated to be able to push the new certificate
Apr 10 09:39:36: Handling new certificate event for tendril / rsa-2048
Apr 10 09:39:37: Running subprocess ['/usr/local/bin/acme-chief-gdnsd-sync.py', '--remote-servers', 'authdns1001.wikimedia.org', 'authdns2001.wikimedia.org', 'multatuli.wikimedia.org', '--', '_acme-challenge.tendril.wikimedia.org', 'COvB8oOd_FM8sVmyLXmKBIzd0HdQo0e-ZQZ2PFN5jDY']
Apr 10 09:39:39: Handling pushed CSR event for tendril / rsa-2048
Apr 10 09:39:39: Handling validated challenges event for tendril / rsa-2048
Apr 10 09:39:39: Handling pushed challenges event for tendril / rsa-2048
Apr 10 09:39:43: Handling order finalized event for tendril / rsa-2048
Apr 10 09:39:44: Pushing the new certificate for tendril

Hi @jvgutierrez

I have to run to a meeting in a moment and can’t investigate this problem fully but based on the description I suspect your issuance process is depending on valid authorizations being reused.

In production if your account authorizes example.com with a DNS-01 challenge that valid authorization will be reused for a subsequent order for example.com (within 30d). E.g. two back-to-back orders, one challenge performed.

In staging we have valid authorization reuse disabled right now (I need to follow-up on the context and whether we intended to revert that change and forgot). In staging the second order would have a pending authorization for example.com and a second challenge needs to be performed.

Understanding why this breaks your integration will require more digging. Perhaps your DNS provider doesn’t handle having two TXT records under the same label well and the 2nd stomps the first?

Authorization reuse is a Let’s Encrypt specific optimization so I think there’s value in figuring out how to make sure your process works even when it is disabled. If you ever needed to switch to a different RFC 8555 ACME server you could encounter this again.

I hope the above gives you some foothold to debug from. I’ll try to circle back to this thread later to see if digging into the logs on our side will help shake out any other details.

Thanks for the super-fast response @cpu

I’ve indeed realized that the staging environment currently has the valid authorization reuse disabled.
This shouldn’t be an issue, and actually as you can see in our application log, we send two challenges to the authoritative DNS servers, and for some reason, some times fails and some works successfully. What puzzles me is the discrepancy between the dns-01 token that we send to the DNS servers:
Apr 10 09:33:36: Running subprocess ['/usr/local/bin/acme-chief-gdnsd-sync.py', '--remote-servers', 'authdns1001.wikimedia.org', 'authdns2001.wikimedia.org', 'multatuli.wikimedia.org', '--', '_acme-challenge.tendril.wikimedia.org', 'mngAKhYePDExCl80HhVcB97bRt64YoRWPq3O4vp4LiE'].

So we are adding: _acme-challenge.tendril.wikimedia.org TXT mngAKhYePDExCl80HhVcB97bRt64YoRWPq3O4vp4LiE but for some reason LE seems to expect JtwCcdhuHEgC-lQ2R-rVIFkEgKvVpbWcyIrzFNpLF3E if I’m interpreting correctly https://acme-staging-v02.api.letsencrypt.org/acme/authz/hSHz4rypr5VBdA0uq8dDgerFrQ0fO9iT-3HKr2IfW1s

Also take into account that our integration validates that the TXT records have been added successfully to our authorizative DNS servers by performing DNS queries before signaling to the ACME v2 directory that the dns-01 challenges have been fulfilled.

Please consider that on the second attempt both ECDSA+RSA certs were issued successfully, and that at that point our DNS server was holding 4 TXT records for _acme-challenge.tendril.wikimedia.org added at the following timestamps:

  1. 09:33:33: eZEj0891mrN6bdKf6Qg3ocaPdurvno6u0aKa3ZZLyws --> ACCEPTED (ECDSA)
  2. 09:33:36: mngAKhYePDExCl80HhVcB97bRt64YoRWPq3O4vp4LiE --> REJECTED (RSA)
  3. 09:39:31: hCDCpO1qpnPTWp-y6MJ5bo_BGzbuyjY0vXImO-VFDnU --> ACCEPTED (ECDSA)
  4. 09:39:37: COvB8oOd_FM8sVmyLXmKBIzd0HdQo0e-ZQZ2PFN5jDY --> ACCEPTED (RSA)

gdnsd (our authoritative DNS server software) cleans up the TXT records used in dns-01 validations after 10 minutes.

1 Like

The “token” and the TXT record aren’t supposed to be identical – the TXT record is a hash of the token plus your account key.

https://tools.ietf.org/html/rfc8555#section-8.4

1 Like

you’re right, and we are fulfilling the proper TXT record:

>>> challenge = acme.challenges.DNS01(token=jose.b64decode('JtwCcdhuHEgC-lQ2R-rVIFkEgKvVpbWcyIrzFNpLF3E'))
>>> challenge.validation(account.jkey)
'mngAKhYePDExCl80HhVcB97bRt64YoRWPq3O4vp4LiE'

that matches the TXT record that got rejected at 09:33:36

1 Like

since yesterday I’ve been working in a new release of our integration software, because it had a small limitation regarding dns-01 challenge validation. It only checked the presence of the proper TXT records in one for our authoritative DNS servers. Now it checks their presence in all of them before signaling LE staging environment that the challenges have been successfully fulfilled.

We still see the same behaviour. Check for instance: https://acme-staging-v02.api.letsencrypt.org/acme/order/7090084/30390738. here we have 3 SNIs: two are successfully validated via dns-01 and one fails.

I’ve also discovered that adding an artificial 90 seconds wait solves the issue, could the staging environment be experimenting some issue regarding some sort of dns-01 challenge caching?

Hi @jvgutierrez,

Thanks for the detailed follow-ups. I think I understand what's happening here and can explain.

Both our staging and production environments use recursive resolvers with a max cache TTL of 60s.

What's happening here is a combination of:

  1. quickly back-to-back issuing two identical certificates (modulo the different public key/alg)
  2. authorization reuse being disabled
  3. the cache max TTL.

Because the certificate subjects are the same between the two orders the DNS-01 challenge lookups will be for the same DNS records. That means if the second issuance happens within 60s we'll be checking our cache and not your authoritative server and will see the wrong keyauth.

So end-to-end it looks something like this:

When authorization re-use is enabled everything works fine:

  • the first order is created, and unique dns-01 challenge tokens provisioned
  • the key authzs gets provisioned into the DNS zone. Self-checks sees them.
  • the challenges are initiated, we do TXT lookups and get the correct key authzs
  • the order is fully authorized and a certificate is issued based on the CSR
  • a new order is created, with the same names
  • the valid DNS-01 challenges from the first order are reused, no new tokens/challenges.
  • no challenges are initiated. No TXT lookups are performed.
  • the order is fully authorized and a certificate is issued based on the CSR.

When authorization re-use is disabled the combination of identical names & the max TTL break the second order:

  • the first order is created, and unique dns-01 challenge tokens provisioned
  • the key authzs gets provisioned into the DNS zone. Self-checks sees them.
  • the challenges are initiated, we do TXT lookups and get the correct key authzs
  • the order is fully authorized and a certificate is issued based on the CSR
  • a new order is created, with the same names
  • with no reuse fresh pending authorizations for the names are created and fresh dns-01 challenge tokens are provisioned.
  • the challenges are initiated, and TXT lookups performed. Because the identifiers match between order 1 and 2 the lookups will be for the same DNS records as the initial order. If this happens within <60 (the max cache TTL on our end) the first DNS-01 key authorizations are seen, not the new ones.
  • The order fails to be authorized.

Not ideal! :-/

You could "solve" the problem by adding an artificial sleep longer than our max TTL and it should work fairly reliably.

A better idea might be to check if you can explicitly set a very low TTL on the TXT records your ACME client provisions in the zone. We should respect the TTL you send if its lower than 60s. We allow a min TTL of 0s and I think setting your TXT records with that TTL will solve the problem as well.

I was out of the office yesterday but today I'll follow up about the authorization reuse in the staging environment and what our plan is.

4 Likes

This had slipped through the cracks and wasn't left disabled for any particular reason. I opened a ticket to restore staging's authorization reuse configuration to match production.

I'm going off-topic, but is that a change? I thought the maximum TTL was even lower, but I never knew the exact number, so I might have just assumed incorrectly.

Not as far as I know. I believe its been 60s for the past ~3yrs.

1 Like

A reason I can imagine is that people developing a new client want authorizations to expire quickly.

This has now been done.

@jvgutierrez did you have any luck investigating lowering the record TTL? Can I consider this case-closed?

lowering the TTL below 60 secs for ACME challenges requires some work done in gdnsd so it isn’t a trivial change. I’ll let you know as soon as we’ve conducted the tests. Thanks @cpu

1 Like

Sounds good. Thanks @jvgutierrez!

An tangentially-related experience report from me:

I've been improving the "dry-run" function in our client by implementing authz deactivation at the conclusion of the issuance process. This fixes the first problem of the recently re-activated authz reuse reporting false-positive success on dry-runs.

Now we have to deal with a scenario where you basically get alternating dry-run success/failure due to the 60s TTL on Unbound. We inadvertently had this hardcoded to 360s (oops), but reducing it to 1s (to satisfy a platform constraint of >0) seems to have done it. Duplicate orders in rapid succession have fresh authzs and never fail :partying_face: .

Thanks for the hint about TTL @cpu, that's really helped.

A lot of DNS hosts have weird restrictions on their TTLs though, so this same problem is probably going to be a giant pain in the ass for other, more generic ACME clients :frowning: .

2 Likes

Glad to hear it! Thanks for reporting back with your findings @_az.

When I started chasing this down I talked to @jsha about the 60 max TTL we're using. We're both open to the idea of trying to lower that max TTL, maybe down to 1 or 0 if possible. It will take some time to make sure it won't have too significant an impact on the resources of our Unbound instances but I think it will be a relatively low-cost change we could make on our side that would help resolve this more generally. I'll try and pick that discussion up again in earnest next week when more of our staff are around to chat with.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.