Acme4j Invalid http challenge token sent by Lets encrypt validation server

Please fill out the fields below so we can help you better. Note: you must provide your domain name to get help. Domain names for issued certificates are all made public in Certificate Transparency logs (e.g. crt.sh | example.com), so withholding your domain name here does not increase secrecy, but only makes it harder for us to provide help.

My domain is: auth.panw.pro/

I ran this command: "GET /.well-known/acme-challenge/AGh83OQynTZe8stC4g6wQ1c7l62FZUZyDO6v1BrZikQ

It produced this output:HTTP/1.1" 200 with empty response

My web server is (include version):

The operating system my web server runs on is (include version):

My hosting provider, if applicable, is:

I can login to a root shell on my machine (yes or no, or I don't know):

I'm using a control panel to manage my site (no, or provide the name and version of the control panel):

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): acme4j-client.version 2.11

Hi

We are seeing Lets encrypt sending invalid http challenge token in the well-known URL. The token value is different from the value that was obtained and stored on the web server initially. Can someone help check why the request are getting sent for invalid tokens. Also we are seeing only 3 requests received instead of total 5 from the Lets encrypt validation server.
Certificate renewals are failing for our customer's domain since the http challenge verification failed. Similar logs was seen for multiple domains

Please see example logs below for the domain - auth.panw.pro

"GET /.well-known/acme-challenge/AGh83OQynTZe8stC4g6wQ1c7l62FZUZyDO6v1BrZikQ HTTP/1.1" 200 0 "http://auth.panw.pro/.well-known/acme-challenge/AGh83OQynTZe8stC4g6wQ1c7l62FZUZyDO6v1BrZikQ" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" "23.178.112.212" "auth.panw.pro" 

[ logs from the acme server service ] -  HTTP challenge obtained for domain=auth.panw.pro with tokenName=ykkl9FO4kGJkk_IYSW-SkhlazvVcIhWymcKL4e6M5ZA tokenValue=ykkl9FO4kGJkk_IYSW-SkhlazvVcIhWymcKL4e6M5ZA.T2cVsZoui77RnHprB72meoOPHJUF_ED8RKGIJSZ7yuE 

Before getting into those log entries can you explain how the routing should work with two IP addresses for that domain?

Because normally the DNS used by a domain along with an HTTP Challenge will have just the public IP for the target server. In this case the two IP are for Aws Global Accelerator endpoints. Other services like GoDaddy Domain Forwarding use AWS for this and can cause problems.

auth.panw.pro. 0 IN A 3.33.189.110
auth.panw.pro. 0 IN A 15.197.181.212

See: Let's Debug

2 Likes

the domain is owned by the customer and they configure CNAME pointing to our(Okta) servers. Below is the configuration for this domain. Since Okta uses AWS the request is routed to the aws servers.

dig auth.panw.pro

; <<>> DiG 9.10.6 <<>> auth.panw.pro
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11994
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;auth.panw.pro.			IN	A

;; ANSWER SECTION:
auth.panw.pro.		300	IN	CNAME	dev-431764.customdomains.okta.com.
dev-431764.customdomains.okta.com. 300 IN CNAME	ok11-custom-crtrs.okta.com.
ok11-custom-crtrs.okta.com. 30	IN	CNAME	ok11-custom-crtrs.oktaedge.okta.com.
ok11-custom-crtrs.oktaedge.okta.com. 265 IN CNAME af77c9e516730cc51.awsglobalaccelerator.com.
af77c9e516730cc51.awsglobalaccelerator.com. 300	IN A 15.197.181.212
af77c9e516730cc51.awsglobalaccelerator.com. 300	IN A 3.33.189.110

;; Query time: 51 msec
;; SERVER: 172.24.92.1#53(172.24.92.1)
;; WHEN: Tue Feb 18 16:29:14 PST 2025
;; MSG SIZE  rcvd: 247

update on this issue -
We noticed that the http challenge was successful after 7 days for this domain. The issue was resolved for the other domains as well that were failing due to same issue. They were all renewed after 7 days from last renewal failure.
Given the authorization has expiry of 7 days could this issue be related to incorrect authorization?

We are still seeing the issue for newly created domains.

Below is the sequence of code execution -

order = account.get().newOrder().domains(domain).create();
for (final Authorization auth : order.get().getAuthorizations()) {
domain = auth.getIdentifier().getDomain();
Http01Challenge http01Challenge = auth.findChallenge(Http01Challenge.class);
String tokenName = http01Challenge.getToken();
String tokenValue =  http01Challenge.getAuthorization();
add token name and value on the http server
}

order = account.get().newOrder().domains(domain).create();
for (final Authorization auth : order.get().getAuthorizations()) {
domain = auth.getIdentifier().getDomain();
Http01Challenge http01Challenge = auth.findChallenge(Http01Challenge.class);
http01Challenge.trigger
}

after the above code execution, lets encrypt will send challenge verification requests using the /.well-known/acme-challenge end point with incorrect token name (not matching with the value obtained from the http challenge above).
(pasted example token challenge logs in the initial comment on this thread)

what is the order and authorization expiry period?
if the previous order or authorization is invalid or pending will it return new authorization with a new http challenge on calling auth.findchallenge?

what is the order and authorization expiry period?

The expiry date is on the JSON ACME Order Object.

if the previous order or authorization is invalid or pending will it return new authorization with a new http challenge on calling auth.findchallenge?

Boulder, the ACME Server, will recycle the same PENDING AcmeOrder across multiple requests from the same account.

If a Challenge fails, it will cascade upwards: the Authorization will fail and the Order will fail.

Some clients are configured to cleanup (deactivate) pending authorizations on an order failure. This was once required due to rate limits, but is no longer.

If you retry a failed order (from the same ACME Account), authorizations will be recycled if possible. Recently validated authorizations will be recognized by server and automatically associated to the order, so new authorizations will not be required for that fully qualified domain name. Unused (still pending) authorizations will be associated to the new order.

4 Likes

The problem is that you are actually creating two separate domain orders for the same domain:

order = account.get().newOrder().domains(domain).create();

I assume it's because the file creation and the triggering are isolated steps in a workflow.

With the first order you are creating the challenge file for your web server.

With the second order you are triggering the challenge. However, since this is a separate order, you are triggering a different challenge than the one you created the challenge file for. This is why Let's Encrypt validates with a different token name.

I recommend that you store the URL of your first order object:

order = account.get().newOrder().domains(domain).create();
URL orderLocation = order.getLocation();  // this orderLocation needs to be stored
// now write the validation file to your web server

Instead of creating a second order, you can now recreate the original order for triggering:

order = account.get().getLogin().bindOrder(orderLocation);
// now trigger the challenge

Unfortunately Let's Encrypt does not implement getOrders(), so you need to store the order URLs in a local database anyway.

2 Likes

thanks Shred. I came across a post where it was mentioned that if there is an existing pending order or authorization for the domain that will be re-used. Pending Order expiration - #2 by JuergenAuer
We have been using this same code for more than a year and the renewals have been succesfull. Started seeing the issue recently. Not sure if there are have been any recent changes on how the orders are processed by lets encrypt?

Maybe... But even if a pending order can be "recycled" like that, it's not a documented feature.

I would recommend to store the order URL, like I described above. This would be a clean way to implement it. Alternatively you can also store the http01Challenge URL in your first step, and in the second step re-bind the challenge using Login.bindChallenge() for triggering. But from my experience, you need to store the order URL anyway, so you could do it in your first step already.

3 Likes

@jvanasco if a challenge fails, we are currently deactivating any pending authorizations for the associated domain before the next renewal attempt for that domain.

If a new order is created what will be the default expiry/how long is it valid? is it valid for 7 days? as after 7 days the issue is auto-resolved with successful cert renewal.

If there is a pending order will order = account.get().newOrder().domains(domain).create() create new order or re-assign the previous pending order for that domain?

As you mentioned "any pending or unused authorizations are assigned to the new order" but looks like its not working. As auth.findChallenge(Http01Challenge.class) is returning a different value each time

First, I'll address this:

It is documented on Boulder as an implementation detail and a well known peculiarity of LetsEncrypt; IIRC it was deployed to guard against buggy clients and anti-patterns that would repeatedly and needlessly create a new order :boulder/docs/acme-implementation_details.md at main · letsencrypt/boulder · GitHub

Now getting back to some questions

If a challenge fails, there should be no pending authorizations for that FQDN. An Order will have one Authorization for each FQDN; each Authorization will have one or more potential Challenges (currently 3: DNS-01, HTTP-01, TLS-ALPN-01). A Challenge failure will force the Authorization object to fail, which will force the Order object to fail. The still-pending Authorizations in the order, are the ones which validation has not been attempted on - i.e. other FQDNS. (These previously required cleanup with LetsEncrypt, but no longer do - though other CAs still have these ratelimited.)

I don't know what the current expiry window is offhand. You have to check the AcmeOrder object that is created from newOrder to see the expiry. The RFC does not define these durations, and LetsEncrypt does not publish or guarantee them, as they are subject to change.

LetsEncrypt's Boulder server will attempt to reuse the Authorizations when possible.

I want to highlight the advice from @shred :

What is likely happening here as well, is that there may be some race conditions and timeouts on the Order or Authorization objects. You are probably failing some challenges/authorizations and not realizing what broke when. Because validated authorizations are cached against an account for several days, the buggy code is probably iteratively validating all of the required challenges across multiple invocations and orders.

While these behaviors that we're talking about are documented and well-known, they are also all specific to LetsEncrypt's Boulder installation – other CA's do not support this.

Like @shred stated, you should be storing the order object and retreiving it by url - not relying on this implementation detail. You should checking the order object to determine it is still valid.

I strongly suggest you set up some CI tests that use Pebble (a test CA from LetsEncrypt that makes different design choices), and ensure your solution works on that before testing against a local Boulder or the staging environment. Your current logic will fail on most other CAs, because you are leveraging the implementation details of Boulder as part of your solution.

4 Likes

Technically, Let's Encrypt recently published order lifetime values, since they're dependent on what profile is selected now:

But this is definitely true, that they shouldn't be relied upon. Per the rest of the good advice given here, if you follow the standards you won't have to worry about it.

4 Likes

thanks for the suggestions. We will look into updating our code to not create anew order but use an existing one if it exists. Although the current code is working for some domains with successful challenge verifications. This issue started recently and is seen for a few domains randomly.

Can you share more information on what you mentioned here - "Because validated authorizations are cached against an account for several days, the buggy code is probably iteratively validating all of the required challenges across multiple invocations and orders." ?

Successful validations are cached on the server and will be automatically applied to an order.

The classic example, is you want to get a cert with:

  • a.example.com
  • b.example.com
  • c.example.com

Attempt 1:

  • ACME Server replies :
    • Fulfill AuthZ-1 for a.example.com
    • Fulfill AuthZ-2 for b.example.com
    • Fulfill AuthZ-3 for c.example.com

You complete AuthZ-1 correctly, but fail on AuthZ-2.

Attempt 2:

  • ACME Server replies :
    • Fulfill AuthZ-4 for b.example.com
    • Fulfill AuthZ-5 for c.example.com

No challenge is needed for a.example.com, because the success is cached.
You complete AuthZ-4 correctly, but fail on AuthZ-5 -- some code on your client is failing on every 2nd request

Attempt 3:

  • ACME Server replies :
    • Fulfill AuthZ-6 for c.example.com

You only need to complete a challenge for 1 domain now.
You complete AuthZ-6 correctly, and the certificate issues.

Because your code does not guarantee the same order is happening on the first and second block, you can run into situations where an error on the first order means the second order will be different. If there are any other systems that process domains, or if you have taskrunners overlapping, you might end up successfully authorizing a domain for one of those orders in another process.

What is possibly happening in your code, is that you wrote "Authz-1" in the first block, you did not properly catch a failure, and are trying to validate against a different "Authz-2" in the second block without having updated your challenges. Various bugs and implementation peculiarities are probably keeping your system from recognizing the Authz=2 the server expects - so you need to wait for that authz to expire.

There are a lot of things that can go wrong because of that particular section of code. It is an anti-pattern, and there are likely other anti-patterns deployed as the entire setup appears to be fragile. You should fast-track fixing your client, as that should eliminate the ephemeral errors that you are experiencing.

5 Likes