Cert Renewel issue: Wrong ip for challenges used

We are currently experiencing issues with our cert renewel with certbot. Our infrastructure consists of two GWs running nginx as loadbalancer.
Both of those GWs have the ips 1.1.1.1 and 1.1.1.2 as VIP's on their respectiv NIC.
Within DNS testsite.de is pointing to 1.1.1.2 and 1.1.1.1
We have two Nginx-GWs. We use keepalived for failover functionality. GW1 uses GW2 as an backup. GW2 uses GW1 as an backup
If GW1 fails the other GW2 gets the ip from GW1 too. Hence both GWs has the two IP's on their nic

If we start a dryrun on GW1, we get following output:

Saving debug log to /var/log/letsencrypt/letsencrypt.log
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for ext-services.testsite.de
http-01 challenge for www.testsite.de
http-01 challenge for testsite.de
Using the webroot path OBSCUREDPATH for all unmatched domains.
Waiting for verification...
Challenge failed for domain ext-services.testsite.de
Challenge failed for domain www.testsite.de
Challenge failed for domain testsite.de
http-01 challenge for ext-services.testsite.de
http-01 challenge for www.testsite.de
http-01 challenge for testsite.de
Cleaning up challenges
All challenges have failed.

IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: ext-services.testsite.de
   Type:   unauthorized
   Detail: 1.1.1.2: Invalid response from
   http://ext-services.testsite.de/.well-known/acme-challenge/generatedNounce:
   404

   Domain: www.testsite.de
   Type:   unauthorized
   Detail: 1.1.1.2: Invalid response from
   http://www.testsite.de/.well-known/acme-challenge/generatedNounce:
   404

   Domain: testsite.de
   Type:   unauthorized
   Detail: 1.1.1.2: Invalid response from
   http://testsite.de/.well-known/acme-challenge/generatedNounce:
   404

   To fix these errors, please make sure that your domain name was
   entered correctly and the DNS A/AAAA record(s) for that domain
   contain(s) the right IP address.

We're thinking the issues stems from certbot chosing the wrong ip address, since 1.1.1.2 is being used, even though 1.1.1.1 should have been used.
By default 1.1.1.1 should be in master state, as configured in the keepalived config, which was the case at the time of the test. For those challenges, both of those IPs are being provided and are
seen in the response, but the wrong one is being used.

{
  "identifier": {
    "type": "dns",
    "value": "www.testsite.de"
  },
  "status": "invalid",
  "expires": "2022-09-28T09:39:00Z",
  "challenges": [
    {
      "type": "http-01",
      "status": "invalid",
      "error": {
        "type": "urn:ietf:params:acme:error:unauthorized",
        "detail": "1.1.1.1: Invalid response from http://www.testsite.de/.well-known/acme-challenge/generatedNounce: 404",
        "status": 403
      },
      "url": "https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/3703319624/ddjOvg",
      "token": "generatedToken",
      "validationRecord": [
        {
          "url": "http://www.testsite.de/.well-known/acme-challenge/generatedNounce",
          "hostname": "www.testsite.de",
          "port": "80",
          "addressesResolved": [
            "1.1.1.1",
            "1.1.1.2"
          ],
          "addressUsed": "1.1.1.1"
        }
      ],
      "validated": "2022-09-21T09:39:01Z"
    }
  ]
}

his is part of the output from the letsencrypt.log file from GW2. In this case, 1.1.1.2 should have been used, since requests addressed to 1.1.1.1 are being forwarded to GW1.
On GW1 the addressesResolved part looks like this:

          "addressesResolved": [
            "1.1.1.2",
            "1.1.1.1"
          ],

with 1.1.1.2 being used for all the requests aswell as value for "addressUsed".

If we shutdown one of the GWs, the dryrun finishes without any errors.
Is there a setting we are missing? Why wont Certbot try all the resolved addresses?
I obscured most of the data but I think for the questions asked it should suffice. If there is anything missing or not clear enough, feel free to ask.

We are using certbot 0.40.0

When two IP addresses are provided for a given domain name, a client may use any of those IP addresses without associated priority. So it is like random that the ACME validation process tries to connect to which IP address first. If that IP address does not connect, the validation process may try the second IP address.

4 Likes

Thats what i would expect, but at the same time I would expect it to succeed or fail 50% of the time over X tries. In our case though the failrate on the same gateway was 100% always chosing the IP from the other gateway. Same thing on the second gateway. I would also assume there was no attempt to connect to the "correct" IP first before swapping to the second since there is nothing about that in the logfile.

Your expectation may fail, since the failure/success statistics that is up to the selection algorithm of the IP address on the client side. It may be just a sorted list of IP addresses and always tried on the same order. You should no assume any behavior on the client side when you configure your web server and your DNS.

4 Likes

You're totally right about that. It would be just a weird coincidence since it's failing on both gateways. If its a sorted list, at least one of them would succeed. If there would be some randomness, at least 1 of X tries would succeed. But yea, those are all assumptions from me.

But is there no option/flag to set that all provided addresses should be challenged if others fail? Based on your first reply this behavior is used if the connection failed.

There is no such option, as far as I know.

I also assumed that behavior based on your experience as plausible reason.

Why do not use an ACME client that supports multiple parallel working frontends to answer ACME challenges on a stateless manner?

My ACME client GitHub - bruncsak/ght-acme.sh: Get publicly trusted certificate via ACME protocol from LetsEncrypt or from BuyPass does this. I am pretty sure some other ACME clients are capable to do the same.

5 Likes

I will definitly look into your and other ACME clients, thanks for that information.
Guess I was just hoping for an easy fix, since the solution seemed rather simple first. Worst case we could just shutdown one gateway every 2 months and renew them manually via certbot instead of the service.timer. Ill leave this thread open for now, maybe there will be some future revelations.

Thanks for taking the time and helping!

3 Likes

Another option is to use the DNS Challenge. This can work well in this situation if you have API control of the DNS. If this is an option note the acme.sh ACME client supports many more DNS API's than certbot. If you'd provided details on your config we could give more specific advice.

Below is a thread which discussed various methods for dealing with multiple servers operating behind a load balancer. It is not exactly your case but perhaps helpful anyway

It's also important to realize that the Let's Encrypt servers are the ones making the challenge requests. There will be (currently) four requests from different locations around the globe. The ACME client is just making a request for a cert from the LE Server. The client has no control over that challenge process. This doc page explains this in more detail

4 Likes

This is an oversimplified and erroneous calculation.
The "math" to the problem is more like:
Flip a coin four times and have it land on the same side (heads or tails) all four times.
Why?
Because LE uses multiple points of validation (four).
And only when they have all passed will you obtain a cert.

So, the "math" is 1/2 ^ 4 = 1/16 = ~6% chance of success [far less than 50%]

3 Likes

Yea that was my bad in phrasing in that way. What i meant by succeeding was the correct IP being chosen and not the whole proccess to succeed. Every dry-run generated 3 of those blocks I posted in the beginning with an "addressUsed" field and it always pointing to the wrong IP.

I will also look into the DNS Challenge, since this might be the solution for our case and keep this updated. Will also ask about sharing some config files so this case might get clearer, since its not my place to simply publish them without enough knowledge about the topic.

Thanks to everyone taking their time for this!

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.