429: Too many pending authz but logs show 200 respnse on every POST

Hi,

I have a server with 11 SAN certificates, each SAN has 100 domains (or close to it).

I am receiving this error when issuing any given SAN certificate:

Create new order error. Le_OrderFinalize not found. {
  "type": "urn:ietf:params:acme:error:rateLimited",
  "detail": "Error creating new order :: too many currently pending authorizations: see https://letsencrypt.org/docs/rate-limits/",
  "status": 429
}
  • I can run the issue command with --staging flag enabled and I get no errors.
  • I reviewed the letsEncrypt.log for open authz requests, but every time I observe a POST I always see a status 200 returned. Example:
2021-08-20 03:11:24,622:DEBUG:acme.client:Sending POST request to https://acme-v02.api.letsencrypt.org/acme/authz-v3/23422690450:
2021-08-20 03:11:24,722:DEBUG:urllib3.connectionpool:https://acme-v02.api.letsencrypt.org:443 "POST /acme/authz-v3/23422690450 HTTP/1.1" 200 794
  • I reviewed the acme.sh.log for open authz requests, and see similar. Example:
[Fri Aug 20 04:09:27 UTC 2021] POST
[Fri Aug 20 04:09:27 UTC 2021] _post_url='https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/312028258'
[Fri Aug 20 04:09:27 UTC 2021] _CURL='curl -L --silent --dump-header /home/sslService/.acme.sh/http.header  -g '
[Fri Aug 20 04:09:27 UTC 2021] _ret='0'
[Fri Aug 20 04:09:27 UTC 2021] code='200'

I'm not sure where to go from here. Any help or guidance is appreciated. This started in the past two days. Thanks for the great service and tools.

Other information

My domain is:

cert11.cloudpano.com

I ran this command:

acme.sh --issue --log -w [dir] -k 4096 --force -d cert11.cloudpano.com -d -our.homesite360.com -d hs360example.homesite360.com -d tours.4dtour.me -d [...other domains] --debug

It produced this output:

[Fri Aug 20 04:13:44 UTC 2021] Lets find script dir.
[Fri Aug 20 04:13:45 UTC 2021] _SCRIPT_='/home/sslService/.acme.sh/acme.sh'
[Fri Aug 20 04:13:45 UTC 2021] _script='/home/sslService/.acme.sh/acme.sh'
[Fri Aug 20 04:13:45 UTC 2021] _script_home='/home/sslService/.acme.sh'
[Fri Aug 20 04:13:45 UTC 2021] Using config home:/home/sslService/.acme.sh
https://github.com/acmesh-official/acme.sh
v2.8.6
[Fri Aug 20 04:13:45 UTC 2021] Running cmd: issue
[Fri Aug 20 04:13:45 UTC 2021] _main_domain='cert11.cloudpano.com'
[Fri Aug 20 04:13:45 UTC 2021] _alt_domains='tour.homesite360.com,hs360example.homesite360.com,tours.4dtour.me,[etc...]'
[Fri Aug 20 04:13:45 UTC 2021] Using config home:/home/sslService/.acme.sh
[Fri Aug 20 04:13:45 UTC 2021] ACME_DIRECTORY='https://acme-v02.api.letsencrypt.org/directory'
[Fri Aug 20 04:13:45 UTC 2021] DOMAIN_PATH='/home/sslService/.acme.sh/cert11.cloudpano.com'
[Fri Aug 20 04:13:45 UTC 2021] Using ACME_DIRECTORY: https://acme-v02.api.letsencrypt.org/directory
[Fri Aug 20 04:13:45 UTC 2021] _init api for server: https://acme-v02.api.letsencrypt.org/directory
[Fri Aug 20 04:13:45 UTC 2021] GET
[Fri Aug 20 04:13:45 UTC 2021] url='https://acme-v02.api.letsencrypt.org/directory'
[Fri Aug 20 04:13:45 UTC 2021] timeout=
[Fri Aug 20 04:13:45 UTC 2021] _CURL='curl -L --silent --dump-header /home/sslService/.acme.sh/http.header  -g '
[Fri Aug 20 04:13:45 UTC 2021] ret='0'
[Fri Aug 20 04:13:45 UTC 2021] ACME_KEY_CHANGE='https://acme-v02.api.letsencrypt.org/acme/key-change'
[Fri Aug 20 04:13:45 UTC 2021] ACME_NEW_AUTHZ
[Fri Aug 20 04:13:45 UTC 2021] ACME_NEW_ORDER='https://acme-v02.api.letsencrypt.org/acme/new-order'
[Fri Aug 20 04:13:45 UTC 2021] ACME_NEW_ACCOUNT='https://acme-v02.api.letsencrypt.org/acme/new-acct'
[Fri Aug 20 04:13:45 UTC 2021] ACME_REVOKE_CERT='https://acme-v02.api.letsencrypt.org/acme/revoke-cert'
[Fri Aug 20 04:13:45 UTC 2021] ACME_AGREEMENT='https://letsencrypt.org/documents/LE-SA-v1.2-November-15-2017.pdf'
[Fri Aug 20 04:13:45 UTC 2021] ACME_NEW_NONCE='https://acme-v02.api.letsencrypt.org/acme/new-nonce'
[Fri Aug 20 04:13:45 UTC 2021] ACME_VERSION='2'
[Fri Aug 20 04:13:45 UTC 2021] Le_NextRenewTime='1634530331'
[Fri Aug 20 04:13:45 UTC 2021] _on_before_issue
[Fri Aug 20 04:13:45 UTC 2021] _chk_main_domain='cert11.cloudpano.com'
[Fri Aug 20 04:13:45 UTC 2021] _chk_alt_domains='tour.homesite360.com,hs360example.homesite360.com,tours.4dtour.me,[etc...]'
[Fri Aug 20 04:13:45 UTC 2021] Le_LocalAddress
[Fri Aug 20 04:13:45 UTC 2021] d='cert11.cloudpano.com'
[Fri Aug 20 04:13:45 UTC 2021] Check for domain='cert11.cloudpano.com'
[Fri Aug 20 04:13:45 UTC 2021] _currentRoot='/home/user/big-data-homes/build'
[Fri Aug 20 04:13:45 UTC 2021] d='tour.homesite360.com'
[Fri Aug 20 04:13:45 UTC 2021] Check for domain='tour.homesite360.com'
[Fri Aug 20 04:13:45 UTC 2021] _currentRoot='/home/user/big-data-homes/build'
[Fri Aug 20 04:13:45 UTC 2021] d='hs360example.homesite360.com'
[Fri Aug 20 04:13:45 UTC 2021] Check for domain='hs360example.homesite360.com'
[Fri Aug 20 04:13:45 UTC 2021] _currentRoot='/home/user/big-data-homes/build'
[Fri Aug 20 04:13:45 UTC 2021] d='tours.4dtour.me'
[Fri Aug 20 04:13:45 UTC 2021] Check for domain='tours.4dtour.me'
[etc...]
[Fri Aug 20 04:13:47 UTC 2021] Getting domain auth token for each domain
[Fri Aug 20 04:13:47 UTC 2021] d='tour.homesite360.com'
[Fri Aug 20 04:13:47 UTC 2021] d='hs360example.homesite360.com'
[Fri Aug 20 04:13:47 UTC 2021] d='tours.4dtour.me'
[etc...]
[Fri Aug 20 04:13:48 UTC 2021] d='tours.mozesphotography.com'
[Fri Aug 20 04:13:48 UTC 2021] d
[Fri Aug 20 04:13:48 UTC 2021] url='https://acme-v02.api.letsencrypt.org/acme/new-order'
[Fri Aug 20 04:13:48 UTC 2021] payload='{"identifiers": [{"type":"dns","value":"cert11.cloudpano.com"},{"type":"dns","value":"tour.homesite360.com"},{"type":"dns","value":"hs360example.homesite360.com"},{"type":"dns","value":"tours.4dtour.me"},[etc...],{"type":"dns","value":"tours.mozesphotography.com"}]}'
[Fri Aug 20 04:13:48 UTC 2021] RSA key
[Fri Aug 20 04:13:48 UTC 2021] HEAD
[Fri Aug 20 04:13:48 UTC 2021] _post_url='https://acme-v02.api.letsencrypt.org/acme/new-nonce'
[Fri Aug 20 04:13:48 UTC 2021] _CURL='curl -L --silent --dump-header /home/sslService/.acme.sh/http.header  -g  -I  '
[Fri Aug 20 04:13:49 UTC 2021] _ret='0'
[Fri Aug 20 04:13:49 UTC 2021] POST
[Fri Aug 20 04:13:49 UTC 2021] _post_url='https://acme-v02.api.letsencrypt.org/acme/new-order'
[Fri Aug 20 04:13:49 UTC 2021] _CURL='curl -L --silent --dump-header /home/sslService/.acme.sh/http.header  -g '
[Fri Aug 20 04:13:49 UTC 2021] _ret='0'
[Fri Aug 20 04:13:49 UTC 2021] code='429'
[Fri Aug 20 04:13:49 UTC 2021] Le_LinkOrder
[Fri Aug 20 04:13:49 UTC 2021] Le_OrderFinalize
[Fri Aug 20 04:13:49 UTC 2021] Create new order error. Le_OrderFinalize not found. {
  "type": "urn:ietf:params:acme:error:rateLimited",
  "detail": "Error creating new order :: too many currently pending authorizations: see https://letsencrypt.org/docs/rate-limits/",
  "status": 429
}
[Fri Aug 20 04:13:49 UTC 2021] pid
[Fri Aug 20 04:13:49 UTC 2021] No need to restore nginx, skip.
[Fri Aug 20 04:13:49 UTC 2021] _clearupdns
[Fri Aug 20 04:13:49 UTC 2021] dns_entries
[Fri Aug 20 04:13:49 UTC 2021] skip dns.
[Fri Aug 20 04:13:49 UTC 2021] _on_issue_err
[Fri Aug 20 04:13:49 UTC 2021] Please check log file for more details: /home/sslService/.acme.sh/acme.sh.log
[Fri Aug 20 04:13:49 UTC 2021] socat doesn't exists.
[Fri Aug 20 04:13:49 UTC 2021] Diagnosis versions: 
openssl:openssl
OpenSSL 1.1.1  11 Sep 2018
apache:
apache doesn't exists.
nginx:
nginx version: nginx/1.14.0 (Ubuntu)
built with OpenSSL 1.1.1  11 Sep 2018
TLS SNI support enabled
configure arguments: --with-cc-opt='-g -O2 -fdebug-prefix-map=/build/nginx-H4cN7P/nginx-1.14.0=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -Wdate-time -D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -fPIC' --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --modules-path=/usr/lib/nginx/modules --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_v2_module --with-http_dav_module --with-http_slice_module --with-threads --with-http_addition_module --with-http_geoip_module=dynamic --with-http_gunzip_module --with-http_gzip_static_module --with-http_image_filter_module=dynamic --with-http_sub_module --with-http_xslt_module=dynamic --with-stream=dynamic --with-stream_ssl_module --with-mail=dynamic --with-mail_ssl_module
socat:

My web server is (include version):
nginx/1.14.0 (Ubuntu)

The operating system my web server runs on is (include version):
Ubuntu 18.04.1 LTS

My hosting provider, if applicable, is:
Digital Ocean

I can login to a root shell on my machine (yes or no, or I don't know):
yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel):
no

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):
certbot 0.26.1
acme.sh v2.8.6

1 Like

Is it possible that certbot is still being called buy an old cron job?
The logs shown are from acme.sh and presumably the current and active ACME client.
Please show:
tail -n 20 /var/log/letsencrypt/letsencrypt.log

1 Like

The use of the force is hardly ever required outside of Star Wars!

3 Likes

Nice. I think you are onto something here. I see certificates that may be very old and not in use anymore.

user@server:~$ tail -n 40 /var/log/letsencrypt/letsencrypt.log
2021-08-20 03:11:39,553:WARNING:certbot.renewal:Attempting to renew cert (cloudpano.com-0001) from /etc/letsencrypt/renewal/cloudpano.com-0001.conf produced an unexpected error: Failed authorization procedure. cloudpano.com (http-01): urn:ietf:params:acme:error:unauthorized :: The client lacks sufficient authorization :: The key authorization file from the server did not match this challenge "juGMrMZs6RcaZDLwle_lkuahtG5UdI0uzIFclFHtZEs.Y_2-s0LS1zF7v-lZx1yiRUTcrYLVxAU0nghARdpYMpc" != "". Skipping.
2021-08-20 03:11:39,554:DEBUG:certbot.renewal:Traceback was:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/certbot/renewal.py", line 430, in handle_renewal_request
    main.renew_cert(lineage_config, plugins, renewal_candidate)
  File "/usr/lib/python3/dist-packages/certbot/main.py", line 1197, in renew_cert
    renewed_lineage = _get_and_save_cert(le_client, config, lineage=lineage)
  File "/usr/lib/python3/dist-packages/certbot/main.py", line 115, in _get_and_save_cert
    renewal.renew_cert(config, domains, le_client, lineage)
  File "/usr/lib/python3/dist-packages/certbot/renewal.py", line 305, in renew_cert
    new_cert, new_chain, new_key, _ = le_client.obtain_certificate(domains, new_key)
  File "/usr/lib/python3/dist-packages/certbot/client.py", line 334, in obtain_certificate
    orderr = self._get_order_and_authorizations(csr.data, self.config.allow_subset_of_names)
  File "/usr/lib/python3/dist-packages/certbot/client.py", line 370, in _get_order_and_authorizations
    authzr = self.auth_handler.handle_authorizations(orderr, best_effort)
  File "/usr/lib/python3/dist-packages/certbot/auth_handler.py", line 82, in handle_authorizations
    self._respond(aauthzrs, resp, best_effort)
  File "/usr/lib/python3/dist-packages/certbot/auth_handler.py", line 155, in _respond
    self._poll_challenges(aauthzrs, chall_update, best_effort)
  File "/usr/lib/python3/dist-packages/certbot/auth_handler.py", line 226, in _poll_challenges
    raise errors.FailedChallenges(all_failed_achalls)
certbot.errors.FailedChallenges: Failed authorization procedure. cloudpano.com (http-01): urn:ietf:params:acme:error:unauthorized :: The client lacks sufficient authorization :: The key authorization file from the server did not match this challenge "juGMrMZs6RcaZDLwle_lkuahtG5UdI0uzIFclFHtZEs.Y_2-s0LS1zF7v-lZx1yiRUTcrYLVxAU0nghARdpYMpc" != ""

2021-08-20 03:11:39,555:ERROR:certbot.renewal:All renewal attempts failed. The following certs could not be renewed:
2021-08-20 03:11:39,555:ERROR:certbot.renewal:  /etc/letsencrypt/live/www.cloudpano.com/fullchain.pem (failure)
  /etc/letsencrypt/live/app.cloudpano.com-0001/fullchain.pem (failure)
  /etc/letsencrypt/live/app.cloudpano.com/fullchain.pem (failure)
  /etc/letsencrypt/live/cloudpano.com/fullchain.pem (failure)
  /etc/letsencrypt/live/cloudpano.com-0001/fullchain.pem (failure)
2021-08-20 03:11:39,555:DEBUG:certbot.log:Exiting abnormally:
Traceback (most recent call last):
  File "/usr/bin/certbot", line 11, in <module>
    load_entry_point('certbot==0.26.1', 'console_scripts', 'certbot')()
  File "/usr/lib/python3/dist-packages/certbot/main.py", line 1364, in main
    return config.func(config, plugins)
  File "/usr/lib/python3/dist-packages/certbot/main.py", line 1276, in renew
    renewal.handle_renewal_request(config)
  File "/usr/lib/python3/dist-packages/certbot/renewal.py", line 455, in handle_renewal_request
    len(renew_failures), len(parse_failures)))
certbot.errors.Error: 5 renew failure(s), 0 parse failure(s)

I may be a little lost here. I had certbot as my primary client until late 2019 and I switched to acme.sh. Perhaps I did not configure things correctly during the switch. Any advice?

1 Like

Well I found something pretty interesting in a cronfile I didnt know existed until 30 seconds ago...

/etc/cron.d/certbot

0 */12 * * * root test -x /usr/bin/certbot -a \! -d /run/systemd/system && perl -e 'sleep int(rand(43200))' && certbot -q renew
1 Like

Yeah, those certbot entries are from today!:

Remove the certbot cron job.
Uninstall certbot

1 Like

I was able to find thousands of open authz due to stale certificates from previous certbot client.

I confirmed that I can uninstall certbot without harm to acme.sh here: Using acme.sh; should I remove certbot

I then used the LE_FIND_PENDING_AUTHZ.py script that could help me remove the pending authorizations: Too many requests of a given type - how do you reset this?

It seems the script has gone stale (this pending auth 429 problem isn't seen much since 2017) so I made some quick fixes and put it here: GitHub - claytonrothschild/LE_FIND_PENDING_AUTHZ: Use Certbot Logs as a way of finding pending autz</t

It is now rolling through removing the pending authorizations. Feels great!

One final thought: If it wasn't for @rg305's observation of multiple clients in my original post this would not have been solved so easily at all. I wonder if it would be useful in the Help Template to ask if the user has or has had more than one client installed. I just put two clients down by happenstance, and truthfully I thought acme.sh was just a wrapper for certbot. So I got lucky.

Thanks all.

4 Likes

Another testament to there is no such thing as too much information.
I too can only imaging how long it would have taken us to figure this out without that clue.

Glad to hear that all is on the right track now :slight_smile:
Cheers from Miami :beers:

#FreeCUBA

3 Likes

Well, I misread the script output. When it said it found "3000" authorizations in a given log file, it did not imply these were pending, rather these were just authorization requests.

It seems that the LE_FIND_PENDING_AUTHORIZATIONS script is correctly checking all the authz in the log file, but every single one is coming back as valid, invalid, or expired. I have no pending statuses being returned. I am running this script on all three servers.

Any other ideas?

1 Like

Then you still having the "Too many pending authz"?

Did you remove certbot from all three?

1 Like

Correct

[Fri Aug 20 16:51:33 UTC 2021] Create new order error. Le_OrderFinalize not found. {
  "type": "urn:ietf:params:acme:error:rateLimited",
  "detail": "Error creating new order :: too many currently pending authorizations: see https://letsencrypt.org/docs/rate-limits/",
  "status": 429
}

I removed certbot from prod server. Lower environment are a mess and turns out they use certbot for SSL so in that case I just ensured only one LE client existed.

1 Like

FYI, I am very familiar with your problem and it is unfortunately common to the certificate partitioning scheme you identified above.

@rg305 is helping you get out of the current mess, so I am just going to address avoiding it in the future.

One or both of the clients on your system was not cleaning up authorization challenges when a domain validation failed. Not all clients do this, because this feature is not wanted for most LetsEncrypt users.

Most users will procure one or two certificates, each with one or two domains. If a challenge fails, leaving the pending authorizations open is typically desired -- as there is no reason to destroy them, only to immediately recreate them once a local challenge verification problem is fixed. LetsEncrypt associates authorizations primarily to the ACME Account - which is how both pending and verified authorizations are re-used across ACME Orders.

For users that have many certificates with many domains, this creates a cascade of failures that often wedges the account via the pending-authz ratelimit. In your situation, a first certificate attempt will create 100 pending authorizations; if the 10th challenge fails, you'll be left with 90 pending authorizations and the client will move on to the next certificate for renewal. In your situation, the 300 pending authz ratelimit can be hit as early as the 4th acme-order.

The ways to mitigate this are:

  1. Partition fewer domains onto Certificates. In your current scheme, a failed authorization can leave up to 99 pending authorizations. Even only dropping down to 50 domains per certificate will help avoid this by a lot.

  2. Use a client that will cleanup pending authorizations on failed orders, and enable that feature. Not all clients offer this. Last I checked, most do not. I only built this into my client, because it's targeted for infinitely scalable domains.

3 Likes

@jvanasco, we are all here to help, feel free to help whenever/wherever you can.
This is not a car dealership - he isn't my client - we don't get paid for any of this - there is no lost commission - LOL
If you can end his suffering any sooner, then please do so :slight_smile:

And you do bring up very valid points that I hadn't even thought of - since I've never even had to put that many entries into a SAN :wink:

2 Likes

You've got it all covered. Nothing I can add on the recovery concepts, and too many cooks will just get too confusing here.

I happen to know this cascading failure scenario very well, because I've lived through it a few times. One of the (many) reasons I wrote a client backed by SQL, is that I wanted to quickly query and filter pending authorizations to recover from situations like this. Even when using post-failure cleanups, you can miss some auths and still wedge an account.

3 Likes

@jvanasco I think your insights are extremely important, and your description of what can happen with large SAN certs is exactly the problem we experience all the time. We are always adding and removing domains from our SANs depending on our client's whitelabel of our product. The way we identify a domain to remove? We wait for the SAN to fail validation for a given domain :slight_smile: and then we remove the domain and try to issue the SAN cert again. The result, I imagine, is a lot of pending auths.

One interesting note, is that I used clearauth and LE_FIND_PENDING_AUTHZ tools on my servers and never could find the pending authz.

I would have kept going, but during my analysis of "not able to generate new certs", I ended up replacing a prod cert with a staging cert. And then I couldnt replace the staging cert because I was, again, unable to issue new certs. The result was I took 100 prod domains offline :man_facepalming:

Overall it was a cluster. Due to the fire in prod and difficulties finding the pending authz, I ended up registering a new account to quickly re-generate the offline domains. I wish I could have resolved this in a non-brute-force way. Given 1.) this scenario, 2.) @jvanasco insight's on large SANs, and 3.) already knowing the performance problems and high manual intervention involved in large SANs, I finally decided to document the new architecture I have wanted to implement for a year now, and our team will take it into development soon. So the brute force is a bandaid and this problem will eventually be mitigated by the new architecture.

Thanks all. This is my third time posting here over the same number of years and each time I am impressed with the community and quality of discussion and problem-solving. All the best.

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.