Malformed request message, challenge is not pending?

I manage a system of dozens of thousands of domains, with new ones arriving daily. We have a script that runs twice daily, attempting to generate a SAN cert for all new domains. We organize our domains as 100 per SAN.

This has been working mostly flawlessly for over a year. Intermittently over the last year, we get this error and validation fails:
The request message was malformed :: Unable to update challenge :: The challenge is not pending.
The solution every time has been to re-run the script and it always works the second time. However, it has now been failing for almost 24 hours straight with this message, and 12 attempts in a row have failed. It began Apr 23rd (yesterday).

This is a big issue for us. When cert generating fails, we immediately have customers waiting on us to cert them, and furthermore we renew certs 30 days before expiration, so we have a 30 day window (now 29) to solve this before we start expiring certs by the thousands :fearful:

My domain is:
I have dozens of thousands, grouped into 100 hostnames per SAN. This doesn’t seem to be specific to any one domain.

I ran this command:
letsencrypt certonly --webroot --staging --csr /path/to/my/san119.csr -w /var/www/html -d -d some.other.domain

In reality there will be dozens of domains in that command rather than the 2 I’m showing.

It produced this output:
The request message was malformed :: Unable to update challenge :: The challenge is not pending.

My web server is (include version):
Apache 2

The operating system my web server runs on is (include version):
Ubuntu 16.04.2 LTS

My hosting provider, if applicable, is:

I can login to a root shell on my machine:

I’m using a control panel to manage my site: No

I think this is everything relevant from the log file:

2018-04-24 16:01:31,146:DEBUG:requests.packages.urllib3.connectionpool:"POST /acme/challenge/bths5KrnlcrI2hrOWrNelpNUAJ87d3ROrM5Bg3p7k48/121848775 HTTP/1.1" 202 339
2018-04-24 16:01:31,148:DEBUG:root:Received <Response [202]>. Headers: {'Content-Length': '339', 'Expires': 'Tue, 24 Apr 2018 16:01:31 GMT', 'Server': 'nginx', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Connection': 'keep-alive', 'Link': '<>;rel="up"', 'Location': '', 'Pragma': 'no-cache', 'Boulder-Requester': '1576757', 'Date': 'Tue, 24 Apr 2018 16:01:31 GMT', 'Content-Type': 'application/json', 'Replay-Nonce': '6D_rJ5-Uwy6zNUA8zzdJnuO2FJDLmIW8fEVlasulphc'}. Content: '{\n  "type": "http-01",\n  "status": "pending",\n  "uri": "",\n  "token": "BmWwVBkVtrHrTq49WKHvt2aViKUvBDey0hO29uHNHeo",\n  "keyAuthorization": "BmWwVBkVtrHrTq49WKHvt2aViKUvBDey0hO29uHNHeo.L4FkFHqV7wwNCI6NWKvVp-n_AICpg-BjFowc6T3JXh8"\n}'
2018-04-24 16:01:31,148:DEBUG:acme.client:Storing nonce: "\xe8?\xeb'\x9f\x94\xc3.\xb35@<\xcf7I\x9e\xe3\xb6\x14\x90\xcb\x98\x85\xbc|Eej\xcb\xa5\xa6\x17"
2018-04-24 16:01:31,148:DEBUG:acme.client:Received response <Response [202]> (headers: {'Content-Length': '339', 'Expires': 'Tue, 24 Apr 2018 16:01:31 GMT', 'Server': 'nginx', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Connection': 'keep-alive', 'Link': '<>;rel="up"', 'Location': '', 'Pragma': 'no-cache', 'Boulder-Requester': '1576757', 'Date': 'Tue, 24 Apr 2018 16:01:31 GMT', 'Content-Type': 'application/json', 'Replay-Nonce': '6D_rJ5-Uwy6zNUA8zzdJnuO2FJDLmIW8fEVlasulphc'}): '{\n  "type": "http-01",\n  "status": "pending",\n  "uri": "",\n  "token": "BmWwVBkVtrHrTq49WKHvt2aViKUvBDey0hO29uHNHeo",\n  "keyAuthorization": "BmWwVBkVtrHrTq49WKHvt2aViKUvBDey0hO29uHNHeo.L4FkFHqV7wwNCI6NWKvVp-n_AICpg-BjFowc6T3JXh8"\n}'
2018-04-24 16:01:31,149:DEBUG:acme.client:Serialized JSON: {"keyAuthorization": "hwjQNt7CvpneFU4H_JbA7J02COFd532N8gjiyvZJEBs.L4FkFHqV7wwNCI6NWKvVp-n_AICpg-BjFowc6T3JXh8", "type": "http-01", "resource": "challenge"}
2018-04-24 16:01:31,150:DEBUG:acme.jose.json_util:Omitted empty fields: x5c=(), x5tS256=None, cty=None, jku=None, x5u=None, x5t=None, crit=(), kid=None, alg=None, jwk=None, typ=None
2018-04-24 16:01:31,151:DEBUG:acme.jose.json_util:Omitted empty fields: jku=None, x5tS256=None, cty=None, x5c=(), x5u=None, x5t=None, crit=(), nonce=None, kid=None, typ=None
2018-04-24 16:01:31,151:DEBUG:root:Sending POST request to args: (), kwargs: {'data': '{"header": {"alg": "RS256", "jwk": {"e": "AQAB", "kty": "RSA", "n": "qk8W97GVbO0FkwsnpkVzdEBuJcZvAAxxENe_KfrvX9uGu0VuoNsx2lRUp03-rHrvvulFgO8qoOKyxVWl5ldjGr4uzWq9zGqjVKjPga-W9lRGB1oKPGhi9VMY1H7fKh0OsrPe50erDQXk82rviCrPdFwnZdiOsee7VXgFJBQ8QzBxal0bVavDcNk4NugtDghmyxUSdBQ0Q5yLsMBtCHc2D6J6mRYMmSPMrlKnNiZpzfm2YbcTOXDVzmzOFPTH9VXTkFUksgWnq2ryNP59ioR_YQdGqbhYW-P3_viKY32sKkYIpPn_PJpYQkEm-G0DgNGHUQokjq-6EfKFagwr46pJ2Q"}}, "protected": "eyJub25jZSI6ICI2RF9ySjUtVXd5NnpOVUE4enpkSm51TzJGSkRMbUlXOGZFVmxhc3VscGhjIn0", "payload": "eyJrZXlBdXRob3JpemF0aW9uIjogImh3alFOdDdDdnBuZUZVNEhfSmJBN0owMkNPRmQ1MzJOOGdqaXl2WkpFQnMuTDRGa0ZIcVY3d3dOQ0k2TldLdlZwLW5fQUlDcGctQmpGb3djNlQzSlhoOCIsICJ0eXBlIjogImh0dHAtMDEiLCAicmVzb3VyY2UiOiAiY2hhbGxlbmdlIn0", "signature": "pBKxiReKmLOayS-xF6Ec2vJHa4ar1Jfqlf3d4Ijpif_d5CGGkBK3AbcJdDmjzHxbZlv47psYIwAnEu5t-FBXvqp-MIrqtWfnaUsPclMCRUSyWl97PU3VhHndPKDgsYVFKoTgsa3lDL494BC5Thwq6bRBP9BuJREj7ZYsga_q8_aveo-S1cj5ZG0Q9shzQdy6IpV5JC-j9p3HFdUrA78vOavQ13kMMJw0yfcBjkWVTC05hiwzvsdnMVpvd04Yp2C566KsHoQCK6BEfWxyFifQssSvZLqGZ42ZhvTIzDJwVCJ4rxpVz-rXc1FnWOwjhiY6Fi9CdmZ45GbJ_1hlsWukbg"}'}
2018-04-24 16:01:31,152:INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1):
2018-04-24 16:01:31,306:DEBUG:requests.packages.urllib3.connectionpool:"POST /acme/challenge/uD7SZFyeNoE6z5DptWwLfDB9QAcbJZU9AamsfBqJ5Q8/121848710 HTTP/1.1" 400 132
2018-04-24 16:01:31,308:DEBUG:root:Received <Response [400]>. Headers: {'Content-Length': '132', 'Expires': 'Tue, 24 Apr 2018 16:01:31 GMT', 'Server': 'nginx', 'Connection': 'close', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Boulder-Requester': '1576757', 'Date': 'Tue, 24 Apr 2018 16:01:31 GMT', 'Content-Type': 'application/problem+json', 'Replay-Nonce': 'l1MNeGwnP-jvUW8A8xq-VjJo-7SHMmu77BMjSj2zt7k'}. Content: '{\n  "type": "urn:acme:error:malformed",\n  "detail": "Unable to update challenge :: The challenge is not pending.",\n  "status": 400\n}'
2018-04-24 16:01:31,308:DEBUG:acme.client:Storing nonce: "\x97S\rxl'?\xe8\xefQo\x00\xf3\x1a\xbeV2h\xfb\xb4\x872k\xbb\xec\x13#J=\xb3\xb7\xb9"
2018-04-24 16:01:31,308:DEBUG:acme.client:Received response <Response [400]> (headers: {'Content-Length': '132', 'Expires': 'Tue, 24 Apr 2018 16:01:31 GMT', 'Server': 'nginx', 'Connection': 'close', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Boulder-Requester': '1576757', 'Date': 'Tue, 24 Apr 2018 16:01:31 GMT', 'Content-Type': 'application/problem+json', 'Replay-Nonce': 'l1MNeGwnP-jvUW8A8xq-VjJo-7SHMmu77BMjSj2zt7k'}): '{\n  "type": "urn:acme:error:malformed",\n  "detail": "Unable to update challenge :: The challenge is not pending.",\n  "status": 400\n}'
2018-04-24 16:01:31,311:DEBUG:letsencrypt.cli:Exiting abnormally:

I wonder if a recent update to the staging server API no longer supports my older letsencrypt client?

Note: We must run all of our commands with the --staging flag first, and only if they succeed on staging do we then try the same command without the --staging flag. At our volume, failures happen enough that if we don’t catch them in staging, then it will use up our “newauthz” limit in production, shutting us down for a week at a time. So, we want to make sure all those failures are captured in staging instead.

@jsha, could you please take a look at this?

Some updated information:
Our process attempted the same command 9 times with the same result, as shown above.

However I noticed that on the 10th time, and all after that, I additionally got an error based on CAA

Failed authorization procedure. (http-01): urn:acme:error:caa :: CAA record for prevents issuance, (http-01): urn:acme:error:unauthorized :: The client lacks sufficient authorization :: Invalid response from

To be clear, I started getting this alongside the request message was malformed message.

Some pre-existing logic of ours notices these errors that begin with Failed authorization procedure and parses the hostname from the error and then re-tries without this problematic hostname. That filtration process means that we failed and failed until eventually running without any hostnames that have a CAA record issue. Amazingly, once that was solved, it also succeeded without the request message was malformed error.

However, I’m not at all confident that the original request message was malformed error was caused by these CAA record issues; the initial 9 failures had NO error messages about CAA record issues, and failed ONLY with the request message was malformed message.

Either this request message was malformed message was NOT caused by the CAA record issues, or else it was but sometimes fails to output an error message about CAA record issue.

Still thinking through what my next steps might be. At least for now we got a large bundle of our daily hostnames certed and the house is no longer burning down.

We haven't changed the API in ways that would break such an old client. However, if you are relying on Let's Encrypt to issue certificates for tens of thousands of domains, you should certainly be keeping your client up-to-date. In other words, if you're using letsencrypt, you should make the time to upgrade to the latest certbot.

Before we dig further into debugging, I'd like to pose the question @bytecamp asked back in October: Why are you bundling so many certificates together? HTTPS virtual hosting (through the ServerNameIndicator extension) is very widely supported these days, which means you can issue one certificate per hostname. This is generally a preferable approach, since it means problems with one customer's domain don't affect your other customers.

Another question: do you run multiple copies of letsencrypt in parallel?

You should make the time to upgrade to the latest certbot.

We're aware and want to. It's in our backlog. You know how it is, there's always so much high priority stuff competing for dev time. Everything seems like #1 top priority until you start comparing all the #1 top priorities to each other.

Why are you bundling so many certificates together?

We are using SNI, however our hosting service providing SSL termination sets a limit on total number of certs. We'll surpass that limit if not bundling 100 per cert this way. I believe eventually we'll re-negotiate that limit :stuck_out_tongue:

BTW thanks for your time and thoughts, as usual, guys. I'm surprised you remembered me from October haha

Good plan. :slight_smile: We've seen 100-SAN certs be a regular source of problems like this, and are encouraging more people to move towards single-name certs. Hosting provider limits on numbers of certificates are a common theme. If you're willing to PM me the name of your hosting provider, it is useful for us to know which hosting providers have these sorts of limits. We may be able to work with them to increase them for all customers.

Looking at this specific error, it looks like you probably had two letsencrypt processes running at the same time for the same certificate. I see two POSTs to /acme/challenge/uD7SZFyeNoE6z5DptWwLfDB9QAcbJZU9AamsfBqJ5Q8/121848710, at these times:

04/24/2018 16:01:31.259
04/24/2018 16:01:19.076

Boulder will reuse an existing pending authorization if you request the same name multiple times in a row. What seems to have happened here is that one process requested an authorization, and started the validation process (which eventually produced a connection timeout). While the validation process was still running, another instance of letsencrypt came along and tried to issue for the same name. It received the same authorization object, because that authorization was still pending, but by the time letsencrypt attempted to POST to that challenge, it had become "invalid" because the validation attempt had finally timed out. So you get this error. My recommendation is to only run one copy of letsencrypt at a time.

I don’t believe we’re running multiple instances of letsencrypt at once, at least I have no evidence that that would be true.

However, we DO do this:

  • Attempt to auth a SAN with 100 hostnames
  • Notice that letsencrypt failed, and the error tells us which hostname was the problem
  • Attempt again without that problem hostname, now with 99 hostnames

This is also very often when we see the request message was malformed message. Perhaps an auth process is still pending somewhere in the LE system when I fire up my sequential attempts after parsing out a failed hostname. I think I’ll implement some kind of pause/timeout between these attempts and see if that clears it up

Today we got it worse than usual

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.