Many challenges returning invalid starting yesterday

We run a hosting service that uses LetsEncrypt to create and renew certificates for many customers using our own server that speaks the ACME API.

Starting yesterday, many of our challenges suddenly started failing with an invalid status. The first one was around 2017-12-06T14:21:52Z and we have had over 6000 failures since then. I can share specific domain names if there is an appropriately private place to do so.

We use http-01 challenges. I do not believe anything changed in our infrastructure yesterday morning.

Not all of our creations/renewals are failing. In that same period we have had over 200 successes.

I note that many, but not all, of the failing certificates are for registered domains, and many, but not all, off the successful certificates are for subdomains.

It’s also imaginable that we are hitting rate limits, but would rate limits manifest as the challenge endpoint successfully returning an invalid status?

Do you have any more information besides “invalid”? This doesn’t really give us a whole lot to go on.

Additionally, it may or may not end up mattering (depending on what the actual failure condition is), but your customer’s domain names are already publicly exposed on the Certificate Transparency logs. You’re not keeping anything secret by not posting them here - only negatively impacting the ability of the community to offer meaningful support.

1 Like

Hi @glasser,

Can you share the full response you’re receiving? I assume the requests were POSTs to specific challenges to initiate validation?

Domain names are published in certificate transparency logs already. It’s much easier if you can share the domains directly in-thread so that more people can help out & we can distribute the support load.

1 Like

Sure, here’s a handful of the 92 domains that started failing:

choozle.mobi, allegiancecafe.com, ltnecuador.travel, villeneuvedascq.catholique.fr, public.azumuta.com, noted.is, www.tecnoreps.com, inprogress.spedvk.de

I believe these all are new certs, not renewals, though I haven’t checked them all. Our server validates that DNS (as seen by our server) is currently serving our servers before it tries to create the challenge, so we usually don’t get invalid challenges, and certainly not in this volume.

I don’t have the full response right now but I can try to instrument our service to return it. It definitely includes status: “invalid”.

Hi @glasser,

These all seem like normal every-day failures that can be explained by misconfiguration. All seem to be hitting the too many invalid authorizations rate limit now.

This one returned an HTML document instead of the ACME HTTP-01 challenge response. It includes “503 Service Unavailable: No healthy endpoints to handle the request”. It’s also getting rate limited presently for too many invalid authorization attempts.

Same story RE: rate limit for failed authzs. I see validation attempts returning an invalid HTML error page response to the challenge request here too:

"\u003c!DOCTYPE html\u003e\n\u003chtml\u003e\n\u003chead\u003e\n \u003clink rel=“stylesheet” type=“text/css” class=“meteor-css”

This one is: DNS problem: SERVFAIL looking up CAA for ltnecuador.travel"

Another invalid response with HTML:

\u003c!DOCTYPE html\u003e\n\u003chtml\u003e\n\u003chead\u003e\n \u003clink rel=“stylesheet” type=“text/css” class=“meteor-css” href="/bf1deb548dce2ed3949ef85e09c"

Same invalid HTML response as above.

Same invalid HTML response as above.

This one is CAA again:

DNS problem: query timed out looking up CAA for tecnoreps.com

HTML document in response to the challenge request again.

Ok. I definitely recommend you start keeping & displaying more detail for these cases. For instance, the last request I mentioned for inprogress.spedvk.de returned a problem that would have explained the failure more clearly. You can see the full problem that was returned by CURLing the authz URL:

curl https://acme-v01.api.letsencrypt.org/acme/challenge/BM-NfyBJUzBOWBCQhTIp8uD3R3oHKTzvfSOrEsfihRQ/2668347028
{
  "type": "http-01",
  "status": "invalid",
  "error": {
    "type": "urn:acme:error:unauthorized",
    "detail": "Invalid response from http://inprogress.spedvk.de/.well-known/acme-challenge/YE1qeyPvvzBvAmXVdlRkiz4MfbMojtnGvL5P81lky5w: \"\u003c!DOCTYPE html\u003e\n\u003chtml\u003e\n\u003chead\u003e\n  \u003clink rel=\"stylesheet\" type=\"text/css\" class=\"__meteor-css__\" href=\"/99ba300cde8454acfde4f2ea2b8\"",
    "status": 403
  },
  "uri": "https://acme-v01.api.letsencrypt.org/acme/challenge/BM-NfyBJUzBOWBCQhTIp8uD3R3oHKTzvfSOrEsfihRQ/2668347028",
  "token": "YE1qeyPvvzBvAmXVdlRkiz4MfbMojtnGvL5P81lky5w",
  "keyAuthorization": "YE1qeyPvvzBvAmXVdlRkiz4MfbMojtnGvL5P81lky5w.QRRvz3cNxWGJObT4gl6G9ZNx-4cXE2eK81kX5lpYzmo",
  "validationRecord": [
    {
      "url": "http://inprogress.spedvk.de/.well-known/acme-challenge/YE1qeyPvvzBvAmXVdlRkiz4MfbMojtnGvL5P81lky5w",
      "hostname": "inprogress.spedvk.de",
      "port": "80",
      "addressesResolved": [
        "81.169.145.72",
        "2a01:238:20a:202:1072::"
      ],
      "addressUsed": "2a01:238:20a:202:1072::",
      "addressesTried": []
    }
  ]
}

The response problem for POSTs to the challenge also indicates the rate limiting failure when it happens:

"Errors":["429 :: rateLimited :: Error creating new authz :: Too many failed authorizations recently."]

Notably that last authz was validated against an IPv6 address - does your system validate that the AAAA record is correct in addition to the A?

Hope this helps,

1 Like

Thanks, this is helpful. I didn’t realize that the challenges would still be queryable at this later date.

IPv6 is an interesting point. I’ll look further into this on our end.

You don’t know of anything that changed in your system around the timestamp I gave above, though? Something definitely change in our system from “always works” to “mostly fails”…

It isn’t possible that something changed yesterday such that http-01 challenges no longer follow redirects, is it?

For what it’s what, it appears that these failures have been going on for longer than the period than I mentioned but merely stopped happening for a few days for some reason (perhaps other failures). I believe they are due to previous changes in our own pre-validation logic that we run before talking to LE which went from being overly strict to overly lax. All on our end. Thanks for the help!

I really should send a PR to https://github.com/hlandau/acme/ to not drop the “error” field in challenges…

1 Like

acmetool is a fantastic client!

You can get more detailed information by running it with a higher verbosity level:

acmetool --xlog.severity=debug reconcile

@_az We use the API directly, not the command-line tool.

There was a deploy roughly around the same time (You can subscribe to the status page to learn about these automatically). We include a changelog with each update.

As far as I’m aware nothing has changed with the semantics of HTTP-01 redirects since we fixed the error for non-80/443 redirects.

That would make sense! Glad you were able to determine a potential cause.

I also use echo @_az’s enthusiasm for that client :slight_smile: I use it for my own personal certificate needs. It sounds like @_az’s suggestion (thanks!) to add the --xlog.severity parameter doesn’t help with your particular integration so perhaps there is a documentation update worth requesting for the library instructions that would make it clearer that this sort of information should be collected & retained?

Good luck!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.