Intermittent failures during finalize: Unable to meet CA SCT embedding requirements

Hi!

Our nightly regressions test our ACME client, one test creates then revokes a certificate using LE Staging. The tests have run without errors for many weeks, our client hasn't changed during that time.

Starting 16 April we are seeing intermittent failures during certificate create, while waiting for finalize, specifically this:

data:{
  "status": "invalid",
  "expires": "2023-04-23T19:38:53Z",
  "identifiers": [
    {
      "type": "dns",
      "value": "www.appoptimization.com"
    }
  ],
  "authorizations": [
    "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/5952795884"
  ],
  "finalize": "https://acme-staging-v02.api.letsencrypt.org/acme/finalize/18175989/8297400174",
  "error": {
    "type": "urn:ietf:params:acme:error:serverInternal",
    "detail": "Error finalizing order :: Unable to meet CA SCT embedding requirements",
    "status": 500
  }
} 

Most of the certificate ops executed by our nightly regressions using LE Stage pass, however there have been 7 of the finalizing-order failures since 2023-04-16 19:40:27 UTC. The client polls for finalize status to change from status "processing" to a completed status - in failure cases it generally has to wait about 90 seconds before the 500 error is returned.

Any ideas why these intermittent failures have shown in the past two days?

Thanks!

4 Likes

@lestaff: Throwing this one at you. :baseball:

5 Likes

I've identified the issue with CT submission in staging and will resolve this today.

7 Likes

thank you !!

3 Likes

This is fixed now.

The configuration for the "Sapling 2023h2" CT log was incorrect, which resulted in all our SCTs going to other test logs. Two of those other test logs became overloaded, causing slow staging finalizes and intermittent failures.

I've correctly configured Sapling 2023h2, and everything should be better now. This would have started at 2023-07-15T00:00Z when the Sapling 2023h1 log shard ended.

7 Likes

kinda surprising even staging certs are capable of overloading other logs.
how much LE stageing signs per day?

3 Likes

i manually ran the regression test that failed per above, it still fails but much quicker now. once domain validation succeeds and finalize order starts, the client receives

data:{
  "status": "processing",
  "expires": "2023-04-24T17:11:44Z",
  "identifiers": [
    {
      "type": "dns",
      "value": "www.appoptimization.com"
    }
  ],
  "authorizations": [
    "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/5952795884"
  ],
  "finalize": "https://acme-staging-v02.api.letsencrypt.org/acme/finalize/18175989/8311141904"
} 

then a retry 5 seconds later receives:

data:{
  "status": "invalid",
  "expires": "2023-04-24T17:11:44Z",
  "identifiers": [
    {
      "type": "dns",
      "value": "www.appoptimization.com"
    }
  ],
  "authorizations": [
    "https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/5952795884"
  ],
  "finalize": "https://acme-staging-v02.api.letsencrypt.org/acme/finalize/18175989/8311141904",
  "error": {
    "type": "urn:ietf:params:acme:error:serverInternal",
    "detail": "Error finalizing order",
    "status": 500
  }
} 

any ideas? thanks!

2 Likes

Yeah, it's still (basically) the same problem, just we're failing certificate linting after successfully submitting now. I missed a spot in my config change and posted here a little too early. Will be a few more minutes.

5 Likes

np, and no rush, appreciate the attention you are giving this!

3 Likes

Sapling is also a staging CT log, so maybe LE has allocated less resources to it? Just speculating here though :slight_smile:

3 Likes

I had to restart a service to pick up the new configs, I think it's good now. I will verify.

Staging submits to three "Log Operators":

  1. Sapling, our public CT log. This is the one that we ran off the end of the valid configured shards and I fixed.
  2. Google's Testflume log. This one has been fine.
  3. Some internal-only test logs, based on boulder's ct-test-srv. These are the ones that started getting very slow. I'm not sure why yet.
9 Likes

three runs of the previously-failed regressions, three PASS'es now, it looks much better to me... thanks again!

6 Likes

Could you also post the public key of the Sapling 2023h2 log? I couldn't find it on ct-logs.

This is in the format of google's CT log list json files

        {
          "description": "Let's Encrypt 'Sapling2023h2' log",
          "log_id": "7audHd2Dc5Wf9SqI5Gu0vMPEzE12imDM/042LX+41mg=",
          "key": "MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEbdCykRsTPRgfjKVQvINRLJk3gy+2qNKOU48bo/sWO0ko75S92C+PBDxsqMEd0YpCYYLogCt2LAK/U4H7UwHsjA==",
          "url": "https://sapling.ct.letsencrypt.org/2023h2/",
          "mmd": 86400,
          "temporal_interval": {
            "start_inclusive": "2023-06-15T00:00:00Z",
            "end_exclusive": "2024-01-15T00:00:00Z"
          },
          "state": {
            "usable": {
              "timestamp": "2023-05-01T00:00:00Z"
            }
          }
        },

I think somebody has a ticket to update the website, it just hasn't happened yet.

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.