"Unable to deactivate account :: only valid registrations can be deactivated" since 2024-12-18

Hello, long time no see.

I have a nightly CI job that tests a plugin of mine against the staging server.

It is as follows:

sudo certbot --staging register -n --no-eff-email \
  --agree-tos --email certbot-dns-multi-nightly@zorin.au
sudo certbot certonly --staging \
  -d "*.zorin.au" -a dns-multi \
  --dns-multi-credentials=/etc/letsencrypt/dns-multi.ini
sudo certbot --staging show_account -vvv
sudo certbot --staging unregister -n -vvv

During the unregister step, I get the following error:

{
  "type": "urn:ietf:params:acme:error:malformed",
  "detail": "Unable to deactivate account :: only valid registrations can be deactivated",
  "status": 400
}

I haven't seen it before, and it now happens on 100% of runs since it first occurred on 2024-12-18 04:31:37 UTC-ish (I added the account email address after then, sorry).

In the show_account step, I get:

JWS payload:
b'{\n  "onlyReturnExisting": true,\n  "contact": []\n}'
Sending POST request to https://acme-staging-v02.api.letsencrypt.org/acme/new-acct:
{
  "protected": "***",
  "signature": "j_GFoWUDflb99-kBowQSjLxUI5-ZPGjdEpeEkNlnSc_bDVzzi4NfDkDKL4Gu0fc2MjEq7VE17nE7S2NZsFOSORI_62_bie5vMIkPCSnCjAGhVozbjFNn9yztcoTkBvhbcsQGAwbNjEuhSdRGa5xFgdetx6u6mc3epmN3W1v46JRTD2g69GNhFJroR6GW3bmCtpTxG5OewLbdOViIehp_gAmfTr0HOt2LrnInzcNLRdXz_Jk0w1IsmRYcn3t1f2I5uG7GZ-5BWtDa2-KmfVcRVfcEsr49rwhKX6A2BUfGisyK5nFETWsSvANnkK2wV5bV6lFk1CHxa0cR7SmH_fXiqQ",
  "payload": "ewogICJvbmx5UmV0dXJuRXhpc3RpbmciOiB0cnVlLAogICJjb250YWN0IjogW10KfQ"
}
https://acme-staging-v02.api.letsencrypt.org:443 "POST /acme/new-acct HTTP/1.1" 200 535
Received response:
HTTP 200
Server: nginx
Date: Wed, 18 Dec 2024 07:32:59 GMT
Content-Type: application/json
Content-Length: 535
Connection: keep-alive
Boulder-Requester: 176584664
Cache-Control: public, max-age=0, no-cache
Link: <https://acme-staging-v02.api.letsencrypt.org/directory>;rel="index"
Location: https://acme-staging-v02.api.letsencrypt.org/acme/acct/176584664
Replay-Nonce: XCuYn79es3B1HR38m_Qg2dZyzmmIFZik8yq8gu9PKnMj_URmM88
X-Frame-Options: DENY
Strict-Transport-Security: max-age=604800

{
  "key": {
    "kty": "RSA",
    "n": "pHwmQh7xug9DWHLWNdaXQEVo0KKlRFVvFSJS9Ka4TwpEZdSzeRsIBISn4uD1nKrvDwAyDoVbpy04h3vWSza3nu8Ych8u6fFxya2n20KXkUeCHl_u-YtKopAx_N5h-hJgxeXd3kNupe5eUg5a5k5ap8g52NJ9jj3X0Bzq-XB-fDTV6FU3KVuVZSuYC79yo_dMfzjNSQ7Yb3qTxXD3f60_NNpIkXVEw_vPegBhsTS7Vksb5kAvAlOCmG_DH78OcL2_DOX_6ifzNxof8ULdbVEOE9dC1M2tmvsGIxX1PXfOZ84ITvUT0EqhdHbuXhfsnVKrCHA2cHafOBInIBnoMEc8Ew",
    "e": "AQAB"
  },
  "contact": [
    "mailto:certbot-dns-multi-nightly@zorin.au"
  ],
  "createdAt": "2024-12-18T07:31:52Z",
  "status": "valid"
}

So, the status reported by the server is valid, but then the subsequent deregistration step fails:

JWS payload:
b'{\n  "status": "deactivated"\n}'
Sending POST request to https://acme-staging-v02.api.letsencrypt.org/acme/acct/176584664:
{
  "protected": "***",
  "signature": "TT9TvxSTdaMIIw202kniqv09sQnp9r-VMZh43kQ9pHV7izNWY3kgmXP_tp5SdQ-L8bIfYH3_ZKsxzOk4TV5n6fFFTay98F_jlPkmYEJCgldXdLvhPLaOgJpPluvl70j-YpLGICW9kpslt2NorQ5S2sPhCVq7fvoQM-SVMnaTInvGQzQmGt_1roedtUyxe9qYl-RHux7t-ULnWHNeENV8kVS_RAeYUcpDNWlNvwatsLEUNvYK-SQN-JoKFbfTMPuJUjlD8ZUXmZalrCL4bdVR87rGIXkbP7YQKugeaYX7zLPJ3ViDmGKEfyXO5I9zIy_lmiXXy5RT7MDPctCpxNo44g",
  "payload": "ewogICJzdGF0dXMiOiAiZGVhY3RpdmF0ZWQiCn0"
}
https://acme-staging-v02.api.letsencrypt.org:443 "POST /acme/acct/176584664 HTTP/1.1" 400 160
Received response:
HTTP 400
Server: nginx
Date: Wed, 18 Dec 2024 07:33:00 GMT
Content-Type: application/problem+json
Content-Length: 160
Connection: keep-alive
Boulder-Requester: 176584664
Cache-Control: public, max-age=0, no-cache
Link: <https://acme-staging-v02.api.letsencrypt.org/directory>;rel="index"
Replay-Nonce: J6d90a-tWaq4ts2NmVA-2Ko0abSAxtU8UXJLUsqR0b8l4sGk_B4

{
  "type": "urn:ietf:params:acme:error:malformed",
  "detail": "Unable to deactivate account :: only valid registrations can be deactivated",
  "status": 400
}

It looks like I'm not getting a correct read-after-write.

Any tips? I can always update the job to not deregister, but I don't want to clutter the staging server with a bunch of valid accounts. Not too keen on holding long-term account credentials for an account on the CI job either.

10 Likes

If I "Github blame" wfe.go where the error is produced, I see a recent PR wfe: Use separate UpdateRegistrationContact & UpdateRegistrationKey methods by jprenken · Pull Request #7827 · letsencrypt/boulder · GitHub which has modified some code, see Blaming boulder/wfe2/wfe.go at 11d543bd9878f0624b3929a691ae6aa4838d451b · letsencrypt/boulder · GitHub. Staging currently runs 11d543bd, and that PR is included.

Maybe the culprit?

8 Likes

Staging was updated to 11d543bd around 2024-12-17 23:00Z, which is a bit earlier than when _az reported the first failure. I also had a look at CI and there's a successful run earlier on 2024-12-17, before the update - so the timeline fits quite well with this update causing the error.

8 Likes

Where did you get this info? :slight_smile: There's no info on the LE status.io about staging updates.

5 Likes

My own monitoring :smile:

https://germancoding.com/boulder-version-history/

9 Likes

Noice 

6 Likes

Good eye, and thanks for the heads up! I suspect that the diagnosis here is exactly correct; we'll investigate and fix and make sure this doesn't go to prod.

12 Likes

The staging boulder has been rolled back downhill.

14 Likes

Just how long have you been waiting to use that pun? Or is it standard terminology amongst the Boulder devs? :slight_smile:

9 Likes

The magic flowed out of fingers into the reply as I was typing. Like Michael Scott says, "Sometimes I'll start a sentence, and I don't even know where it's going. I just hope I find it along the way. Like an improv conversation. An improversation.”

11 Likes

Speaking of... yes, and... :wink:

5 Likes

In combination with:

@Nummer378 Is it a deliberate choice to keep just the single commit entry for rollbacks? I see you use "First seen" and "Last seen", but for me it would make more sense to have a new entry for rollbacks with that certain commit a second time with a new "From" and "To" timestamp.

Maybe even add an exclamation mark with "rollback" title popup so it's clear it was rolled back :slight_smile:

3 Likes

I don't recall exactly, but I think it was easier this way to cover different scenarios. I do recall a time when LE did some type of A-B testing* where the client would frequently flip-flop between two different Boulder versions. Describing that as a "rollback" every time an old version reappeared would quite quickly overload the table, so I went with just a single entry. The drawback is, as you've realized, that multiple rollbacks are very difficult to reconstruct from the data set.

*It may also have been due to different DC's running slightly different build IDs, I don't know for sure.

5 Likes

Hmm, but how often would that be the case?

Also maybe some "grace period" might help there..? Just thinking out loud :slight_smile:

That said, I suspect there would be a third commit after a rollback, so it's possible to reconstruct where the rollback was..

I.e., if the commits after reach other were A -> B -> C and B was rolled back to A with the reason for the rollback fixed in C, the "First seen" and "Last seen" times of B would be in the middle of the "First seen" and "Last seen" of A. I.e., the timestamps of A would be before and after the respective timestamps of B. And the "First seen" from C would correspond with the new "Last seen" of A, while the now missing earlier "Last seen" of A corresponds with the "First seen" of B.. So it is possible to reconstruct the events.

4 Likes

We "Canary" a new release for a few moments and verify it's working properly before doing a full deploy, and the deploy itself is rolling, so there is often periods where two releases are running. Even the public endpoint is just the "WFE" component, not the full stack.

The deploy process for each DC is also independent, so there may be a period of time where there's two releases running in prod (we only have 1 staging)

Here's what it looks like in terms of running processes (though the old processes that are in the process of shutdown don't show up here so it's not 100% right):

You can see the big vertical spike as the canaries come up, a (mostly) flat line as we wait for the canary period, and then a big yellow drop as we start swapping out the old release

5 Likes

Looks like that s all done within the hour, perhaps even on a scale of minutes if the next value on the X-axis is "23:20".

2 Likes

Yes, that's times at the bottom, so that staging deploy started at 23:13 and finished at 23:18 UTC. So about 5 minutes. Double that in prod assuming we do the DCs sequentually, plus any time we might wait in between.

5 Likes

Sounds like an hour as "grace period" for @Nummer378 s history application would be more than enough (overkill probably) to prevent erroneous entries in that history.

Assuming no AB-testing is going to happen any time soon without prior warning.


On another note (and more on topic): seems like @aarongable fixed this bug in Fix error when deactivating an account by aarongable · Pull Request #7899 · letsencrypt/boulder · GitHub. :partying_face:

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.