POST to new-order URL fails with HTTP 500 "error retrieving account"

My domain is: *.arnavion.dev (wildcard cert via dns-01)

I ran this command: N/A. Custom ACME client.

The current sitation is that when my client posts to the new-order URL ( https://acme-v02.api.letsencrypt.org/acme/new-order as returned by the get-directory response), the server responds with HTTP 500 and the following body:

{
    "type": "urn:ietf:params:acme:error:serverInternal",
    "detail": "Error retrieving account \"https://acme-v02.api.letsencrypt.org/acme/acct/<redacted>\"",
    "status": 500
}

The full history is:

2022-06-01 00:00:11 - Client initiates cert renewal -> POST to new-order URL fails with the above response

2022-06-02 00:00:02 - Client initiates cert renewal -> gets a pending order (presumably created behind the scenes on 2022-06-01) -> client proceeds with dns-01 challenge -> client waits for challenge to complete
2022-06-02 00:00:18 - Polling the challenge object fails with HTTP 500 and response body:

{
    "type": "urn:ietf:params:acme:error:serverInternal",
    "detail": "Error retrieving account \"https://acme-v02.api.letsencrypt.org/acme/acct/<redacted>\"",
    "status": 500
}

2022-06-03 00:00:04 - Client initiates cert renewal -> gets the same pending order -> attempts dns-01 challenge again -> client waits for challenge to complete -> polling fails with the same HTTP 500 as the one on 2022-06-02

2022-06-04 00:00:05 - Client initiates cert renewal -> gets the first HTTP 500 from POSTing to new-order URL, as above.

Given it has failed multiple times so far, I assume this isn't a temporary outage, so I'd appreciate LE folks checking on their end.

Note again that this is a custom client, not certbot etc, but given that it has been working unmodified for over a year already and given the error message, I assume this isn't a client issue.

Edit: Also to be clear, the client always validates that the account is correct before placing the order (as it's supposed to), so that isn't the problem either. Specifically https://acme-v02.api.letsencrypt.org/acme/acct/<redacted> is reported to be in the "valid" state.


My web server is (include version): N/A

The operating system my web server runs on is (include version): N/A

My hosting provider, if applicable, is: N/A

I can login to a root shell on my machine (yes or no, or I don't know): Yes.

I'm using a control panel to manage my site (no, or provide the name and version of the control panel): N/A

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot): N/A

1 Like

Hm, weird error. It shouldn't happen if I read the code correctly:

Could you perhaps share the relevant code of your ACME client?

And we could ask @lestaff if they see an increased amount of internal server errors with "Error calling SA.GetRegistration" in their logs and/or an increased "JWSKeyIDLookupFailed" counter.

5 Likes

If you are not married to that account, you could delete it and try getting a new one.

2 Likes

Get the current account, and ensure it's in "valid" state in the process: https://github.com/Arnavion/acme-azure-function/blob/6d06d779252e47751f3957979727e1f94ab5f7d5/acme/src/lib.rs#L25

Create a new "pending" order / get the existing one: https://github.com/Arnavion/acme-azure-function/blob/6d06d779252e47751f3957979727e1f94ab5f7d5/acme/src/lib.rs#L150

Report the challenge as completed and wait for the order to become "ready": https://github.com/Arnavion/acme-azure-function/blob/6d06d779252e47751f3957979727e1f94ab5f7d5/acme/src/lib.rs#L301

Yes, I did think of that. But the existing cert is still valid for another month, so I figured I'd keep the repro for now so that LE can investigate.

You could perhaps backup that account and try a new one.
If that one also fails, I think we would all be a lot more worried about this.
If it does works, however, you could always restore the other account and continue investigating.

2 Likes

I hope that even this one case of a data integrity problem in their backend is reason enough for LE to worry about this. If I need to have multiple accounts failing before LE investigates their backend issues, then I'm the one who is worried that I need to find another CA.

To be clear, the purpose of me posting here is not to find workarounds. I already know I can work around by using a different account key (and hoping that the new account isn't affected by the same problem) or by switching to a different CA. The purpose of me posting here is to point out to LE that something is wrong on their end. I've done my part of the investigation by providing the requests I made and the responses I got. Now I'm waiting for LE to do their part.

Incidentally, when the client retried on 2022-06-05 and 2022-06-06 the situation has gotten "worse". Now the initial request to get the account itself fails, with:

{
    "type": "urn:ietf:params:acme:error:badPublicKey",
    "detail": "rpc error: code = Unknown desc = failed to select one blockedKeys: invalid connection",
    "status": 400
}

If you can find one that will do this much for less than FREE, then have at it - LOL

You need to realize that 9 times out of 10, these types of problems have nothing to do with LE.
If they stop what they are doing every time someone "thinks" the sky is falling...
So... it does help to add some more "proof" that the sky is actually falling before making such claims.

We all want to get to the bottom of this.
But I think you are the only one having this issue.

2 Likes

The Community Guidelines are being followed well, I see.

And you need to realize that a server that sends an HTTP 500 response cannot be the fault of the client. The fact that the server response indicates the server is timing out unable to talk to its own database cannot be the fault of the client. The fact that the client has been working perfectly for over a year and that it is compliant with the RFC is the cherry on top that it's likely not the fault of the client.

Maybe you're not familiar with the tech involved, and that's fine. But please understand that you can't group all failures into one "failure" bucket and treat them equally.

I already assumed that from the beginning, given that I'm the only one who's posted about it. But what does that have to do with anything?

Sorry, I don't speak Rust. And looking at the code, meine liebe, I don't wanna learn to speak Rust :stuck_out_tongue: I'll stick with easy languages such as Python. :slight_smile:

1 Like

Not much.
I'm just saying that the unpaid volunteers here are not likely to be of much help to you.
And if you want LE to expedite things... My only suggestion was for you to add more fuel to your fire.
But you don't have to take any of my advice as gospel.

3 Likes

Well, I've already tagged LE staff, so we'd just have to wait I guess to see if they're willing to dive into the logs.

4 Likes

The client's attempt on 2022-06-07 (~2h ago) succeeded fine.

I'll leave this thread open in case anyone from LE does any investigation for the previous days.

Hello, and thank you for pointing this out @Arnavion. I am seeing some bursts of errors in the CA which does indicate something might not be right. It does look like it's originating from timeouts internally, so this may be some kind of database load or query issue.

They appear to be happening at the same time, but not limited to a single account or user agent.

6 Likes

A short update:

We appear to be having load issues right at midnight UTC. Many people have jobs that run at that time presumably. We’ve known this already but it’s getting worse recently. We are investigating and seeing about how we can improve.

If at all possible, please don’t run right at midnight UTC. That will likely solve your problem.

4 Likes

Haha, I thought it might be that. I'll move it to another time. Thanks.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.