Current official status page for API?

Hey folks, I've gotten a couple of reports from users who's renewals are failing overnight and it seems to be a timeout talking to the Let's Encrypt production API.

What's the official status reporting page for LE?

Looking at https://letsencrypt.status.io/ doesn't show anything recent, but Let's Encrypt Status. Check if Let's Encrypt is down or having problems. | StatusGator shows a bunch of things.

2 Likes

Looks like this is the page I was after: Let's Encrypt Status

2 Likes

The official status page is https://letsencrypt.status.io

The Let’s Encrypt SRE team does their best to keep it up to date with maintenances and incidents. Usually, Let’s Encrypt knows about an incident from internal alerting but it takes a bit to confirm, assess the impact, and update the page. The status is currently ‘Operational’ and our internal metrics and alerting confirm that. If you can get user’s to provide the specific errors we can help assess the problem in #help.

1 Like

Thanks Jillian, I did a new release of my app yesterday and got complaints of renewal errors today, so I'm in firefighting mode currently. In particular during new certificate orders they didn't get any http challenges in the API response, which my code wasn't really expecting :slight_smile:

https://acme-v02.api.letsencrypt.org/acme/authz-v3/10663697360

I'm guessing this coincided with API maintenance, which is absolutely fine. I probably need to hook into the status.io API to see if I can inform users of maintenance dynamically.

3 Likes

This is was EXACTLY why I keep kept pushing for an Under Maintenance page with a 503 return code for the directory endpoint during maintenance for both staging and production. When end-users see this on-screen and/or in their error logs, they will get the picture INSTANTLY and thus hopefully won't flood developers (and this community) with unnecessary help requests.

Edit: Thanks Let's Encrypt for implementing this!

1 Like

I guess what was interesting here was that the API was returning stuff, but it was apparently impaired (no http challenges), or at least that's my impression. Hard to tell without a trace of the http responses at the time (which I don't have).

Graceful degradation is is a cool feature but you have to be expecting it to in turn build a client that expects that to happen (i.e. you can talk to the API, but all might not be well and you may not know that).

2 Likes

For info, the current staging downtime did indeed return a 503, which is great.

StatusCode: 503, ReasonPhrase: 'Service Temporarily Unavailable'

With response body:

{
  "type": "urn:acme:error:serverInternal",
  "detail": "The service is down for maintenance or had an internal error. Check https://letsencrypt.status.io/ for more details."
}
2 Likes

Awesome! :slightly_smiling_face:

That's precisely what I was hoping for.

:clap:

2 Likes

@webprofusion

I think CTW should be able to easily handle that return and convey the information to users, I suspect without modification. :slightly_smiling_face:

I know that CertSage (my own client) reflects this directly and in the response history.

2 Likes

While we do handle the error overall we could report it more specifically. Unfortunately the library we use hides that initial failure (fetching the directory), but not for long!

2 Likes

Ah... I ran into that too the other day.

@jillian

I see you down there. :wink:

Did Let's Encrypt recently change the Content-Type header away from "application/problem+json"? I noticed my detailed error-handling stopped working.

1 Like

The staging 503 error header included it:

{
  Connection: keep-alive
  Date: Sat, 20 Feb 2021 02:16:59 GMT
  ETag: "5f76372f-b2"
  Server: nginx
  Content-Length: 178
  Content-Type: application/problem+json
}
2 Likes

Let’s Encrypt will return a 503 when we are certain the infrastructure is unavailable- but this is only possible when our load balancers are still up and accessible. There will always be some maintenances where we make changes to our networking gear and our datacenter is essentially offline. This usually only affects Staging because we don’t have a secondary datacenter that we can fail over to. As a result, the errors returned to the users are from our CDN. We think this is ok because of how rarely we take Staging entirely offline and cannot serve a proper 503 response.

In production, we mostly do non-interruptive rolling restarts and rarely turn off all access to the API. On the occasions where do stop Production API access, we make sure to return a 503 from our frontends whenever possible and provide a maintenance notice on status page about the downtime.

3 Likes

Cool. Wonder why I didn't get that type during some of my testing for certain errors. :thinking: I'll need to look into this more. Osiris has been generously helping to integrate cPanel support into CertSage, so we've encountered a few odd things along the way (from our own doing).

1 Like

That sounds like an excellent logic and strategy. :slightly_smiling_face:

Comically, both myself then Osiris smacked right into the two unscheduled staging maintenances during testing for CertSage. His reaction was priceless. Basically mirrored my own. Just unfortunate timing in our testing. :yum:

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.