Hey folks, I've gotten a couple of reports from users who's renewals are failing overnight and it seems to be a timeout talking to the Let's Encrypt production API.
The Let’s Encrypt SRE team does their best to keep it up to date with maintenances and incidents. Usually, Let’s Encrypt knows about an incident from internal alerting but it takes a bit to confirm, assess the impact, and update the page. The status is currently ‘Operational’ and our internal metrics and alerting confirm that. If you can get user’s to provide the specific errors we can help assess the problem in Help.
Thanks Jillian, I did a new release of my app yesterday and got complaints of renewal errors today, so I'm in firefighting mode currently. In particular during new certificate orders they didn't get any http challenges in the API response, which my code wasn't really expecting
I'm guessing this coincided with API maintenance, which is absolutely fine. I probably need to hook into the status.io API to see if I can inform users of maintenance dynamically.
This is was EXACTLY why I keep kept pushing for an Under Maintenance page with a 503 return code for the directory endpoint during maintenance for both staging and production. When end-users see this on-screen and/or in their error logs, they will get the picture INSTANTLY and thus hopefully won't flood developers (and this community) with unnecessary help requests.
I guess what was interesting here was that the API was returning stuff, but it was apparently impaired (no http challenges), or at least that's my impression. Hard to tell without a trace of the http responses at the time (which I don't have).
Graceful degradation is is a cool feature but you have to be expecting it to in turn build a client that expects that to happen (i.e. you can talk to the API, but all might not be well and you may not know that).
{
"type": "urn:acme:error:serverInternal",
"detail": "The service is down for maintenance or had an internal error. Check https://letsencrypt.status.io/ for more details."
}
While we do handle the error overall we could report it more specifically. Unfortunately the library we use hides that initial failure (fetching the directory), but not for long!
Let’s Encrypt will return a 503 when we are certain the infrastructure is unavailable- but this is only possible when our load balancers are still up and accessible. There will always be some maintenances where we make changes to our networking gear and our datacenter is essentially offline. This usually only affects Staging because we don’t have a secondary datacenter that we can fail over to. As a result, the errors returned to the users are from our CDN. We think this is ok because of how rarely we take Staging entirely offline and cannot serve a proper 503 response.
In production, we mostly do non-interruptive rolling restarts and rarely turn off all access to the API. On the occasions where do stop Production API access, we make sure to return a 503 from our frontends whenever possible and provide a maintenance notice on status page about the downtime.
Cool. Wonder why I didn't get that type during some of my testing for certain errors. I'll need to look into this more. Osiris has been generously helping to integrate cPanel support into CertSage, so we've encountered a few odd things along the way (from our own doing).
Comically, both myself then Osiris smacked right into the two unscheduled staging maintenances during testing for CertSage. His reaction was priceless. Basically mirrored my own. Just unfortunate timing in our testing.