CertSage 3.3.1 Release


  • Added a delay before downloading certificate to address errors coming from Let's Encrypt server caused by requesting certificate download too soon after completion of order finalization

Funny we noticed this Certify The Web as well, the order completes and the status is fine but the certificate download url isn't populated. Glad it's not just us!

Isn't that similar to what was talked about here: LE prod issue with download URL - #3 by rmbolger

Yep. We can work around it and it forces clients to act more robustly but it seems to be unique to LE and not definitively per spec, as a valid order should have an issued certificate, and by implication of the entire context, a certificate url.

Right. But, from what I understood of that thread the order object returned by the Finalize will be valid and have a valid cert URL (per jsha and Max)

It was only if you poll the Order after the Finalize and not check its status that it might not have a valid cert URL as it might still be pending.

Is the fact the order status might "downgrade" from the valid seen after finalize to something else (pending) that is unique to LE?

Or have I misunderstood that other thread? I haven't had any problems using the cert URL in the Finalized Order but wonder if I should be adding extra mitigation. That's why I ask.

I can confirm that this is exactly what I've been seeing with CertSage. When the problem occurs, attempting to request the cert from said URL results in:

urn:ietf:params:acme:error:malformed
Requested certificate was not found

Ah yeah for some reason I thought it was that the certificate url wasn't set but it could be that it doesn't point to a real result yet, would have thought that would just be a 404, I haven't looked at the boulder code to dig into it..

Due to how the database replicas work, there are two possible scenarios I can imagine. Haven't seen either of these on my client though:

  1. You finalize an order, get an order with status: valid back (this will also have the certificate url, always). Now you decide to poll the order again (unnecessary, but apparently some clients do this) and this time the order suddenly flips back to processing, and now the certificate url is missing. This happens because your poll hit a read replica that isn't yet updated. Since the status was already set to valid previously, this may confuse clients.
  2. You attempt to retrieve a freshly issued certificate from the final certificate url, and are returned a 404-style error. I believe this is the malformed ACME error we saw in the referenced thread. It also happens because your download request hits a read replica that doesn't yet know about that certificate.

Scenario two is what has happened with CertSage for a few people to my knowledge, including myself.

Likely all this related to the fact that the order object status processing had to be removed because a lot of broken clients out there. Maybe state pending of the order object should be kept until all read replicas are having the certificate ready?

I concur, @bruncsak. I definitely prefer polling on a non-errored state (e.g. pending) than an errored state (e.g. malformed) to reserve using error-handling mechanisms (e.g. try-catch) for actual error-handling.

Yes, and jsha said the same in that thread I linked earlier. Which they tried to do with async Finalize which broke too many clients. See: LE prod issue with download URL - #4 by jsha

I don't like the idea of treating a specific error as a trigger to poll the order either. Or, perhaps on seeing that error to wait a bit and retry it.

But, more broadly I've worked on other threads recently with the "404" problem repeating more than you'd expect. It looks like the read replicas are lagging more often. Obviously not for every request but enough to be surfacing related problems more often.

There might be middle ground for AsyncFinalize flag to be just on or off. For known well-behaving clients (handling correctly the processing state of the order object) should be on, for the known broken ones off. The key should be the User-Agent string.

I've pulled this thread off topic. I wrote a long post about atomic key value caching for "hot" information but discarded it, inconsistent reads is an API bug and how you fix it it's an implementation detail. Client's don't need to know how LE works, it's just ACME.