Nope. You may find empirically there is an order that occurs more often but I would strongly caution against making any assumptions about order (or total # of validations).
If you want to maintain RFC 8555 compatibility you should make no assumptions about the state of an authorization based on observed validation requests. Use the authorization state reported by the ACME server.
Yes. The request from our primary validation authority in our own data centre (not a cloud provider) is always required to succeed. If the primary validation request fails the authorization is marked invalid. This is important for compliance with our various requirements (CABF BRs, root programs, webtrust audits, etc). The new perspectives are “added value” but can not absolutely influence the validation process counter to the primary results.
This is configurable on the Let’s Encrypt side based on a feature flag. We wanted this to be configurable because waiting for all of the validation requests to complete/timeout provides more data for debugging at the cost of overall validation latency.
In staging we presently have the MultiVAFullResults feature flag disabled (as of ~yesterday afternoon), so authorizations become invalid as soon as the primary validation request fails, or 2 of the remote validation requests fail.
We intend to launch in production on Feb 19th with MutliVAFullResults enabled, so authorizations will only become invalid once all of the validation requests have completed unsuccessfully or timed out. This mode makes it easier for us to collect data on differentials between perspectives (it’s currently enabled in prod for that reason).
Sometime after launch (TBD) we will disable MultiVAFullResults and the behaviour in prod will match staging. This will improve the overall validation latency at the cost of having more data to help support debugging.
This is an example of why I recommend not making assumptions about authorization state client-side. Even if you aren’t interested in compatibility with other ACME CAs the implementation details at Let’s Encrypt are subject to change.
IMHO in the use-case of DNS challenges I think would be beneficial to immediately short-circuit to failure off the primary (required) LE server first, then immediately short-circuit a failure on the second failed secondary server. That looks like it would be a new feature for Boulder though, as it’s a challenge specific feature - not global behavior.
The reason for this is the nature of commercial DNS providers, and how they handle their internal data storage caching layers… before you even get to their DNS systems that have separate caching and flushing enabled, or how they distribute their DNS load. If the DNS challenge is misconfigured, the ‘all challenges’ approach is likely to wedge incorrect data into multiple DNS and application caches. I remember a few DNS vendors supported a limited number of flush commands per day, had a 5-10 minute TTL, and a read-through cache with a least-recently-used eviction policy… and no write-through cache functionality. An end-user could edit and flush endlessly, but the DNS servers would just load the bad data off an application cache. (This caused me to adopt ACME-DNS)
This not an issue for http-01 validation, or DNS delegated to ACME-DNS. It’s just a handful of commercial DNS providers.
Things may have changed since I last dealt with this, but you should have this on your radar in case it hasn’t.
Thanks for the insight @jvanasco. It’s valuable to hear about experience with larger commercial DNS providers.
Indeed, there’s no way to do this with Boulder as-is. I’m also not sure it would truly resolve the situation. I don’t think the problem is specific to multiple vantage point validation, though I agree it exacerbates the situation.
If there is a caching layer in front of the true authoritative DNS zone information and it caches incorrect data in a way that the user can’t evict there will always be problems. Our primary validation requests are just as likely to cache bad data and subsequent retries by the user would be subject to the stale data.
Yikes. That sounds pretty dicey. For what it’s worth our infrastructure clips the max TTL to 60s, though that likely does not help if the caches are enforcing that TTL in front of the auth. zone data.