I ran some numbers to estimate the size of impact. So far just looking at validation logs. From 03-21 to 03-28, we had:
- 216M validation attempts
- 179M validation failures (83%)
- 97M of those failures came from accounts that had 0 validation successes of the course of that week. Removing those failures would bring the failure rate down to 38%. (edited)
Here are some numbers bucketed by how many validation attempts a given account had during the week. A "total failure" is an account that had 0 validation successes; these are the candidates for pausing (if they also had no issuances for X days). I summed up the error counts from them.
bucket | accounts | validation attempts | errors | errors from total failures | error rate |
---|---|---|---|---|---|
1 | 786,802 | 786,802 | 185,146 | 185,146 | 0.23531 |
2-5 | 1,129,112 | 3,126,791 | 636,033 | 473,726 | 0.20341 |
6-25 | 547,993 | 6,865,347 | 4,637,927 | 3,622,346 | 0.67556 |
26-625 | 521,949 | 74,910,336 | 69,896,477 | 49,091,267 | 0.93307 |
626-3125 | 45,053 | 55,769,849 | 52,059,549 | 25,932,861 | 0.93347 |
3126-15625 | 6,053 | 35,594,192 | 32,814,899 | 13,122,730 | 0.92192 |
15625+ | 696 | 39,353,654 | 19,274,601 | 4,582,055 | 0.48978 |
Interesting that the error rate starts out very low, in the buckets with few attempts. In buckets with larger number of attempts, we see the error rates get much higher. Presumably this is a matter of clients that retry faster than they should.
Since these are actual validation attempts, they don't include requests that were stopped by the rate limits. So for instance, a client that was retrying failed validation as fast as allowed by the rate limits (5 failed validations / hostname / hour) would have 840 attempts for a single hostname.