Do you get a more detailed response along with the HTTP 429? I’d be curious about what it says.
Looking at the crt.sh results for your domain, it doesn’t appear that you ran into the actual rate limits for Let’s Encrypt yet. However, even if that were the issue, it would be limited to one environment and should not affect staging. You should be able to verify if this is a rate-limiting error by looking at the full output rather than just the HTTP status code - this would also include which of the various rate limits you’re running into.
It might be possible that Akamai, the CDN that sits in front of Let’s Encrypt’s servers, has some additional request throttling logic that also uses HTTP 429, for traffic that looks like a DDoS attempt or something like that. Both staging and production are behind Akamai, so this logic could have banned you from both endpoints. If you’re not seeing a detailed error message in the response (with something like “urn:acme:error:rateLimited”), that’s likely what’s happening here. That being said, I’m just speculating here, I haven’t seen this happen before.
Here’s the messages I’m receiving from kube-lego. It appears to be the exact same message I was receiving when I was pointed at the Production endpoint. It appears to be an error coming back from LetsEncrypt.
2016-10-06T02:40:37.001112032Z time="2016-10-06T02:40:37Z" level=debug msg="testing reachablity of http://api.polykube.io/.well-known/acme-challenge/_selftest" context=acme host=api.polykube.io
2016-10-06T02:40:37.003517028Z time="2016-10-06T02:40:37Z" level=debug msg="testing reachablity of http://polykube.io/.well-known/acme-challenge/_selftest" context=acme host=polykube.io
2016-10-06T02:40:37.005540909Z 2016/10/06 02:40:37 [INFO][api.polykube.io, polykube.io] acme: Obtaining bundled SAN certificate
2016-10-06T02:40:37.183483439Z 2016/10/06 02:40:37 [INFO][api.polykube.io] acme: Could not find solver for: tls-sni-01
2016-10-06T02:40:37.183514340Z 2016/10/06 02:40:37 [INFO][api.polykube.io] acme: Could not find solver for: dns-01
2016-10-06T02:40:37.183518140Z 2016/10/06 02:40:37 [INFO][api.polykube.io] acme: Trying to solve HTTP-01
2016-10-06T02:40:37.281825379Z 2016/10/06 02:40:37 [INFO][api.polykube.io] The server validated our request
2016-10-06T02:40:37.281855981Z 2016/10/06 02:40:37 [INFO][polykube.io] acme: Could not find solver for: dns-01
2016-10-06T02:40:37.281859681Z 2016/10/06 02:40:37 [INFO][polykube.io] acme: Trying to solve HTTP-01
2016-10-06T02:40:37.365698140Z 2016/10/06 02:40:37 [INFO][polykube.io] The server validated our request
2016-10-06T02:40:37.365726041Z 2016/10/06 02:40:37 [INFO][api.polykube.io, polykube.io] acme: Validations succeeded; requesting certificates
2016-10-06T02:40:37.747826952Z time="2016-10-06T02:40:37Z" level=warning msg="Error while obtaining certificate: Errors while obtaining cert: map[api.polykube.io:acme: Error 429 - urn:acme:error:rateLimited - Error creating new cert :: Too many certificates already issued for exact set of domains: api.polykube.io,polykube.io polykube.io:acme: Error 429 - urn:acme:error:rateLimited - Error creating new cert :: Too many certificates already issued for exact set of domains: api.polykube.io,polykube.io]" context=acme
Yep, that looks like a regular rate limiting error. I guess crt.sh is lagging behind a bit, I don’t see that many certificates yet.
Either way, this error should be limited to the environment in which you exceeded the rate limit. That would mean either:
You ran into the rate limits on the staging environment as well. Staging generally has higher rate limits, so this would be harder to pull off, plus it doesn’t sound like you’ve used staging at all so far.
The client isn’t actually talking to staging when you think it is. Not sure if the log mentions the ACME server it’s talking to somewhere? Can you double-check that LEGO_URL is pointing to acme-staging?
So if I look at crt.sh now… I see there are certs issued for polykube.io to:
OCSP - URI:http://ocsp.int-x3.letsencrypt.org/
CA Issuers - URI:http://cert.int-x3.letsencrypt.org/
Which makes me think I was hitting staging maybe? But I really don’t think I ever got a valid cert. When I realized I had been banned from Prod, I changed the config and watched logs and saw it immediately getting 429.
I think certificates issued by staging are logged to a special CT log for untrusted certificates (ct.googleapis.com/submariner), but that log is not monitored by crt.sh, so there’s no easy way to check. The ones that show up on crt.sh are all from production.
I guess you could also try requesting the certificate with a different client against both environments to see where the error occurs, in case adding logging is too cumbersome.
Looking at the issuer field in the certificate (with something like openssl x509 -in cert.pem -text, or using your browser's certificate UI), if it's "CN=Let's Encrypt Authority X3", that's a production certificate, otherwise it's from staging.
They're both hosted behind the same CDN, so that's fine. Traffic would be handled by different backend servers depending on the hostname (kind of like having multiple apache vhosts on the same server).
Yeah... when pointed at staging, I got a Prod cert.
You're sure that the staging host isn't supposed to point to: api.letsencrypt.org.edgekey-staging.net (which is valid and seems to return similar responses as the production endpoint...)?
Using certbot with --staging resulted in a certificate with “Issuer: CN=Fake LE Intermediate X1”, without staging I got “Issuer: C=US, O=Let’s Encrypt, CN=Let’s Encrypt Authority X3”.
My new suspicion is that the initial requests are being sent to staging, but the staging endpoint is returning documents with links to production… hence why the initial request goes to staging, but the certs are retrieved from Prod.
Which is itself, interesting. I would’ve expected the later request to Prod to have failed since it was initiated against the Staging endpoint.