Do bans from Prod endpoint affect Staging endpoint?

My domain requests are for [polykube.io, api.polykube.io].

I fudged some config and got myself temp-banned from the Production endpoint. (I see 429s in response).

I have since switched to pointing at the Staging endpoint, but I’m receiving the same error message, as if I were still hitting Prod.

I’m rather confident that I’m not still hitting the Prod endpoint. Is there a chance that the ban applies to both environments?

Do you get a more detailed response along with the HTTP 429? I’d be curious about what it says.

Looking at the crt.sh results for your domain, it doesn’t appear that you ran into the actual rate limits for Let’s Encrypt yet. However, even if that were the issue, it would be limited to one environment and should not affect staging. You should be able to verify if this is a rate-limiting error by looking at the full output rather than just the HTTP status code - this would also include which of the various rate limits you’re running into.

It might be possible that Akamai, the CDN that sits in front of Let’s Encrypt’s servers, has some additional request throttling logic that also uses HTTP 429, for traffic that looks like a DDoS attempt or something like that. Both staging and production are behind Akamai, so this logic could have banned you from both endpoints. If you’re not seeing a detailed error message in the response (with something like “urn:acme:error:rateLimited”), that’s likely what’s happening here. That being said, I’m just speculating here, I haven’t seen this happen before.

Here’s the messages I’m receiving from kube-lego. It appears to be the exact same message I was receiving when I was pointed at the Production endpoint. It appears to be an error coming back from LetsEncrypt.

2016-10-06T02:40:37.001112032Z time="2016-10-06T02:40:37Z" level=debug msg="testing reachablity of http://api.polykube.io/.well-known/acme-challenge/_selftest" context=acme host=api.polykube.io 
2016-10-06T02:40:37.003517028Z time="2016-10-06T02:40:37Z" level=debug msg="testing reachablity of http://polykube.io/.well-known/acme-challenge/_selftest" context=acme host=polykube.io 
2016-10-06T02:40:37.005540909Z 2016/10/06 02:40:37 [INFO][api.polykube.io, polykube.io] acme: Obtaining bundled SAN certificate
2016-10-06T02:40:37.183483439Z 2016/10/06 02:40:37 [INFO][api.polykube.io] acme: Could not find solver for: tls-sni-01
2016-10-06T02:40:37.183514340Z 2016/10/06 02:40:37 [INFO][api.polykube.io] acme: Could not find solver for: dns-01
2016-10-06T02:40:37.183518140Z 2016/10/06 02:40:37 [INFO][api.polykube.io] acme: Trying to solve HTTP-01
2016-10-06T02:40:37.281825379Z 2016/10/06 02:40:37 [INFO][api.polykube.io] The server validated our request
2016-10-06T02:40:37.281855981Z 2016/10/06 02:40:37 [INFO][polykube.io] acme: Could not find solver for: dns-01
2016-10-06T02:40:37.281859681Z 2016/10/06 02:40:37 [INFO][polykube.io] acme: Trying to solve HTTP-01
2016-10-06T02:40:37.365698140Z 2016/10/06 02:40:37 [INFO][polykube.io] The server validated our request
2016-10-06T02:40:37.365726041Z 2016/10/06 02:40:37 [INFO][api.polykube.io, polykube.io] acme: Validations succeeded; requesting certificates
2016-10-06T02:40:37.747826952Z time="2016-10-06T02:40:37Z" level=warning msg="Error while obtaining certificate: Errors while obtaining cert: map[api.polykube.io:acme: Error 429 - urn:acme:error:rateLimited - Error creating new cert :: Too many certificates already issued for exact set of domains: api.polykube.io,polykube.io polykube.io:acme: Error 429 - urn:acme:error:rateLimited - Error creating new cert :: Too many certificates already issued for exact set of domains: api.polykube.io,polykube.io]" context=acme

Yep, that looks like a regular rate limiting error. I guess crt.sh is lagging behind a bit, I don’t see that many certificates yet.

Either way, this error should be limited to the environment in which you exceeded the rate limit. That would mean either:

  • You ran into the rate limits on the staging environment as well. Staging generally has higher rate limits, so this would be harder to pull off, plus it doesn’t sound like you’ve used staging at all so far.
  • The client isn’t actually talking to staging when you think it is. Not sure if the log mentions the ACME server it’s talking to somewhere? Can you double-check that LEGO_URL is pointing to acme-staging?

I had the same conclusions as you, however, I’ve never gotten a valid cert from “staging” (assuming I’m actually hitting it).

The reasons why I don’t think I’m pointed at Prod:

  1. The code defaults to Staging if there is no environment variable set: https://github.com/jetstack/kube-lego/blob/73b73901112030a9daeae1cb1c46734563612571/pkg/kubelego/kubelego.go#L213

  2. The environment variable is being set to: “https://acme-staging.api.letsencrypt.org/directory

Do staging certs show up in Certificate Transparency? I assumed not?

So if I look at crt.sh now… I see there are certs issued for polykube.io to:

            OCSP - URI:http://ocsp.int-x3.letsencrypt.org/
            CA Issuers - URI:http://cert.int-x3.letsencrypt.org/

Which makes me think I was hitting staging maybe? But I really don’t think I ever got a valid cert. When I realized I had been banned from Prod, I changed the config and watched logs and saw it immediately getting 429.

Actually, it looks like all the certs were issued that way, so still not proof I was hitting staging.

(I’m currently building a copy of kube-lego with extra logging…)

I think certificates issued by staging are logged to a special CT log for untrusted certificates (ct.googleapis.com/submariner), but that log is not monitored by crt.sh, so there’s no easy way to check. The ones that show up on crt.sh are all from production.

I guess you could also try requesting the certificate with a different client against both environments to see where the error occurs, in case adding logging is too cumbersome.

I’m pretty sure I’m hitting Staging.

After adding this change: https://github.com/colemickens/kube-lego/commit/74090126cf40471aff922d8e94a81ca7b8dcaf63

I see this before the failures:

time="2016-10-06T05:15:22Z" level=info msg="initializing lego acme connection to: %!(EXTRA string=https://acme-staging.api.letsencrypt.org/directory)" context=acme

I switched the domain names to another I control and got a cert. How can I check if the cert was issued by the Staging or Prod endpoint?

Are the two domains supported to point at the same IP?

$ dig +short acme-v01.api.letsencrypt.org
api.letsencrypt.org.edgekey.net.
e981.dscb.akamaiedge.net.
23.72.237.126

$ dig +short acme-staging.api.letsencrypt.org
api.letsencrypt.org.edgekey.net.
e981.dscb.akamaiedge.net.
23.72.237.126

Looking at the issuer field in the certificate (with something like openssl x509 -in cert.pem -text, or using your browser's certificate UI), if it's "CN=Let's Encrypt Authority X3", that's a production certificate, otherwise it's from staging.

They're both hosted behind the same CDN, so that's fine. Traffic would be handled by different backend servers depending on the hostname (kind of like having multiple apache vhosts on the same server).

Yeah... when pointed at staging, I got a Prod cert.

You're sure that the staging host isn't supposed to point to: api.letsencrypt.org.edgekey-staging.net (which is valid and seems to return similar responses as the production endpoint...)?

There’s at least one other historical reference to the ...edgekey-staging... url: https://github.com/sludin/Protocol-ACME/blob/63b308fe12e589d032d63f5390edb582d3ce17e2/revoke.pl#L13

Did a quick test on a clean server, I got the following hostnames/IPs (the CDN uses different IPs depending on your location, that’s fine):

dig +short acme-v01.api.letsencrypt.org
api.letsencrypt.org.edgekey.net.
e981.dscb.akamaiedge.net.
172.229.221.121
dig +short acme-staging.api.letsencrypt.org
api.letsencrypt.org.edgekey.net.
e981.dscb.akamaiedge.net.
172.229.221.121

Using certbot with --staging resulted in a certificate with “Issuer: CN=Fake LE Intermediate X1”, without staging I got “Issuer: C=US, O=Let’s Encrypt, CN=Let’s Encrypt Authority X3”.

:scream: Oh dear. Thank you for sanity checking me and your patience. I’ll try to investigate more on my side.

I’m making some progress over here: https://github.com/jetstack/kube-lego/issues/43

My new suspicion is that the initial requests are being sent to staging, but the staging endpoint is returning documents with links to production… hence why the initial request goes to staging, but the certs are retrieved from Prod.

Which is itself, interesting. I would’ve expected the later request to Prod to have failed since it was initiated against the Staging endpoint.

I just found the problem. kube-lego caches the user registration which contains the endpoints.

Since I was pointed at Prod, it cached a user registration that contained the Prod endpoints.

Thanks again for the support and patience @pfg. I’ll follow up on kube-lego.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.