Request to unblock IP address


#1

Our IP address, 35.185.20.115, has been flagged for ‘ridiculously excessive traffic’. I have restarted the server that handles requesting certificates (to kill any longer running repeated requests), and we are further investigating the root cause of why the server issued so many requests. Could I have some clarity on what the criteria for being unblocked is?


#2

Hi @bytesighs,

Can you provide more information about your ACME client? Are you using cert-manager or kube-lego? What version?


#3

We are using golang.org/x/crypto/acme/autocert directly – updated just after they added support for http-01 (after tls-sni-01 was disabled), commit 13931e2.


#4

As I noted several improvements to golang.org/x/crypto/acme and golang.org/x/crypto/acme/autocert, I have updated the server to the most recent master. Particularly, I noted that some of these changes seem to be in response to your own interactions over cert-manager/kube-lego.


#5

Excellent, thanks @bytesighs! I’ll send a ticket to our Ops team to unblock.


#6

Thank you for assistance, both. I’ll avoid tagging you as I am sure you already get spammed enough (cpu, jsha)


#7

A post was split to a new topic: Request to unblock IP address (upgraded cert-manager)


#8

Hi @bytesighs,

A few followups:

  • It looks like your account still doesn’t have an email address associated with it. Could you please add one? Before this recent batch of blocked IP addresses we reached out via email. Adding a deliverable email address will help us reach you in the future and avoid needing to block your IP.
  • Your IP address is still sending traffic with a User-Agent header of “Go-http-client/1.1” instead of a more informative one. Older versions of golang.org/x/crypto/acme had this problem. I believe it’s fixed in newer versions. Perhaps your upgrade was unsuccessful?
  • I am still seeing fairly high traffic levels from your IP address, similar to what we saw before the block. In particular I’m noticing that failing domains seem to be retried 2-4 times per hour, with no backoff, and that sometimes the same domain is requested multiple times in one pass. I would recommend adding detailed logging to your client and looking for such duplicate and rapid-retry cases. Out of curiosity, is your client open source?

Since you’re working on the issue, we’ll leave your IP address unblocked. Definitely appreciate your attention to this.

Thanks,
Jacob


#9
  1. I’ll get an email added immediately, we just aren’t setting the field (not sure why). If you can let me know if you see an email now associated with it, that’d be great (it should be techops@prestigedigital.com)
  2. The HTTP client is just the Go standard client default, when used from acme.Client. I think it is plausible to modify the User-Agent by implementing a custom http.RoundTripper for the http.Client, is that what you mean?
  3. The client is https://golang.org/x/crypto/acme/autocert. The system outside of that is: when our edge servers receive an HTTPS request for a host they do not have a TLS certificate for, they request it from the server running autocert. If the server running autocert doesn’t have a locally-managed certificate for the host, then we hand off to autocert. The autocert.Manager.HostPolicy are added to an ‘unconfigured’ pool by users in our application, and then moved to the pool for the HostPolicy when we receive an HTTPS request.

#10

Yep, I see the email address now. Thanks!

I was under the impression that golang.org/x/crypto/acme had released a change so that acme.Client would use a more informative User-Agent by default. I filed an issue and saw that there was a changelist up, but it appears not to have been merged. I’ll ping the maintainers.

In general I recommend against this “issue-on-handshake” style of deployment. As the autocert docs say:

In practice we’ve noticed that clients implementing issue-on-handshake tend to run into a lot of problems with sending unnecessary requests. Consider the case where issuance is a bit slow. One handshake comes in for example.com. While issuance is still ongoing, a second handshake comes in for example.com. Does your system correctly handle this by ensuring there is only ever one inflight issuance request for example.com? Does it handle backoff appropriately? A system that fails to handle these tricky edge cases will wind up consistently over-requesting to no purpose, and filling up rate limits.

Additionally, an issue-on-handshake system has trouble handling renewals. You can design it to perpetually renew every certificate it has ever issued, but domains do expire. They also move to other hosting services. When that happens, how do you decide to give up on trying to renew the domain? Relatedly, how many times can someone attempt a TLS handshake for a domain that fails issuance before you give up on trying to issue for it?

Do you have a list of all sites you serve? Would it be possible to redesign around issuing only for those sites rather than issue-on-handshake? I think that would probably be more robust and make more efficient use of Let’s Encrypt’s resources. But I’m definitely interested to hear more about your use case and see if there’s some aspect I haven’t thought about yet!

Thanks,
Jacob


#11

Thank you, Jacob, you’ve added a lot to consider. I’ll try to address those points I think I have answers to below.

Does your system correctly handle [multiple requests that cause issuance] by ensuring there is only ever one inflight issuance request for example.com? Does it handle backoff appropriately?

We do not handle this directly, but I believe the internal state management for autocert should cover duplicated requests. The backoff on retry is from acme.Client.post (in http.go in package golang/x/crypto/acme).

Additionally, an issue-on-handshake system has trouble handling renewals. You can design it to perpetually renew every certificate it has ever issued, but domains do expire. They also move to other hosting services. When that happens, how do you decide to give up on trying to renew the domain? Relatedly, how many times can someone attempt a TLS handshake for a domain that fails issuance before you give up on trying to issue for it?

When failing, the domain would be moved to an unconfigured state, but a request to the system would mark it as configured again (as it would be treated as routable). However, there is no management done of the configured list (the one used for the HostPolicy), outside of the recent change of moving hosts that fail back out of the configured list to the unconfigured list. I think that managing this list more actively/aggressively is a worthwhile improvement.

EDIT: Also, after a cursory review of the renewal code in golang/x/crypto/acme/autocert, it does not appear to check with autocert.Manager.HostPolicy for whether that domain should be still created certificates for – but this is only relevant in renewals that trigger after the timer, not when loading an expired certificate.

Do you have a list of all sites you serve? Would it be possible to redesign around issuing only for those sites rather than issue-on-handshake?

The HostPolicy is a list of the sites we serve, in as much as user intent. Practically, the list of sites (as it stands) can be inaccurate for all the reasons you mentioned (users change DNS, domain expires, etc.). But, I am worried I have not communicated clearly. The issue-on-handshake is only for sites that have been registered by our users. The handshake was treated as a signal that the site is actually routable, though I can see the flaws you point out in that decision. When our users add a site it is marked as not configured. When we receive the first request for the certificate, we mark that site as configured (meaning, routed to us) and hand off to autocert.

In case it will be of assistance, I will try to describe the process/use case.

We provide a service for our users to create a specific niche of website. As part of that offering, the user can configure ‘custom’ domains – either via letting us manage their DNS, a CNAME record, or an A record. When the host in question resolves to our edge servers, we allow them to ‘enable SSL’.

Enabling SSL involves being added to the ‘not configured’ list of domains on the cert server. When we receive the first request for a certificate for the host, we move it to the ‘configured’ list of domains, and handoff to autocert. The ‘configured’ list is what the host policy in autocert is checking against.

If autocert has an error when actively called (either for a certificate, or for a status check), we move the host from ‘configured’ back to ‘not configured’. The only requirement to move back in the other direction is for a certificate request to come in from the edges (i.e., somebody asked the edge for the HTTPS version of the site). This is, however, the only management of the list that occurs after initial issuance.

Thank you for your help with this, we really want to be a good citizen – we couldn’t do what we do without Let’s Encrypt!


#12

Just a note that I have removed adding certificates, and we are serving only from our local cache of previously issued certificates. This means that there shouldn’t be requests as of now – though the amount previous to disabling the service was so high, it may have triggered another block. As of this note, the only requests that autocert is responding to are Let’s Encrypt http-01 challenges – which it will not have tokens for – and the certificate that is used for the server (a separate instance of autocert).


#13

Sounds good, and this is also what I hear from the autocert folks. I saw a bunch of new-authz requests for the same domain close together (1-2 seconds), but that doesn’t necessarily mean concurrent requests. It could be that a request started and failed within 1 second, and then another request was triggered shortly after. Sounds like autocert could use some additional state about failed names.

Ah, great! That makes me a lot less nervous.

This all sounds fairly reasonable. It sounds like the main issues are around what happens with failing domains: Since no state is remembered, those domains can get retried arbitrarily fast. It seems like one solution might be to keep such domains “configured,” with a “number of failures” field and “last failure” time field. That would allow you to implement backoff and eventually giving up past a certain number of failures. This would presumably also be a welcome change in autocert if you are up for it!

Thanks again for working on the issue.

Jacob


#14

Since no state is remembered, those domains can get retried arbitrarily fast. It seems like one solution might be to keep such domains “configured,” with a “number of failures” field and “last failure” time field. That would allow you to implement backoff and eventually giving up past a certain number of failures.

I have implemented essentially this scheme (and also provided a fairly conservative backoff). Thank you so much for all your assistance in this issue, your input was invaluable. I hope I have adequately implemented the measures you described, and now we are going to be well-behaved! At least (now) we have the email setup on the account if there are any further issues…

This would presumably also be a welcome change in autocert if you are up for it!

My employer offered me paid work time to do this, but I am not sure I am quite proficient enough with Go to actually bite the bullet. I shall have to revisit this with a bit more experience under my belt! I would certainly hope to be able to help out in the future.


#15

The traffic pattern is looking a lot better. I’m not seeing significant numbers of duplicate requests in the last 24 hours. I do note that I see seven requests for 35.185.3.114. In term of current traffic this is harmless but might indicate a bug you’d want to check out, so I figured I’d let you know.

Thanks so much for working on this and fixing the issues!


#16

Thanks for the heads up, it is appreciated. I investigated where the IP address request came from, out of curiosity, and it seems that another dev here decided to add it for reasons that remain known only unto himself (as in, he’d put it in the configured set). I’ve added in checks for the next deployment to stop this insanity and prevent the bug from reappearing if/when the issuance and routing failure lists are reset. 35.185.3.114, fortunately, already crossed the threshold for where we won’t try it again without manual intervention.

Also, a big thank you that I forgot to make the other day, your failure lists concept has also allowed me to simplify the code that was juggling configured and ‘unconfigured’ hosts by applying another dimension to the lists for routing failures (these are checked before issuance is attempted).

I cannot thank you enough for your assistance here. I can only hope you put significant value on post likes!


#17

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.