Using BGP to Acquire Bogus TLS Certificates

Folks here will probably be interested in this paper recently presented at HotPETS: Using BGP to Acquire Bogus TLS Certificates. Essentially, BGP hijacking can be used to fool most domain validation processes, allowing an attacker to issue a certificate for a domain they don’t control, using nearly any CA that does Domain Validation. This weakness in the WebPKI has been known for a long time, but the authors documented and demonstrated it in particular detail, and presented some interesting countermeasures.

They were kind enough to show us an early draft of their paper, which kicked us into higher gear on implementing the multi-viewpoint validation feature we had been intending to write for a long time. There’s some code landed in Boulder, and we’re working on the operation deployment part of things. We’ve also looked at the proposed route-age heuristic from the paper but think it’s not ready for implementation. Specifically, you would have to apply this heuristic not only to the IP address of the server in HTTP or TLS-SNI validation, but to the IP addresses of all nameservers contacted during validation. I’m assuming that including those IP addresses would increase the false positive rate unacceptably, but we don’t yet have data on that.

There is one line of defense against this style of attack: Let’s Encrypt and other CAs have a list of high-risk domains. For Let’s Encrypt, that includes many of the most popular online services. Boulder will refuse to issue for those domains because the risk of issuing for them based on a BGP hijack (or DNS poisoning, or registry hack, or MitM attack) is too high. Unfortunately this approach doesn’t scale to the whole Internet, so we’re interested in finding ways to increase the robustness of Domain Validation in general.

8 Likes

That attack was also mentioned in https://groups.google.com/d/msg/mozilla.dev.security.policy/ydxiw3S3gSw/qVCWNlzYBgAJ from the “Regarding CA requirements as to technical infrastructure utilized in automated domain validations, etc. (if any)” thread on the mozilla.dev.security.policy list.

1 Like

One thing that I think could help a lot would be involving the domain registrars in the process. They are regarded as the source of ground truth with respect to domain control for almost all purposes, so if they would directly assert it or deny it with respect to particular certificate applicants, we should be no worse off than we are today. And there are fewer of them, and it might be possible to get much stronger cryptographic validation of their identities.

Presumably the DNS challenge using records hosted with a registrar supporting DNSSEC (and CAA records) is immune to BGP IP hijacking and DNS poisoning attacks?

Yes, but there are two problems: First, the attacker gets to choose what challenges to use*. Second, many of the domains we would most like to protect don’t implement DNSSEC.

*There is work underway at the ACME WG to extend CAA to express limits on validation methods, but it’s not ready yet.

2 Likes

Full disclosure: I’m the guy who started the “Regarding CA requirements as to technical infrastructure utilized in automated domain validations, etc. (if any)” thread on moz.dev.security.policy. I am not related to the paper or presentation that jsha referenced, though I have started a dialogue with the lead author.

I am competent to speak to many of the vulnerabilities and possible mitigations in the general scope and nature discussed in said paper.

I believe that the most essential defense against this kind of attack is to have multiple vantage points, each vantage point attached at a single point of interconnection and each in a distinct physical geography. Diversity of internet transit wherever possible.

What I am not competent to speak on is the physical / virtual / network security architecture of CA infrastructure. What I’ll say from this point forward is the union of what I’ve read and some assumptions I’ve made. Please correct any misgivings I might have:

I presume that the primary Validation Agent is a distinct physical element which has limited privilege to communicate back and forth with the policy engine of the CA such that the policy engine can request certain validations be tested and that the validation agent can report back the results of that test.

I would assume that the primary validation agent is physically collocated with the other critical CA infrastructure and likely sets in a distinct firewall zone.

Presumably, having far flung validation agents can be really really cheap… If they can be implemented as a cheap VM running on commodity infrastructure without significant security requirements.

Is it possible to construe the overall test result such that the primary validation agent must say yes, and then (and only if the primary says yes), the far-flung secondary validation agents must form a critical forum whose only job is to register an objection to the validation and stop issuance? If you do so, would that permit you to effectively define these remote agents as minimalist VMs (not even a shell interface) that boots up, phones home, and starts taking jobs to test and returning results to be collated, all without having to provide specialized persistent environments and persistent security guarantees of these secondary validation agents? It seems to me that if it is at all possible to eliminate the special operations environments needs of the secondary validators that this becomes a lot more practical and economical to deploy.

I have some comments on others’ posts on this thread:

I believe that the domain registry and the domain registrars are of interest, but that this is primarily between the party holding the domain and their registrar. There appears to be good security, in general, at the registry level, and so the user’s choice and configuration of their account at the registrar lets the user choose an appropriate level of registration security for their domain.

DNSSEC and CAA together have great potential for eliminating the hijack vulnerability – at least as to the DNS challenge, but I think it is improbable that CAA alone will help.

It is far more likely that a party wishing to maliciously secure a certificate will hijack the authoritative DNS server addresses and elect a dns validation than HTTP or TLS-SNI validation. I make this assertion because so many websites today are hosted on CDN farms where I might not know which answer to the DNS query that my target CA will get as to which IP the web server is at. It’s more likely that I’ll have a smaller set of IP space to hijack to intervene at the

As a result, if CAA alone without DNSSEC is utilized to set issuance criteria, it’s kind of pointless as a defense of the hijack attack. The attacker’s responder on the hijacked DNS server IP will present a rosier CAA picture, if any at all. You could catch that with DNSSEC, but not without.

I’ve not yet read in detail the proposed mechanism for the route-age heuristic, but my mind immediately comes to several challenges that I hope their proposal addresses:

  1. You can’t use just any view of the global BGP routing table to reference for the age of the advertised prefix. This is because a good attacker will work to ensure that the hijacked prefix is only announced as close to the target CA’s network as possible. If the scope of the hijack can be contained, it will break far fewer things and greatly increase the odds that the hijack is never noticed.
  2. As a consequence of concern #1, a useful view would mean that the validation agent is able to access the live BGP view of either the CA’s routing infrastructure (do CAs even generally BGP peer?) or if a non-BGP routing environment within the CA, would require a gratuitous BGP session with the upstream CA at the same point of interconnection as the CA’s actual upstream internet link to the upstream. Is it practical to bring such a feed into a regulated environment like a CA and rely upon it for decisioning (at least, that is, pushing a decision to the negative) on issuance?
  3. Have you ever watched a full routes stream live? Route advertisements in some instances are quite ephemeral. Some service providers dynamically rebalance traffic during the course of the day through modifying some of their advertisements. As a result, other ISPs across the net suddenly see a different best-path route to the same IP, even though connectivity was effectively continuous. I’m not sure that enough work has been done to link stability of a given prefix advertisement in one view of the global table to any particular security posture.

I reiterate again that the three points I’ve just made are naive as to the specific technique mentioned in the paper, as I’ve not yet read that section in detail.

Thanks,

Matt

1 Like

Hi Matt,

I am Henry Birge-Lee, the lead author of the paper, and me and Matt have been exchanging emails.

I would like to briefly address the status of the route age heuristic. It is not currently ready for implementation, and the HotPETS paper gives a very cursory overview of the heuristic omitting some key. In that short write up we decided to focus more on attacks because they help increase the demand for our research and are much more accessible to the broader community. I have written unpublished reports to my advisors that contain significantly more details about this and the results of our preliminary false positive evaluations.

As for the points you bring up:

  1. We understand that vantage point selection is an essential part of the implementation of this heuristic. Our reference implementation so far has used a single AS vantage point via the CAIDA BGP Stream software (which happens to have a 15 min delay on all the updates). This obviously must be changed before even being considered for implementation. Possible vantage point sources include new CAIDA vantage points that have around a 1 min delay, BGPMon.io feed which I have not looked at in depth but appears to be much closer to real time, and ideally real peering with a group of ASes. We also are considering vantage point location. We are assuming an adversarial model that can propagate an announcement to any subset of ASes it would like. Thus we can assume an adversary will launch an announcement that will not be seen at any vantage point. This seems hopeless but the upside is that every vantage point network an adversary hides the announcement from is a network that will route packets to the original prefix owner (the victim). Thus, the internet can be through of as having select ASes that will make the wrong (from an adversarial viewpoint) routing decision. If these ASes are large backbone networks like Cogent and Hurricane, the packets will likely not reach their destination and get stuck in forwarding loops or get dropped. Thus, having vantage points (especially if the providers for ViaWest are included as vantage points) does help.

  2. Addressed in point 1.

  3. Yes I have looked at live BGP updates and you are very right that they are turbulent. This is why we never very seriously considered the route age heuristic in that simple form. Updates are just too frequent. The hop age heuristic I described in my email to you is a simplified explanation of our going proposal. By looking at the age of individual hops (ASNs in the path) and the age any route to that exact prefix has been in the routing tables we can eliminate the affects of such load balancing. Even if the prefix owner is load balancing between two different providers, the final hop (the owner) and the age of any route to the prefix should still be very old. I have heard of load balancing on a prefix level (network operators will announce a subprefix as a form of load balancing), but this should be much less common than the path-based load balancing you are referring to.

As I mention in the email, we have been doing research to look at the false positive rate we expect. The results so far when looked at from a hop age point of view as opposed to route age are fairly promising. Also it is important to keep in mind that this is a completely tunable heuristic. We can change the time thresholds and the number of hops in the route that we consider in the heuristic to have a minimal impact on normal issuance.

Also, this idea of not trusting new routes is not completely new. Jen Rexfored worked on Pretty Good BGP (http://www.cs.princeton.edu/~jrex/papers/pgbgp.pdf) which looked at a similar idea of not using new routes but at a router level not an end host level. I understand that this work is cited as having flaws including preventing new routes from legitimate domain owners to propagate (like those used to win traffic back from hijacks), but I would argue that CAs should take some false positives to prevent routing attacks.

In addition, these countermeasures are not intended to operate individually or in a vacuum. The hop age heuristic is intended to make announcements long lived and the vantage points are intended to make the announcement global. With these two requirements in place, we believe that network operators should be able to detect these attacks before certificates are issued.

Keep in mind that multiple vantage points will no do anything to prevent short lived global attacks in the middle of the nite. I do not believe it is reasonable to expect network operators to notice a 5 min log BGP event that happens while they are asleep even if it is global. There really needs to be a time requirement of some sort. Making the issuance process take 24 hours and having multiple checks at random times of the verification document might be a better solution, but if we want to keep certificate issuance nearly instantaneous, historical BGP data is the next best.

Best,
Henry

1 Like

Hi, Henry!

I’m excited about the hop-age heuristic and think it has real promise. I’m looking forward to examining the scenario as you described in the email. I think I may also be able to propose some useful refinements. If so, I hope you’ll run them against your test data set and see what differences might be exposed.

I agree that your approach of utilizing the hop-age by a host validating network path stability is an entirely different perspective than utilizing the mechanism in making instantaneous routing decisions and as such I would regard this use case as altogether different than those which brought about criticism of “Pretty Good BGP”.

Intuitively, I believe the combination of multiple vantage point and hop-age heuristic could substantially improve domain validation accuracy / quality in the face of IP hijack risks. Has thought been given yet as to how one might objectively quantify the security improvement which derives from deployment of this combination? I have a suspicion that a reasonable answer to that question will go far in helping to standardize a proposed solution.

I look forward to further discussion.

Matt

Hi Matt,

I look forward to future work. Ideally we will run against our test data and compare the trade-offs of each one. Feel free to send us some ideas about algorithms and we can take a look at them.

I also wanted to mention ways we intend to quantify security improvements. There are three metrics we intend to use:

  1. Number of possible adversaries. This mostly applies to equally specific and stealthy attacks. We aim to measure the set of ASes that can perform a given attack type and show that it will decrease with security measures.

  2. Visibility of attack. What is the minimum number of ASes that will have to see an attack for it to fool a CA?

  3. Duration of an attack. How long must an attack be active before the adversary can get a certificate?

Multiple vantage points do a lot for the first two metrics (and we are working on research to demonstrate this), but in the current CA infrastructure attacks can be very short lived.

Best,
Henry

As an update, we just announced the deployment of this feature to the staging environment.

Belatedly, I would like to offer another word of caution about the wisdom of heuristic approaches in this domain.

It will nearly always be essential to compare false positive rates to true positives and not consider them absolutely. This is the downfall to many otherwise interesting looking ideas for airport security and medical diagnosis, where true positive rates are often incredibly low. If the true rate of domain validation fraud is, say 1 per 4 million then even a false positive rate of 1 in a thousand is terribly painful. And if the heuristic works, and true positive rates fall, the false positives become even harder to justify ironically.

Beyond that, a common problem is that false positives are a burden for some specific minority, who will be aggrieved even if the system is not deliberately targeting them. For example we could imagine that Ethiopia might suffer network connectivity problems that look untrustworthy to the heuristic verifier - the effect, though not intended, is to deny service overwhelmingly to poor Africans. Effort spent justifying a technical measure which has such consequences is usually good money after bad.

I applaud Let’s Encrypt’s approach so far. I don’t want to shoot down interesting research, but I think the route heuristic idea won’t turn out to be practical as an element of automated validation for the reasons described above.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.