I was going to wait until Dan Kaminsky announced more details about this flaw at the Black Hat Briefings on August 6th, but Halver Flake’s recent post as essentially squeezed the toothpaste out of the tube on this one. Just look at what Dan has to say.

I’m not going to talk about Dan’s decision not to release the details of this attack as soon as possible or the merits of full disclosure in computer security. Although interesting, it is less interesting to me than the flaw itself.

I know not everyone who reads this blog is technically oriented. To those people, I encourage you to try and make your way through this (long) post. I will try to keep things as simple as possible and I can guarantee you that a better understand of this particular problem will not only give you a better understanding of computer security, but also a better understanding of how the Internet really works.

Let me take a few moments to provide some background. The Domain Name System (DNS) is the protocol that translates a website’s domain name (e.g. somebank.com) into the corresponding IP Address (e.g. 192.168.1.1). IP Addresses are used by routers and network infrastructure to deliver information from one place to another on the Internet. DNS has been around since the mid 1980′s. It is a critical part of the infrastructure of the Internet. When you type in a domain name or use a bookmark to visit your bank’s website, you are trusting that the DNS protocol will take you to the correct server and not to a well-designed phishing website that looks just like your real bank.

The recent flaw in DNS is a protocol-level design flaw, not a software bug. A protocol is merely a pre-defined set of steps done to achieve some objective. For example, when Alice introduces two of her friends, Bob and Chris, to one another for the first time, she would follow a social protocol of introduction. She may introduce Bob to Chris as her co-worker from the Human Resources department, and she may follow this by immediately introducing Chris to Bob as her friend from church. If Alice forgot to introduce Bob to Chris and Bob eventually had to introduce himself to Chris while Alice was standing there, then that failure on Alice’s part is analogous to a failure in a single piece of software. If there were a flaw in this protocol, then every introduction performed based on this social protocol would fail. That is the difference between a protocol-level flaw and a software bug.

Now we have gotten to the crux of the issue. There is a protocol-level flaw in DNS that allows a phisher to take over the actual domain name of the site that it is trying to imitate. This is a serious problem that led to an astonishing collaboration to patch the entire Internet. Even patching the entire Internet isn’t going to “solve” this problem. Why? Because the patches are just that: patches. The problem still exists in the protocol.

What exactly is this problem? (And here’s where I may lose anyone who’s not technically oriented, but I’ll try and keep this simple.) When a DNS server doesn’t know how to translate a domain name into an IP address, it asks another, more trusted, DNS server for the information. Of course, this happens quite frequently since any given DNS server can’t store all the correct DNS translations for the entire Internet all the time (and since these translations can change).

Each time a DNS server has to ask a more trusted DNS server for a domain name to IP address translation, it does so by providing a number called a Query ID (QID). Now, there used to be a ton of attacks based on these QID since they were sequential. This class of attacks basically consisted of an evil doer asking a DNS server to perform a translation on a domain name that it didn’t already have. The evil doer would then start sending forged responses with sequentially increasing QIDs. If the evil doer got the right one, a bad domain name to IP address would be cached. Once a translation is cached, most DNS software implementations will ignore other updates to that domain’s information.

There are many ways to poison a DNS cache. This particular problem was patched (not solved) by just not using sequential QIDs. If a random QID is used, then it becomes very difficult for the evil doer to respond before the real response arrives.

Another interesting way to poison a DNS cache is to send a fake resource record. This attack works because of a chicken-ad-the-egg problem that I deftly avoided in my earlier description of DNS. I said that when a DNS server doesn’t know the proper translation for a domain name, it asks a more trusted DNS server. How? How does it know a more trusted DNS server? Basically, it only knows trusted DNS servers by their domain name. So it has to resolve a domain name for the next step in the hierarchy. Let me give a simple example.

Let’s say you’re a DNS server trying to resolve checking.somebank.com and you don’t know how. Who are you going to ask? Well, you’re going to ask whatever domain name server is controlling somebank.com since somebank.com is the next step in the hierarchy. If you don’t know that one, you’re going to ask the .com root server. Of course, you would like to learn how to ask somebank.com how to resolve all of it’s subdomains (e.g. checking.somebank.com, savings.somebank.com, etc…) since that would be efficient. This is done through a DNS Resource Record (RR).

Although there are many kinds of DNS Resource Records, for this attack all you need to know is that when you make a query for a DNS translation, you can receive back an answer as well as an additional resource record that is intended to help speed up future queries. Now, it used to be possible to poison DNS caches directly with this because there was a flaw in the protocol that allowed these resource records to be totally unrelated to the original request.

For example, let’s say you’re a DNS server and you just sent out a query about checking.somebank.com. It used to be possible that you would receive a domain name to IP address translation for checking.somebank.com and an addition resource record telling you that you should cache ns.evildoer.com as a name server for future queries. This was patched (not “solved”) by requiring the additional resource records be related to the query. (Thus, you would only be able to get a DNS RR for a somebank.com name server.)

The most recent DNS protocol-level flaw is related to both the QID problem and the DNS RR problem. Here’s how I believe it works (and these details are already available to anyone with access to google and a few minutes):

  1. Get a DNS server to look up a subdomain for the site that you want to compromise. For example, randomAAAAAA.somebank.com. The subdomain itself doesn’t really matter other than it shouldn’t exist.
  2. Since the DNS server doesn’t have this domain name to IP address translation it will have to look up the answer. Now, the evil doer can’t reliably predict the QID since random QIDs are used. The vast majority of these lookups will correctly be answered by ns.somebank.com as non-existent subdomains with the right QID. However, the evil doer can still try and race ns.somebank.com to guess an answer.
  3. The evil doer repeats step 2 and increments the random domain name every time. For example, the next domain name the evil doer might try could be randomAAAAAB.somebank.com. Since QIDs are just randomized and not cryptographically secure, the attacker may still have a mathematically reasonable chance at eventually guessing correctly and beating the real name server’s response. If that happens, then the real name server’s response is dropped and more importantly the attacker has earned the right to send a DNS Resource Record updating the name server for the bank. (i.e. The attacker gets to poison ns.somebank.com and make it point to their phishing site.)

It’s clever. It’s not easy to solve, so we’re going to play the patching game again and people are rushing to patch their DNS servers. Now, this post is not going to talk about the losing battle that is penetrate-and-patch. Although it would be fun to rant, that debate is no longer interesting since all the smart people are on the same team.

So why is the flaw (and perhaps computer security on the whole) interesting? The assumptions involved. Professor Spafford has a great quote about computer security and assumptions:

Finding vulnerabilities is simple; discover the assumptions a developer made, ad then violate those assumptions.

People have become accustomed to DNS working. They assume it will work. It’s not just users, but also developers that do this. Let’s take one example: OpenID.

For those who don’t know, OpenID is an identity system that enables users to store their identity information in one place. Instead of having usernames, passwords, addresses, and other account information stored separately at amazon.com, ebay.com, flickr.com, etc…, users would be able to store it (and update it) all in one place. It’s a really neat idea that could eventually provide useful services and save real people time. However, it was designed with the assumption that DNS just worked.

Kim Cameron points this out on his blog, but I think the best summary of the problem is by Tim Anderson:

Note that Cameron is not opposed to OpenID. Apart from anything else, he recognizes that this may well be the beginning of an identity revolution – part of a process, at the end of which we get a safer, less spam laden, less criminal-infested internet.

At the same time, he’s right. The whole OpenID structure hinges on the URL routing to the correct machine on the Internet. In other words, DNS. Now do some research on DNS poisoning. Scary.

Now, it strikes me that you can largely fix this by requiring SSL connections. In other words, have the OpenID URL be an https:// URL, and have the relying party (the website where you want to log in) check for a valid SSL certificate. Note thought that SSL must be used at every stage. OpenID lets you use your own URL as the identifier, but redirect to another OpenID identity provider. Both URLs must use SSL to maintain integrity.

Scary indeed. The OpenID developers have assumed reliable DNS. Now, Tim’s probably right that encryption is the solution to this problem, but I don’t think SSL would work. Even if there is a certificate for the site, most browsers fail to properly inform users what it means when an SSL certificate has changed or isn’t there now. Plus, people are trained to use the domain name and trust that it works.

So how can encryption help? Well, I think DNSSEC and IPSEC (or IPv6) would actually solve (not patch) the problem, but designing better protocols hasn’t been the real issue. DNSSEC and IPSEC have been around for a while. The problem is adoption. No one uses these protocols just like no one uses PGP for encrypting their email.

Metcalfe’s Law is holding most people back since they don’t want to be the only ones using the “other” network. This is another great example of why “road” or “highway” analogies don’t work for the Internet. If this were a pothole or even a collapsed bridge, we could fix the problem properly without really affecting most people. However, since this is the Internet, we can’t actually solve this unless everyone agrees to stop using DNS.

So we’re going to continue to see problems with old infrastructure protocols like DNS. As a result, phishing will continue to be a serious problem. The only way this will stop is if there is a problem so big that the monetary incentive to avoid the problem pushes everyone to change. Who wants to guess how big of a problem that would have to be?