In Praise of Randomness In Defense of Pseudorandomness

12Sep/112

Anonymity and Hash Functions

Posted by Yogesh Swami

Want to anonymize user location? Well, just take the SHA1 of GPS coordinates. Want to anonymize phone numbers? Just take the SHA1. Social Security Number? Just take the SHA1. IMEIs? Take the SHA1... You get the idea!

The practice of using hash functions for anonymizing data is so prevalent—and so tempting—that few question its efficacy. But how effective are hash functions and under what conditions do they perform well? That's focus of this post.

Let's take an example and explore. Suppose there are 100 students in a class with roll numbers 1 to 100. The school registrar wants to broadcast the steller grades of her students to whole world, but doesn't want to give away the roll-numbers itself. (That would cause too much embarrassment to those who did not do that well). So, she decides to simply replace the roll-numbers with its corresponding SHA1 value, and publish the grades against those SHA1 values. She reasons that since no one can find the role numbers from its SHA1 values, her students have sufficient anonymity. But do they?

Anonymity

Consider an "attacker" who knows there are 100 students in the class. (This attacker is really a slacker in the class, and wants to know who got the best grades so that he can steal their homework in future. Hitchcockish yet?) To find out which roll-number got what grades, he constructs a table of roll number ρ and its corresponding SHA1(ρ) for all ρ=1..100 (please see the table on right). Based on this table, he searches through the SHA1 values that the registrar published, and find the corresponding ρ (the roll-number), and the grade!

Clearly, the hash function is not very effective in this contrived example. (The registrar's mistake was that she assumed that non-invertibility of hash-functions is a sufficient condition for anonymity, while in reality, it's merely a necessary one. In fact, not even that!). At a foundational level, anonymity demands that some information present in the source data remain "hidden" from the anonymized data. In this example, however, the source data is easily predictable (enumerable?), and the SHA1s are merely a new encoding of the roll-numbers, and nothing more!

Seen from this angle, ciphers, not hash-functions, appear to be a more appropriate choice for anonymizing data. Unfortunately, ciphers being pseudorandom permutation, leak information about the source and are not appropriate for anonymizing data. (For example, a chipher will leak information about size in bits of the source data. This information can sometimes be quite significant, for example, in deciding if someone deposited a million or a billion dollars in bank?)

As a compromise, a keyed MAC (e.g., HMAC) with a randomly selected "secret key" is always a better choice than using either a cipher or a plan hash function for anonymizing data.

Please note that in many cases one can get away with a pure hash function if the data source itself is highly random. In this case, the "hidden" component necessary for anonymity is the randomness of the source itself and does not require an explicit secret. (That being said, it's important to emphasize that it's the randomness of the source, and not merely the bit-size of data source that is important. For example, in case of roll-numbers, the attack would be just as effective if the roll-numbers started from 1-billion and went up to 1-billion plus 100. On the other hand, if the roll numbers were picked randomly from 0 to 1-billion, the attack would have been far less effective. This distinction is vital, and sometimes easy to forget about.)

For all the use cases given in the first paragraph (GPS location, Phone Numbers, SSN, and IMEI), there is not enough randomness in the source that justifies the use of a hash function alone for anonymity. For example, it's tempting to think that GPS data is two dimensional, floating point numbers, and therefore sufficiently random. In reality, however, people drive on (nearly) linear roads, enter and exit from well known GPS co-ordinates, and drive within a limited range of speed. Based on this, an attacker can easily build a table of GPS locations on a given highway, and find the actual coordinates of a driver who uses hash function for anonymity. Similar arguments apply for IMEI numbers, Phone Numbers, and SSN.

In summary, when anonymizing data, it's better to use keyed MACs over hash functions and explicit secrets (e.g., keys) over implicit ones (e.g., inherent randomness of source).

Filed under: Anonymity 2 Comments
5Sep/110

Trust is an edge, not vertex, property

Posted by Yogesh Swami

I've been too passive for a very long time. It's time for me to come out of my self-imposed hibernation and talk about things I really love talking about. Earlier this year, I moved out of Nokia Research Center, and joined a wonderful startup called Lookout Mobile Security, where I work as a software engineer. (Lookout is probably one of the few security companies that has got usability and security right.) Although, I'm no longer doing active research, I do try to keep up with the research community as best as I can (old habits die hard as you know :-) ).

To kick things off, let's talk about the issue of trust. DigiNotar, a dutch Certification Authority, was recently hacked into issuing as many as 200 fraudulent SSL and EV-SSL (OV) Certificates, including fake certificates for Gmail, Yahoo, Mozilla, and Tor. Every time an incident like this occurs, the Internet gets abuzz with proposals and counter proposals on how to replace CAs with either a web-of-trust model, or with new perspectivespdf icon on how to make PKI more "distributed," (i.e., a SSH style leap-of-faith authentication, augmented with distributed SSL Observatories). Lost in this mechanism debate, however, is the real conceptual issue of what trust is, and what does it mean for end-host authentication.

The traditional view of trust is that it's a property of a vertex. That is, in the interconnection graph between humans, machines, organizations, etc., (please see figure on right; renders properly in Safari!), trust is inherently a property of the vertex itself. We assume that a person, a certificate, or an organization is inherently trustworthy—in isolation! It's this notion of trust that's the foundation of  PKI, that relies on granting certificates to named entities, as opposed to interaction between entities.

In reality, trust is really an accumulated measure of expectation fulfillment between vertices. That is, a user is trustworthy because of the way she interacts with others and never—a categorical never—an inherent property of the vertex (person) itself. Therefore, IMHO, the first step towards solving the PKI problem is to find ways to make inferences about vertices based on the information present in the edges. In fact, many systems outside the realm of cryptographic security, explicitly make use of edge-properties to make inferences about the trustworthiness of a vertex. A good example is the PageRank algorithm which, in contrast to TFIDF—a vertex only algorithm—uses link-structure to find relevant (i.e., trustworthy) web-sites.

You could argue that the web-of-trust model already takes edge interactions into account, and therefore already does what I'm suggesting. After all, in PGP, don't individuals accept or reject keys on the basis of their own interaction with others? (Personally, I have not met one single PGP/GPG user who has ever rejected a key! But that could be because I don't interact with too many people :-) )

Granted, the web-of-trust model does a better job than PKI in deciding who to trust, yet it still relies on an ad-hoc selection of vertices called "trusted introducers" to make trust decisions. In my personal opinion, what is needed is an independent algorithmic means to make inferences about all vertices based on edge properties (similar to computing the first dominant EigenVector of the entire Web Graph derived adjacency matrix in PageRank). In other words, you don't (and shouldn't) get to pick who is trustworthy or not; you only get to interact with others—and the system computes the trustworthiness of each vertex on the basis of these edge interactions. Fair enough?

The second issue is that trust is never static. PKI (X509, to be precise) addresses the dynamic nature of trust by its "Not Before" and "Not After" fields and by issuing Certificate Revocation Lists. Both these approaches are too coarse grained, and do not capture the moment to moment varying nature of trust. A good authentication system will dynamically update its trust measure as interaction between vertices evolve, and will not depend on a handful of entities in the system.

Finally, trust and identity should not be separated from one other. In X509, what you verify is the signature(s) and what you trust is the private key, by association. The identity in the certificate, however, is really just a blob of data added on as an afterthought. If we look at the physical world, however, trust and identity are inextricably interlinked. For example, regardless of what name a person assumes, the legal system still treats a "natural person" on the basis of her behavior, not just her name. Treating identity separate from trust inherently opens doors for numerous forms of attacks. In fact, this is one of the reasons why Identity Based Encryption algorithms are quite interesting, even though they are currently not  practical on a global scale (current IBE systems require a global namespace and a global KDC, which are not practical at the Internet scale).

Although it's a lot of fun analyzing new ways of augmenting PKI, I also believe that PKI and X509 are here to stay! The complexities of overhauling it are immense and the ROI too little.

21Aug/110

Testing, testing testing

Posted by Yogesh Swami

This better be good.

 

/* clang -o hello_wordpress -Wall -Wextra hello.c */
#include <stdio.h>

int main(int argc, char* argvp[]){
    printf("Hello wordpress!\n");
    return 0;
}
Filed under: Uncategorized No Comments
   

Switch to our mobile site