Anonymity and Hash Functions
Want to anonymize user location? Well, just take the SHA1 of GPS coordinates. Want to anonymize phone numbers? Just take the SHA1. Social Security Number? Just take the SHA1. IMEIs? Take the SHA1... You get the idea!
The practice of using hash functions for anonymizing data is so prevalent—and so tempting—that few question its efficacy. But how effective are hash functions and under what conditions do they perform well? That's focus of this post.
Let's take an example and explore. Suppose there are 100 students in a class with roll numbers 1 to 100. The school registrar wants to broadcast the steller grades of her students to whole world, but doesn't want to give away the roll-numbers itself. (That would cause too much embarrassment to those who did not do that well). So, she decides to simply replace the roll-numbers with its corresponding SHA1 value, and publish the grades against those SHA1 values. She reasons that since no one can find the role numbers from its SHA1 values, her students have sufficient anonymity. But do they?

Consider an "attacker" who knows there are 100 students in the class. (This attacker is really a slacker in the class, and wants to know who got the best grades so that he can steal their homework in future. Hitchcockish yet?) To find out which roll-number got what grades, he constructs a table of roll number ρ and its corresponding SHA1(ρ) for all ρ=1..100 (please see the table on right). Based on this table, he searches through the SHA1 values that the registrar published, and find the corresponding ρ (the roll-number), and the grade!
Clearly, the hash function is not very effective in this contrived example. (The registrar's mistake was that she assumed that non-invertibility of hash-functions is a sufficient condition for anonymity, while in reality, it's merely a necessary one. In fact, not even that!). At a foundational level, anonymity demands that some information present in the source data remain "hidden" from the anonymized data. In this example, however, the source data is easily predictable (enumerable?), and the SHA1s are merely a new encoding of the roll-numbers, and nothing more!
Seen from this angle, ciphers, not hash-functions, appear to be a more appropriate choice for anonymizing data. Unfortunately, ciphers being pseudorandom permutation, leak information about the source and are not appropriate for anonymizing data. (For example, a chipher will leak information about size in bits of the source data. This information can sometimes be quite significant, for example, in deciding if someone deposited a million or a billion dollars in bank?)
As a compromise, a keyed MAC (e.g., HMAC) with a randomly selected "secret key" is always a better choice than using either a cipher or a plan hash function for anonymizing data.
Please note that in many cases one can get away with a pure hash function if the data source itself is highly random. In this case, the "hidden" component necessary for anonymity is the randomness of the source itself and does not require an explicit secret. (That being said, it's important to emphasize that it's the randomness of the source, and not merely the bit-size of data source that is important. For example, in case of roll-numbers, the attack would be just as effective if the roll-numbers started from 1-billion and went up to 1-billion plus 100. On the other hand, if the roll numbers were picked randomly from 0 to 1-billion, the attack would have been far less effective. This distinction is vital, and sometimes easy to forget about.)
For all the use cases given in the first paragraph (GPS location, Phone Numbers, SSN, and IMEI), there is not enough randomness in the source that justifies the use of a hash function alone for anonymity. For example, it's tempting to think that GPS data is two dimensional, floating point numbers, and therefore sufficiently random. In reality, however, people drive on (nearly) linear roads, enter and exit from well known GPS co-ordinates, and drive within a limited range of speed. Based on this, an attacker can easily build a table of GPS locations on a given highway, and find the actual coordinates of a driver who uses hash function for anonymity. Similar arguments apply for IMEI numbers, Phone Numbers, and SSN.
In summary, when anonymizing data, it's better to use keyed MACs over hash functions and explicit secrets (e.g., keys) over implicit ones (e.g., inherent randomness of source).