So what do passwords look like?
The short (overly crude) answer: they all look the same.
The 2009 RockYou password database is perhaps the most famous, infamous rather, of breaches. It was substantial (32mm records) but what made it most distinctive, and extensively studied, was that the passwords were stored in cleartext (or plaintext).
Let's re-hash (all pun intended) some of the important findings regarding password distribution -
Password Length: 6 to 8 characters
More than two-thirds of all passwords are made up of 8 characters or less (and of those, most were 6, 7, or 8).
Character Sets: one set or two?
We can group the characters on your keyboard into four groupings, or character sets:
- Lower x26
- Upper x26
- Digit x10
- Symbol x33 (or x34 depending upon the spacebar)
How might passwords distribute across this keyspace? It turns out that nearly all passwords are made up of only one character set (c. 42% Lower, 16% Digit) or two (c. 37%) and where we use two character sets they are nearly always string-data or data-string; e.g. 'abcdef', 'house14', 'Thisword'.
Character Frequency: not fooled by randomness
Not surprisingly user's choices don't deviate too far from the classic ETAOIN pattern seen in everyday English (first observed by Mayzner) since we tend to use dictionary words in our passwords. Unfortunately, in contrast, cyber security often assumes (as base case anyway) that characters are chosen randomly across the keyspace when it comes to calculating password entropy.
To explain further, if the user only chooses from L (say), it is assumed that there is an equal 1/26 probability of any particular letter being chosen. Or to put it another way, the hacker is deemed to have to search through your 26 letters and will do so in no particular order to guess your one L.
Since we know that letters are not distributed evenly, the opposite becomes true and I can effectively shrink the keyspace. I won't search across L randomly but rather will search in order of (decreasing) probability and save myself time.
Let's make these more concrete: the 20 most popular characters are the only characters seen in nearly 4.8mm records (nearly 1/6th of the database). As user it seems we over-, and under-, use particular sections of our keyboard. From the hacker's perspective this translates as follows: why search across 95 characters when 20 will do as a rough cut?
Herding: little variety
Herding across words themselves was also very strong: the top 10 passwords were used in more than 2 in a 100 cases. That's 10 choices selected, from a theoretically near-infinite keyspace, by 670,000 users. This, you might agree, is fairly mind-blowing.
When considered in conjunction with all the other knowledge we've gleaned, we might even say that these behaviours add up to something rather dangerous for one's cyber security.