How much do humans herd?

Choosing Ice-Creams: can we measure a lack of uniformity?
There are 16 ice-creams at the parlour. If we witness the first 16 customers all choosing vanilla, predicting customer choice 17 is pretty straightforward; not so if those first 16 customers each choose a different flavour.

This, in (vanilla) essence, is the problem afore the hacker.

Choosing Passwords: can hackers measure a lack of uniformity?
Unlike with ice-creams, when choosing passwords the variety in front of users is nearly infinite - we can each choose whatever keystrokes we wish for our password - but, despite this unfettered freedom, as a group do we really choose in a random way, or do we all tend to plump for certain flavours/keywords? In other words, just how much of password keyspace do we use up?

Since we instinctively know the answer to this question is yes (we aren't as individual as like to believe and do all clump), a better question then: can we measure the extent of our human herding?

Measuring Disorder
Entropy (usually labelled 'H'), is the term used to describe the state of disorder (i.e. randomness, messiness, lack of structure) in a system, and can be applied to the hacker's dilemma.

For cyber security / password purposes, Entropy can be measured by applying a fairly simple formula ( P x Log2(P) where P is a probability ) to each possible choice. The effect of the formula is to allow us to figure just how spread out the choices actually made compare with the field of possibilities.

In the case of the 16 ice creams chosen equally, this is easy. Since one customer out of every 16 orders it, each of the 16 flavours has the same probability of being ordered (1/16) and this is all we need to calculate H:

	Flavours	Probability	Entropy, H
1	Vanilla	1 / 16	1/16 Log₂(1/16)
2	Chocolate	1 / 16	…
3	Strawberry	1 / 16	…
…	…	…	…
16	Rum Raisin	1 / 16	…
Total		1.00	Log₂(1/16)
Hsystem = Hmax			- 4.00

For this optimal case we can shortcut: look at the number of choices (16) and take -Log2(Choices).

Hmax = - Log2(16) = - 4.00

The Hmax number itself doesn't really matter. What does matter is that no other distribution will ever produce an entropy score that exceeds this number; it represents maximum dispersion (and minimum herding).

Rockstart: the hackers' ice-cream parlour
A quick revisit. There were 32.6mm accounts opened and only 14.3mm passwords used. Off the bat we already know that we won't attain Hmax.

But as we said above, the manner in which those passwords were chosen also matters. We want to know just how skewed the choices were to particular passwords, and by using Entropy scoring we can get an idea of this.

Using our shortcut method we can find Hmax -

Hmax = - Log2(32.6mm) = - 25 (approx.)
(so a score equal to 25 represents 100% dispersion)

And to get H for the user choices made, Hsystem, just as with our ice-creams, we take every password we come across in the dataset and figure out how frequently it occurs. We then apply our Entropy calculation and taken together have a definitive measure on human herding:

Rank	Password	Probability	Entropy, H
1	123456	0.89%	-0.06
2	12345	0.24%	-0.02
3	123456789	0.23%	-0.02
…	…	…	…
32,600,000	…	…	…
Hsystem		100.00%	-21.07
Hmax			-24.96

Human Herding: Human Hacking
At first glance 21 out of 25 (let's just round) doesn't look bad, but it isn't as good as you suspect either.

We might think of it like an "84% perfect randomness score" but that missing 4 (between 21 and 25), means that as a group we password users take up 1 / (2 * 2 * 2 * 2), i.e. 1/16, of all of the keyspace available to us -

Herded Keyspace = 1/16

Our hacker - who has been carefully watching the Rockstart parlour - sees that 'customer' choices aren't truly random and rejoices in the fact that we've made her job that much easier.

Hacking, much like predicting customer choice # 17, comes down to clever guesswork. The more technical expression for that guesswork is "searching the keyspace" and it's a useful one. It helps us easily envisage the idea that we've made the hacker's guessing search across cyberspace that much easier by limiting our choices to one corner that occupies only 1/16 of the total volume available.

Looks like we all do like vanilla passwords after all...

We can measure this.