QlikView hash functions and collisions

By Barry Harmsen

HashFunctionI’m currently updating my materials for the upcoming Masters Summit for QlikView in Chicago, and thought I’d share a little bit with you. In my session on data modeling, I explain how you can deal with various types of Slowly Changing Dimensions in QlikView. One of the techniques I explain is using hash functions to detect changes in (historical) records. During the previous events, this always lead to two questions from the audience:

  • What exactly are hash functions and hashes?
  • And, from those who already know the answer to the first question: Aren’t you worried about hash collisions?

Today I will answer both questions and hopefully give you some insight into hash functions, their usefulness in QlikView and the risks of hash collisions.

Hash functions

A hash function is an algorithm that maps data of arbitrary length to data of a fixed length. The value returned by a hash function is like a fingerprint of the input value, and is called a hash value or simply hash. For example, all of the text above can be translated into the following MD5 hash357799131ceffdd43cc0fe9f52b36eeb.

You will notice that this hash is much shorter than the original string used to generate it.Besides that, if only a single character in the text is changed this will lead to a completely different hash. This property makes hash functions very useful to compare things, for example files, but also historical versions of a record.

A hash function is deterministic, meaning that the same input value should always lead to the same hash value. Typically, a hash function is a one-way function, you cannot ‘decode’ the original input value based on the hash value alone. Besides that, a good hash function is also uniform, which means that each hash value should have the same probability of being picked. The image at the top of this post illustrates a very simple hash function. Each of the four input values is mapped to a unique output value.

Hash functions in QlikView

In QlikView, the following hash functions are available:

  • Hash128(): a 128 bit hash function that returns a 22 character string.
  • Hash160(): a 160 bit hash function that returns 27 character string.
  • Hash256(): a 256 bit hash function that returns a 43 character string.

The number of bits determines the output range of the function. A 128 bit hash can store 2^128 (or, 340.282.366.920.938.000.000.000.000.000.000.000.000) different combinations. 160 and 256 bit can store even more combinations (2^160 and 2^256 respectively).

Besides these functions, QlikView also has the AutoNumberHash128() and AutoNumberHash256() functions. These functions basically take the output of the Hash128() and Hash256() function and passes it through the AutoNumber() function. While I think they have a nicer syntax than the regular AutoNumber(), you can supply a comma-separated list of fields instead of a concatenated string, the usefulness of these functions eludes me.

Detecting changed records

Consider a QlikView application containing the following Employee table:

Historic

Now, assume we get some new, changed data and want to quickly determine which rows have changed:

New

As you can see, Jim has moved to another office. How can we detect that this row has changed? We could compare each field in the table to each previous version of the field, but as we are only interested in detecting if the row has changed, using a hash function is a more elegant solution. Using Hash128(Name, Position, Office) we can calculate a hash value for each row:

Hashed

The hash value for Dwight’s record hasn’t changed, because the record hasn’t changed either. Jim’s changed record however does have another hash value than the previous one. Once we’ve detected this we can do further processing on these records. This will be the topic of a future blog post. Or, if you don’t want to wait for that, my data modeling session at the Masters Summit for QlikView.

Hash collisions

As noted before, a hash function is an algorithm that maps data of arbitrary length to data of a fixed length. When different input values lead to the same output hash value, this is known as a hash collision. Consider the following, simplified hash function:

Collision

In this example, both Michael and Toby get the same hash value of 2. It’s easy to see what the problem is here, there are 5 input values and only 4 possible hash values. The input domain is greater than the output range.

Now, you may think “this isn’t a problem for me, the number of input values I deal with is much less than 2^128, let alone 2^256”. It’s a simple assumption to make, but also a wrong one as hash collisions can occur long before the number of input values reaches the range of the hash function.

The birthday problem

Imagine you’re in a room with a group of people. How many people do you think need to be in that room before the probability of two people sharing the same birthday reaches 50%? There are (excluding leap years) 365 days in a year, so maybe 185? 200?

The answer is 23. Surprising, isn’t it? If we raise the number of people to 75, the probability of at least two people sharing a birthday raises to 99,95%. This is known as the birthday problem.

As this is a QlikView blog and not a math blog, I won’t go through the complete solution and proof. Basically, instead of calculating the probability that two people in a group share a birthday, the trick is to calculate the probability that no one in the group shares a birthday. This is much easier to calculate. The result is then subtracted from 1, which gives the probability that at least two people in the group share a birthday.

Calculating the probability of a hash collision

If you looked closely at the previous example, you may see that the people can be considered input values and that their birthdays can be considered hash values. When two people share the same birthday it’s a hash collision! If we understand this, then we can apply the same logic to determine the probability of a hash collision in our data sets. To calculate the approximate probability of a hash collision we can use the following formula:

Approximation

I created a small Excel workbook to calculate the probability of a hash collision. Now, it’s good to realize that Excel only uses 30 significant digits. As these probabilities are very small, this means that Excel is unable to calculate probabilities for very small input values. So, in the example below, I calculated the probability that 1 quadrillion (that’s a 1 with 15 zeroes) input values could lead to a hash collision when using a 128 bit hash.

Probability

The probability of this happening are around 1 in 680 million. Or, to put it in perspective:

Perspective

Now, there is a small caveat with this calculation. It assumes the hash functions used in QlikView leads to a uniform output, meaning each value has the same probability. This may not be the case.

On the other hand, we are not comparing a quadrillion records, we are only comparing two. When calculating the probability of a hash collision with just 2 records and a 128 bit hash using an online high precision calculator, the result is 2.938735877055718769922E-39 (1 in 2.9 Duodecillion). Or, to put it in perspective again, this is less likely than a single person winning the lottery, getting hit by a meteorite, getting attacked by a shark -and- becoming president of the USA in their lifetime.

Switch to a 160 bit hash and  the likelihood of a collision becomes lower than the combined probability of all events in the chart above. Now, just because it is very unlikely doesn’t mean that it can’t happen (see: Law of large numbers), but I like those odds!

TL;DR: when using hashes to check for changed records, don’t worry about collisions.

If you want to experiment with your own calculations, the workbook I used can be downloaded below.

Download the hash collision workbook