I’m currently updating my materials for the upcoming Masters Summit for QlikView in Chicago, and thought I’d share a little bit with you. In my session on data modeling, I explain how you can deal with various types of Slowly Changing Dimensions in QlikView. One of the techniques I explain is using hash functions to detect changes in (historical) records. During the previous events, this always lead to two questions from the audience:

- What exactly are hash functions and hashes?
- And, from those who already know the answer to the first question: Aren’t you worried about hash collisions?

Today I will answer both questions and hopefully give you some insight into hash functions, their usefulness in QlikView and the risks of hash collisions.

### Hash functions

A hash function is an algorithm that maps data of arbitrary length to data of a fixed length. The value returned by a hash function is like a fingerprint of the input value, and is called a hash value or simply hash. For example, all of the text above can be translated into the following MD5 hash:Β *357799131ceffdd43cc0fe9f52b36eeb.*

You will notice that this hash is much shorter than the original string used to generate it.Besides that, if only a single character in the text is changed this will lead to a completely different hash. This property makes hash functions very useful to compare things, for example files, but also historical versions of a record.

A hash function is **deterministic**, meaning that the same input value should always lead to the same hash value. Typically, a hash function is a **one-way function**, you cannot ‘decode’ the original input value based on the hash value alone. Besides that, a good hash function is also **uniform**, which means that each hash value should have the same probability of being picked. The image at the top of this post illustrates a very simple hash function. Each of the four input values is mapped to a unique output value.

### Hash functions in QlikView

In QlikView, the following hash functions are available:

**Hash128()**: a 128 bit hash function that returns a 22 character string.**Hash160()**: a 160 bit hash function that returns 27 character string.**Hash256()**: a 256 bit hash function that returns a 43 character string.

The number of bits determines the **output range** of the function. A 128 bit hash can store 2^128 (or, 340.282.366.920.938.000.000.000.000.000.000.000.000) different combinations. 160 and 256 bit can store even more combinations (2^160 and 2^256 respectively).

Besides these functions, QlikView also has the **AutoNumberHash128()** and **AutoNumberHash256()** functions. These functions basically take the output of the **Hash128()** and **Hash256()** function and passes it through the **AutoNumber()** function. While I think they have a nicer syntax than the regular **AutoNumber()**, you can supply a comma-separated list of fields instead of a concatenated string, the usefulness of these functions eludes me.

### Detecting changed records

Consider a QlikView application containing the following Employee table:

Now, assume we get some new, changed data and want to quickly determine which rows have changed:

As you can see, Jim has moved to another office. How can we detect that this row has changed? We could compare each field in the table to each previous version of the field, but as we are only interested in detecting if the row has changed, using a hash function is a more elegant solution. Using **Hash128(Name, Position, Office)** we can calculate a hash value for each row:

The hash value for Dwight’s record hasn’t changed, because the record hasn’t changed either. Jim’s changed record however does have another hash value than the previous one. Once we’ve detected this we can do further processing on these records. This will be the topic of a future blog post. Or, if you don’t want to wait for that, my data modeling session at the Masters Summit for QlikView.

### Hash collisions

As noted before, a hash function is an algorithm that maps data of arbitrary length to data of a fixed length. When different input values lead to the same output hash value, this is known as a **hash collision**. Consider the following, simplified hash function:

In this example, both **Michael** and **Tob****y** get the same hash value of **2**. It’s easy to see what the problem is here, there are 5 input values and only 4 possible hash values. The input domain is greater than the output range.

Now, you may think “this isn’t a problem for me, the number of input values I deal with is much less than 2^128, let alone 2^256”. It’s a simple assumption to make, but also a wrong one as hash collisions can occur long before the number of input values reaches the range of the hash function.

### The birthday problem

Imagine you’re in a room with a group of people. How many people do you think need to be in that room before the probability of two people sharing the same birthday reaches 50%? There are (excluding leap years) 365 days in a year, so maybe 185? 200?

The answer is **23**. Surprising, isn’t it? If we raise the number of people to 75, the probability of at least two people sharing a birthday raises to **99,95%**. This is known as the birthday problem.

As this is a QlikView blog and not a math blog, I won’t go through the complete solution and proof. Basically, instead of calculating the probability that two people in a group share a birthday, the trick is to calculate the probability that **no one** in the group shares a birthday. This is much easier to calculate. The result is then subtracted from 1, which gives the probability that at least two people in the group share a birthday.

### Calculating the probability of a hash collision

If you looked closely at the previous example, you may see that the people can be considered input values and that their birthdays can be considered hash values. When two people share the same birthday it’s a hash collision! If we understand this, then we can apply the same logic to determine the probability of a hash collision in our data sets. To calculate the approximate probability of a hash collision we can use the following formula:

I created a small Excel workbook to calculate the probability of a hash collision. Now, it’s good to realize that Excel only uses 30 significant digits. As these probabilities are very small, this means that Excel is unable to calculate probabilities for very small input values. So, in the example below, I calculated the probability that 1 quadrillion (that’s a 1 with 15 zeroes) input values could lead to a hash collision when using a 128 bit hash.

The probability of this happening are around **1 in 680 million**. Or, to put it in perspective:

Now, there is a small caveat with this calculation. It assumes the hash functions used in QlikView leads to a uniform output, meaning each value has the same probability. This may not be the case.

On the other hand, we are not comparing a quadrillion records, we are only comparing two. When calculating the probability of a hash collision with just 2 records and a 128 bit hash using an online high precision calculator, the result is 2.938735877055718769922E-39 (1 in 2.9 Duodecillion). Or, to put it in perspective again, this is less likely than a single person winning the lottery, getting hit by a meteorite, getting attacked by a shark -and- becoming president of the USA in their lifetime.

Switch to a 160 bit hash andΒ the likelihood of a collision becomes lower than the combined probability of all events in the chart above. Now, just because it is very unlikely doesn’t mean that it can’t happen (see: Law of large numbers), but I like those odds!

**TL;DR: when using hashes to check for changed records, don’t worry about collisions.**

If you want to experiment with your own calculations, the workbook I used can be downloaded below.

## 26 Comments

Nice post Barry!

But, I do not get why to store an additional hash value (and use for comparison) if you can access and use the historical dimensional data itself?

awesome post, Barry! but can you elaborate a bit more on differences between autonumber() and hash() functions? which one is better from a performance perspective once loaded (i guess a numeric one) and each one is more reliable.

Also VERY IMPORTANT thing here is concatenation of the fields and replacing blanks with some characters. People who do not concatenate fields properly, often surprised to see the same hash assigned to two different records. Here is one example that will get the same hash value if you do not replace blanks

name age status

john 9

john 9

if you do

autonumber(name & age & status), both records will get the same hash

but if you do

autonumber(name & ‘|’ & age & ‘|’ & status), they will get different hash values.

Hi Borys,

Just did a quick check. Though concatenating without a separator can give different records the same hash, this doesn’t seem to be the case for the Hash128, 160 and 256 functions:

https://dl.dropboxusercontent.com/u/6795646/QlikFix/Hashing/NullHash.png

The performance difference is for another time π

Kind regards,

Barry

good to know, thanks for testing this, Barry. Great job again on the post!

So I was curious about the performance difference of hashing vs AutoNumber() and decided to perform a small test. I used a QVD containing 10 million random records in 25 fields (3GB on disk) and loaded them twice, once using Hash128() to create a fingerprint of the record and once using AutoNumber(). The performance difference was very noticeable. On my laptop (i7-3720QM, 16GB RAM) Hash128() took 44 seconds, while AutoNumber() took 5:09.

(note that I do not advise using hashes as keys, in this case I’m only using them to detect changed records.)

interesting…I work with SQL Server every day and it can also hash millions of rows in no time. I wonder if autonumber has to build a reference table on a side to “remember” the counters while hash is processed on one value and deterministic.

now you probably need to explain why you would not use hash as keys π

thanks again man, I learned a lot from you blog and a book!

Hi Barry,

I have also performed a same kind of test to compare Autonumber and the result is for me very suprising.

and this is my result.

regular reload Duration: 04:30 Size qvw: 961.423

reload with autonumber Duration: 16.19 Size qvw : 919.812

reload with autonumberhash256 Duration: 11:54 Size qvw : 711.161

So yes the reload with autonumberhash is faster compared with the autonumber but still a lot slower compared to the regular reload.

On the other hand the size of the qvw decreased allot.

So now my questions is the same as Borys.

Why should we not use autonumberhash as a keyfield?

Thanks in advance

What Barry Harmsen is trying to say is that he doesn’t recommend using Hashes ONLY ans keys, not that you can’t use the autonumberhash as key.

I think the point here is that qlikview uses numeric keys much faster than text ones (as a Hash would be)

So final conclusion, IF, you NEED to use an autonumber, use the hashed version of the autonumber since it seems to work a lot faster, and (as Jeroen Jordaan experiment) seems to yield a smaller Qvw size (although I would be cautios about this last assumptiom because, the stored value is an integer just as the one that autonumber yields)

Great Post Barry,

Knowing my luck, I’d probably suffer a hash collision instead of the much more likely (albeit unlikely) event of winning the lottery.

π

Alan

I suddenly have an idea for a Hash Collision Lottery: your ticket number is a hash, if someone else gets the same hash you both win π

Hi Barry,

Very good and clear post. I use autonumberhash256 to turn my ‘complex’ combined key into a numeric key. What I see is that the reload with autonumberhash is much slower then without. Will this be the same with hash?

Hi Barry,

As always a very good and clear post.

I do have a question.

I use autonumberhash256 to change my ‘relative’ complex composed into a faster and smaller number key field.

The only thing is that the reload duration with autonumberhash is much longer then without. Does the same counts for Hash256?

Excellent! Looking forward to part 2.

Hi Barry,

Thanks for the great post.

I have a question on input field and the way it’s stored and hoping you could help clarify please.

We have an input field with 52 rows (values). A user created a bookmark awhile back that includes this inputfield. I then applied this bookmark and created a bookmark as my own (without any changes from the original bookmark). When I viewed both bookmarks using powertool, I can see the inputfield has 52 values and as expected, both bookmarks have the same values for these 52 rows. however, the packed hashed values are different between the two bookmarks. I can’t explain how the difference occurs. Any suggestions please?

Thanks

Elly

I did another experiment:

LOAD AutoNumberHash128(field1, field2) as The_Autonumbered_Hash, field1, field2

Inline [

field1, field2

a, aa

aa, a

1, 11

11, 1

try,

tr, y

t, ry

];

So as to see if it would treat the fields as in concatenation, or it has some code in it to cover that.

I got:

The_Autonumbered_Hash field1 field2

1 a aa

2 aa a

7 t ry

6 tr y

5 try

3 1 11

4 11 1

Soooo, it seems to work… I would be inclined to think that autonumberhash is faster due to not having to calculate the usual Concatenation, BUT, it has to form a hash of the input, so,,, Everything comes down to who is faster, the “concatenation” or the hashing algorithm.

(My money would be on the hashing since its a mathematical algorithm, and not a string operation) π

Yet… I couldn’t resist the question in my head…

I did This:

Table1:

LOAD

IterNo() as iteration,

AutoNumber(IterNo() & ‘_’ & ‘a test string’) as Autonumbered // as normal

AutoGenerate 1

While IterNo() 91 s

Then I tryed:

Table1:

LOAD

IterNo() as iteration,

AutoNumberHash256(IterNo(), ‘a test string’) as Autonumbered // as would be done

AutoGenerate 1

While IterNo() <= 1e7;

and got 28 s!!

So hashing IS more efficient than Concatenating the strings with the underscores!

the autonumber one yielded a 13719KB Qvw

as did the Autonumberhash256

So my theory on the size of the resulting Qvw seems to be standing B)

I think that maybe Jeroen Jordaan droped a few fields in the process and maybe didn’t notice them.

Also!, I think It’s unfair to compare the load of the “regular reload” to the other autonumber and autonumberhash256 because, hey, If you could do it WITHOUT an autonumber, why on earth would you place one there anyways? The right scenario is to compare both solutions in which you end with a decent numeric key.

thanks.

ps: I’ve invested so much keyboard on these comments that I’m going to post this to my blog as well, xD maybe in spanish or something like that, I’ll link this anyways, c ya

Hi Barry, excellent post and of course make people hungry for the complete solution π

Funny to see when people are talking about performance they are talking about script execution. To me this is only really important when the execution time ‘gets in the way’.

The most important thing that always matters is: How is my application going to perform to the users (charts calculation time, etc).

Using a hash function as a key (not the AutoNumber hash variant) leaves your model with keys that are pretty large (128 bit, or worse). This will give a huge performance drop!

So I recommend to never use the hash functions for keys, but use 1 of the AutoNumber functions. They will both generate an integer sequence for your key values.

Of course: After your done with the hash fields (you have determined

what records changed), drop those fields too!

Diego

Hi all, and great post Barry!

Just an additional note to consider the symbol tables and the way that QlikView stores data.

The Autonumber() and AutonumberHash() functions will take up virtually no space in the symbol tables, where as the Hash() functions will.

Hash..() is more efficient in the load than autonumber…() as autonumber…() is having to map the values to distinct sequential integers.

However in the application, once the data is loaded, the autonumber() functions take up less memory.

Also, to bear in mind, the Hash functions will take up MORE memory than a concatenated value that is smaller in size.

EG values of the type “XYZ_12345” that are given a 256 bit hash will take up more memory in the symbol tables as the hash is bigger. This is significant as Hashes are often used to create keys – of which there are many distinct values!

So it depends on whether the optimisation is required in the load, or the application, and your data model in general.

Erica

Hi Erica,

Thanks for pointing this out. I only use the hash functions for change detection, they never make it to the final data model. Of course, they do use some additional memory while the script is running, so the memory peak might be higher. It depends on your environment if you find this acceptable, but generally I do not find it to be an issues (and I’ve used this on >=100 million row tables).

Cheers,

Barry

Just as an extra, these are the memory statistics for the symbol table for fields that were generated:

Row = rownumber

Value = random number using rand()

There are 1 million rows of data.

.tg {border-collapse:collapse;border-spacing:0;}

.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}

.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}

.tg .tg-e3zv{font-weight:bold}

FieldName

Bytes

Count

Row

0

1000000

Value

7998128

999766

Auto_Value

0

999766

Auto_Value128

0

999766

Auto_Value256

0

999766

Hash_Value128

27993448

999766

Hash_Value256

48988534

999766

as you can see, any field that comprises sequential integers takes up 0 symbol table space – including “Row”. The hash values actually take up more room than the original values as QlikView’s generated random number is <128 bit

Erica

My understanding of autonumber() and autonumberhash() is different from what other commenters wrote (and from yours, Barry). It might be wrong, but here it is:

The hash() functions just return hashes and therefore allow collisions. They have performance O(1).

The autonumberhash() functions use internal hash tables to return a unique ID for every value. They do NOT have collisions. See https://en.wikipedia.org/wiki/Hash_table to understand how hash tables deal with collisions. A lookup through a hash table has O(n) performance and therefore is fast. I would assume that 128-bit hash tables take less RAM, but become ineffective on very large volumes of data as they have to resolve many collisions per bucket and therefore their performance drops. Thus for very large amounts of data the 256-bit version of the function is more preferable.

autonumber() function returns a unique ID but it’s internal structure in not a hash table. Most probably it’s either a sorted list or array. Therefore it’s slower — probably something O(n log n). But since it’s internal structure doesn’t use long hashes, it takes even less RAM.

All internal structures take RAM only during loading script execution. They are disposed once it’s complete. hash() functions do not use internal structures.

ResumeAll hash functions are NOT suitable for generating primary keys.

All autonumber functions are suitable for generating primary keys.

You can use autonumber() for small amounts of data. But honestly I don’t see any sense in using it at all. I guess it was created in earlier versions. Then, because its performance was unsatisfactory, Qlik added autonumberhash() functions in later versions, and autonumber() was left for compatibility.

Use autonumberhash128() on large amounts of data.

Use autonumberhash256() on VERY large amounts of data to increase speed at the cost of RAM.

Assumption that autonumberhash() is the same as autonumber( hash()) is not correct.

PS. Designing EasyMorph has taught me a lot about hash tables. They effectively are the fastest way to do deduplication, mapping, aggregation, etc.

Hello,

I recently posted a solution to create a slowly changing dimension (using AutoNumberHash256), curious to know if you have the same algorithm.

https://community.qlik.com/docs/DOC-16551

Hi Barry,

I am looking for source code or algorithm of hash256() function in order to validate a value sent from QV to Java App via WebPageViewer2. Do you know if it is possible to get it and from where?

The idea is to sent some user’s selection in link. This link should contain also hashed value of the selection. Java App then validates this link and if hash value is the same on Java side it lets user in. If the hash is not found then user doesn’t get access.

Thanks in advance!

Hi Barry

I have a strange QlikView issue

I am using Hash to detect that some data exported in excel has not been modified manually after export.

I do that on application obtained from a reduce&distribute task

From time to time I detect FALSE modifications.

I investigated on the matter and discoverd the following:

On a distributed application, on the table that I use for export,

If I duplicate the column containing the hash, I obtain SOMETIMES different hash values in the duplicated column?! How can it be?

Hi Barry,

I attended the Masters Summit this past fall in Philadelphia and truly enjoyed the experience. Since my return to the office, I am working with a document that uses autnumberhash128, but I am confused on what is occurring. The data from two qvds is being placed into one table in which the second qvd data is concatenating to the first. As the data loads, a record can have three rows due to the Source being PreIT, Post, and NCRx. Each of the sources have a different source flag number which is 1, 2, 3 respectively. After the table is created, the data is loaded into a new table and includes the creation of the autonumberhash128. Once this step is complete, the first two records with source of PreIT and Post contain a hash value; however, the third row with source of NCRx does not. Why would the third row not contain a hash value? Please let me know your thoughts on the above as it is greatly appreciated. Thanks!