QlikView hash functions and collisions

By Barry Harmsen

HashFunctionI’m currently updating my materials for the upcoming Masters Summit for QlikView in Chicago, and thought I’d share a little bit with you. In my session on data modeling, I explain how you can deal with various types of Slowly Changing Dimensions in QlikView. One of the techniques I explain is using hash functions to detect changes in (historical) records. During the previous events, this always lead to two questions from the audience:

  • What exactly are hash functions and hashes?
  • And, from those who already know the answer to the first question: Aren’t you worried about hash collisions?

Today I will answer both questions and hopefully give you some insight into hash functions, their usefulness in QlikView and the risks of hash collisions.

Hash functions

A hash function is an algorithm that maps data of arbitrary length to data of a fixed length. The value returned by a hash function is like a fingerprint of the input value, and is called a hash value or simply hash. For example, all of the text above can be translated into the following MD5 hash357799131ceffdd43cc0fe9f52b36eeb.

You will notice that this hash is much shorter than the original string used to generate it.Besides that, if only a single character in the text is changed this will lead to a completely different hash. This property makes hash functions very useful to compare things, for example files, but also historical versions of a record.

A hash function is deterministic, meaning that the same input value should always lead to the same hash value. Typically, a hash function is a one-way function, you cannot ‘decode’ the original input value based on the hash value alone. Besides that, a good hash function is also uniform, which means that each hash value should have the same probability of being picked. The image at the top of this post illustrates a very simple hash function. Each of the four input values is mapped to a unique output value.

Hash functions in QlikView

In QlikView, the following hash functions are available:

  • Hash128(): a 128 bit hash function that returns a 22 character string.
  • Hash160(): a 160 bit hash function that returns 27 character string.
  • Hash256(): a 256 bit hash function that returns a 43 character string.

The number of bits determines the output range of the function. A 128 bit hash can store 2^128 (or, 340.282.366.920.938.000.000.000.000.000.000.000.000) different combinations. 160 and 256 bit can store even more combinations (2^160 and 2^256 respectively).

Besides these functions, QlikView also has the AutoNumberHash128() and AutoNumberHash256() functions. These functions basically take the output of the Hash128() and Hash256() function and passes it through the AutoNumber() function. While I think they have a nicer syntax than the regular AutoNumber(), you can supply a comma-separated list of fields instead of a concatenated string, the usefulness of these functions eludes me.

Detecting changed records

Consider a QlikView application containing the following Employee table:

Historic

Now, assume we get some new, changed data and want to quickly determine which rows have changed:

New

As you can see, Jim has moved to another office. How can we detect that this row has changed? We could compare each field in the table to each previous version of the field, but as we are only interested in detecting if the row has changed, using a hash function is a more elegant solution. Using Hash128(Name, Position, Office) we can calculate a hash value for each row:

Hashed

The hash value for Dwight’s record hasn’t changed, because the record hasn’t changed either. Jim’s changed record however does have another hash value than the previous one. Once we’ve detected this we can do further processing on these records. This will be the topic of a future blog post. Or, if you don’t want to wait for that, my data modeling session at the Masters Summit for QlikView.

Hash collisions

As noted before, a hash function is an algorithm that maps data of arbitrary length to data of a fixed length. When different input values lead to the same output hash value, this is known as a hash collision. Consider the following, simplified hash function:

Collision

In this example, both Michael and Toby get the same hash value of 2. It’s easy to see what the problem is here, there are 5 input values and only 4 possible hash values. The input domain is greater than the output range.

Now, you may think “this isn’t a problem for me, the number of input values I deal with is much less than 2^128, let alone 2^256”. It’s a simple assumption to make, but also a wrong one as hash collisions can occur long before the number of input values reaches the range of the hash function.

The birthday problem

Imagine you’re in a room with a group of people. How many people do you think need to be in that room before the probability of two people sharing the same birthday reaches 50%? There are (excluding leap years) 365 days in a year, so maybe 185? 200?

The answer is 23. Surprising, isn’t it? If we raise the number of people to 75, the probability of at least two people sharing a birthday raises to 99,95%. This is known as the birthday problem.

As this is a QlikView blog and not a math blog, I won’t go through the complete solution and proof. Basically, instead of calculating the probability that two people in a group share a birthday, the trick is to calculate the probability that no one in the group shares a birthday. This is much easier to calculate. The result is then subtracted from 1, which gives the probability that at least two people in the group share a birthday.

Calculating the probability of a hash collision

If you looked closely at the previous example, you may see that the people can be considered input values and that their birthdays can be considered hash values. When two people share the same birthday it’s a hash collision! If we understand this, then we can apply the same logic to determine the probability of a hash collision in our data sets. To calculate the approximate probability of a hash collision we can use the following formula:

Approximation

I created a small Excel workbook to calculate the probability of a hash collision. Now, it’s good to realize that Excel only uses 30 significant digits. As these probabilities are very small, this means that Excel is unable to calculate probabilities for very small input values. So, in the example below, I calculated the probability that 1 quadrillion (that’s a 1 with 15 zeroes) input values could lead to a hash collision when using a 128 bit hash.

Probability

The probability of this happening are around 1 in 680 million. Or, to put it in perspective:

Perspective

Now, there is a small caveat with this calculation. It assumes the hash functions used in QlikView leads to a uniform output, meaning each value has the same probability. This may not be the case.

On the other hand, we are not comparing a quadrillion records, we are only comparing two. When calculating the probability of a hash collision with just 2 records and a 128 bit hash using an online high precision calculator, the result is 2.938735877055718769922E-39 (1 in 2.9 Duodecillion). Or, to put it in perspective again, this is less likely than a single person winning the lottery, getting hit by a meteorite, getting attacked by a shark -and- becoming president of the USA in their lifetime.

Switch to a 160 bit hash and  the likelihood of a collision becomes lower than the combined probability of all events in the chart above. Now, just because it is very unlikely doesn’t mean that it can’t happen (see: Law of large numbers), but I like those odds!

TL;DR: when using hashes to check for changed records, don’t worry about collisions.

If you want to experiment with your own calculations, the workbook I used can be downloaded below.

Download the hash collision workbook

About The Author

Barry Harmsen

Hi there, I'm Barry and I'm a Business Intelligence Consultant at Bitmetric and based in the Netherlands. Originally from a background of 'traditional' Data Warehousing, Business Intelligence and Performance Management, for the past few years I have been specializing in QlikView and a more user-centric form of BI. I have done numerous QlikView implementations in many different roles and industries. In 2012 I co-authored the book QlikView 11 for Developers. You can follow me on Twitter at @meneerharmsen.

24 Comments

  • 1
    March 11, 2014 - 13:22 | Permalink

    Nice post Barry!

    But, I do not get why to store an additional hash value (and use for comparison) if you can access and use the historical dimensional data itself?

  • 2
    Borys Tyukin
    March 11, 2014 - 16:16 | Permalink

    awesome post, Barry! but can you elaborate a bit more on differences between autonumber() and hash() functions? which one is better from a performance perspective once loaded (i guess a numeric one) and each one is more reliable.

    Also VERY IMPORTANT thing here is concatenation of the fields and replacing blanks with some characters. People who do not concatenate fields properly, often surprised to see the same hash assigned to two different records. Here is one example that will get the same hash value if you do not replace blanks

    name age status
    john 9
    john 9

    if you do
    autonumber(name & age & status), both records will get the same hash

    but if you do
    autonumber(name & ‘|’ & age & ‘|’ & status), they will get different hash values.

    • 3
      March 11, 2014 - 20:49 | Permalink

      Hi Borys,

      Just did a quick check. Though concatenating without a separator can give different records the same hash, this doesn’t seem to be the case for the Hash128, 160 and 256 functions:

      https://dl.dropboxusercontent.com/u/6795646/QlikFix/Hashing/NullHash.png

      The performance difference is for another time 😉

      Kind regards,
      Barry

      • 4
        Borys Tyukin
        March 11, 2014 - 23:09 | Permalink

        good to know, thanks for testing this, Barry. Great job again on the post!

        • 5
          March 12, 2014 - 21:20 | Permalink

          So I was curious about the performance difference of hashing vs AutoNumber() and decided to perform a small test. I used a QVD containing 10 million random records in 25 fields (3GB on disk) and loaded them twice, once using Hash128() to create a fingerprint of the record and once using AutoNumber(). The performance difference was very noticeable. On my laptop (i7-3720QM, 16GB RAM) Hash128() took 44 seconds, while AutoNumber() took 5:09.

          (note that I do not advise using hashes as keys, in this case I’m only using them to detect changed records.)

          • 6
            Borys Tyukin
            March 12, 2014 - 21:59 | Permalink

            interesting…I work with SQL Server every day and it can also hash millions of rows in no time. I wonder if autonumber has to build a reference table on a side to “remember” the counters while hash is processed on one value and deterministic.

            now you probably need to explain why you would not use hash as keys 🙂

            thanks again man, I learned a lot from you blog and a book!

          • 7
            Jeroen Jordaan
            March 12, 2014 - 22:08 | Permalink

            Hi Barry,

            I have also performed a same kind of test to compare Autonumber and the result is for me very suprising.

            and this is my result.

            regular reload Duration: 04:30 Size qvw: 961.423
            reload with autonumber Duration: 16.19 Size qvw : 919.812
            reload with autonumberhash256 Duration: 11:54 Size qvw : 711.161

            So yes the reload with autonumberhash is faster compared with the autonumber but still a lot slower compared to the regular reload.
            On the other hand the size of the qvw decreased allot.

            So now my questions is the same as Borys.
            Why should we not use autonumberhash as a keyfield?

            Thanks in advance

          • 8
            July 29, 2014 - 01:04 | Permalink

            What Barry Harmsen is trying to say is that he doesn’t recommend using Hashes ONLY ans keys, not that you can’t use the autonumberhash as key.

            I think the point here is that qlikview uses numeric keys much faster than text ones (as a Hash would be)

            So final conclusion, IF, you NEED to use an autonumber, use the hashed version of the autonumber since it seems to work a lot faster, and (as Jeroen Jordaan experiment) seems to yield a smaller Qvw size (although I would be cautios about this last assumptiom because, the stored value is an integer just as the one that autonumber yields)

  • 9
    Alan Farrell
    March 11, 2014 - 22:28 | Permalink

    Great Post Barry,

    Knowing my luck, I’d probably suffer a hash collision instead of the much more likely (albeit unlikely) event of winning the lottery.

    🙂

    Alan

    • 10
      March 11, 2014 - 23:17 | Permalink

      I suddenly have an idea for a Hash Collision Lottery: your ticket number is a hash, if someone else gets the same hash you both win 🙂

  • 11
    Jeroen Jordaan
    March 12, 2014 - 00:45 | Permalink

    Hi Barry,

    Very good and clear post. I use autonumberhash256 to turn my ‘complex’ combined key into a numeric key. What I see is that the reload with autonumberhash is much slower then without. Will this be the same with hash?

  • 12
    Jeroen Jordaan
    March 12, 2014 - 10:14 | Permalink

    Hi Barry,

    As always a very good and clear post.

    I do have a question.
    I use autonumberhash256 to change my ‘relative’ complex composed into a faster and smaller number key field.

    The only thing is that the reload duration with autonumberhash is much longer then without. Does the same counts for Hash256?

  • 13
    Brian Garland
    March 12, 2014 - 17:15 | Permalink

    Excellent! Looking forward to part 2.

  • 14
    Elly
    July 27, 2014 - 16:08 | Permalink

    Hi Barry,
    Thanks for the great post.
    I have a question on input field and the way it’s stored and hoping you could help clarify please.
    We have an input field with 52 rows (values). A user created a bookmark awhile back that includes this inputfield. I then applied this bookmark and created a bookmark as my own (without any changes from the original bookmark). When I viewed both bookmarks using powertool, I can see the inputfield has 52 values and as expected, both bookmarks have the same values for these 52 rows. however, the packed hashed values are different between the two bookmarks. I can’t explain how the difference occurs. Any suggestions please?
    Thanks
    Elly

  • 15
    July 29, 2014 - 01:10 | Permalink

    I did another experiment:

    LOAD AutoNumberHash128(field1, field2) as The_Autonumbered_Hash, field1, field2
    Inline [
    field1, field2
    a, aa
    aa, a
    1, 11
    11, 1
    try,
    tr, y
    t, ry
    ];

    So as to see if it would treat the fields as in concatenation, or it has some code in it to cover that.

    I got:
    The_Autonumbered_Hash field1 field2
    1 a aa
    2 aa a
    7 t ry
    6 tr y
    5 try
    3 1 11
    4 11 1

    Soooo, it seems to work… I would be inclined to think that autonumberhash is faster due to not having to calculate the usual Concatenation, BUT, it has to form a hash of the input, so,,, Everything comes down to who is faster, the “concatenation” or the hashing algorithm.

    (My money would be on the hashing since its a mathematical algorithm, and not a string operation) 😉

  • 16
    July 29, 2014 - 01:25 | Permalink

    Yet… I couldn’t resist the question in my head…

    I did This:

    Table1:
    LOAD
    IterNo() as iteration,
    AutoNumber(IterNo() & ‘_’ & ‘a test string’) as Autonumbered // as normal
    AutoGenerate 1
    While IterNo() 91 s

    Then I tryed:

    Table1:
    LOAD
    IterNo() as iteration,
    AutoNumberHash256(IterNo(), ‘a test string’) as Autonumbered // as would be done
    AutoGenerate 1
    While IterNo() <= 1e7;

    and got 28 s!!

    So hashing IS more efficient than Concatenating the strings with the underscores!

    • 17
      July 29, 2014 - 01:35 | Permalink

      the autonumber one yielded a 13719KB Qvw

      as did the Autonumberhash256

      So my theory on the size of the resulting Qvw seems to be standing B)

      I think that maybe Jeroen Jordaan droped a few fields in the process and maybe didn’t notice them.
      Also!, I think It’s unfair to compare the load of the “regular reload” to the other autonumber and autonumberhash256 because, hey, If you could do it WITHOUT an autonumber, why on earth would you place one there anyways? The right scenario is to compare both solutions in which you end with a decent numeric key.

      thanks.

      ps: I’ve invested so much keyboard on these comments that I’m going to post this to my blog as well, xD maybe in spanish or something like that, I’ll link this anyways, c ya

  • 18
    Diego
    September 17, 2014 - 10:25 | Permalink

    Hi Barry, excellent post and of course make people hungry for the complete solution 😉

    Funny to see when people are talking about performance they are talking about script execution. To me this is only really important when the execution time ‘gets in the way’.

    The most important thing that always matters is: How is my application going to perform to the users (charts calculation time, etc).

    Using a hash function as a key (not the AutoNumber hash variant) leaves your model with keys that are pretty large (128 bit, or worse). This will give a huge performance drop!

    So I recommend to never use the hash functions for keys, but use 1 of the AutoNumber functions. They will both generate an integer sequence for your key values.

    Of course: After your done with the hash fields (you have determined
    what records changed), drop those fields too!

    Diego

  • 19
    Erica Whalley
    December 30, 2014 - 14:22 | Permalink

    Hi all, and great post Barry!

    Just an additional note to consider the symbol tables and the way that QlikView stores data.

    The Autonumber() and AutonumberHash() functions will take up virtually no space in the symbol tables, where as the Hash() functions will.

    Hash..() is more efficient in the load than autonumber…() as autonumber…() is having to map the values to distinct sequential integers.

    However in the application, once the data is loaded, the autonumber() functions take up less memory.

    Also, to bear in mind, the Hash functions will take up MORE memory than a concatenated value that is smaller in size.

    EG values of the type “XYZ_12345” that are given a 256 bit hash will take up more memory in the symbol tables as the hash is bigger. This is significant as Hashes are often used to create keys – of which there are many distinct values!

    So it depends on whether the optimisation is required in the load, or the application, and your data model in general.

    Erica

    • 20
      December 30, 2014 - 17:23 | Permalink

      Hi Erica,

      Thanks for pointing this out. I only use the hash functions for change detection, they never make it to the final data model. Of course, they do use some additional memory while the script is running, so the memory peak might be higher. It depends on your environment if you find this acceptable, but generally I do not find it to be an issues (and I’ve used this on >=100 million row tables).

      Cheers,
      Barry

  • 21
    Erica Whalley
    December 30, 2014 - 14:29 | Permalink

    Just as an extra, these are the memory statistics for the symbol table for fields that were generated:
    Row = rownumber
    Value = random number using rand()

    There are 1 million rows of data.

    .tg {border-collapse:collapse;border-spacing:0;}
    .tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
    .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
    .tg .tg-e3zv{font-weight:bold}

    FieldName
    Bytes
    Count

    Row
    0
    1000000

    Value
    7998128
    999766

    Auto_Value
    0
    999766

    Auto_Value128
    0
    999766

    Auto_Value256
    0
    999766

    Hash_Value128
    27993448
    999766

    Hash_Value256
    48988534
    999766

    as you can see, any field that comprises sequential integers takes up 0 symbol table space – including “Row”. The hash values actually take up more room than the original values as QlikView’s generated random number is <128 bit

    Erica

  • 22
    August 12, 2015 - 05:12 | Permalink

    My understanding of autonumber() and autonumberhash() is different from what other commenters wrote (and from yours, Barry). It might be wrong, but here it is:

    The hash() functions just return hashes and therefore allow collisions. They have performance O(1).

    The autonumberhash() functions use internal hash tables to return a unique ID for every value. They do NOT have collisions. See https://en.wikipedia.org/wiki/Hash_table to understand how hash tables deal with collisions. A lookup through a hash table has O(n) performance and therefore is fast. I would assume that 128-bit hash tables take less RAM, but become ineffective on very large volumes of data as they have to resolve many collisions per bucket and therefore their performance drops. Thus for very large amounts of data the 256-bit version of the function is more preferable.

    autonumber() function returns a unique ID but it’s internal structure in not a hash table. Most probably it’s either a sorted list or array. Therefore it’s slower — probably something O(n log n). But since it’s internal structure doesn’t use long hashes, it takes even less RAM.

    All internal structures take RAM only during loading script execution. They are disposed once it’s complete. hash() functions do not use internal structures.

    Resume
    All hash functions are NOT suitable for generating primary keys.

    All autonumber functions are suitable for generating primary keys.

    You can use autonumber() for small amounts of data. But honestly I don’t see any sense in using it at all. I guess it was created in earlier versions. Then, because its performance was unsatisfactory, Qlik added autonumberhash() functions in later versions, and autonumber() was left for compatibility.

    Use autonumberhash128() on large amounts of data.

    Use autonumberhash256() on VERY large amounts of data to increase speed at the cost of RAM.

    Assumption that autonumberhash() is the same as autonumber( hash()) is not correct.

    PS. Designing EasyMorph has taught me a lot about hash tables. They effectively are the fastest way to do deduplication, mapping, aggregation, etc.

  • 23
    apierens
    April 27, 2016 - 15:22 | Permalink

    Hello,

    I recently posted a solution to create a slowly changing dimension (using AutoNumberHash256), curious to know if you have the same algorithm.

    https://community.qlik.com/docs/DOC-16551

  • 24
    Michal
    January 25, 2017 - 11:26 | Permalink

    Hi Barry,
    I am looking for source code or algorithm of hash256() function in order to validate a value sent from QV to Java App via WebPageViewer2. Do you know if it is possible to get it and from where?
    The idea is to sent some user’s selection in link. This link should contain also hashed value of the selection. Java App then validates this link and if hash value is the same on Java side it lets user in. If the hash is not found then user doesn’t get access.
    Thanks in advance!

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    Read previous post:
    A secure (and cheap) alternative to Dropbox

    A short article that isn't directly related to QlikView, but if you're looking for a secure way to synchronize and...

    Close