DISTINCT can be deceiving

By Barry Harmsen

DISTINCT can be deceiving

Never think you know it all! Today I have for you a little ‘gotcha!’ about the LOAD DISTINCT statement that I picked up from Oleg Troyansky at the Masters Summit for QlikView.

Common knowledge about LOAD DISTINCT is that it will remove duplicate records from the input table, similar to how the DISTINCT clause works in SQL. For example:

Regular LOAD DISTINCT

However, when used in combination with the JOIN or CONCATENATE statement, the results can be surprising, and even counterintuitive.

The spoiler

When combined with the JOIN or CONCATENATE statement, LOAD DISTINCT will not only remove all duplicates from the input table, but also from the resulting target table.

The case

Consider the following two tables; Sales and Customers:

Sales and Customers tables

You will notice that both tables contain duplicate records, the Sales table contains identical records for ID’s 2 and 3, and the Customers table contains two records for Mickey Mouse. If we JOIN these two tables together using the following script:

Regular LOAD script joining Sales and Customers.

Then the result is the following table:

Sales and Customer tables joined without the DISTINCT clause.

There are no surprises here. There are 4 rows for Mickey Mouse with a Sales Amount of 300, as both tables contained two duplicate rows and 2 x 2, the number of ways in which both records can be joined, equals 4. There are also 2 rows for Donald Duck with a Sales Amount of 400, as the Sales table contained duplicate rows for ID 3.

The DISTINCT LOAD

Now we add one small change to the script, we use LOAD DISTINCT to load the second table:

LOAD DISTINCT

Since I already revealed the spoiler at the start of this post, that using LOAD DISTINCT with the JOIN or CONCATENATE statements will also remove all duplicates from the target table, you have probably guessed that the previous piece of script results in the following table:

DISTINCT LOAD resulting table

The twist

Now that we’ve seen how LOAD DISTINCT impacts the resulting table, let’s make one small change to the script; instead of using LOAD DISTINCT on the second table we’ll use it on the first table:

LOAD DISTINCT on the first table.

Can you guess what the resulting table is going to look like? Think about it for a second before scrolling down further. When deduplicating the first table and then joining the second table (which also contains duplicates), the result must be something like the one shown below, right?

Using a LOAD DISTINCT on the first table.

Gotcha!

Once loaded DISTINCT a table stays distinct.

Oh, and by the way, you’re actually a ghost.

In an M. Night Shyamalan-esque twist (ok, so not really), the result is actually exactly the same as when we used LOAD DISTINCT on the second table:

DISTINCT LOAD resulting table

The conclusion

Whenever the DISTINCT clause is used on a table, the table’s records will stay distinct, no matter what you JOIN or CONCATENATE to the table. This can have all sorts of unwelcome consequences, such as duplicate-but-correct facts disappearing from your fact table. Of course, it can also have positive effects; this behavior is very useful when you need your link table to only contain distinct combinations

The advice

Be aware of this behavior in QlikView, if you need to do a DISTINCT LOAD combined with a JOIN or CONCATENATE statement consider if you might be losing duplicate rows, and if this is something that is desirable. If it’s not, then you should probably load the data into a temporary, resident table before joining it to your other data.

The example app

The example application for this post lets you load the example tables in a number of ways and displays the associated script and resulting table.

Download the example application.
Download the example application

Got any other QlikView gotchas? I’d be very interested to hear about them, so please feel free to leave a comment.

About The Author

Barry Harmsen

Hi there, I'm Barry and I'm a Business Intelligence Consultant at Bitmetric and based in the Netherlands. Originally from a background of 'traditional' Data Warehousing, Business Intelligence and Performance Management, for the past 10 years I have been specializing in Qlik and a more user-centric form of BI. I have done numerous QlikView and Qlik Sense implementations in many different roles and industries. In 2012 I co-authored the book QlikView 11 for Developers. You can follow me on Twitter at @meneerharmsen.

19 Comments

  • 1
    July 30, 2013 - 15:36 | Permalink

    Wow!

    Never ran into that one before (that I know about!)

    Could be very interesting to make it happen on purpose!

    • 2
      July 30, 2013 - 23:37 | Permalink

      I was very surprised about this as well. I get how joining/concatenating a distinct table to another table might make the target table distinct as well (though that’s already quite counter-intuitive when you’re used to SQL), but that joining duplicate records to a table that was loaded distinct earlier still leads to a distinct table is really strange. It puzzles me why it would be implemented like this.

  • 3
    Frnak
    July 30, 2013 - 16:01 | Permalink

    Good article! I have to check all my old code now, thanks for that 😉

  • 4
    July 30, 2013 - 19:46 | Permalink

    Mmmmmh…

    Me too I have to check some old code…

    Many thanks for your well done explanation…;-)

    Giulio

  • 5
    August 2, 2013 - 01:42 | Permalink

    As a “source” for this gotcha, let me add my two cents to the story…

    What’s important to remember is this: “DISTINCT will always work on the end result”. Whether you use it in a single LOAD or a series of auto-concatenating LOADs, or as part of a JOIN or a Forced-Concatenated LOAD, DISTINCT will always affect the final result.

    So, sometimes it’s good. For example, when you are building a classical Link Table, you load a DISTINCT list of key fields from multiple data sources. Intuitively, after 2-3 loads from different sources, you want to reload the final table once again, to ensure that no keys were loaded twice. No need. Why? Because DISTINCT is always applied to the end result. Even if some of the keys exist in multiple data sources, that final DISTINCT in the last load will make the whole table DISTINCT.

    Many times, it’s causing disasters, and that’s very well described in this article by Barry.

    cheers

  • 6
    Code
    February 19, 2014 - 12:12 | Permalink

    Thanks mate, you really saved my ass!

  • 7
    Julian
    February 12, 2015 - 18:01 | Permalink

    Nice explanation there, thanks!

    Do you know if its possible to exclude certain fields from the distinct expression?

  • 8
    Bruno Paula Cardoso
    May 6, 2015 - 10:49 | Permalink

    you cant overcame this problem by using:

    Load RowNo() as ID_2,
    … (then the other fields).

    This way you always have a diferent number on the ID_2.

    🙂

  • 9
    Juraj
    October 16, 2015 - 14:14 | Permalink

    Hello Barry,
    coming back to this older post. Today my colleague did a distinct load then droped a field from table and then stored the table to a QVD. BUT, rowcount in QlikView was different from rowcount in QVD. QVD records were not distinct, but data in QV were distinct. Try code below:

    Temp:
    LOAD * INLINE [
    A, B
    1, 1
    1, 2
    ];

    A: NOConcatenate LOAD Distinct A, B Resident Temp; DROP Table Temp;

    DROP Field B;

    STORE A into A.qvd (qvd);

    It seems as if DISTINCT was applied only at the end of script processing, so it did not affect data when the table was stored.

  • 10
    sachin
    January 7, 2016 - 09:48 | Permalink

    good post..

    But blind distinct deceives but identifying the fields and combining them and then having distinct will provide the soln many times.Thanks

  • 11
    April 6, 2016 - 12:42 | Permalink

    Hello Sir,

    How can we take a distinct count of distinct… That is
    If jan1 = 10 ids
    Jan 2 = 20 ids (5 ids common between jan1 and jan2)
    Jan 3 = 30 ids ( 7 ids common between jan1, jan2 and jan3)
    Then i should be able to remove the common ids.
    Hence in the YTD graph it should come as
    Jan 1 = 10
    Jan 2 = 15
    Jan 3 = 23
    Please help me know how can I achieve this.

  • 12
    May 27, 2016 - 09:28 | Permalink

    So, I want to show Duplicate values

    I have Example like this

    LOAD * INLINE [
    Country, Sales
    India, 5000
    Us, 2500
    Uk, 4000
    India, 5000
    Us, 5500
    India, 4500
    ];

    I want to show one straight table with India data like

    India 5000
    India 5000
    India 4500

  • 13
    geetha
    May 30, 2016 - 19:35 | Permalink

    Hi every one,

    I have the data like

    keyID, country

    101 FR
    101 IT
    101 –
    102 –
    102 –
    102 US

    The Output should look like

    101 FR,IT,-
    102 -,-,US

    can someone help me how to write the script in QlikView.

  • 14
    Amanda Sundberg
    September 13, 2016 - 09:26 | Permalink

    Hi, thanks for a great post. I just discovered it myself when trying to concatenate a table where I had used DISTINCT and found my desired field completely disappeared from the target table! However, I am not able to solve this through making a temporary table and then concatenating it, which I understand is your suggestion.

  • 15
    Sue
    January 30, 2018 - 15:25 | Permalink

    Excellent post! I couldn’t understand why duplicate records were getting dropped until after reading your article. Thank you so much!

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    Read previous post:
    QlikView for Developers Cookbook + Foreword

    As you may have read or seen elsewhere already, the 24th of June saw a new addition to the QlikView...

    Close