QlikFix

Testing the performance implications of variables and label referencing versus direct expressions After my previous post about variables, I got an interesting question from DV. He asked me about the reuse of chart expressions by referencing the label of another expression (“label referencing”), and what the performance implications of using variables and label referencing versus direct expressions are.

I use variables and label referencing extensively in my applications, but I never really tested what this means for performance. I have always assumed that using variables instead of direct expressions would have a slight impact on performance. I also suspected that using label referencing would result in significantly better performance (I will explain this later).

But was this really true? Triggered by DV’s question, I set up a small experiment to test my assumptions.

For those of you that are in a hurry or do not want to read a lot of text before reaching an unexciting conclusion: My experiment seems to indicate that there is a very, very small performance cost when using variables or label referencing instead of direct expressions. For all intents and purposes this difference is negligible however, resulting in only a roughly 0.3% slower calculation for the slowest option.

For those of you that want to see how I did it, and maybe pick up some techniques for identifying and resolving performance issues in your QlikView applications: Read on!

The set-up

As performance issues with calculation of objects are usually not an issue with small data volumes, I decided that I would need a moderate-to-large dataset to test with. I generated a 500 million row CSV file containing the following data:

Customer: random customer names between AAA and ZZZ, resulting in 17,576 unique values;
Product: random products named A to Z, 26 unique values;
Year: random year from 2008 to 2011;
Actual revenue: random number between 0 and 1,000,000;
Target revenue: random number between 0 and 1,000,000.

This resulted in a 12.5 GB CSV file, that I then loaded into QlikView. (It also turned into a 4 GB QVD file, but that is the subject for another post)

I then created a very big pivot table containing actual and targeted revenue for each customer by product and year. Additionally, I created a calculated field to compare actual versus targeted revenue.

Very big pivot table, containing 500 million records

The idea was to create an object that contained many records and calculations, so that any differences in calculation speed between the various calculation options would be amplified.

The calculation options

For the expressions, I tested three different options to see how the calculation time would be affected:

Direct expression: this is the straightforward option, directly typing the expression into the expression definition:
- Actual revenue: sum(Actual)
- Target revenue: sum(Target)
- Actual vs Target: sum(Actual) / sum(Target)
Label referencing: this option is the same for the first two expressions, Actual and Target revenue. However, for the third expression (Actual vs Target) calculation, instead of using the Sum() function we reference the earlier two results by their expression label names. My impression was that this might calculate faster, because for the last expression QlikView does not need to calculate the sums of the actual and target revenue again. Instead it can use the results that were calculated in the other two expressions:
- Actual revenue: sum(Actual)
- Target revenue: sum(Target)
- Actual vs Target: [Actual revenue] / [Target revenue]
Variables: Instead of using direct expressions, this option uses variables to store the calculations. I suspected that this method might be a little bit slower, as QlikView would also have to perform Dollar-sign expansion:
- Actual revenue: $(vActual)
- Target revenue: $(vTarget)
- Actual vs Target: $(vAT)

The first test

My initial idea was to create three copies of the pivot table, each containing a different calculation option. After making a selection, I would check the CalcTime for each of the pivot tables via the Objects tab in the Sheet properties dialog.

Sheet objects

There was a problem however. It looks as if the CalcTime is not showing the individual calculation time for an object, but rather the cumulative calculation time when the object finished updating. So the first pivot table to be calculated would show a calculation time of 30 seconds, while the second would show 60 seconds. It would actually not take 60 seconds for the second table, but since it had to wait until the first table was calculated that time was being added as well.

The second test

As the three-in-one approach was not going to work, I decided to create three identical applications with only the calculations being different.

I then proceeded to make the same 25 selections (in the same order) in each of the three applications. Afterwards I exported the memory statistics via the General tab in the Document Properties dialog.

Export memory statistics

This function exports a text file containing memory usage statistics for the various parts of the QlikView document. These .mem files can then be loaded and analyzed in the QlikView Optimizer application. This application originally came with the training materials for QlikView 8.5, but is still compatible with version 10. You can download it by clicking here.

QlikView Optimizer result for the pivot tables

As you can see in the result above, the direct expression has the lowest average calculation time. The variable ends in second place. Label referencing turned out to be the slowest option, which surprised me. Overall the difference between the fastest and slowest option was 0.33%.

Conclusion

So, what did I learn? Well, my suspicion that using variables instead of direct expressions impacts calculation time seems to be confirmed. However, the difference is so small that the it is virtually unnoticeable to the user. I was surprised that label referencing turned out to be the slowest option. Although at 0.33% slower, this is hardly a problem.

While I did not gain any shocking insights from this experiment, it is nice to know that performance is not an issue when picking which option to use. You are free to pick the option that best fits the situation.

For those of you interested in generating random data, here is the script I used.

Download the data generation script

As always, if you have any comments, questions or critiques on the unscientific method I used ;), feel free to drop them in the comments below.

Testing the performance implications of variables and label referencing versus direct expressions

By Barry Harmsen

The set-up

The calculation options

The first test

The second test

Conclusion