As Ometis’ Service Delivery Manager, I work on a diverse range of projects but a recent exercise enabled extensive use of the ‘group by’ function in Qlik Sense with interesting results.
It’s not one of the most commonly-used functions – often any grouping is done as part of the load from the source-data, or you avoid doing it to make sure Qlik gets the full transactional detail that it thrives on. Neither of these applied in this case – the source data is large exported CSV files and we needed to calculate something like the average revenue for each product group across a range of transactions and then map this value onto individual transactions for further calculations to be performed.
I had the calculations working fine across the ~2 million rows of data but, as is common with projects, the data and functionality grew over time until eventually load times were taking several hours. This is never ideal, so it was time to look at optimisation.
The first step was to gather some data – I logged the start & end time of each section of the script to a table so we could see the slowest components. This is so useful, I wish Qlik Sense had a built-in feature for it – the script logs are hard to glean this from but a summary of the start & end times of each script section would be great!
Next, I applied some of the common optimisation techniques: reducing the data as much as possible, ensuring optimised QVD loads if possible – including using where exists, storing to QVD and loading from there rather than resident reloads. There were good improvements but I was left with a range of statements that were really holding things up. The statements included:
These were taking 2-3 minutes to load our 2 million rows of data which seemed really slow, especially considering it was only 2 fields. Given that I had around 100 of these statements, it was high on the list of candidates for optimisation. As you can probably see there are various components here to consider:
• Is a Mapping Load different from a normal Load in terms of performance?
• Does Group By perform better when loading from QVD vs Resident vs Preceding?
• How does the where clause affect performance in these different scenarios?
• Why does it seem that group by in the script is much slower than the equivalent functionality in the front end? (A question the customer specifically asked me)
• How does the number of columns in the source table affect performance?
I set out to test a few things and learn some answers – what I found surprised me so I thought it was worth sharing.
Early Findings (a.k.a. TLDR version)
From some rough tests, it quickly became clear that Group By performs much better on a resident table than when loading from source data, even if it’s a QVD – in my tests it was at least 20x faster (even factoring in the time to load from the QVD to get the table resident).
So, the code quickly became:
This means the first (normal) load can be optimised and then the second (mapping) load is resident to the first. Across a sample of the data, this reduced the load time from 2m13s to 6s – a speed-up factor of over 22x! This was a big win and cut almost 2 hours off our load time when applied to all relevant places we could do it.
More detailed exploration
I then wanted to analyse this further to see what factors affected the load time in what way, so I set up a range of test scenarios. I ran variations of a load with the following variables:
• Loading from a QVD with ~100 fields and ~2 million rows or QVD with the same number of rows but only 6 columns
• Normal Loads and Mapping Loads
• With and without Group By
• Without any Where clauses, with standard Where clause and with a Where Exists clause.
• Loading from QVD, using Resident Loads or Preceding Loads
Using those tests, let me revisit the questions above one at a time.
Is a Mapping Load different from a normal Load in terms of performance?
Load * from x.qvd; in our data took 10x less time than Mapping Load a , b from x.qvd (8s vs 82s) – and that’s loading all the data, not just the 2 fields required for a Mapping Load.
Conclusion: Yes, Mapping Load seems not to do an optimised load from QVD even when no functions or filters are used in the load statement.
Does Group By perform better when loading from QVD vs Resident vs Preceding?
As I mentioned above, we found that Resident was by far the quickest way of doing this. To make the same 2-column mapping load output, we found that:
• Doing it directly from the QVD took 86s
• Using a preceding load took 87s – basically the same as going directly from the QVD
• Using a resident load took 3s – a speed up of nearly 30x! (This includes loading the 2 columns from QVD and dropping the temporary table after – see above sample code)
Conclusion: You’re far better off loading the data from source into a temporary table and then running your Group By using a Resident Load from that table than doing the Group By directly on the source data.
How does the where clause affect performance in these different scenarios?
Unsurprisingly, reducing the data that you’re loading does speed things up – when we filtered down from just over 2 years of data to 1 year, our load time reduced from 82s to 71s – about 10%.
Common wisdom says use Where Exists where possible but in this set of tests we didn’t see any difference in performance – perhaps because the filter we were using was a simple numeric comparison of a Year field with a static value.
Conclusion: It doesn’t seem to make a big difference but could still be a worthwhile gain.
Why does it seem that group by in the script is much slower than the equivalent functionality in the front end?
This was a great question my customer asked and is what really led me down this path of investigation in the first place. If we put a dimension of Group Key in a chart and an expression of Sum(Revenue) then it would calculate in seconds, not minutes – that’s essentially the same as the group by we did above… so why is it so much slower?
When you think about what Qlik Sense is doing this does make some sense – when loading directly from a data source, it (presumably) groups the data as the data comes through, so it’s not all in memory to start with. There’s also the chance that a where clause or some other function running on the data will affect the sample of data being grouped by.
However, when you load it first into a temporary table and then group by after, it is in Qlik’s memory so it can use a faster way of grouping.
That said, I always thought that a preceding load was basically like a resident load. People like Henric Cronström (Qlik employee #3) have suggested it should usually be quicker than even a resident load. At least in this scenario, it was much, much worse – I’m guessing it’s again to do with how the data flow works in the context of a preceding load.
This is all conjecture – I’d be interested to know from Qlik directly why this is.
Conclusion: It’s not – as long as you’re using it in memory!
How does the number of columns in the source table affect performance?
This made a significant difference, for instance the below code took over 2 minutes on a file with >100 fields in it versus 10 seconds for a file with the same rows but only containing the 2 required fields.
Conclusion; As per the TLDR; version above – generally speaking it’s quicker to do GROUP BY and MAPPING LOAD on a resident table with as few columns in as possible rather than doing them directly from source – even if no other transformations are required.
By Alex Walker