We shall be discussing about the two prominent techniques to help reduce the data loaded into Import models. Remember, import models are loaded with data compressed and optimized and then stored to disk by the VertiPaq storage engine. You will learn about data reduction techniques now.
We shall now discuss the primary data reduction techniques used to reduce the data loaded into the import models.
Optimize data types
Considering the VertiPaq storage engine uses separate data structures for each column. Some of the features of optimizing column data types include –
- Firstly, by design, these data structures achieve the highest optimizations for numeric column data, which use-value encoding.
- Also, text and other non-numeric data, uses hash encoding.
- Further, it requires the storage engine to assign a numeric identifier to each unique text value contained in the column.
- Moreover, it is the numeric identifier, then, that is then stored in the data structure, requiring a hash lookup during storage and querying.
Moreover, In some instances, you can convert source text data into numeric values. For instance, a sales order number may be consistently prefixed by a text value (e.g. “SD123456”). The prefix could be removed, and the order number value converted to whole number. For large tables, it can result in significant data reduction, especially when the column contains unique or high cardinality values.
Group by and summarize
This is possibly one of the most effective techniques used to reduce a model size is to load pre-summarized data.
- Firstly, this technique can be used to raise the grain of fact-type tables.
- Secondly, there is a distinct trade-off, however, resulting in loss of detail.
- Moreover, deciding to summarize fact-type data always involves tradeoffs.
- Lastly, tradeoff could be mitigated by a Mixed model design, and this option is described in the Switch to Mixed mode technique.
For instance, a source sales fact table stores one row per order line. However, significant data reduction could be attained by summarizing all sales metrics, grouping by date, customer, and product. Consider, then, that an even more significant data reduction could be achieved by grouping by date at monthly level. Now, this could achieve a possible 99% reduction in model size, but reporting at day level or individual order level is not possible anymore.