Optimize Azure Synapse Analytics

  1. Home
  2. Optimize Azure Synapse Analytics

Go back to DP-200 Tutorials

Microsoft DP-200 exam is getting retired on June 30, 2021. A new replacement exam Data Engineering on Microsoft Azure (DP-203) is available.

In this will learn about the collection of best practices to help you to achieve optimal performance from your dedicated SQL pool (formerly SQL DW) deployment. The purpose of this is to give you some basic guidance, highlight important areas of focus and to optimize Azure Synapse Analytics.

Maintain statistics

Dedicated SQL pool (formerly SQL DW) can be configured to automatically detect and create statistics on columns. However, the query plans created by the optimizer are only as good as the available statistics. Further, it is recommended that you enable AUTO_CREATE_STATISTICS for your databases and keep the statistics updated daily.

Group INSERT statements into batches

A one-time load to a small table with an INSERT statement or even a periodic reload of a look-up may perform well for your needs with a statement like INSERT INTO MyLookup VALUES (1, ‘Type 1’). However, if you need to load thousands or millions of rows throughout the day. Then, you might find that singleton INSERTS just can’t keep up. Instead, develop your processes so that they write to a file and another process periodically comes along. And then, loads this file.

Use PolyBase to load and export data quickly

Dedicated SQL pool (formerly SQL DW) supports loading and exporting data through several tools including Azure Data Factory, PolyBase, and BCP. For small amounts of data where performance isn’t critical, any tool may be sufficient for your needs. However, when you are loading or exporting large volumes of data or fast performance is needed, PolyBase is the best choice. Further, PolyBase is designed to leverage distributed nature of the system and will load and export data magnitudes faster than any other tool. PolyBase loads can be run using CTAS or INSERT INTO.

Load then query external tables

While Polybase, also known as external tables, can be the fastest way to load data, it is not optimal for queries. Polybase tables currently only support Azure blob files and Azure Data Lake storage. However, these files do not have any compute resources backing them. As a result, dedicated SQL pool cannot offload this work and therefore must read the entire file by loading it to tempdb in order to read the data. Therefore, if you have several queries that will be querying this data, it is better to load this data once and have queries use the local table.

Hash distribute large tables

By default, tables are Round Robin distributed. This makes it easy for users to get started creating tables without having to decide how their tables should be distributed. Round Robin tables may perform sufficiently for some workloads, but in most cases selecting a distribution column will perform much better.

Do not over-partition

While partitioning data can be effective for maintaining your data through partition switching or optimizing scans by with partition elimination, having too many partitions can slow down your queries. Often a high granularity partitioning strategy, which may work well on SQL Server may not work well in dedicated SQL pool (formerly SQL DW).

DP-200 practice tests

Minimize transaction sizes

INSERT, UPDATE, and DELETE statements run in a transaction and when they fail they must be rolled back. However, to minimize the potential for a long rollback, minimize transaction sizes whenever possible. This can be done by dividing INSERT, UPDATE, and DELETE statements into parts. Leverage special Minimal Logging cases, like CTAS, TRUNCATE, DROP TABLE, or INSERT to empty tables, to reduce rollback risk.

Optimize clustered columnstore tables

Clustered columnstore indexes are one of the most efficient ways you can store your data in dedicated SQL pool. By default, tables in dedicated SQL pool are created as Clustered ColumnStore. To get the best performance for queries on columnstore tables, having good segment quality is important. Further, when rows are written to columnstore tables under memory pressure, columnstore segment quality may suffer. Segment quality can be measured by number of rows in a compressed Row Group.

Since columnstore tables generally won’t push data into a compressed columnstore segment. This is until there are more than 1 million rows per table and each dedicated SQL pool table is partitioned into 60 tables, as a rule of thumb. However, columnstore tables won’t benefit a query unless the table has more than 60 million rows. However, for table with less than 60 million rows, it may not make any sense to have a columnstore index. It also may not hurt.

Use larger resource class to improve query performance

Dedicated SQL pool uses resource groups as a way to allocate memory to queries. Further, all users are assigned to the small resource class, which grants 100 MB of memory per distribution. Since there are always 60 distributions and each distribution is given a minimum of 100 MB. And, system wide the total memory allocation is 6,000 MB, or just under 6 GB. However, utilizing larger resource classes reduces concurrency.

Use Smaller Resource Class to Increase Concurrency

If you notice that user queries seem to have a long delay. However, it could be that your users are running in larger resource classes and are consuming many concurrency slots causing other queries to queue up. Further, to see if users queries are queued, run SELECT * FROM sys.dm_pdw_waits to see if any rows are returned.

Optimize Azure Synapse Analytics DP-200 Online course

Reference: Microsoft Documentation

Go back to DP-200 Tutorials

Menu