Data partitioning strategies

Go back to DP-200 Tutorials

In this we will learn to describes some strategies for partitioning data in various Azure data stores.

Partitioning Azure SQL Database

A single SQL database has a limit to the volume of data that it can contain. Throughput is constrained by architectural factors and the number of concurrent connections that it supports. However, elastic pools support horizontal scaling for a SQL database. Using elastic pools, you can partition your data into shards that are spread across multiple SQL databases. You can also add or remove shards as the volume of data that you need to handle grows and shrinks.
Secondly, each shard is implemented as a SQL database. A shard can hold more than one dataset (called a shardlet). Each database maintains metadata that describes the shardlets that it contains. A shardlet can be a single data item, or a group of items that share the same shardlet key.
Lastly, client applications are responsible for associating a dataset with a shardlet key. A separate SQL database acts as a global shard map manager. This database has a list of all the shards and shardlets in the system. The application connects to the shard map manager database to obtain a copy of the shard map.

Further, Elastic Database provides two schemes for mapping data to shardlets and storing them in shards:

Firstly, a list shard map associates a single key to a shardlet. For example, in a multitenant system, the data for each tenant can be associated with a unique key and stored in its own shardlet. However, to guarantee isolation, each shardlet can be held within its own shard.

Using a list shard map to store tenant data in separate shards — Image Source: Microsoft

Secondly, range shard map associates a set of contiguous key values to a shardlet. For example, you can group the data for a set of tenants (each with their own key) within the same shardlet. This scheme is less expensive than the first, because tenants share data storage, but has less isolation.

Using a range shard map to store data for a range of tenants in a shard — Image Source: Microsoft

Also, elastic pools make it possible to add and remove shards as the volume of data shrinks and grows. Client applications can create and delete shards dynamically, and transparently update the shard map manager. However, removing a shard is a destructive operation that also requires deleting all the data in that shard.

Partitioning Azure table storage

Azure table storage is a key-value store that’s designed around partitioning. All entities are stored in a partition, and partitions are managed internally by Azure table storage. Each entity stored in a table must provide a two-part key that includes:

Firstly, the partition key. This is a string value that determines the partition where Azure table storage will place the entity. All entities with the same partition key are stored in the same partition.
Secondly, the row key. This is a string value that identifies the entity within the partition. All entities within a partition are sorted lexically, in ascending order, by this key. The partition key/row key combination must be unique for each entity and cannot exceed 1 KB in length.

However, the following diagram shows the logical structure of an example storage account. The storage account contains three tables: Customer Info, Product Info, and Order Info.

The tables and partitions in an example storage account — Image Source: Microsoft

Each table has multiple partitions.

Firstly, in the Customer Info table, the data is partitioned according to the city where the customer is located. The row key contains the customer ID.
Secondly, in the Product Info table, products are partitioned by product category, and the row key contains the product number.
Lastly, in the Order Info table, the orders are partitioned by order date, and the row key specifies the time the order was received. All data is ordered by the row key in each partition.

Partitioning Azure blob storage

Firstly, Azure blob storage makes it possible to hold large binary objects. Use block blobs in scenarios when you need to upload or download large volumes of data quickly.
Secondly, each blob (either block or page) is held in a container in an Azure storage account. You can use containers to group related blobs that have the same security requirements.
Thirdly, the partition key for a blob is account name + container name + blob name. The partition key is used to partition data into ranges and these ranges are load-balanced across the system.
Next, if your naming scheme uses timestamps or numerical identifiers, it can lead to excessive traffic going to one partition, limiting the system from effectively load balancing. For instance, if you have daily operations that use a blob object with a timestamp such as yyyy-mm-dd, all the traffic for that operation would go to a single partition server. Instead, consider prefixing the name with a three-digit hash.
Lastly, the actions of writing a single block or page are atomic, but operations that span blocks, pages, or blobs are not.

Partitioning Azure storage queues

Firstly, Azure storage queues enable you to implement asynchronous messaging between processes. An Azure storage account can contain any number of queues, and each queue can contain any number of messages. However, the only limitation is the space that’s available in the storage account. The maximum size of an individual message is 64 KB.
Secondly, each storage queue has a unique name within the storage account that contains it. Azure partitions queues based on the name. All messages for the same queue are stored in the same partition, which is controlled by a single server.
Thirdly, in a large-scale application, don’t use the same storage queue for all instances of the application because this approach might cause the server that’s hosting the queue to become a hot spot. Instead, use different queues for different functional areas of the application.
Lastly, an Azure storage queue can handle up to 2,000 messages per second. If you need to process messages at a greater rate than this, consider creating multiple queues.

Partitioning Cosmos DB

Firstly, Azure Cosmos DB is a NoSQL database that can store JSON documents using the Azure Cosmos DB SQL API. A document in a Cosmos DB database is a JSON-serialized representation of an object or other piece of data.
Secondly, documents are organized into collections. You can group related documents together in a collection. For example, in a system that maintains blog postings, you can store the contents of each blog post as a document in a collection.
Next, Cosmos DB supports automatic partitioning of data based on an application-defined partition key. A logical partition is a partition that stores all the data for a single partition key value. However, all documents that share the same value for the partition key are placed within the same logical partition. Cosmos DB distributes values according to hash of the partition key. A logical partition has a maximum size of 10 GB.

However, if the partitioning mechanism that Cosmos DB provides is not sufficient, you may need to shard the data at the application level. Document collections provide a natural mechanism for partitioning data within a single database. The simplest way to implement sharding is to create a collection for each shard. Containers are logical resources and can span one or more servers. Fixed-size containers have a maximum limit of 10 GB and 10,000 RU/s throughput. Unlimited containers do not have a maximum storage size, but must specify a partition key. With application sharding, the client application must direct requests to the appropriate shard, usually by implementing its own mapping mechanism based on some attributes of the data that define the shard key.

Data partitioning strategies DP-200 Online course

Reference: Microsoft Documentation