Sharding pattern

Go back to DP-300 Tutorials

In this we will learn about sharding patterns. Divide a data store into a series of horizontal divisions, or shards, for this. When storing and accessing massive amounts of data, this can help with scalability.

A data store hosted by a single server might be subject to the following limitations:

Firstly, Storage space. A data store for a large-scale cloud application is expected to contain a huge volume of data that could increase significantly over time. A server typically provides only a finite amount of disk storage. But, you can replace existing disks with larger ones, or add further disks to a machine as data volumes grow.
Secondly, Computing resources. A cloud application is required to support a large number of concurrent users. However, each of which run queries that retrieve information from the data store.
Thirdly, Network bandwidth. Ultimately, the performance of a data store running on a single server is governed by the rate the server can receive requests and send replies.
Lastly, Geography. It might be necessary to store data generated by specific users in the same region as those users for legal, compliance, or performance reasons, or to reduce latency of data access.

Solution

Divide the data store into horizontal partitions or shards. Each shard has the same structure as the others, but it stores a different subset of the data. A shard is a standalone data store (it may hold data for a variety of different sorts of entities) that runs on a storage node server.

This pattern has the following benefits:

Firstly, you can scale the system out by adding further shards running on additional storage nodes.
Secondly, a system can use off-the-shelf hardware rather than specialized and expensive computers for each storage node.
Thirdly, you can reduce contention and improve performance by balancing the workload across shards.
Lastly, in the cloud, shards can be located physically close to the users that’ll access the data.

Decide which data should be stored in each shard when partitioning a data store into shards. A shard is made up of things that fall inside a certain range given by one or more data properties. The shard key is made up of these properties (sometimes referred to as the partition key). The shard key must be constant. It shouldn’t be based on facts that might change in the future.

However, in order to achieve best speed and scalability, the data must be separated in a fashion that is appropriate for the sorts of queries the application conducts. It’s doubtful that the sharding technique will perfectly fit the needs of every query in most circumstances. Furthermore, if your queries frequently return data based on a mix of attribute values. Then, you may most likely create a composite shard key by tying attributes together.

Sharding strategies

Three approaches to determining the shard key and how data should be distributed among shards.

Firstly, the Lookup strategy. In this strategy, the sharding logic implements a map that routes a request for data to the shard that contains that data using the shard key. However, Multiple tenants might share the same shard. But the data for a single tenant won’t be spread across multiple shards. The figure illustrates sharding tenant data based on tenant IDs.

Sharding tenant data based on tenant IDs — Image Source: Microsoft

Secondly, The Range strategy. This strategy groups related items together in the same shard, and orders them by shard key—the shard keys are sequential. It’s useful for applications that frequently retrieve sets of items using range queries.
- Queries that return a set of data items for a shard key that falls within a given range.
The next figure illustrates storing sequential sets (ranges) of data in shard.

data based on a hash of tenant IDs — Image Source: Microsoft

Lastly, the Hash strategy. The purpose of this strategy is to reduce the chance of hotspots (shards that receive a disproportionate amount of load). It distributes the data across the shards in a way that achieves a balance between the size of each shard and the average load that each shard will encounter. The sharding logic computes the shard to store an item in based on a hash of one or more attributes of the data. The next figure illustrates sharding tenant data based on a hash of tenant IDs.

Sharding tenant data based on a hash of tenant IDs — Image Source: Microsoft

The three sharding strategies have the following advantages and considerations:

Firstly, Lookup. This offers more control over the way that shards are configured and used. Using virtual shards reduces the impact when rebalancing data because new physical partitions can be added to even out the workload. However, the mapping between a virtual shard and the physical partitions that implement the shard can be modified without affecting application code that uses a shard key to store and retrieve data. Looking up shard locations can impose an additional overhead.
Secondly, Range. This is easy to implement and works well with range queries because they can often fetch multiple data items from a single shard in a single operation. This strategy offers easier data management. However, this strategy doesn’t provide optimal balancing between shards. Rebalancing shards is difficult and might not resolve the problem of uneven load if the majority of activity is for adjacent shard keys.
Lastly, Hash. This strategy offers a better chance of more even data and load distribution.

Scaling and data movement operations

Firstly, each of the sharding strategies implies different capabilities and levels of complexity for managing scale in, scale out, data movement, and maintaining state.
Secondly, the Lookup strategy permits scaling and data movement operations to be carried out at the user level, either online or offline. The technique is to suspend some or all user activity (perhaps during off-peak periods), move the data to the new virtual partition or physical shard, change the mappings, invalidate or refresh any caches that hold this data, and then allow user activity to resume.
Thirdly, the Range strategy imposes some limitations on scaling and data movement operations. This must typically be carried out when a part or all of the data store is offline because the data must be split and merged across the shards. Moving the data to rebalance shards might not resolve the problem of uneven load if the majority of activity is for adjacent shard keys or data identifiers that are within the same range.
Lastly, the Hash strategy makes scaling and data movement operations more complex because the partition keys are hashes of the shard keys or data identifiers.

When to use this pattern

When a data store has to grow beyond the resources available on a single storage node, or when you want to increase performance by lowering contention in a data store, use this design.

Example

The following C# sample employs a sharded set of SQL Server databases. Each database stores a portion of an application’s data.
Using its own sharding logic, the application obtains data that is dispersed among the shards (this is an example of a fan-out query).
A method called GetShards, on the other hand, returns the specifics of the data stored in each shard.
This method provides an enumerable list of ShardInformation objects, each of which contains a shard identification and the SQL Server connection string that an application should use to connect to the shard.

Sharding pattern — Image Source: Microsoft

Reference: Microsoft Documentation