Cluster size and Autoscaling

Microsoft DP-200 exam is getting retired on June 30, 2021. A new replacement exam Data Engineering on Microsoft Azure Beta (DP-203) is available.

In this we will understand about cluster size and autoscaling. However, when you create a Azure Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster. And, when you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is autoscaling.

Using autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job, Moreover, Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload.

Autoscaling types

Azure Databricks offers two types of cluster node autoscaling: standard and optimized. Automated (job) clusters always use optimized autoscaling.

However, the type of autoscaling performed on all-purpose clusters depends on the workspace configuration.
Secondly, Standard autoscaling is used by all-purpose clusters in workspaces in the Standard pricing tier.

How autoscaling behaves

Autoscaling behaves differently depending on whether it optimize or standard and whether applied to an all-purpose or a job cluster.

Optimized autoscaling

Firstly, scales up from min to max in 2 steps.
Secondly, can scale down even if the cluster is not idle by looking at shuffle file state.
Thirdly, scales down based on a percentage of current nodes.
And, on job clusters, scales down if the cluster is underutilized over the last 40 seconds.
Lastly, on all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds.

Standard autoscaling

Firstly, starts with adding 8 nodes. Thereafter, scales up exponentially, but can take many steps to reach the max.
Secondly, scales down only when the cluster is completely idle and it has been underutilized for the last 10 minutes.
Lastly, scales down exponentially, starting with 1 node.

Enable and configure autoscaling

To allow Azure Databricks to resize your cluster automatically, you enable autoscaling for the cluster and provide the min and max range of workers.

Firstly, enable autoscaling.
- All-Purpose cluster – On the Create Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:
- Job cluster – On the Configure Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:
Secondly, configure the min and max workers.

Autoscaling local storage

Firstly, it can often be difficult to estimate how much disk space a particular job will take. However, to save you from having to estimate how many gigabytes of managed disk, Azure Databricks automatically enables autoscaling local storage on all Azure Databricks clusters.
Secondly, with autoscaling local storage, Azure Databricks monitors the amount of free disk space available on your cluster’s Spark workers. If a worker begins to run too low on disk, Databricks automatically attaches a new managed disk to the worker before it runs out of disk space.
Lastly, the managed disks attached to a virtual machine are detached only when the virtual machine is returned to Azure. That is, managed disks are never detached from a virtual machine as long as it is part of a running cluster.

Spark configuration

To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.

Firstly, on the cluster configuration page, click the Advanced Options toggle.
Secondly, click the Spark tab.

And, when you configure a cluster using the Clusters API, set Spark properties in the spark_conf field in the Create cluster request or Edit cluster request.

However, to set Spark properties for all clusters, create a global init script:

Scala
dbutils.fs.put(“dbfs:/databricks/init/set_spark_params.sh”,”””
|#!/bin/bash
|
|cat << ‘EOF’ > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
|[driver] {
| “spark.sql.sources.partitionOverwriteMode” = “DYNAMIC”
|}
|EOF
“””.stripMargin, true)

Enable local disk encryption

Some instance types you use to run clusters may have locally attached disks. Azure Databricks may store shuffle data or ephemeral data on these locally attached disks. To ensure that all data at rest is encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local disks, you can enable local disk encryption.

However, when local disk encryption is enabled, Azure Databricks generates an encryption key locally that is unique to each cluster node and is used to encrypt all data stored on local disks. The scope of the key is local to each cluster node and is destroyed along with the cluster node itself.

Further, to enable local disk encryption, you must use the Clusters API. During cluster creation or edit, set:

JSON
{
“enable_local_disk_encryption”: true
}

Here is an example of a cluster create call that enables local disk encryption:

JSON
{
“cluster_name”: “my-cluster”,
“spark_version”: “7.3.x-scala2.12”,
“node_type_id”: “Standard_D3_v2”,
“enable_local_disk_encryption”: true,
“spark_conf”: {
“spark.speculation”: true
},
“num_workers”: 25
}

Environment variables

You can set environment variables that you can access from scripts running on a cluster.

Firstly, on the cluster configuration page, click the Advanced Options toggle.
Then, click the Spark tab.
Lastly, set the environment variables in the Environment Variables field.

Cluster tags

Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. Moreover, you can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes. For convenience, Azure Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId. And, in addition, on job clusters, Azure Databricks applies two default tags: RunName and JobId.

Further, you can add custom tags when you create a cluster. To configure cluster tags:

Firstly, on the cluster configuration page, click the Advanced Options toggle.
Secondly, at the bottom of the page, click the Tags tab.
Then, add a key-value pair for each custom tag. You can add up to 43 custom tags.
Lastly, custom tags are displayed on Azure bills and updated whenever you add, edit, or delete a custom tag.

Cluster log delivery

When you create a cluster, you can specify a location to deliver Spark driver, worker, and event logs. Logs are delivered every five minutes to your chosen destination. When a cluster is terminated, Azure Databricks guarantees to deliver all logs generated up until the cluster was terminated. However, the destination of the logs depends on the cluster ID.

Further, to configure the log delivery location:

Firstly, on the cluster configuration page, click the Advanced Options toggle.
Secondly, at the bottom of the page, click the Logging tab.
Then, select a destination type.
Lastly, enter the cluster log path.

Cluster size and Autoscaling DP-200 Online course

Reference: Microsoft Documentation

Go back to DP-200 Tutorials