Data storage allocation

Go back to GCP Tutorials

In this tutorial we will learn and understand about Data storage allocation.

Naming

Firstly, if you need a lot of buckets, use GUIDs or an equivalent for bucket names, put retry logic in your code to handle name collisions. And keep a list to cross-reference your buckets. Another option is to use domain-named buckets and manage the bucket names as sub-domains.
Secondly, don’t use user IDs, email addresses, project names, project numbers, or any personally identifiable information (PII) in bucket names. This is because anyone can probe for the existence of a bucket. Similarly, be very careful with putting PII in your object names, because object names appear in URLs for the object.
Lastly, avoid using sequential filenames such as timestamp-based filenames if you are uploading many files in parallel. Files with sequential names are stored consecutively, so they are likely to hit the same backend server. When this happens, throughput is constrained. In order to achieve optimal throughput, add the hash of the sequence number as part of the filename to make it non-sequential.

Traffic

Firstly, perform a back-of-the-envelope estimation of the amount of traffic that will be sent to Cloud Storage. Specifically, think about:
- Operations per second. How many operations per second do you expect, for both buckets and objects, and for create, update, and delete operations.
- Bandwidth. How much data will be sent, over what time frame?
- Cache control. Specifying the Cache-Control metadata on objects will benefit read latency on hot or frequently accessed objects.
Secondly, design your application to minimize spikes in traffic. If there are clients of your application doing updates, spread them out throughout the day.
Thirdly, while Cloud Storage has no upper bound on the request rate, for the best performance when scaling to high request rates, follow the Request Rate and Access Distribution Guidelines.
Next, be aware that there are rate limits for certain operations and design your application accordingly.

However, if you get an error:

Firstly, use exponential backoff as part of your retry strategy in order to avoid problems due to large traffic bursts.
Secondly, retry using a new connection and possibly re-resolve the domain name.

After that, if your application is latency sensitive, use hedged requests. Hedged requests allow you to retry faster and cut down on tail latency.
lastly, understand the performance level customers expect from your application. This information helps you choose a storage option and region when creating new buckets.

Regions & data storage options

Firstly, data that will be served at a high rate with high availability should use the Standard Storage class. This class provides the best availability with the trade-off of a higher price.
Secondly, data that will be infrequently accessed and can tolerate slightly lower availability can be stored using the Nearline Storage, Coldline Storage, or Archive Storage class.
Thirdly, store your data in a region closest to your application’s users. For instance, for EU data you might choose an EU bucket, and for US data you might choose a US bucket.
Lastly, keep compliance requirements in mind when choosing a location for user data. Are there legal requirements around the locations that your users will be providing data?

Uploading data

Firstly, if you use XMLHttpRequest (XHR) callbacks to get progress updates, do not close and re-open the connection if you detect that progress has stalled. Doing so creates a bad positive feedback loop during times of network congestion.
Secondly, for upload traffic, we recommend setting reasonably long timeouts. For a good end-user experience, you can set a client-side timer that updates the client status window with a message when your application hasn’t received an XHR callback for a long time.
Thirdly, if you use Compute Engine instances with processes that POST to Cloud Storage to initiate a resumable upload. Then, you should use Compute Engine instances in the same locations as your Cloud Storage buckets. However, you can then use a geo IP service to pick the Compute Engine region to which you route customer requests, which helps keep traffic localized to a geo-region.
For resumable uploads, the resumable session should stay in the region in which it was created.
After that, avoid breaking a transfer into smaller chunks if possible and instead upload the entire content in a single chunk. Avoiding chunking removes added latency costs from committed offset queries for each chunk and improves throughput.
Lastly, an easy and convenient way to reduce the bandwidth needed for each request is to enable gzip compression. Although this requires additional CPU time to uncompress the results, the trade-off with network costs usually makes it very worthwhile.

Deleting data

If you are concerned that your application software or users might erroneously delete or replace objects at some point, Cloud Storage has features that help you protect your data:

Firstly, a retention policy that specifies a retention period can be placed on a bucket. An object in the bucket cannot be deleted or replaced until it reaches the specified age.
Secondly, an object hold can be placed on individual objects to prevent anyone from deleting or replacing the object until the hold is removed.
Then, object versioning can be enabled on a bucket in order to retain older versions of objects. When the live version of an object is deleted or replaced, it becomes noncurrent if versioning is enabled on the bucket. However, if you accidentally delete a live object version, you can copy the noncurrent version of it back to the live version.
Lastly, object Versioning increases storage costs, but this can be partially mitigated by configuring Object Lifecycle Management to delete older object versions.

However, if you want to bulk delete a hundred thousand or more objects, avoid using gsutil, as the process takes a long time to complete. Instead, use one of the following options:

Firstly, the Cloud Console can bulk delete up to several million objects and does so in the background. The Cloud Console can also be used to bulk delete only those objects that share a common prefix.
Secondly, object Lifecycle Management can bulk delete any number of objects. To bulk delete objects in your bucket, set a lifecycle configuration rule on your bucket where the condition has Age set to 0 days, and the action is set to Delete.

Object listing

Object listing can become temporarily very slow and, as a result, you may experience an increased rate of 5xx errors after deleting millions of objects in a bucket. This can happen both when sending the deletion requests yourself or when deleting through Object Lifecycle Management rules. However, this behavior occurs because the deleted records are not purged from the underlying storage system immediately. Thus, object listing needs to skip over the deleted records when finding the objects to return. Eventually the deleted records are removed from the underlying storage system, and object listing performance becomes normal again. This typically takes a few hours, but in some cases may take a few days.

Data storage allocation GCP cloud architect online course

Reference: Google Documentation