The migration Process Google Professional Data Engineer GCP

A migration is a journey and involves various phases with multiple options to reach destination. As per diagram

Professional Data Engineer Google Cloud

There are four phases of migration:

- Involves assessment and discovery of existing environment,
- understand app and environment inventory
- identify app dependencies and requirements
- perform TCO and app performance benchmarks.
- Create the basic cloud infrastructure for workloads
- Make plan how to move apps.
- Planning involves enlisting identity management, organization and project structure, networking, sorting apps, and a prioritized migration strategy.
- Design, implement and execute migration
- May refine cloud resources as per any need
Optimize
- Analyze and optimize cloud resource utilization
- Reduce costs
- Implement Automation, ML and AI services

Assess Phase

Build an inventory of apps – Use teams for each workload in current environment.
The inventory should include
- apps
- Dependencies of each app
- Services supporting app infrastructure
- Servers configurations
- Network devices, firewalls, and other dedicated hardware.
For each item gather
- Source code location
- Deployment method
- Network restrictions or security requirements.
- Licensing requirements
Categorize apps
- Categorize to prioritize the apps to migrate first
- Also understand complexity and risk involved
- A catalog matrix is used for purpose

Transferring large datasets

For large datasets transfer involves various steps as

building the right team
planning early
testing transfer plan before implementing

Data transfer

Process of moving data without transforming
It involves
- Making a transfer plan to decide transfer option and get approvals
- Coordinating team that executes the transfer
- Choosing the right transfer tool based on resources, cost, time
- Overcoming data transfer challenges, like insufficient bandwidth, moving actively used datasets, protecting and monitoring the data during transfer and ensuring successful transfer
Other types of data transfer projects
- ETL transformation use Dataflow.
- To migrate a database and related apps use Cloud Spanner
- For virtual machine (VM) instance migration use Migrate for Compute Engine.

Step 1: Assembling team

Planning a transfer typically requires personnel with the following roles and responsibilities:

Storage, IT, and network admins to execute transfer
Data owners or governors, legal persons approval for transfer

Step 2: Collecting requirements and available resources

Identify datasets to move.
- Use Data Catalog to organize data into logical groupings
- Work with teams to update these groupings.
Identify datasets you can move.
- Any regulatory, security, or other factors prohibit transfer
- Remove sensitive data or reorganize data as needed. Use Dataflow or Cloud Data Fusion or Cloud Composer.
For movable datasets decide where to transfer each dataset.
- Select storage option to store data.
- Understand data access policies to maintain after migration.
- Any region or geography specific requirement
- data structure in the destination
- transfer on an ongoing basis or one off
For movable datasets also enlist following
- Time: When to transfer
- Cost: budget available
- People: Who will execute the transfer
- Bandwidth (for online transfers):

Step 3: Evaluating transfer options

Data transfer options are selected as per following factors

Cost
Time
Offline versus online transfer options
Transfer tools and technologies
Security

Cost:

It includes

Networking costs
- Egress charges if any
- bandwidth charges for transfer
Storage and operation costs for Cloud Storage during and after the transfer of data
Personnel costs for support

Time:

Time involved for transfer
when to undertake transfer
Connection options for data transfer between private data center and GCP
- A public internet connection by using a public API
- Direct Peering by using a public API
- Cloud Interconnect by using a private API

Connecting with a public internet connection –

Less predictable
Dependent on ISP capacity
low costs
Google offers peering arrangements if applicable

Connecting with Direct Peering –

Access GCP network with lesser network hops
Direct Peering connects ISP network and Google’s Edge Points of Presence (PoPs)
A registered Autonomous System (AS) Number need to be set up along with around-the-clock contact with network operations center.

Connecting with Cloud Interconnect –

Cloud Interconnect is a direct connection to GCP by Cloud Interconnect service providers.
No need to send data on the public internet
more consistent throughput for large data transfers.
SLAs for network availability and performance

Online versus offline transfer –

Transfer data over a network, or by using storage hardware.

Deciding among Google’s transfer options

Factors to choose a transfer option

Where you’re moving data from	Scenario	Suggested products
Another cloud provider (for example, Amazon Web Services or Microsoft Azure) to Google Cloud	—	Storage Transfer Service
Cloud Storage to Cloud Storage (two different buckets)	—	Storage Transfer Service
private data center to Google Cloud	Enough bandwidth to meet project deadline
for less than a few TB of data	gsutil
private data center to Google Cloud	Enough bandwidth to meet project deadline
for more than a few TB of data	Storage Transfer Service for on-premises data
private data center to Google Cloud	Not enough bandwidth to meet project deadline	Transfer Appliance

gsutil

suitable for smaller transfers of on-premises data (less than a few TB)
include gsutil in default path if using Cloud Shell.
By default provided with Cloud SDK.
manages Cloud Storage instances,
functions provided –
copying data to and from the local file system and Cloud Storage.
move and rename objects and
perform real-time incremental syncs
Use scenarios
- transfers as-needed basis, or in command-line sessions by users.
- If transfer few files or very large files, or both.
- consuming output of a program like streaming output to Cloud Storage
- if watching and syncing a directory with a fewer number of files
For using gsutil, create a Cloud Storage bucket and copy data to it.
For security use HTTPS
For large datasets transfer
- use gsutil -m for multi-threaded transfers
- use Composite transfers for a single large file, it breaks large files into smaller chunks to increase transfer speed.

Storage Transfer Service

for large transfers of on-premises data
Designed for large-scale transfers (up to petabytes of data or billions of files).
supports full copies or incremental copies,
Offers graphical user interface
Usage scenarios
- If sufficient bandwidth available to move the data volumes
- For large internal users who cannot use gsutil
- need error-reporting and a record of data moved.
- limit the impact of transfers on other workloads
- To run recurring transfers on a schedule.
Install agents to use Storage Transfer Service on-premises
Agents are in Docker containers and run or orchestrate them by Kubernetes.
After setup start transfers by providing
- a source directory
- destination bucket
- time or schedule
Storage Transfer Service recursively crawls subdirectories and files in the source directory and creates objects with a corresponding name in Cloud Storage
Automatically re-attempts transfer if any transient errors
You can monitor files moved and the overall transfer speed
After transfer a tab-delimited file (TSV) file lists all files transferred and error messages
Best Practices
- Use an identical agent setup on every machine.
- More agents results in more speed so deploy many agents
- Bandwidth caps can protect other workloads
- Plan time for reviewing errors.
- Set up Cloud Monitoring for long-running transfers.

Transfer Appliance –

Used for larger transfers if limited network bandwidth or costly
Usage scenarios:
- data at a remote location with limited / no bandwidth.
- Required bandwidth is not available
Involves receiving and shipping back the hardware
It is Google-owned hardware.
Available only in specific countries.
Factors for choosing it are cost and speed.
Request a appliance in the Cloud Console detailing data to transfer
Approximate turnaround time for a appliance to be shipped, loaded with data, shipped back, and rehydrated on Google Cloud is 50 days.
cost for the 480 TB device process is less than $3,000.

Storage Transfer Service for cloud-to-cloud transfers –

Storage Transfer Service is a fully managed and highly scalable data transfer service
Automates transfers from other public clouds into Cloud Storage.
supports transfers into Cloud Storage from Amazon S3 and HTTP.
For Amazon S3,
- access key and an S3 bucket details are needed
- Daily copies of any modified objects is also supported
- Cannot transfer to Amazon S3.
For HTTP, list of public URLs in a specified format are needed
Script needed with size of each file in bytes, with Base64-encoded MD5 hash of the file contents.

Security

Primary focus during transfer
different levels of security offered by GCP
consider protection of
data at rest (authorization and access to the source and destination storage system),
data in transit,
access to the transfer product.

Security offered by product.

Product	Data at rest	Data in transit	Access to transfer product
Transfer Appliance	All data is encrypted.	Protected with keys managed by the customer.	Anyone can order an appliance, but to use it they need access to the data source.
gsutil	Access keys required to access Cloud Storage, which is encrypted at rest.	Data is sent over HTTPS and encrypted in transit.	Anyone can download and run gsutil. They must have permissions to buckets and local files in order to move data.
Storage Transfer Service for on-premises data	Access keys required to access Cloud Storage, which is encrypted at rest. The agent process can access local files as OS permissions allow.	Data is sent over HTTPS and encrypted in transit.	You must have object editor permissions to access Cloud Storage buckets.
Storage Transfer Service	Access keys required for non-Google Cloud resources (for example, Amazon S3). Access keys are required to access Cloud Storage, which is encrypted at rest.	Data is sent over HTTPS and encrypted in transit.	You must have Cloud IAM permissions for the service account to access the source and object editor permissions for any Cloud Storage buckets.

Step 4: Preparing for transfer

Steps involved are

Pricing and ROI estimation.
Functional testing. to confirm product set up and that network connectivity
- Confirmation of install and operation of the transfer.
- Enlist issues that block data movement
- List operations needed like training needed
Performance testing. run a transfer on a large sample of data and confirm speed and fix bottlenecks

Step 5: Ensuring the integrity of transfer

Enable versioning
backup on destination to circumvent any accidental deletes.
Validate data before removing the source data.

The migration Process Google Professional Data Engineer GCP

Assess Phase

Prepare for Assured Success