The migration Process Google Professional Data Engineer GCP

  1. Home
  2. The migration Process Google Professional Data Engineer GCP

 

A migration is a journey and involves various phases with multiple options to reach destination. As per diagram

Professional Data Engineer Google Cloud

There are four phases of migration:

    • Involves assessment and discovery of existing environment,
    • understand app and environment inventory
    • identify app dependencies and requirements
    • perform TCO and app performance benchmarks.
    • Create the basic cloud infrastructure for workloads
    • Make plan how to move apps.
    • Planning involves enlisting identity management, organization and project structure, networking, sorting apps, and a prioritized migration strategy.
    • Design, implement and execute migration
    • May refine cloud resources as per any need
  • Optimize
    • Analyze and optimize cloud resource utilization
    • Reduce costs
    • Implement Automation, ML and AI services

Assess Phase

  • Build an inventory of apps – Use teams for each workload in current environment.
  • The inventory should include
    • apps
    • Dependencies of each app
    • Services supporting app infrastructure
    • Servers configurations
    • Network devices, firewalls, and other dedicated hardware.
  • For each item gather
    • Source code location
    • Deployment method
    • Network restrictions or security requirements.
    • Licensing requirements
  • Categorize apps
    • Categorize to prioritize the apps to migrate first
    • Also understand complexity and risk involved
    • A catalog matrix is used for purpose

Transferring large datasets

For large datasets transfer involves various steps as

  • building the right team
  • planning early
  • testing transfer plan before implementing

Data transfer

  • Process of moving data without transforming
  • It involves
    • Making a transfer plan to decide transfer option and get approvals
    • Coordinating team that executes the transfer
    • Choosing the right transfer tool based on resources, cost, time
    • Overcoming data transfer challenges, like insufficient bandwidth, moving actively used datasets, protecting and monitoring the data during transfer and ensuring successful transfer
  • Other types of data transfer projects
    • ETL transformation use Dataflow.
    • To migrate a database and related apps use Cloud Spanner
    • For virtual machine (VM) instance migration use Migrate for Compute Engine.

 

Step 1: Assembling team

Planning a transfer typically requires personnel with the following roles and responsibilities:

  • Storage, IT, and network admins to execute transfer
  • Data owners or governors, legal persons approval for transfer

Step 2: Collecting requirements and available resources

  • Identify datasets to move.
    • Use Data Catalog to organize data into logical groupings
    • Work with teams to update these groupings.
  • Identify datasets you can move.
    • Any regulatory, security, or other factors prohibit transfer
    • Remove sensitive data or reorganize data as needed. Use Dataflow or Cloud Data Fusion or Cloud Composer.
  • For movable datasets decide where to transfer each dataset.
    • Select storage option to store data.
    • Understand data access policies to maintain after migration.
    • Any region or geography specific requirement
    • data structure in the destination
    • transfer on an ongoing basis or one off
  • For movable datasets also enlist following
    • Time: When to transfer
    • Cost: budget available
    • People: Who will execute the transfer
    • Bandwidth (for online transfers):

Step 3: Evaluating transfer options

Data transfer options are selected as per following factors

  • Cost
  • Time
  • Offline versus online transfer options
  • Transfer tools and technologies
  • Security

 

Cost:

It includes

  • Networking costs
    • Egress charges if any
    • bandwidth charges for transfer
  • Storage and operation costs for Cloud Storage during and after the transfer of data
  • Personnel costs for support

 

Time:

  • Time involved for transfer
  • when to undertake transfer
  • Connection options for data transfer between private data center and GCP
    • A public internet connection by using a public API
    • Direct Peering by using a public API
    • Cloud Interconnect by using a private API

Connecting with a public internet connection –

  • Less predictable
  • Dependent on ISP capacity
  • low costs
  • Google offers peering arrangements if applicable

 

Connecting with Direct Peering –

  • Access GCP network with lesser network hops
  • Direct Peering connects ISP network and Google’s Edge Points of Presence (PoPs)
  • A registered Autonomous System (AS) Number need to be set up along with around-the-clock contact with network operations center.

Connecting with Cloud Interconnect –

  • Cloud Interconnect is a direct connection to GCP by Cloud Interconnect service providers.
  • No need to send data on the public internet
  • more consistent throughput for large data transfers.
  • SLAs for network availability and performance

Online versus offline transfer –

  • Transfer data over a network, or by using storage hardware.

 

Deciding among Google’s transfer options

Factors to choose a transfer option

Where you’re moving data from Scenario Suggested products
Another cloud provider (for example, Amazon Web Services or Microsoft Azure) to Google Cloud Storage Transfer Service
Cloud Storage to Cloud Storage (two different buckets) Storage Transfer Service
private data center to Google Cloud Enough bandwidth to meet project deadline
for less than a few TB of data gsutil
private data center to Google Cloud Enough bandwidth to meet project deadline
for more than a few TB of data Storage Transfer Service for on-premises data
private data center to Google Cloud Not enough bandwidth to meet project deadline Transfer Appliance

 

gsutil

  • suitable for smaller transfers of on-premises data (less than a few TB)
  • include gsutil in default path if using Cloud Shell.
  • By default provided with Cloud SDK.
  • manages Cloud Storage instances,
  • functions provided –
  • copying data to and from the local file system and Cloud Storage.
  • move and rename objects and
  • perform real-time incremental syncs
  • Use scenarios
    • transfers as-needed basis, or in command-line sessions by users.
    • If transfer few files or very large files, or both.
    • consuming output of a program like streaming output to Cloud Storage
    • if watching and syncing a directory with a fewer number of files
  • For using gsutil, create a Cloud Storage bucket and copy data to it.
  • For security use HTTPS
  • For large datasets transfer
    • use gsutil -m for multi-threaded transfers
    • use Composite transfers for a single large file, it breaks large files into smaller chunks to increase transfer speed.

 

Storage Transfer Service

  • for large transfers of on-premises data
  • Designed for large-scale transfers (up to petabytes of data or billions of files).
  • supports full copies or incremental copies,
  • Offers graphical user interface
  • Usage scenarios
    • If sufficient bandwidth available to move the data volumes
    • For large internal users who cannot use gsutil
    • need error-reporting and a record of data moved.
    • limit the impact of transfers on other workloads
    • To run recurring transfers on a schedule.
  • Install agents to use Storage Transfer Service on-premises
  • Agents are in Docker containers and run or orchestrate them by Kubernetes.
  • After setup start transfers by providing
    • a source directory
    • destination bucket
    • time or schedule
  • Storage Transfer Service recursively crawls subdirectories and files in the source directory and creates objects with a corresponding name in Cloud Storage
  • Automatically re-attempts transfer if any transient errors
  • You can monitor files moved and the overall transfer speed
  • After transfer a tab-delimited file (TSV) file lists all files transferred and error messages
  • Best Practices
    • Use an identical agent setup on every machine.
    • More agents results in more speed so deploy many agents
    • Bandwidth caps can protect other workloads
    • Plan time for reviewing errors.
    • Set up Cloud Monitoring for long-running transfers.

 

Transfer Appliance –

  • Used for larger transfers if limited network bandwidth or costly
  • Usage scenarios:
    • data at a remote location with limited / no bandwidth.
    • Required bandwidth is not available
  • Involves receiving and shipping back the hardware
  • It is Google-owned hardware.
  • Available only in specific countries.
  • Factors for choosing it are cost and speed.
  • Request a appliance in the Cloud Console detailing data to transfer
  • Approximate turnaround time for a appliance to be shipped, loaded with data, shipped back, and rehydrated on Google Cloud is 50 days.
  • cost for the 480 TB device process is less than $3,000.

 

Storage Transfer Service for cloud-to-cloud transfers –

  • Storage Transfer Service is a fully managed and highly scalable data transfer service
  • Automates transfers from other public clouds into Cloud Storage.
  • supports transfers into Cloud Storage from Amazon S3 and HTTP.
  • For Amazon S3,
    • access key and an S3 bucket details are needed
    • Daily copies of any modified objects is also supported
    • Cannot transfer to Amazon S3.
  • For HTTP, list of public URLs in a specified format are needed
  • Script needed with size of each file in bytes, with Base64-encoded MD5 hash of the file contents.

 

Security

  • Primary focus during transfer
  • different levels of security offered by GCP
  • consider protection of
  • data at rest (authorization and access to the source and destination storage system),
  • data in transit,
  • access to the transfer product.

 

Security offered by product.

Product Data at rest Data in transit Access to transfer product
Transfer Appliance All data is encrypted. Protected with keys managed by the customer. Anyone can order an appliance, but to use it they need access to the data source.
gsutil Access keys required to access Cloud Storage, which is encrypted at rest. Data is sent over HTTPS and encrypted in transit. Anyone can download and run gsutil. They must have permissions to buckets and local files in order to move data.
Storage Transfer Service for on-premises data Access keys required to access Cloud Storage, which is encrypted at rest. The agent process can access local files as OS permissions allow. Data is sent over HTTPS and encrypted in transit. You must have object editor permissions to access Cloud Storage buckets.
Storage Transfer Service Access keys required for non-Google Cloud resources (for example, Amazon S3). Access keys are required to access Cloud Storage, which is encrypted at rest. Data is sent over HTTPS and encrypted in transit. You must have Cloud IAM permissions for the service account to access the source and object editor permissions for any Cloud Storage buckets.

 

Step 4: Preparing for transfer

Steps involved are

  • Pricing and ROI estimation.
  • Functional testing. to confirm product set up and that network connectivity
    • Confirmation of install and operation of the transfer.
    • Enlist issues that block data movement
    • List operations needed like training needed
  • Performance testing. run a transfer on a large sample of data and confirm   speed and fix bottlenecks

 

Step 5: Ensuring the integrity of transfer

  • Enable versioning
  • backup on destination to circumvent any accidental deletes.
  • Validate data before removing the source data.
Menu