AI Platform Training Google Professional Data Engineer GCP

  1. Home
  2. AI Platform Training Google Professional Data Engineer GCP
  • provides the dependencies required to train ML models using hosted frameworks.
  • can use custom containers to run training jobs


How training works

  • Can train a built-in algorithm (beta) against dataset without writing a training application.
  • Can create a training application to run on AI Platform Training.
  • create a Python application that trains model
  • get training and verification data into a source that AI Platform Training can access.
  • When application is ready, package it and transfer it to a Cloud Storage bucket that project can access. It is automated by gcloud command-line tool.
  • It does –
    • sets up resources for job.
    • allocates one or more virtual machines (called training instances) as per job configuration.
  • training instance is set up by:
    • Applying standard machine image for version of AI Platform Training job uses.
    • Loading application package and installing it with pip.
    • Installing any additional packages that you specify as dependencies.
  • The training service runs application,
  • can get information about running job by
    • Cloud Logging
    • gcloud command-line tool.
    • programmatically making status requests to the training service.
  • When training job succeeds or encounters an unrecoverable error, AI Platform Training halts all job processes and cleans up the resources.


Distributed training structure

For distributed TensorFlow job, each replica in the training cluster is given a single role or task in distributed training:

  • Master: one replica is designated and it manages the others and reports status for the job as a whole.
  • Workers: Replicas do their part of the work as per job configuration.
  • Parameter servers: They coordinate shared model state between the workers.

Distributed training strategies

There are three basic strategies to train a model with multiple nodes:

  • Data-parallel training with synchronous updates.
  • Data-parallel training with asynchronous updates.
  • Model-parallel training.

Input data

Data for training job should follow

  • Must be in a format that you can read and feed to training code.
  • Must be in a location that code can access.

Output data

  • applications output data, including checkpoints during training and a saved model when training is complete.
  • can also output other data as needed by application.
  • easiest to save output files to a Cloud Storage bucket


Training and Prediction of a neural network on AI Platform can be done with

  • Keras
  • TensorFlow Estimator


Training job

  • Training provides model training as an asynchronous (batch) service.
  • Must specify the number and types of machines you need.
  • Can pick from a set of predefined cluster specifications called scale tiers.
  • Else, choose a custom tier and specify machine types.
  • can monitor several aspects of job while it runs.



  • Training sets an environment variable called TF_CONFIG on each VM instance part of job.
  • use to access details about the training job and the role of VM.
  • facilitate distributed training,


Access Control

  • AI Platform Training uses IAM to manage access to resources.


  • contain the data that govern the training process itself.


  • Cloud Audit Logs to generate logs for API operations
  • track how resources are modified and accessed within GCP
  • Cloud Audit Logs includes the following types of logs:
    • Admin Activity logs: Has log entries for operations that modify the configuration or metadata of a AI Platform Training resource.
    • Data Access logs: Has log entries for operations that perform read-only operations that do not modify any data, such as get and list.