AI Platform Training Google Professional Data Engineer GCP

How training works

Can train a built-in algorithm (beta) against dataset without writing a training application.
Can create a training application to run on AI Platform Training.
create a Python application that trains model
get training and verification data into a source that AI Platform Training can access.
When application is ready, package it and transfer it to a Cloud Storage bucket that project can access. It is automated by gcloud command-line tool.
It does –
- sets up resources for job.
- allocates one or more virtual machines (called training instances) as per job configuration.
training instance is set up by:
- Applying standard machine image for version of AI Platform Training job uses.
- Loading application package and installing it with pip.
- Installing any additional packages that you specify as dependencies.
The training service runs application,
can get information about running job by
- Cloud Logging
- gcloud command-line tool.
- programmatically making status requests to the training service.
When training job succeeds or encounters an unrecoverable error, AI Platform Training halts all job processes and cleans up the resources.

Distributed training structure

For distributed TensorFlow job, each replica in the training cluster is given a single role or task in distributed training:

Master: one replica is designated and it manages the others and reports status for the job as a whole.
Workers: Replicas do their part of the work as per job configuration.
Parameter servers: They coordinate shared model state between the workers.

Distributed training strategies

There are three basic strategies to train a model with multiple nodes:

Input data

Data for training job should follow

Output data

applications output data, including checkpoints during training and a saved model when training is complete.
can also output other data as needed by application.
easiest to save output files to a Cloud Storage bucket

Training and Prediction of a neural network on AI Platform can be done with

Training job

TF-CONFIG

Training sets an environment variable called TF_CONFIG on each VM instance part of job.
use to access details about the training job and the role of VM.
facilitate distributed training,

Access Control

Hyperparameter

Logs

Cloud Audit Logs to generate logs for API operations
track how resources are modified and accessed within GCP
Cloud Audit Logs includes the following types of logs:
- Admin Activity logs: Has log entries for operations that modify the configuration or metadata of a AI Platform Training resource.
- Data Access logs: Has log entries for operations that perform read-only operations that do not modify any data, such as get and list.