Machine Learning Algorithms Google Professional Data Engineer GCP

Supervised learning:

Two categories of supervised learning:

Classification: To label and group related data items.
Regression: Used if output is a continuous value, like to forecast the value of a stock.

List of Techniques

Algorithm Name	Description	Type
Linear regression	to correlate each feature to output to help predict future values.	Regression
Logistic regression	Extension of linear regression. The output variable is binary (e.g., only black or white) rather than continuous	Classification
Decision tree	splits data-feature values into branches at decision nodes until a final decision output is made	Regression Classification
Naive Bayes	uses the Bayesian theorem, updates prior knowledge of an event with the independent probability of each feature that can affect the event.	Regression Classification
Support vector machine	finds a hyperplane that optimally divided the classes.	Regression Classification
Random forest	built upon a decision tree. Random forest generates many times simple decision trees and uses the ‘majority vote’ method to decide on which label to return. For classification task, the final prediction will be the one with the most vote; while for the regression task, the average prediction of all the trees is the final prediction.	Regression Classification
AdaBoost	uses a multitude of models to come up with a decision but weighs them based on their accuracy in predicting the outcome	Regression Classification
Gradient-boosting trees	Focusing on the error committed by the previous trees and tries to correct it.	Regression Classification

Unsupervised learning

List of techniques

Algorithm	Description	Type
K-means clustering	Puts data into some groups (k) that each contains data with similar characteristics (as determined by the model, not in advance by humans)	Clustering
Gaussian mixture model	A generalization of k-means clustering that provides more flexibility in the size and shape of groups (clusters	Clustering
Hierarchical clustering	Splits clusters along a hierarchical tree to form a classification system.	Clustering
Recommender system	Help to define the relevant data for making a recommendation.	Clustering
PCA/T-SNE	to decrease the dimensionality of the data. Reduce the number of features to 3 or 4 vectors with the highest variances.	Dimension Reduction

Under-fitting & Over-fitting

We try to make the machine learning algorithm fit the input data by increasing or decreasing the models capacity. In linear regression problems, we increase or decrease the degree of the polynomials.

Under-fitting: When the model has fewer features and hence not able to learn from the data very well. This model has high bias.
Over-fitting: When the model has complex functions and hence able to fit the data very well but is not able to generalize to predict new data. This model has high variance.

Machine Learning Terminology

Labels

It is the thing we’re predicting—the y variable in simple linear regression.
Like future price, kind of animal shown in a picture, the meaning of an audio clip, etc.

Features

It is an input variable—the x variable in simple linear regression.
ML project can have a single feature or millions of features, specified as: x1,x2,…xn
In a spam detector, the features could include the following:
words in the email text
sender’s address
time of day the email was sent
email contains the phrase “one weird trick.”

Examples

It is a particular instance of data, x.
two categories: labelled examples and unlabeled examples
A labelled example includes both feature(s) and the label or – labelled examples: {features, label}: (x, y)
Use labelled examples to train the model.
An unlabeled example contains features but not the label. That is: unlabeled examples: {features, ?}: (x, ?)

Models

It defines the relationship between features and label.
two phases of a model’s life:
- Training means creating or learning the model or show the model labelled examples and enable the model to gradually learn the relationships between features and label.
- Inference means applying the trained model to unlabeled examples.

Bias

A ML model has low bias if its predictability level is high.
Hence, fewer mistakes when it is working on a dataset.

Cross-validation bias

Technique that provides an accurate measure of the performance of ML model.

Epoch

1 traversal thru entire dataset. Traversal is called evaluation

Underfitting

If ML model is not able to predict with a decent level of accuracy, then the model underfits.
Due to
- not selecting the correct features for the prediction
- the problem statement is too complex

Overfitting

It occurs when the model fits the data too well.
Overfitting model learns the detail and noise in the training data so as to negatively impacts performance
can be solved by decreasing the number of features/inputs or by increasing the number of training examples

3 ML approaches

Tensorflow
ML engine
prebuilt models

Tensorflow

Has estimator api
model abstraction layer(layers, error function)
python low level library
c++ low level library
hardware(cpu, gpu, tpu)
Execution
- There is a build and run stage.
- Build stage builds the graphs.
- Run stage runs the tensor in a distributed manner.
Placeholder and feed_dict is a dynamic way of passing values
Data types are inferred implicitly
eager allows you to avoid the build-then-run stages. Used only for testing purposes.
Model types:
- Linear regressor
- DNregressor – deep neural network
- Linear classifier
- DNclassifier

ML engine

Does both training and prediction
Supports tensorflow, xgboost, scikit learn, keras in beta

Other terms

Training example: a sample from x including its output from the target function
Target function: the mapping function f from x to f(x)
Hypothesis: approximation of f, a candidate function.
Concept: A boolean target function, positive examples and negative examples for the 1/0 class values.
Classifier: Learning program outputs a classifier that can be used to classify.
Learner: Process that creates the classifier.
Hypothesis space: set of possible approximations of f that the algorithm can create.
Version space: subset of the hypothesis space that is consistent with the observed data.

Machine Learning Algorithms Google Professional Data Engineer GCP

Prepare for Assured Success