Machine Learning Algorithms Google Professional Data Engineer GCP

  1. Home
  2. Machine Learning Algorithms Google Professional Data Engineer GCP

Supervised learning:

Two categories of supervised learning:

  • Classification: To label and group related data items.
  • Regression: Used if output is a continuous value, like to forecast the value of a stock.

List of Techniques

Algorithm Name Description Type
Linear regression to correlate each feature to output to help predict future values. Regression
Logistic regression Extension of linear regression. The output variable is binary (e.g., only black or white) rather than continuous Classification
Decision tree splits data-feature values into branches at decision nodes until a final decision output is made Regression Classification
Naive Bayes uses the Bayesian theorem, updates prior knowledge of an event with the independent probability of each feature that can affect the event. Regression Classification
Support vector machine finds a hyperplane that optimally divided the classes. Regression Classification
Random forest built upon a decision tree. Random forest generates many times simple decision trees and uses the ‘majority vote’ method to decide on which label to return. For classification task, the final prediction will be the one with the most vote; while for the regression task, the average prediction of all the trees is the final prediction. Regression Classification
AdaBoost uses a multitude of models to come up with a decision but weighs them based on their accuracy in predicting the outcome Regression Classification
Gradient-boosting trees Focusing on the error committed by the previous trees and tries to correct it. Regression Classification

Unsupervised learning

List of techniques

Algorithm Description Type
K-means clustering Puts data into some groups (k) that each contains data with similar characteristics (as determined by the model, not in advance by humans) Clustering
Gaussian mixture model A generalization of k-means clustering that provides more flexibility in the size and shape of groups (clusters Clustering
Hierarchical clustering Splits clusters along a hierarchical tree to form a classification system. Clustering
Recommender system Help to define the relevant data for making a recommendation. Clustering
PCA/T-SNE to decrease the dimensionality of the data. Reduce the number of features to 3 or 4 vectors with the highest variances. Dimension Reduction

Under-fitting & Over-fitting

We try to make the machine learning algorithm fit the input data by increasing or decreasing the models capacity. In linear regression problems, we increase or decrease the degree of the polynomials.

  • Under-fitting: When the model has fewer features and hence not able to learn from the data very well. This model has high bias.
  • Over-fitting: When the model has complex functions and hence able to fit the data very well but is not able to generalize to predict new data. This model has high variance.

 

Machine Learning Terminology

 

Labels

  • It is the thing we’re predicting—the y variable in simple linear regression.
  • Like future price, kind of animal shown in a picture, the meaning of an audio clip, etc.

Features

  • It is an input variable—the x variable in simple linear regression.
  • ML project can have a single feature or millions of features, specified as: x1,x2,…xn
  • In a spam detector, the features could include the following:
  • words in the email text
  • sender’s address
  • time of day the email was sent
  • email contains the phrase “one weird trick.”

Examples

  • It is a particular instance of data, x.
  • two categories: labelled examples and unlabeled examples
  • A labelled example includes both feature(s) and the label or – labelled examples: {features, label}: (x, y)
  • Use labelled examples to train the model.
  • An unlabeled example contains features but not the label. That is: unlabeled examples: {features, ?}: (x, ?)

 

Models

  • It defines the relationship between features and label.
  • two phases of a model’s life:
    • Training means creating or learning the model or show the model labelled examples and enable the model to gradually learn the relationships between features and label.
    • Inference means applying the trained model to unlabeled examples.

 

Bias

  • A ML model has low bias if its predictability level is high.
  • Hence, fewer mistakes when it is working on a dataset.

 

Cross-validation bias

  • Technique that provides an accurate measure of the performance of ML model.

Epoch

  • 1 traversal thru entire dataset. Traversal is called evaluation

Underfitting

  • If ML model is not able to predict with a decent level of accuracy, then the model underfits.
  • Due to
    • not selecting the correct features for the prediction
    • the problem statement is too complex

 

Overfitting

  • It occurs when the model fits the data too well.
  • Overfitting model learns the detail and noise in the training data so as to negatively impacts performance
  • can be solved by decreasing the number of features/inputs or by increasing the number of training examples

 

3 ML approaches

  • Tensorflow
  • ML engine
  • prebuilt models

 

Tensorflow

  • Has estimator api
  • model abstraction layer(layers, error function)
  • python low level library
  • c++ low level library
  • hardware(cpu, gpu, tpu)
  • Execution
    • There is a build and run stage.
    • Build stage builds the graphs.
    • Run stage runs the tensor in a distributed manner.
  • Placeholder and feed_dict is a dynamic way of passing values
  • Data types are inferred implicitly
  • eager allows you to avoid the build-then-run stages. Used only for testing purposes.
  • Model types:
    • Linear regressor
    • DNregressor – deep neural network
    • Linear classifier
    • DNclassifier

 

ML engine

  • Does both training and prediction
  • Supports tensorflow, xgboost, scikit learn, keras in beta

 

Other terms

  • Training example: a sample from x including its output from the target function
  • Target function: the mapping function f from x to f(x)
  • Hypothesis: approximation of f, a candidate function.
  • Concept: A boolean target function, positive examples and negative examples for the 1/0 class values.
  • Classifier: Learning program outputs a classifier that can be used to classify.
  • Learner: Process that creates the classifier.
  • Hypothesis space: set of possible approximations of f that the algorithm can create.
  • Version space: subset of the hypothesis space that is consistent with the observed data.
Menu