Loading Data Google Professional Data Engineer GCP

  1. Home
  2. Loading Data Google Professional Data Engineer GCP
  • load data from
    • From Cloud Storage
    • From other Google services
    • From a readable data source
    • By inserting individual records using streaming inserts
    • Using DML statements to perform bulk inserts
    • Using a BigQuery I/O transform in a Dataflow pipeline to write data to BigQuery
  • can load data into a new table or partition
  • can also append data to an existing table or partition
  • can overwrite a table or partition.
  • method for ingesting data into BigQuery:
    • the BigQuery Jobs API
    • streaming writes
    • writing query results into a table
    • loading CSV files from Cloud Storage
    • using BigQuery as a Cloud Dataflow sink
  • The default source format for loading data is CSV.
  • Also supports streaming inserts by BigQuery API and BigQuery buffers records before insertion.

 

Load data from Cloud Storage and readable data sources in the following formats:

Cloud Storage:

  • Avro
  • CSV
  • JSON (newline delimited only)
  • ORC
  • Parquet
  • Datastore exports
  • Firestore exports

Readable data source (such as local machine):

  • Avro
  • CSV
  • JSON (newline delimited only)
  • ORC
  • Parquet

 

Choosing a data ingestion format

  • can load data into variety of formats.
  • Loaded data is converted into columnar format for Capacitor (BigQuery’s storage format).
  • During loading select data ingestion format based on
    • data’s schema
    • Embedded newlines
    • External limitations

Loading encoded data

  • BigQuery supports UTF-8 encoding
  • for both nested or repeated and flat data.
  • ISO-8859-1 encoding for flat data only for CSV files.
  • By default, the BigQuery service expects all source data to be UTF-8 encoded.

Loading compressed and uncompressed data

  • The Avro binary format is the preferred for loading both compressed and uncompressed data.

Loading denormalized, nested, and repeated data

  • BigQuery performs best when data is denormalized.

 

Schema auto-detection

  • available when you load data into BigQuery
  • Also when you query an external data source.
  • Steps
    • BigQuery starts inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as sample.
    • then examines each field and attempts to assign a data type to that field based on the values in the sample.
  • use schema auto-detection for JSON or CSV files.
  • not available for Avro files, ORC files, Parquet files, Datastore exports, or Firestore exports

 

BigQuery Data Transfer Service

It automates loading data into BigQuery from these services:

Google Software as a Service (SaaS) apps

  • Campaign Manager
  • Cloud Storage
  • Google Ad Manager
  • Google Ads
  • Google Merchant Center (beta)
  • Google Play
  • Search Ads 360 (beta)
  • YouTube Channel reports
  • YouTube Content Owner reports

External cloud storage providers

  • Amazon S3

Data warehouses

  • Teradata
  • Amazon Redshift
Menu