Loading Data Google Professional Data Engineer GCP
- load data from
- From Cloud Storage
 - From other Google services
 - From a readable data source
 - By inserting individual records using streaming inserts
 - Using DML statements to perform bulk inserts
 - Using a BigQuery I/O transform in a Dataflow pipeline to write data to BigQuery
 
 - can load data into a new table or partition
 - can also append data to an existing table or partition
 - can overwrite a table or partition.
 - method for ingesting data into BigQuery:
- the BigQuery Jobs API
 - streaming writes
 - writing query results into a table
 - loading CSV files from Cloud Storage
 - using BigQuery as a Cloud Dataflow sink
 
 - The default source format for loading data is CSV.
 - Also supports streaming inserts by BigQuery API and BigQuery buffers records before insertion.
 
Load data from Cloud Storage and readable data sources in the following formats:
Cloud Storage:
- Avro
 - CSV
 - JSON (newline delimited only)
 - ORC
 - Parquet
 - Datastore exports
 - Firestore exports
 
Readable data source (such as local machine):
- Avro
 - CSV
 - JSON (newline delimited only)
 - ORC
 - Parquet
 
Choosing a data ingestion format
- can load data into variety of formats.
 - Loaded data is converted into columnar format for Capacitor (BigQuery’s storage format).
 - During loading select data ingestion format based on
- data’s schema
 - Embedded newlines
 - External limitations
 
 
Loading encoded data
- BigQuery supports UTF-8 encoding
 - for both nested or repeated and flat data.
 - ISO-8859-1 encoding for flat data only for CSV files.
 - By default, the BigQuery service expects all source data to be UTF-8 encoded.
 
Loading compressed and uncompressed data
- The Avro binary format is the preferred for loading both compressed and uncompressed data.
 
Loading denormalized, nested, and repeated data
- BigQuery performs best when data is denormalized.
 
Schema auto-detection
- available when you load data into BigQuery
 - Also when you query an external data source.
 - Steps
- BigQuery starts inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as sample.
 - then examines each field and attempts to assign a data type to that field based on the values in the sample.
 
 - use schema auto-detection for JSON or CSV files.
 - not available for Avro files, ORC files, Parquet files, Datastore exports, or Firestore exports
 
BigQuery Data Transfer Service
It automates loading data into BigQuery from these services:
Google Software as a Service (SaaS) apps
- Campaign Manager
 - Cloud Storage
 - Google Ad Manager
 - Google Ads
 - Google Merchant Center (beta)
 - Google Play
 - Search Ads 360 (beta)
 - YouTube Channel reports
 - YouTube Content Owner reports
 
External cloud storage providers
- Amazon S3
 
Data warehouses
- Teradata
 - Amazon Redshift
 
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz
		