Non-relational Schema Design in Cloud Bigtable: GCP Data Engineer

  1. Home
  2. Non-relational Schema Design in Cloud Bigtable: GCP Data Engineer

In this, we will learn the basics of Non-relational Schema Design in Cloud Bigtable.

The following concepts are crucial:
  • Each table has only one index, the row key.
  • There are no secondary indices.
  • Rows are sorted lexicographically by row key, from the lowest to the highest byte string.
  • Row keys are sorted in big-endian byte order.
  • Column group by column family and sort in lexicographic order within the column family.
  • All operations are atomic at the row level.
    • Avoid schema design that require atomicity across rows.
    • Keep all information for an entity in a single row. Can split the entity across multiple rows.
  • Reads and writes should distribute evenly across the row space of the table.
  • Store related entities in adjacent rows, for efficient reads.
  • Cloud Bigtable tables are sparse.
  • Empty columns don’t take up any space.
  • It’s better to have a few large tables than many small tables. .

Size limits

  • Size limits on data to store within tables.
  • Store a maximum of 10 MB in a single cell and 100 MB in a single row.
  • Size limits are in binary megabytes (MB) or 220 bytes also called a mebibyte (MiB).

Choosing a row key

  • Think carefully about composing row key.
  • Efficient queries use the row key, a row key prefix, or a row range to retrieve the data.
  • Other types of queries trigger a full table scan, which is much less efficient.
  • Factors for selection
    • User information needed
    • User-generated content
    • Time series data requirements

Types of row keys

  • Keep row keys reasonably short.
  • Long row keys take up additional memory and storage and increase time to get responses
  • Reverse domain names – If storing data about entities can be represented as domain names, use a reverse domain name (for example, company.product) as the row key. Effective if each row’s data tends to overlap with adjacent rows. Apt if data is spread across many different reverse domain names.
  • String identifiers – If identification by a string, use the string identifier as row key. Use human-readable values.
  • Timestamps – Useful if need to retrieve data based on the time when it was recorded. It is not recommended
  • Multiple values in a single row key – Useful to include multiple identifiers in row key.
  • Row key prefixes
    • It is the first value in a multi-value row key
    • Related data in contiguous rows enables access as a range of rows against inefficient table scans.
    • Provide a scalable solution for a “multi-tenancy” use case,
  • Row keys to avoid
    • Domain names
    • Sequential numeric IDs
    • Frequently updated identifiers
    • Hashed values
  • Column families
    • Cloud Bigtable can use up to about 100 column families
    • A row with multiple values which are related to one another, group into same column family. Column families enables retrieving from the family instead from each row.
    • The names of column families should be short as included in each request.
  • Column qualifiers
    • Can create as many column qualifiers as per need in each row.
    • No space penalty for empty cells in a row.
    • Avoid splitting data across more column qualifiers than necessary
    • keep the names of column qualifiers short, as used in each request.

Understand Non-relational Schema Design in Cloud Bigtable and Pass the GCP Data Engineer Exam Now!