• It is a quick, powerful, and fully managed, petabyte-scale data warehouse service in AWS
  • It is a RDBMS
    • built specifically for OLAP
    • optimized for high-performance analysis and reporting
    • for very large datasets
  • Usage in fast querying capabilities
  • Uses standard SQL commands for interactive query
  • connectivity by ODBC or JDBC
  • It depends on PostgreSQL
  • It also automatically monitors nodes and drives to support recovery from any failures.
  • Has a massively parallel processing (MPP) architecture to parallelize and distribute SQL operations
  • stores three copies of your data — all data written to a node in cluster is automatically replicated to other nodes within the cluster, and all data is continuously backed up to Amazon S3.
  • Snapshots are automated, incremental, and continuous and stored for a user-defined period (1-35 days)
  • Manual snapshots can be created and are retained until deleted.
  • Continuously monitors health of cluster
  • Automatically re-replicates data from failed drives and replaces nodes as necessary.
  • has three pricing components:
    • data warehouse node hours – total number of hours run across all the compute node
    • backup storage – storage cost for automated and manual snapshots
    • data transfer
      • There is no data transfer charge for data transferred to or from Amazon Redshift outside of Amazon VPC
      • Data transfer to or from Amazon Redshift in Amazon VPC accrues standard AWS data transfer charges.
  • number of nodes can be easily scaled as per demand

Clusters and Nodes

  • cluster is composed of a leader node and one or more compute nodes.
  • client application interacts directly only with leader node
  • compute nodes are transparent to external applications.
  • Redshift provides support for six different node types and each has a different mix of CPU, memory, and storage.
  • node types are grouped into categories
    • Dense Compute –support clusters up to 326TB using fast SSDs
    • Dense Storage –support clusters up to 2PB using large magnetic disks.

Data Types

  • Redshift columns support a wide range of data types
  • Supported data types are
    • numeric – INTEGER, DECIMAL, and DOUBLE
    • text – CHAR and VARCHAR
    • date – DATE and TIMESTAMP
  • Additional columns can be added to table using ALTER TABLE command
  • existing columns cannot be modified.

Compression Encoding

  • Redshift’s data compression is key performance optimizations
  • During data loading into an empty table, Redshift samples data and selects best compression scheme for each column.
  • User can also specify compression encoding on a per-column basis with CREATE TABLE  command.

Distribution Strategy

  • how to distribute records across nodes and slices in a cluster
  • configure distribution style of a table, how data be partitioned to meet query patterns
  • By running a query, optimizer shifts rows as needed to perform any joins and aggregates
  • table distribution style aims
    • to minimize impact of redistribution step
    • putting data where it needs to be before query is performed.
  • For best distribution strategy for each table, balance data distribution
  • During table creation, can choose between distribution styles: EVEN, KEY, or ALL.
  • EVEN distribution –default option, data being distributed across slices in uniform fashion regardless of data.
  • KEY distribution –rows are distributed as per values in one column. leader node will store matching values close together and increase query performance for joins.
  • ALL distribution –a full copy of entire table is distributed to every node. useful for large tables which are not updated frequently.

Loading Data 

  • For bulk operations, Redshift provides COPY command
  • Does not need to repeatedly calling INSERT for bulk upload
  • loading data from S3, COPY command can read from multiple files at same time.
  • Redshift can distribute workload to nodes and perform load process in parallel.
  • Enable parallel processing by cluster with multiple nodes and multiple input files instead single large file with data
  • After bulk data load run VACUUM command to reorganize data and reclaim space after deletes.
  • suggested to run ANALYZE command to update table statistics.

Querying Data

  • Redshift offers standard SQL commands to query tables
  • It supports SELECT to query and join tables
  • For complex queries, can analyze query plan to optimize access pattern.
  • can monitor performance of cluster by queries using CloudWatch and Redshift web console.
  • To support many users, configure Workload Management (WLM) to queue and prioritize queries.
  • WLM defines multiple queues and set concurrency level for each queue.

Snapshots

  • create point-in-time snapshots of Redshift cluster
  • can be used to restore a copy or create a clone of original Redshift cluster
  • They are durably stored internally in S3 by Redshift.
  • Redshift supports automated and manual snapshots
  • automated snapshots – Redshift periodically take snapshots of cluster and copy is kept for configurable retention period.
  • manual snapshots – can be done and share across regions or with other AWS accounts. Need to be explicitly deleted

Security

  • securing Redshift cluster similar to securing databases in cloud.
  • security plan must include controls to protect
    • infrastructure resources
    • database schema
    • records in table
    • network access.
  • Need to address security at every level to securely operate an Redshift data warehouse.
  • first layer – infrastructure level using IAM policies to limit actions like
    • to create and manage lifecycle of a cluster
    • scaling
    • backup
    • recovery operations.
  • Deploy clusters within private IP address space of VPC to restrict network connectivity
  • Fine-grained network access by security groups and network ACLs at subnet level.
  • At database level, create a master user account for creating to create more users and groups.
  • Each database user can be granted permission to
    • Schemas
    • Tables
    • other database objects
  • These permissions are independent from IAM policies
Menu