Database and Replication

Understand the concepts of Database and Replication.

Cloud continues to drive down the cost of storage and compute
New applications need databases to store terabytes to petabytes of storage
new types of data is to be stored
Access to the data with millisecond latency is needed
process millions of requests per second
Scale to support millions of users anywhere in the world
Both relational and non-relational databases are needed
AWS offers broadest range of databases.

AWS Managed database services

removes constraints that come with licensing costs
supports diverse database engines
Access to the information stored on these databases is the main purpose of cloud computing.

There are three different categories of databases to keep in mind while architecting:

Relational databases – Data is normalized into tables and also provided with
- powerful query language
- flexible indexing capabilities
- strong integrity controls
- ability to combine data from multiple tables in a fast and efficient manner.
- Can be scaled vertically and are highly available during failovers.
NoSQL databases– They have flexible data model that seamlessly scales horizontally. NoSQL databases utilize a variety of data models, including graphs, key-value pairs, and JSON documents. They offer
- ease of development
- scalable performance
- high availability
- resilience
Data warehouse – A specialized type of relational database, optimized for analysis and reporting of large amounts of data. It can be used to combine transactional data from disparate sources making them available for analysis and decision-making.
- Horizontal-scaling is usually based on the partitioning of the data i.e. each node contains only part of the data
- Vertical-scaling the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.

Database Engines

Amazon Aurora
PostgreSQL
MySQL
MariaDB
Oracle
Microsoft SQL Server

Relational Database

Relational databases store data with pre-defined schema
relationships between them are defined
designed for supporting ACID transactions
maintaining referential integrity
maintain data consistency
Used for
- Traditional applications
- ERP
- CRM
- e-commerce
AWS Offerings
- Amazon Aurora
- MySQL, PostgreSQL
- Amazon RDS
- MySQL, PostgreSQL, MariaDB, Oracle, SQL Server
- Amazon Redshift

Key-value databses

Optimized to store and retrieve key-value pairs
large data volumes with milliseconds retrieval
without the performance overhead
scale limitations of relational databases
Used for
- Internet-scale applications
- real-time bidding
- shopping carts
- customer preferences
AWS Offering
- Amazon DynamoDB

Document Databases

Designed to store semi-structured data
Stores documents
Data is typically represented as a readable document
Used for
- Content management
- Personalization
- mobile applications
AWS Offering
- Amazon DocumentDB (with MongoDB compatibility)

In-memory Databases

Used for applications that require real time access to data.
By storing data directly in memory, microsecond latency is provided
Used for
- Caching
- gaming leaderboards
- real-time analytics
AWS Offerings:
- Amazon ElastiCache for Redis
- Amazon ElastiCache for Memcached

Graph Databases

Used for applications needing to query and navigate relationships between highly connected, graph datasets with millisecond latency.
Used for
- Fraud detection
- social networking
- recommendation engines
AWS Offering
- Amazon Neptune

Time Series Databases

Used to efficiently
- Collect
- Synthesize
- derive insights from enormous amounts of data

that changes over time (known as time-series data)

Used for
- IoT applications
- DevOps
- industrial telemetry
AWS Offering:
- Amazon Timestream

Ledger Databases

Used for a centralized, trusted authority to maintain a scalable, complete and cryptographically verifiable record of transactions.
Used for
- Systems of record
- supply chain
- registrations
- banking transactions
AWS Offering:
- Amazon Quantum Ledger Database (QLDB)

Amazon RDS

Expands to Relational Database Service
Eliminates much of relational database management
Can be scaled independently
- CPU
- memory
- storage
- IOPS
AWS manages
- backups
- software patching
- automatic failure detection
- recovery
Can trigger, manual or automated backups

It provides high availability with primary instance which if fails, switch to secondary instance
Has a soft limit of 40 Amazon RDS DB instances per account
From 40, up to 10 can be Oracle or Microsoft SQL Server DB instances under the License Included model.
Customers can Bring Own License (BYOL) model to have all 40 DB instances for Oracle or Microsoft SQL Server
Supports database engines
- MySQL
- MariaDB
- PostgreSQL
- Oracle
- Microsoft SQL Server
- MySQL-compatible Amazon Aurora
- AWS IAM controls AWS resources access to Amazon RDS databases.

For security:

put database in an Amazon Virtual Private Cloud (Amazon VPC)
using Secure Sockets Layer (SSL) for data in transit
Using encryption for data at rest

Further,

RDS APIs and the AWS Management Console provide a management interface that allows you to create, delete, modify, and terminate RDS DB instances; to create DB snapshots; and to perform point-in-time restores
There is no AWS data API for Amazon RDS.
Once a database is created, RDS provides a DNS endpoint for the database which can be used to connect to the database.
Endpoint does not change over the lifetime of the instance even during the failover in case of Multi-AZ configuration
RDS leverages Amazon EBS volumes as its data store
RDS provides database backups, which are replicated across multiple AZ’s
Then, RDS Multi AZ’s feature synchronously replicate data between a primary RDS DB instance and a standby instance in another Availability Zone, which prevents data loss,
RDS provides a DNS endpoint and in case of an failure on the primary, it automatically fails over to the standby instance
Next, RDS also allows Read replicas for the supported databases, which are replicated asynchronously
RDS Provisioned IOPS, where the IOPS can be specified when the instance is launched and is guaranteed over the life of the instance

Resource	Default Limit
Cross-region snapshots copy requests	5
DB Instances	40
Event subscriptions	20
Manual snapshots	100
Option groups	20
Parameter groups	50
Read replicas per master	5
Reserved instances	40
Rules per DB security group	20
Rules per VPC security group	50 inbound 50 outbound
DB Security groups	25
VPC Security groups	5
Subnet groups	50
Subnets per subnet group	20
Tags per resource	50
Total storage for all DB instances	100 TiB

Amazon DynamoDB

It is a fully managed NoSQL database service
It provides fast and predictable performance with seamless scalability.
Offload the administrative burdens of operating and scaling a distributed database
It also offers encryption at rest
Create database tables that can store and retrieve any amount of data
Serve any level of request traffic.
Scale up or down tables’ throughput capacity without downtime or performance degradation
Use the AWS Management Console to monitor resource utilization.
Also, provides on-demand backup capability with full backups of tables for long-term retention and archival.
tables, items, and attributes are the core components
A table is a collection of items
item is a collection of attributes.
primary keys are used to uniquely identify each item in a table
secondary indexes provide more querying flexibility
DynamoDB Streams capture data modification events in DynamoDB tables.

DynamoDB components

Tables

Similar to other database systems, DynamoDB stores data in tables.
A table is a collection of data
Example, see the example table called People, is listed below, to store personal contact information about friends, family, or anyone else of interest. You could also have a Cars table to store information about vehicles that people drive.

Items

Each table contains zero or more items
An item is a group of attributes that is uniquely identifiable among all of the other items.
In a People table, each item represents a person. For a Cars table, each item represents one vehicle.
Items are similar in many ways to rows, or tuples in other database systems.
There is no limit to the number of items you can store in a table.

Attributes

Each item is composed of one or more attributes.
It is a fundamental data element, and is not to be broken down any further.
For example, an item in a People table contains attributes called PersonID, LastName, FirstName, and so on.
Attributes are similar to fields or columns in other database systems.

The following diagram shows a table named People with some example items and attributes.

Note the following about the People table:

Each item in the table has a unique identifier, or primary key, that distinguishes the item from all of the others in the table. In the People table, the primary key consists of one attribute (PersonID).
Other than the primary key, the People table is schemaless, which means that neither the attributes nor their data types need to be defined beforehand. Each item can have its own distinct attributes.
Most of the attributes are scalar, which means that they can have only one value. Strings and numbers are common examples of scalars.
Some of the items have a nested attribute (Address). DynamoDB supports nested attributes up to 32 levels deep.

DynamoDB supports two different kinds of primary keys:

Partition key – A simple primary key, composed of one attribute known as the partition key. Partition key’s value is input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored. In a table that has only a partition key, no two items can have the same partition key value. The People table has a simple primary key (PersonID) to access any item in the table directly by providing it.
Partition key and sort key – Called as a composite primary key. Has two attributes – first is the partition key, and second is the sort key. Partition key value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored. All items with the same partition key value are stored together, in sorted order by sort key value.

Secondary Indexes –

Can create one or more secondary indexes on a table.
Query the table using an alternate key by secondary index, instead of primary key.
DynamoDB doesn’t require index, but index give more flexibility during querying data.
With a secondary index on a table, read data from index similar as, from the table.

DynamoDB index types

Global secondary index – Partition key and sort key that can be different from those on the table.
Local secondary index –Same partition key as the table, but a different sort key.

DynamoDB supports eventually consistent and strongly consistent reads.

Eventually Consistent Reads – When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat read request after a short time, the response should return the latest data.
Strongly Consistent Reads – When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful. A strongly consistent read might not be available if there is a network delay or outage. Consistent reads are not supported on global secondary indexes (GSI).

Amazon Redshift

It is a fast, scalable data warehouse
Used to analyze all data across data warehouse and data lake.
Integrates with machine learning, parallel query execution, and columnar storage.
Setup and deploy a new data warehouse in minutes
Run queries across petabytes in Redshift, and exabytes in data lake.
uses columnar storage, data compression, and zone maps to reduce I/O to perform queries.
Has a massively parallel processing (MPP) architecture to parallelize and distribute SQL operations
stores three copies of data — all data written to a node in cluster is automatically replicated to other nodes within the cluster, and all data is continuously backed up to Amazon S3.
Snapshots are automated, incremental, and continuous and stored for a user-defined period (1-35 days)
Manual snapshots can be created and are retained until deleted.
Continuously monitors health of cluster
Automatically re-replicates data from failed drives and replaces nodes as necessary.

Three pricing components:

data warehouse node hours – total number of hours run across all the compute node
backup storage – storage cost for automated and manual snapshots
data transfer

number of nodes can be easily scaled as per demand

Monitoring Amazon RDS

Amazon RDS Events – Subscribe to Amazon RDS events to be notified when changes occur with a DB instance, DB cluster, DB snapshot, DB cluster snapshot, DB parameter group, or DB security group.
Database log files – View, download, or watch database log files using the Amazon RDS console or Amazon RDS API actions. You can also query some database log files that are loaded into database tables.
Amazon RDS Enhanced Monitoring — Look at metrics in real-time for the operating system that DB instance or DB cluster runs on.
Amazon CloudWatch Metrics – Amazon RDS automatically sends metrics to CloudWatch every minute for each active database instance and cluster.
Then, Amazon CloudWatch Alarms – Watch a single Amazon RDS metric over a specific time period, and perform actions based on the value of the metric relative to a threshold you set.
Amazon CloudWatch Logs – MariaDB, MySQL, and Aurora MySQL enable you to monitor, store, and access database log files in CloudWatch Logs.