Determine How to Design and Architect the Analytical Solution

Analyze the Business Problem

Look at the business problem objectively

identify whether it is a problem or not?
Sheer volume or cost may not be the deciding factor.
Multiple criteria like velocity, variety, challenges with the current system and time taken for processing should be considered as well.

Some Common Use Cases:

Data Archival/ Data Offload – Archiving data to tapes for storing huge amounts of data spanning across years (active data) at a very low cost.
Process Offload – Offload jobs that consume expensive MIPS cycles or consume extensive CPU cycles on the current systems.
Data Lake Implementation–Help in storing and processing massive amounts of data.
Unstructured Data Processing –Capabilities to store and process any amount of unstructured data natively.
Data Warehouse Modernization – Integrate the capabilities of Big Data and data warehouse to increase operational efficiency.

Capacity Planning

Capacity planning plays a pivotal role in hardware and infrastructure sizing. Important factors to be considered are:

Data volume for one-time historical load
Daily data ingestion volume
Retention period of data
HDFS Replication factor based on criticality of data
Time period for which the cluster is sized (typically 6months -1 year), after which the cluster is scaled horizontally based on requirements
Multi datacenter deployment

Hindsight, Insight, or Foresight

Hindsight, insight, and foresight are three questions that come to mind when dealing with data; to know what happened, to understand what happened, and to predict what will happen. Hindsight is possible with aggregations and applied statistics. You can aggregate data by different groups and compare those results using statistical techniques, such as confidence intervals and statistical tests. A key component is data visualization that will show related data in context2.

Insight and foresight would require machine learning and data mining. This includes finding patterns, modeling current behavior, predicting future outcomes, and detecting anomalies. Refer to data science and machine learning tools (e.g. R, Apache Spark MLLib, WSO2 Machine Learner, GraphLab) for a deeper understanding.

Steps in design and architect of analytical solution.

Source Profiling

Most important step in deciding the architecture.
It involves
- identifying the different source systems
- categorizing them based on their nature and type.

Important considerations

Identify the internal and external sources systems
High-Level assumption for the amount of data ingested from each source
Identify the mechanism used to get data – push or pull
Determine the type of data source – Database, File, web service, streams etc.
Determine the type of data – structured, semi-structured or unstructured

Ingestion Strategy and Acquisition

Important considerations

Determine the frequency at which data would be ingested from each source
Is there a need to change the semantics of the data append replace etc?
Is there any data validation or transformation required before ingestion (Pre-processing)?
Segregate the data sources based on mode of ingestion – Batch or real-time

Storage

Storage requirements

able to store large amounts of data
store any type of data
able to scale on need basis
number of IOPS (Input output operations per second) that it can provide.

2 types of analytical requirements

Synchronous – Data is analyzed in real-time or near real-time, the storage should be optimized for low latency.
Asynchronous – Data is captured, recorded and analyzed in batch.

Important considerations

Type of data (Historical or Incremental)
Format of data ( structured, semi-structured and unstructured)
Compression requirements
Frequency of incoming data
Query pattern on the data
Consumers of the data

Processing

Earlier data was stored in RAMs, but due to the volume, it is been stored on multiple disks
Processing now is taken closer to data to reduce network I/O.
Processing methodology is driven by business requirements
It can be categorized as per SLA, into
- Batch
- real-time
- Hybrid

Batch Processing –

Collecting the input for a specified interval of time
running transformations on it in a scheduled way.
Historical data load is a typical batch operation
Technology Used: MapReduce, Hive, Pig

Real-time Processing

involves running transformations as and when data is acquired.

Technology Used: Impala, Spark, spark SQL, Tez, Apache Drill

Hybrid Processing –

Combination of batch and real-time processing needs.
Example – lambda architecture.

Consuming Data

Involves consuming the output provided by processing layer.
Different users, consume data in different format.

Data consumption forms

Export Datasets – Requirements for third-party dataset generation. Data sets generated using hive export or directly from HDFS for big data applications.
Reporting and visualization –reporting and visualization tool scan and connect to Hadoop or database service.
Data Exploration – Data scientist build models and perform deep exploration in a sandbox environment. Sandbox can be a separate cluster or a separate schema within the same cluster that contains a subset of actual data.
Adhoc Querying – Adhoc or Interactive querying can be supported by using Hive, Impala or spark SQL.

Important considerations

Dynamics of use case: There a number of scenarios which needs to be considered while designing the architecture, which are
- form and frequency of data
- Type of data
- Type of processing and analytics required.

Myriad of technologies: Multiple technologies offering similar features and claiming to be better than the others.

Determine How to Design and Architect the Analytical Solution

Analyze the Business Problem

Capacity Planning

Hindsight, Insight, or Foresight

Storage

Prepare for Assured Success