Part 5 - Developing Big Data Applications | big data management

Applications Development Milestones

System Analysis
- Understanding users’ requirements.
System Design
- Designing the systems according to the user’s requirements and problem-solving algorithm.
Implementation
- Coding or packaged methods.
Testing
- Checking the whole completed system.
Review
- Summarizing what have been done and what have been missed.

Systems Analysis

This milestone is crucial to the success of the whole application development project.

Excellent communication skill is a must. Systematic understanding of users’ existing and future applications requirements is mandatory.

Conceptual modeling is used throughout the system analysis milestone.

Soft Systems Methodology (SSM)

The problem situation: unstructured.
The problem situation: expressed.
Root definition of relevant systems.
Conceptual models.
Comparison of models to the problem situation.
Definition of feasible desirable changes.
Action to solve the problem or improve the situation.

Figure 1. Soft Systems Methodology (SSM) Diagram. Source

Conceptual Modeling

Conceptual modeling involves capturing various aspects of the real world, and representing them in the form of a model that can be used for communication.

Focuses on “capturing and representing human perceptions of the real world” in such a manner that they can be included in an information system. The outcome of the conceptual modeling activity is usually a diagram or model that can then be translated into a relational or some other logical model.

The adequacy of a conceptual model is based on how well it is able to promote a common understanding among human users.

Advantages and Disadvantages of Conceptual Modeling

Advantages
- Establishes Entities
  - Help ensure that there are fewer surprises down the road, where entities or relationships might otherwise have been neglected or forgotten.
- Defines Project Scope
  - Assists with time management and scheduling.
- Base Model for Other Models
  - Less abstract models will need to be generated beyond the rough concepts.
- High-Level Understanding Beneficial for managers and executives, who may not be dealing directly with coding or implementation, but require a solid understanding of the system and the relationships therein.
Disadvantages
- Creation Requires Deep Understanding
  - Requires a fundamental and robust understanding of the project, along with all associated entities and relationships.
- Potential Time Sink
  - Improper modeling of entities or relationships within a conceptual model may lead to massive time waste and potential sunk costs.
- Possible System Clashes
  - A clash simply indicates that one component may conflict with another component, somewhere down the line.
- Implementation Challenge Scales with Size
  - Challenging to develop and maintain a proper conceptual model for particularly complex projects, as the number of potential issues, or clashes, will grow exponentially as the system size increases therein.

Systems Design

This milestone contains hardware and software design.

Hardware design refers to hardware systems selection and integration. Software design refers to the system logic / interactions and problem-solving algorithm which is one of the critical success factors for big data analytics.

Implementation

Good development framework will simplify the process of developing, executing, testing and debugging new application code.

There are many tools for BDA implementation, we need to find the most appropriate one by comparison and analysis before use.

MapReduce Programming Model

Two basic operations that are applied to sets or lists of data value pairs.

Map
- It describes the computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs.
Reduce
- In which the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results.

A MapReduce application is envisioned as a series of basic operations applied in a sequence to small sets of many data items. These data items are logically organized in a way that enables MapReduce execution model to allocate tasks that can be executed in parallel.

The data items are indexed using a defined key info <key, value> pairs, in which the key represents some grouping criterion associated with a computed value.

Composite Key & Multiple Value

Key can be composed of more than one data item, in this case, the key is called composite key represented by +, e.g. “Product code + Vendor code”.

Value can be composed of more than one data item, in this case the value is called multiple value represented by +.

Note that the key expression and value expression must be separated by one comma.

4 types of key-value pair:

Simple key, single value
- E.g. Maximum temperature of each city.
Composite key, single value
- E.g. Minimum temperature of each city in each month.
Simple key, multiple value
- E.g. Average temperature of each city.
Composite key, multiple value
- E.g. average temperature of each city in each month.

MapReduce (cont.)

With some applications applied to massive datasets, the theory is that the computations applied during the Map phase to each input key/value pair are independent of one another.

Combining both data and computational independence means that both data and computations can be distributed across multiple storage and processing units and automatically parallelized.

This parallelizability allows the programmer to exploit scalable massively parallel processing resources for increased speed and performance.

MapReduce programming model consists of five basic operations:

Split
- Input data is distributed to the available data nodes equally.
Map
- Distributed data in each data node is transformed into {key, value} pairs. If composite key is required, then + sign is used to concatenate the keys.
Sort/shuffle
- Each {key, value} pair is shuffled to the bin with same key.
Reduce
- All {key, value} pairs inside each bin are reduced according to the required mission.
Result
- The target {key, value} pair(s) are taken out as the final results.

Other Big Data Development Frameworks

Enterprise Control Language (ECL) is a data-centric programming language for a different open source big data platform called, High Performance Computing Cluster (HPCC).

ECL is a declarative programming language that describes what is supposed to happen to the data but does not specify how it is done.

HPCC

Born from the deep data analytics history of LexisNexis Risk Solutions, HPPC Systems provides high-performance, parallel processing and delivery for applications using big data.

The open-source platform incorporates a software architecture implemented on commodity computing clusters for reslience and scalability. It is configurable to support both parallel batch data processing and high-performance data delivery applications using indexed data files.

The platform includes a high-level, implicitly parallel data-centric declarative programming language that adds to its flexibility and efficiency.

Comparison of Hadoop and HPCC

Dimensions Hadoop HPPC - table

Dimension	Hadoop	HPCC
Data Storage	Data distributed across nodes	Centralized storage
Data Processing	Compute moves to data	Data moved to compute (like traditional computing systems)
Hardware Infrastructure	Commodity hardware (relatively cheaper)	Presence of InfiniBand network connectivity, is required for high throughput and low latency
File System Storage	HDFS file system (open source)	Lustre file system (open source)
Cluster Resource Management	YARN	SLURM
Programming Languages	Java, Scala, Python	C, C++, Fortran
Business applications	Focused on analytics use cases and commercial	Primarily used for scientific research applications, e.g. life sciences, healthcare and mining sectors

The declarative approach presumes the programmer has the expectation that the compiler and execution scheme automate the parallelization of the declarative statements, thus simplifying programming process.

ECL provides a collection of primitive capabilities that are typical for data analysis, such as sorting, aggregation, deduplication.

With ECL, the declarative model is the source of task parallelism in which discrete and small units of work can be farmed out to waiting processing units in a cluster and executed in parallel.

In fact, each of the ECL programming constructs can be executed in parallel.

Testing

This milestone is a must for the whole application development project because there may have some unfound errors in systems analysis, design and implementation milestones.

Reasonable amount of test data must be designed to walk through part of or whole project, thus the design must be random with the widest coverage, to try to reflect reality.

Part 4 - Big Data Tools and Techniques

Part 6 - Achieving Organizational Alignment for Big Data Analytics