Part 3 - High-Performance Applications | big data management

Introduction

Big Data Analytics combine the means for developing and implementing algorithms that access, consume and manage data.

Some algorithms expect that massive amounts of data are immediately available, necessitating large amounts of core memory. Some may need numerous iterative exchanges of data between different computing nodes, which would require high-speed networks.

Big Data technology ecosystem may include:

Scalable storage systems:
- Used for capturing, manipulating and analyzing massive datasets.
A computing platform:
- Configured specifically for large-scale analytics, often composed of multiple processing nodes connected via a high-speed network to memory and disk storage subsystems.
A data management environment:
- It may range from a traditional database management system (DBMS) scaled to massive parallelism to databases configured with alternative distributions and layouts, to newer graph-based or other NoSQL data management schemes.
An application development framework:
- To simplify process of developing, executing, testing and debugging new application code.
Packed methods of scalable analytics:
- That can be configured by the analysts and other business consumers to help improve the ability to design and build analytical and predictive models.
Oversight and management process and tools:
- That are necessary to ensure alignment with the enterprise analytics infrastructure and collaboration among developers, analysts and other business users.

Storage Considerations

Variables to consider for storage:

Scalability:
- Additional storage resources, i.e. scale up or down.
Extensibility:
- How flexible it is, i.e. changing configuration.
Accessibility:
- Simultaneous access without compromising performance.
Fault tolerance:
- Recover from intermittent failures.
High-speed I/O Capacity:
- Demanding timing requirements for absorbing, storing and sharing large data volumes.
Integrability:
- E.g. how well the storage system can be integrated into production environment.

Parallelism

One of the key objectives of using a multiprocessing node environment is to speed application execution by breaking up large chunks of work into much smaller ones that can be farmed out to a pool of available processing nodes.

Datasets to be consumed and analyzed are also distributed across a pool of storage nodes, i.e. data parallelism (different independent datasets running same task).

As long as there are no dependencies forcing any one specific task to wait to begin until another specific one ends, these smaller tasks can be executed at the same time, i.e. task parallelism (different independent tasks using same dataset).

Hardware and Software Tuned for Analytics

When designing a high-performance analytics environment, we will do this both on the hardware level and the software level.

Let’s take a look at the hardware aspect first.

Hardware Appliances

Often configured as multiprocessor systems.
CPU/core configurations, cache memory, core memory, flash memory, temporary disk storage areas, persistent disk storage.
Symmetric multiprocessor (SMP) systems, massively parallel processing (MPP).
- Massively Parallel Processors (MPP) architecture consists of nodes with each having its own processor, memory and I/O subsystem.
- An independent OS runs at each node.
Multiple processing nodes, multiple storage nodes linked via high-speed interconnect.

Software Appliances

We will go into more detail late or, but for now,

Database management software with a high-performance execution engine (computing platform) to support and take advantage of parallelization and data distribution.
Application development tools, analytics, capabilities as well as enabling direct user tuning with alternate data layouts for improving performance.

Architectural Choices

Designing a high-performance system will depend on some architectural choices, let’s take a look at some approaches one can take.

Shared everything approach:
- Persistent storage and memory components are all shared by the different processing units.
Shared-disk approach:
- Isolated processors, each with its own memory but the persistent storage on disk is still shared across the system.
Shared-nothing approach:
- Each processor has its own dedicated disk and memory storage.

A computer cluster consists of a number of computers, linked together and working closely together. In many ways, the computer cluster works as a single computer. Generally, the component-computers are connected to each other through fast local area networks (LANs).

Clustering technologies can help to achieve scalability, availability, reliability, and fault tolerance.

The advantage of computer clusters over single computers are that they usually improve the over all performance (and availability) greatly, while still being cheaper than individual high-performance computers.

A node is the name used for one unit (usually one computer) in a computer cluster. Generally, this computer will have one or two CPUs, each normally with more than one core. The memory is always shared between cores on the same CPU, but generally not between the CPUs.

Cloud Computing

Lastly, let’s talk about cloud computing.

Complex IT infrastructure management skills are all owned by the cloud computing provider.

The client can simply access a smoothly running IT infrastructure over a fast internet connection.

It is a cost-effective and flexible mode of delivering IT infrastructure, over internet, as a service (XaaS) to clients on a metered basis (pay-as-you-go).

The IT usage can be scaled up or down in minutes.

Characteristics of cloud computing:

Flexible Capacity:
- Scale up or down rapidly.
Attractive payment method:
- Pay-as-you-go.
Resiliency and security:
- Virtualization

There are of course different types of cloud services:

Public cloud:
- A large shared infrastructure available to public users (multi-tenancy model).
Private cloud:
- In-house IT infrastructure.
Hybrid cloud:
- Mix of flexibility of capacity and much control over some key aspects.

Considering Platform Alternatives

Benefits of using hardware appliances for big data center on engineering and integration.

They are engineered for high-performance reporting and analytics, yet have a flexible architecture allowing integrated components to be configured to meet specific application needs.

Benefit of using software appliances is that they can take advantage of low-cost commodity hardware components.

Besides, the reliance on commodity hardware allows a software appliance to be elastic and extensible.

Part 2 - Business Drivers for Big Data Analytics & its related Business Problems

Part 4 - Big Data Tools and Techniques