Big Data – RunModule

This post is an introduction to big data.

A cluster is a set of processing devices or computers, that are called nodes in this context.

Big data refers to the management of big sets of data on a cluster. It is tightly related to parallel and distributed system.

Big data relies on NoSQL databases, as they have a distributed approach. You can read this post about NoSQL database.

V’s of Big Data

Five V’s of Big Data:

Volume
Velocity
Variety
Veracity
Value

Originally, there were only the first 3. They were proposed by Doug Laney in his reseach paper “3D Data Management: Controlling Volume, Velocity and Variety.

Big Data Standards

Big data standards:

ITU Y.3605

ISO 8000 is about data quality, so it is somehow related.

ITU-T Y.3605

ITU-T Y.3605 has the title “Big data – Reference architecture”. It was published in 2020.

ITU-T Y.3605 in PDF format

ITU Y.3600 theoretically has the title “Big data – Cloud computing based requirements and capabilities”, but I have never found this paper.

Big Data Knowledge Technology

Big Data knowledge technology:

SEMMA Methodology
CRISP-DM methodology

SEMMA (Sample, Explore, Modify, Model, Assess)

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Big Data Programming Models

Big data programming model featured on this post:

MapReduce

MapReduce

MapReduce is a programming model or paradigm to process and generate big amounts on data, relying on parallel and distributed algorithms.

It receives the name from the original implementation by Google, that is no longer active. Google’s MapReduce was the first succesful approach to big data in 2003 to enhance the efficiency of Google’s PageRank algorithm.

There are MapReduce libraries available for different programming languages.

The Map function processes an input pair (key, value) to produce a list of intermediate key-value pairs. The output may contain multiple pairs with the same or different keys.

The Reduce function groups all intermediate pairs by key and aggregates the values associated with each key into a single output value (or a list of values), typically producing one output per key.

Hadoop-based Big Data Tools

Big data tools aid in managing large volumes of data. They all have in common that handle data in distributed system.

Big data tool categories featured on this post, in a bottom-down approach:

Big Data Distributed File Systems
- HDFS
- Alluxio
Big Data Serialization
- Apache Avro
Big Data Processing Frameworks
- Apache Hadoop
- Apache Spark
Big Data Streaming
- Apache Kafka
- RabbitMQ
- Apache Pulsar
Big Data Real-Time Processing
- Apache Storm
- Apache Flink
- Apache Spark Streaming
Big Data Workflow and Resource Management
- Apache Oozie
- Apache Zookeeper
- Apache Ambari
- Apache Nifi
Big Data Querying
- Apache Pig
- Apache Hive
Big Data Monitoring
- Apache Flume
- Apache Chukwa
Graph Processing
- Apache Giraph
Machine Learning
- Apache Mahout

Big Data Storage and Serialization

Big data storage models may be connected or substituted by NoSQL databases.

Big data storage and serialization:

HDFS
Alluxio

HDFS

Hadoop File System (HDFS) is a distributed file system.

It is included within Apache Hadoop software, and it is the foundation of all Apache Hadoop tools.

Aulluxio

Aulluxio is a virtual distributed file system.

Big Data Serialization

Big data serialization:

Apache Avro

Apache Avro

Apache Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project.

Apache Avro official website

Apache Avro at Wikipedia

Big Data Coordination

YARN (included in Apache Hadoop)
Apache Mesos

Big Data Core Processing Frameworks

The most popular big data processing frameworks are Apache Hadoop and Apache Spark.

Big data processing frameworks featured on this post:

Apache Hadoop
Apache Spark

Apache Hadoop

Apache Hadoop facilitates using a network of many computers to solve problems involving massive amounts of data and computation.

Apache Hadoop components, in bottom-up order:

Storage: Haddop File System (HDFS)
Cluster resource management: YARN
Data processing module (including MapReduce as default, but also Apache Tez)

Yet Another Resource Negotiator (YARN) allows to work with different big data paradigms like MapReduce or Apache Tez. YARN was introduced in Hadoop 2.0. There was only MapReduce in Hadoop 1.0.

It was first released in 2006. It is written in Java.

It is FOSS, under an Apache License 2.0.

YARN’s data processing modules:

Hadoop’s MapReduce
Apache Tez

Apache Spark

Apache Spark is a unified engine to process big amounts of data. It is built on top of HDFS. Unlike Hadoop, it does not use YARN by default.

It was first released in 2014. It is written in Scala.

It is FOSS, under an Apache License 2.0.

Apache Spark official website

Big Data Processing Components

Big data processing components complement the processing function of a processing framework.

Big Data Processing Components:

Haddop’s MapReduce
Apache Tez

Hadoop’s MapReduce

Hadoop’s MapReduce implementation is one of the components of Apache Hadoop. It is built on top of YARN.

There are MapReduce libraries available for different programming languages.

Apache Tez

Apache Tez is an alternative to MapReduce when using YARN in Apache Hadoop.

Tez can be considered a more flexible and efficient processing engine than MapReduce.

Apache Tez official website

Big Data Streaming

Distributed messages and streaming Hadoop platforms focuses on the transport and distribution of data streams.

The most popular message-oriented middleware (MOM) for big data are:

RabbitMQ
Apache Kafka
Apache Pulsar

You can read this post about message-oriented middleware.

Big Data Real-Time Processing

Big data real-time processing platforms:

Apache Storm
Apache Flink
Apache Spark Streaming

Apache Storm

Apache Storm is a distributed stream processing computation framework.

Apache Storm at Wikipedia

Apache Flink

Apache Flink is a distributed stream-processing and batch-processing framework.

Apache Flink at Wikipedia

Apache Spark Streaming

Apache Spark Streaming is a stream-processing framework within Apache Spark.

Big Data Workflow

Big data workflow:

Apache Oozie
Apache Zookeeper
Apache Ambari
Apache NiFi

All of them are specific to big data.

Apache Oozie

Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs.

Apache Oozie at Wikipedia

Apache Zookeeper

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications.

It started as a sub-project of Hadoop, but it became a project on its own.

Apache Zookeeper at Wikipedia

Apache Ambari

Apache Ambari attempts to simplify Hadoop management by developing software for provisioning, managing, and monitoring Apache Hadoop clusters.

Apache NiFi

Apache NiFi has the purpose of automating the flow of data between software systems.

It does not seem to be related to the Hadoop project.

Apache NiFi at Wikipedia

Big Data Querying

Big data querying tools featured on this post:

Apache Pig
Apache Hive

Apache Pig

Apache Pig is built on top of MapReduce within Apache Hadoop.

It uses the programming language Pig Latin. It is convenient where there is not much structured data.

It is FOSS under an Apache License 2.0.

Apache Pig official website

Apache Hive

Apache Hive is built on top of MapReduce within Apache Hadoop.

It uses SQL-like syntax to access MapReduce, so it is convenient for big data with structured data.

It is FOSS under an Apache License 2.0.

Apache Hive official website

Big Data Monitoring

Big data monitoring tools:

Apache Flume
Apache Chukwa

Apache Flume

Apache Flume is a large scale log aggregation framework.

Apache Chukwa

Apache Chukwa was a data collection system for monitoring large distributed systems.

The project has been retired.

Apache Chukwa website

Big Data Graphs

Big-data graphs:

Apache Giraphe

Apache Giraphe

Apache Giraphe leverages Apache Hadoop to process graph data.

It is FOSS, under an Apache License 2.0.

Big Data Machine Learning

Apache Mahout

Apache Mahout leverages Apache Hadoop.

It first made used of Hadoop and more recently moved to Spark and phased out Hadoop. Its use has a main focus on linear algebra, and it is also used for machine learning.

It is FOSS, under an Apache License 2.0.

Big Data Architectures

Big data architectures:

Data lake architecture
Lambda architecture
Kappa architecture

Data Lake Architecture

Data lake architecture or Hadoop-centric architecture is focused on Hadoop tools. It suits for batch processing.

Data lake architecture layers:

Distributed File System
Processing Layer
Data Management and Workflow Tools

Distributed File System could be HDFS.

The processing layer would cover processing engines like Hadoop plus MapReduce or Tez in Apache Hadoop or Apache Spark, in addition to query tools Pig or Hive.

Data management and workflow tools could cover Apache Oozie for workflow scheduling, Apache Zookeeper for distributed coordination, and Apache Ambari for cluster management help support complex data processing pipelines within Hadoop.

Lambda Architecture

Lamda architecture is used for real-time processing and historical data needs.

It has 3 layers:

Batch
Speed
Serving

The batch layer stores and processes large, historical datasets. It is typically implemented with tools like Apache Hadoop or Apache Spark. The batch layer is optimized for high throughput but has higher latency.

The speed layer processes data in real-time as it arrives, providing low-latency insights. Tools like Apache Storm, Apache Kafka, and Apache Spark Streaming are often used here.

The serving layer combines the results from both the batch and speed layers, allowing for fast queries. Data in this layer may be stored in NoSQL databases optimized for fast retrieval (such as Cassandra, HBase, or Elasticsearch).

Kappa Architecture

Kappa architecture is a simplification of lamda architecture. It is specific to real-time streaming without historical data needs.

There is no batch layer, and there is a single processing layer akin to the speed layer.

According to Astic source, it includes Kafka messages, Spark processing, NoSQL databases and Scala programming language.

Big Data Platforms

Big data platforms:

Clouders
HortonWorks
MapR
Oracle Stream Analytics

Oracle Stream Analyticis is a bid data streaming tool. It would compete against Apache Kafka.

Big Data in Cloud

AWS uses Amazon Elastic MapReduce (EMR), that runs in Hadoop and Spark.

Azure uses propretary HDInsight, based on Hadoop components from Hortonworks Data Platform (HDP).

Google Cloud Platform (GDP) uses Cloud DataFlow.

External References

cloudlytics; “Hadoop vs Spark: A Comparative Study“; cloudlytics
Shivansh Yadav; “MapReduce vs Tez“; dev.to, 2024-07-07

V’s of Big Data

Big Data Standards

ITU-T Y.3605

Big Data Knowledge Technology

Big Data Programming Models

MapReduce

Hadoop-based Big Data Tools

Big Data Storage and Serialization

HDFS

Aulluxio

Big Data Serialization

Apache Avro

Big Data Coordination

Big Data Core Processing Frameworks

Apache Hadoop

Apache Spark

Big Data Processing Components

Hadoop’s MapReduce

Apache Tez

Big Data Streaming

Big Data Real-Time Processing

Apache Storm

Apache Flink

Apache Spark Streaming

Big Data Workflow

Apache Oozie

Apache Zookeeper

Apache Ambari

Apache NiFi

Big Data Querying

Apache Pig

Apache Hive

Big Data Monitoring

Apache Flume

Apache Chukwa

Big Data Graphs

Apache Giraphe

Big Data Machine Learning

Apache Mahout

Big Data Architectures

Data Lake Architecture

Lambda Architecture

Kappa Architecture

Big Data Platforms

Big Data in Cloud

You might also be interested in…

External References

Leave a ReplyCancel Reply