This post is an introduction to big data.
A cluster is a set of processing devices or computers, that are called nodes in this context.
Big data refers to the management of big sets of data on a cluster. It is tightly related to parallel and distributed system.
Big data relies on NoSQL databases, as they have a distributed approach. You can read this post about NoSQL database.
V’s of Big Data
Five V’s of Big Data:
- Volume
- Velocity
- Variety
- Veracity
- Value
Originally, there were only the first 3. They were proposed by Doug Laney in his reseach paper “3D Data Management: Controlling Volume, Velocity and Variety.
Big Data Standards
Big data standards:
- ITU Y.3605
ISO 8000 is about data quality, so it is somehow related.
ITU-T Y.3605
ITU-T Y.3605 has the title “Big data – Reference architecture”. It was published in 2020.
ITU Y.3600 theoretically has the title “Big data – Cloud computing based requirements and capabilities”, but I have never found this paper.
Big Data Knowledge Technology
Big Data knowledge technology:
- KDD
- SEMMA Methodology
- CRISP-DM methodology
Knowledge Discovery in Database (KDD)
SEMMA (Sample, Explore, Modify, Model, Assess)
Cross-Industry Standard Process for Data Mining (CRISP-DM)
Big Data Programming Models
Big data programming model featured on this post:
- MapReduce
MapReduce
MapReduce is a programming model or paradigm to process and generate big amounts on data, relying on parallel and distributed algorithms.
It receives the name from the original implementation by Google, that is no longer active. Google’s MapReduce was the first succesful approach to big data in 2003 to enhance the efficiency of Google’s PageRank algorithm.
There are MapReduce libraries available for different programming languages.
The Map function processes an input pair (key, value) to produce a list of intermediate key-value pairs. The output may contain multiple pairs with the same or different keys.
The Reduce function groups all intermediate pairs by key and aggregates the values associated with each key into a single output value (or a list of values), typically producing one output per key.
Hadoop-based Big Data Tools
Big data tools aid in managing large volumes of data. They all have in common that handle data in distributed system.
Big data tool categories featured on this post, in a bottom-down approach:
- Big Data Distributed File Systems
- HDFS
- Alluxio
- Big Data Serialization
- Apache Avro
- Big Data Processing Frameworks
- Apache Hadoop
- Apache Spark
- Big Data Streaming
- Apache Kafka
- RabbitMQ
- Apache Pulsar
- Big Data Real-Time Processing
- Apache Storm
- Apache Flink
- Apache Spark Streaming
- Big Data Workflow and Resource Management
- Apache Oozie
- Apache Zookeeper
- Apache Ambari
- Apache Nifi
- Big Data Querying
- Apache Pig
- Apache Hive
- Big Data Monitoring
- Apache Flume
- Apache Chukwa
- Graph Processing
- Apache Giraph
- Machine Learning
- Apache Mahout
Big Data Storage and Serialization
Big data storage models may be connected or substituted by NoSQL databases.
Big data storage and serialization:
- HDFS
- Alluxio
HDFS
Hadoop File System (HDFS) is a distributed file system.
It is included within Apache Hadoop software, and it is the foundation of all Apache Hadoop tools.
Aulluxio
Aulluxio is a virtual distributed file system.
Big Data Serialization
Big data serialization:
- Apache Avro
Apache Avro
Apache Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project.
Big Data Coordination
- YARN (included in Apache Hadoop)
- Apache Mesos
Big Data Core Processing Frameworks
The most popular big data processing frameworks are Apache Hadoop and Apache Spark.
Big data processing frameworks featured on this post:
- Apache Hadoop
- Apache Spark
Apache Hadoop
Apache Hadoop facilitates using a network of many computers to solve problems involving massive amounts of data and computation.
Apache Hadoop components, in bottom-up order:
- Storage: Haddop File System (HDFS)
- Cluster resource management: YARN
- Data processing module (including MapReduce as default, but also Apache Tez)
Yet Another Resource Negotiator (YARN) allows to work with different big data paradigms like MapReduce or Apache Tez. YARN was introduced in Hadoop 2.0. There was only MapReduce in Hadoop 1.0.
It was first released in 2006. It is written in Java.
It is FOSS, under an Apache License 2.0.
YARN’s data processing modules:
- Hadoop’s MapReduce
- Apache Tez
Apache Spark
Apache Spark is a unified engine to process big amounts of data. It is built on top of HDFS. Unlike Hadoop, it does not use YARN by default.
It was first released in 2014. It is written in Scala.
It is FOSS, under an Apache License 2.0.
Big Data Processing Components
Big data processing components complement the processing function of a processing framework.
Big Data Processing Components:
- Haddop’s MapReduce
- Apache Tez
Hadoop’s MapReduce
Hadoop’s MapReduce implementation is one of the components of Apache Hadoop. It is built on top of YARN.
It receives the name from the original implementation by Google, that is no longer active. Google’s MapReduce was the first succesful approach to big data in 2003 to enhance the efficiency of Google’s PageRank algorithm.
There are MapReduce libraries available for different programming languages.
Apache Tez
Apache Tez is an alternative to MapReduce when using YARN in Apache Hadoop.
Tez can be considered a more flexible and efficient processing engine than MapReduce.
Big Data Streaming
Distributed messages and streaming Hadoop platforms focuses on the transport and distribution of data streams.
The most popular message-oriented middleware (MOM) for big data are:
- RabbitMQ
- Apache Kafka
- Apache Pulsar
You can read this post about message-oriented middleware.
Big Data Real-Time Processing
Big data real-time processing platforms:
- Apache Storm
- Apache Flink
- Apache Spark Streaming
Apache Storm
Apache Storm is a distributed stream processing computation framework.
Apache Flink
Apache Flink is a distributed stream-processing and batch-processing framework.
Apache Spark Streaming
Apache Spark Streaming is a stream-processing framework within Apache Spark.
Big Data Workflow
Big data workflow:
- Apache Oozie
- Apache Zookeeper
- Apache Ambari
- Apache NiFi
All of them are specific to big data.
Apache Oozie
Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs.
Apache Zookeeper
Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications.
It started as a sub-project of Hadoop, but it became a project on its own.
Apache Ambari
Apache Ambari attempts to simplify Hadoop management by developing software for provisioning, managing, and monitoring Apache Hadoop clusters.
Apache NiFi
Apache NiFi has the purpose of automating the flow of data between software systems.
It does not seem to be related to the Hadoop project.
Big Data Querying
Big data querying tools featured on this post:
- Apache Pig
- Apache Hive
Apache Pig
Apache Pig is built on top of MapReduce within Apache Hadoop.
It uses the programming language Pig Latin. It is convenient where there is not much structured data.
It is FOSS under an Apache License 2.0.
Apache Hive
Apache Hive is built on top of MapReduce within Apache Hadoop.
It uses SQL-like syntax to access MapReduce, so it is convenient for big data with structured data.
It is FOSS under an Apache License 2.0.
Big Data Monitoring
Big data monitoring tools:
- Apache Flume
- Apache Chukwa
Apache Flume
Apache Flume is a large scale log aggregation framework.
Apache Chukwa
Apache Chukwa was a data collection system for monitoring large distributed systems.
The project has been retired.
Big Data Graphs
Big-data graphs:
- Apache Giraphe
Apache Giraphe
Apache Giraphe leverages Apache Hadoop to process graph data.
It is FOSS, under an Apache License 2.0.
Big Data Machine Learning
Apache Mahout
Apache Mahout leverages Apache Hadoop.
It first made used of Hadoop and more recently moved to Spark and phased out Hadoop. Its use has a main focus on linear algebra, and it is also used for machine learning.
It is FOSS, under an Apache License 2.0.
Big Data Architectures
Big data architectures:
- Data lake architecture
- Lambda architecture
- Kappa architecture
Data Lake Architecture
Data lake architecture or Hadoop-centric architecture is focused on Hadoop tools. It suits for batch processing.
Data lake architecture layers:
- Distributed File System
- Processing Layer
- Data Management and Workflow Tools
Distributed File System could be HDFS.
The processing layer would cover processing engines like Hadoop plus MapReduce or Tez in Apache Hadoop or Apache Spark, in addition to query tools Pig or Hive.
Data management and workflow tools could cover Apache Oozie for workflow scheduling, Apache Zookeeper for distributed coordination, and Apache Ambari for cluster management help support complex data processing pipelines within Hadoop.
Lambda Architecture
Lamda architecture is used for real-time processing and historical data needs.
It has 3 layers:
- Batch
- Speed
- Serving
The batch layer stores and processes large, historical datasets. It is typically implemented with tools like Apache Hadoop or Apache Spark. The batch layer is optimized for high throughput but has higher latency.
The speed layer processes data in real-time as it arrives, providing low-latency insights. Tools like Apache Storm, Apache Kafka, and Apache Spark Streaming are often used here.
The serving layer combines the results from both the batch and speed layers, allowing for fast queries. Data in this layer may be stored in NoSQL databases optimized for fast retrieval (such as Cassandra, HBase, or Elasticsearch).
Kappa Architecture
Kappa architecture is a simplification of lamda architecture. It is specific to real-time streaming without historical data needs.
There is no batch layer, and there is a single processing layer akin to the speed layer.
According to Astic source, it includes Kafka messages, Spark processing, NoSQL databases and Scala programming language.
Big Data Platforms
Big data platforms:
- Clouders
- HortonWorks
- MapR
- Oracle Stream Analytics
Oracle Stream Analyticis is a bid data streaming tool. It would compete against Apache Kafka.
Big Data in Cloud
AWS uses Amazon Elastic MapReduce (EMR), that runs in Hadoop and Spark.
Azure uses propretary HDInsight, based on Hadoop components from Hortonworks Data Platform (HDP).
Google Cloud Platform (GDP) uses Cloud DataFlow.
You might also be interested in…
External References
- cloudlytics; “Hadoop vs Spark: A Comparative Study“; cloudlytics
- Shivansh Yadav; “MapReduce vs Tez“; dev.to, 2024-07-07