Big Data

This post is an introduction to big data.

A cluster is a set of processing devices or computers, that are called nodes in this context.

Big data refers to the management of big sets of data on a cluster. It is tightly related to parallel and distributed system.

Big data relies on NoSQL databases, as they have a distributed approach. You can read this post about NoSQL database.

V’s of Big Data

Five V’s of Big Data:

  1. Volume
  2. Velocity
  3. Variety
  4. Veracity
  5. Value

Originally, there were only the first 3. They were proposed by Doug Laney in his reseach paper “3D Data Management: Controlling Volume, Velocity and Variety.

Big Data Standards

Big data standards:

  • ITU Y.3605

ISO 8000 is about data quality, so it is somehow related.

ITU-T Y.3605

ITU-T Y.3605 has the title “Big data – Reference architecture”. It was published in 2020.

ITU-T Y.3605 in PDF format

ITU Y.3600 theoretically has the title “Big data – Cloud computing based requirements and capabilities”, but I have never found this paper.

Big Data Knowledge Technology

Big Data knowledge technology:

  • KDD
  • SEMMA Methodology
  • CRISP-DM methodology

Knowledge Discovery in Database (KDD)

SEMMA (Sample, Explore, Modify, Model, Assess)

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Big Data Programming Models

Big data programming model featured on this post:

  • MapReduce

MapReduce

MapReduce is a programming model or paradigm to process and generate big amounts on data, relying on parallel and distributed algorithms.

It receives the name from the original implementation by Google, that is no longer active. Google’s MapReduce was the first succesful approach to big data in 2003 to enhance the efficiency of Google’s PageRank algorithm.

There are MapReduce libraries available for different programming languages.

The Map function processes an input pair (key, value) to produce a list of intermediate key-value pairs. The output may contain multiple pairs with the same or different keys.

The Reduce function groups all intermediate pairs by key and aggregates the values associated with each key into a single output value (or a list of values), typically producing one output per key.

Hadoop-based Big Data Tools

Big data tools aid in managing large volumes of data. They all have in common that handle data in distributed system.

Big data tool categories featured on this post, in a bottom-down approach:

  • Big Data Distributed File Systems
    • HDFS
    • Alluxio
  • Big Data Serialization
    • Apache Avro
  • Big Data Processing Frameworks
    • Apache Hadoop
    • Apache Spark
  • Big Data Streaming
    • Apache Kafka
    • RabbitMQ
    • Apache Pulsar
  • Big Data Real-Time Processing
    • Apache Storm
    • Apache Flink
    • Apache Spark Streaming
  • Big Data Workflow and Resource Management
    • Apache Oozie
    • Apache Zookeeper
    • Apache Ambari
    • Apache Nifi
  • Big Data Querying
    • Apache Pig
    • Apache Hive
  • Big Data Monitoring
    • Apache Flume
    • Apache Chukwa
  • Graph Processing
    • Apache Giraph
  • Machine Learning
    • Apache Mahout

Big Data Storage and Serialization

Big data storage models may be connected or substituted by NoSQL databases.

Big data storage and serialization:

  • HDFS
  • Alluxio

HDFS

Hadoop File System (HDFS) is a distributed file system.

It is included within Apache Hadoop software, and it is the foundation of all Apache Hadoop tools.

Aulluxio

Aulluxio is a virtual distributed file system.

Big Data Serialization

Big data serialization:

  • Apache Avro

Apache Avro

Apache Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project.

Apache Avro official website

Apache Avro at Wikipedia

Big Data Coordination

  • YARN (included in Apache Hadoop)
  • Apache Mesos

Big Data Core Processing Frameworks

The most popular big data processing frameworks are Apache Hadoop and Apache Spark.

Big data processing frameworks featured on this post:

  • Apache Hadoop
  • Apache Spark

Apache Hadoop

Apache Hadoop facilitates using a network of many computers to solve problems involving massive amounts of data and computation.

Apache Hadoop components, in bottom-up order:

  1. Storage: Haddop File System (HDFS)
  2. Cluster resource management: YARN
  3. Data processing module (including MapReduce as default, but also Apache Tez)

Yet Another Resource Negotiator (YARN) allows to work with different big data paradigms like MapReduce or Apache Tez. YARN was introduced in Hadoop 2.0. There was only MapReduce in Hadoop 1.0.

It was first released in 2006. It is written in Java.

It is FOSS, under an Apache License 2.0.

YARN’s data processing modules:

  • Hadoop’s MapReduce
  • Apache Tez

Apache Spark

Apache Spark is a unified engine to process big amounts of data. It is built on top of HDFS. Unlike Hadoop, it does not use YARN by default.

It was first released in 2014. It is written in Scala.

It is FOSS, under an Apache License 2.0.

Apache Spark official website

Big Data Processing Components

Big data processing components complement the processing function of a processing framework.

Big Data Processing Components:

  • Haddop’s MapReduce
  • Apache Tez

Hadoop’s MapReduce

Hadoop’s MapReduce implementation is one of the components of Apache Hadoop. It is built on top of YARN.

It receives the name from the original implementation by Google, that is no longer active. Google’s MapReduce was the first succesful approach to big data in 2003 to enhance the efficiency of Google’s PageRank algorithm.

There are MapReduce libraries available for different programming languages.

Apache Tez

Apache Tez is an alternative to MapReduce when using YARN in Apache Hadoop.

Tez can be considered a more flexible and efficient processing engine than MapReduce.

Apache Tez official website

Big Data Streaming

Distributed messages and streaming Hadoop platforms focuses on the transport and distribution of data streams.

The most popular message-oriented middleware (MOM) for big data are:

  • RabbitMQ
  • Apache Kafka
  • Apache Pulsar

You can read this post about message-oriented middleware.

Big Data Real-Time Processing

Big data real-time processing platforms:

  • Apache Storm
  • Apache Flink
  • Apache Spark Streaming

Apache Storm

Apache Storm is a distributed stream processing computation framework.

Apache Storm at Wikipedia

Apache Flink

Apache Flink is a distributed stream-processing and batch-processing framework.

Apache Flink at Wikipedia

Apache Spark Streaming

Apache Spark Streaming is a stream-processing framework within Apache Spark.

Big Data Workflow

Big data workflow:

  • Apache Oozie
  • Apache Zookeeper
  • Apache Ambari
  • Apache NiFi

All of them are specific to big data.

Apache Oozie

Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs.

Apache Oozie at Wikipedia

Apache Zookeeper

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications.

It started as a sub-project of Hadoop, but it became a project on its own.

Apache Zookeeper at Wikipedia

Apache Ambari

Apache Ambari attempts to simplify Hadoop management by developing software for provisioning, managing, and monitoring Apache Hadoop clusters.

Apache NiFi

Apache NiFi has the purpose of automating the flow of data between software systems.

It does not seem to be related to the Hadoop project.

Apache NiFi at Wikipedia

Big Data Querying

Big data querying tools featured on this post:

  • Apache Pig
  • Apache Hive

Apache Pig

Apache Pig is built on top of MapReduce within Apache Hadoop.

It uses the programming language Pig Latin. It is convenient where there is not much structured data.

It is FOSS under an Apache License 2.0.

Apache Pig official website

Apache Hive

Apache Hive is built on top of MapReduce within Apache Hadoop.

It uses SQL-like syntax to access MapReduce, so it is convenient for big data with structured data.

It is FOSS under an Apache License 2.0.

Apache Hive official website

Big Data Monitoring

Big data monitoring tools:

  • Apache Flume
  • Apache Chukwa

Apache Flume

Apache Flume is a large scale log aggregation framework.

Apache Chukwa

Apache Chukwa was a data collection system for monitoring large distributed systems.

The project has been retired.

Apache Chukwa website

Big Data Graphs

Big-data graphs:

  • Apache Giraphe

Apache Giraphe

Apache Giraphe leverages Apache Hadoop to process graph data.

It is FOSS, under an Apache License 2.0.

Big Data Machine Learning

Apache Mahout

Apache Mahout leverages Apache Hadoop.

It first made used of Hadoop and more recently moved to Spark and phased out Hadoop. Its use has a main focus on linear algebra, and it is also used for machine learning.

It is FOSS, under an Apache License 2.0.

Big Data Architectures

Big data architectures:

  • Data lake architecture
  • Lambda architecture
  • Kappa architecture

Data Lake Architecture

Data lake architecture or Hadoop-centric architecture is focused on Hadoop tools. It suits for batch processing.

Data lake architecture layers:

  1. Distributed File System
  2. Processing Layer
  3. Data Management and Workflow Tools

Distributed File System could be HDFS.

The processing layer would cover processing engines like Hadoop plus MapReduce or Tez in Apache Hadoop or Apache Spark, in addition to query tools Pig or Hive.

Data management and workflow tools could cover Apache Oozie for workflow scheduling, Apache Zookeeper for distributed coordination, and Apache Ambari for cluster management help support complex data processing pipelines within Hadoop.

Lambda Architecture

Lamda architecture is used for real-time processing and historical data needs.

It has 3 layers:

  • Batch
  • Speed
  • Serving

The batch layer stores and processes large, historical datasets. It is typically implemented with tools like Apache Hadoop or Apache Spark. The batch layer is optimized for high throughput but has higher latency.

The speed layer processes data in real-time as it arrives, providing low-latency insights. Tools like Apache Storm, Apache Kafka, and Apache Spark Streaming are often used here.

The serving layer combines the results from both the batch and speed layers, allowing for fast queries. Data in this layer may be stored in NoSQL databases optimized for fast retrieval (such as Cassandra, HBase, or Elasticsearch).

Kappa Architecture

Kappa architecture is a simplification of lamda architecture. It is specific to real-time streaming without historical data needs.

There is no batch layer, and there is a single processing layer akin to the speed layer.

According to Astic source, it includes Kafka messages, Spark processing, NoSQL databases and Scala programming language.

Big Data Platforms

Big data platforms:

  • Clouders
  • HortonWorks
  • MapR
  • Oracle Stream Analytics

Oracle Stream Analyticis is a bid data streaming tool. It would compete against Apache Kafka.

Big Data in Cloud

AWS uses Amazon Elastic MapReduce (EMR), that runs in Hadoop and Spark.

Azure uses propretary HDInsight, based on Hadoop components from Hortonworks Data Platform (HDP).

Google Cloud Platform (GDP) uses Cloud DataFlow.

You might also be interested in…

External References

Leave a Reply

Your email address will not be published. Required fields are marked *