Data Science and Engineering

This post is about data science and engineering, that can be considered one of the main fields of information technology.

You can read this post about information Technology

Types of Data according to its Structure

Types of data:

  • Structured
  • Semi-structured
  • Non-structured

An example of structured data is a relational database.

An example of semi-structured data is an XML or JSON file.

An example of non-structured data is a video, a photo, an email or a text file.

Fields of Data Science and Engineering

Data Science and Engineering can be divided into:

  • Data Management
  • Data Analytics

Data Management

Data management:

  • Databases
  • Data lake
  • Data warehousing
  • OLAP
  • Big data

Databases

You can read this post about databases.

Data Lake

A data lake is a repository of data stored in its raw format.

It may consist of structured data (relational databases), semi-structured data (XML, JSON) or unstructured data (text, images).

Data Warehousing

Data warehousing is is the process of collecting, storing, and organizing large volumes of structured data from various sources within a centralized repository. These sources are called OLTP (Online Transaction Processing).

The data warehousing approach is batch processing, historical/long-term data and analytical purpose, in contrast to OLTP, that is real-time processing, transactional data and operational purpose.

Data warehousing provides the infrastructure for larges volumes of data, and it is tightly related to the Extract, Transform and Load (ETL) processes and data consolidation.

A staging area is a storage where only the data from the OLTP that is relevant to the datawarehouse is gathered, previous any other transformation.

An operational data store (ODS) is an interim logical area for a data warehouse

A data mart is an aggregation of relevant and transformed data regarding a department or section within an organization. The aggregation of data marts form a single data warehouse.

I have not identified standard related with Data Warehousing, but an author belonging to Astic considers that the Spain’s Quality Managment standard UNE 66175:2003 is relevant.

W. H. Inmon (1945-) is considered by many the father of data warehousing.

OLAP

Online Analyticial Processing (OLAP) is a technology in which data is structured in multidimensional cubes or hypercubes to provide quick access to summarized data.

OLAP technology is frequently built on top of data warehouses, and enables the data exploitation.

OLAP provide the analytical layer to data warehousing.

OLAP technologies:

  • Relational OLAP (ROLAP)
  • Multidimensional OLAP (MOLAP)
  • Hybrid OLAP (HOLAP)

ROLAP is calculated after the query.

MOLAP is precalculated before the query.

OLAP consist of:

  • Dimension table
  • Fact table
  • Indicator

An indicator is an aggregation of a certain fact based on given dimensions.

OLAP architecture schemas:

  • Star
  • Snowflake
  • Galaxy

Query manager operations:

  • Drill-down
  • Drill-up
  • Drill-accross
  • Roll-accross
  • Pivot
  • Page
  • Drill-through

Big Data

Big data deals managing high volumes of data. You can read this post about big data.

Data Analytics

Data Analytics are:

  • Business Intelligence
  • Data mining

Other fields

  • Diagnostic analytics
  • Predictive analytics
  • Preescriptive analytics

Modern Data Analytics makes extensive use of Artificial Intelligence technologies, including machine learning (ML).

Business Intelligence

Business Intelligence (BI) is considered part of data analytics.

You can read this post about business intelligence.

Some tools:

  • Report and queries
  • Online Analytical Processing (OLAP)
  • Dashboards
  • Executive Information Systems (EIS)

Data mining

Data mining has the objective of finding patterns within large volumes of data. It would be a field of statistics and information systems.

Some of these patterns are:

  • Data groups (cluster analysis)
  • Unusual registries (anomaly detection)
  • Dependencies (association rule mining)

It combines statistics, artificial intelligence, machine learning and database management systems.

A related term is Knowledge Discovery in Databases (KDD).

You might also be interested in…

Leave a Reply

Your email address will not be published. Required fields are marked *