Data Science and Engineering

This post is about data science and engineering, that can be considered one of the main fields of information technology.

You can read this post about information Technology

DIKW Pyramid

DIKW ordered from top to bottom:

  • Wisdom
  • Knowledge
  • Information
  • Data

Data Structures

Data structure is really part of theoretical computer science and not data science.

You can read this post about data structures.

Data Representation

You can read this post about data representation.

Data Preservation

You can read this post about data preservation.

Data Migration

You can read this post about data migration.

Fields of Data Science and Engineering

Data Science and Engineering can be divided into:

  • Data Management
  • Data Analytics

Data Management

Data management:

  • Databases
  • Data lake
  • Data warehousing
  • OLAP
  • Big data

Databases

You can read this post about databases.

Content Management System

You can read this post about content management systems (CMS).

Data Lake

A data lake is a repository of data stored in its raw format.

It may consist of structured data (relational databases), semi-structured data (XML, JSON) or unstructured data (text, images).

Data Warehousing

Data warehousing is is the process of collecting, storing, and organizing large volumes of structured data from various sources, such as a OLTP (Online Transaction Processing), within a centralized repository. These sources are called .

The data warehousing approach is batch processing, historical/long-term data and analytical purpose, in contrast to OLTP, that is real-time processing, transactional data and operational purpose.

Data warehousing provides the infrastructure for larges volumes of data, and it is tightly related to the Extract, Transform and Load (ETL) processes and data consolidation.

A staging area is a storage where only the data from the OLTP that is relevant to the data warehouse is gathered, previous any other transformation.

An operational data store (ODS) is an interim logical area for a data warehouse

A data mart is an aggregation of relevant and transformed data regarding a department or section within an organization. The aggregation of data marts form a single data warehouse.

I have not identified standard related with Data Warehousing, but an author belonging to Astic considers that the Spain’s Quality Managment standard UNE 66175:2003 is relevant.

W. H. Inmon (1945-) is considered by many the father of data warehousing.

OLAP

Online Analytical Processing (OLAP) is a technology in which data is structured in multidimensional cubes or hypercubes to provide quick access to summarized data.

OLAP technology is frequently built on top of data warehouses, and enables the data exploitation.

OLAP provide the analytical layer to data warehousing.

OLAP technologies:

  • Relational OLAP (ROLAP)
  • Multidimensional OLAP (MOLAP)
  • Hybrid OLAP (HOLAP)

Relational OLAP (ROLAP) is calculated after the query.

Multidimensional OLAP (MOLAP) is precalculated before the query.

Hybrid OLAP (HOLAP) combines the strenghts of ROLAP and MOLAP.

OLAP consist of:

  • Dimension table
  • Fact table
  • Indicator

A dimension table contains descriptive values.

A fact table contains numeric values.

An indicator is an aggregation of a certain fact based on given dimensions.

OLAP architecture schemas:

  • Star
  • Snowflake
  • Galaxy

Star has a central fact table that is connected to the other dimension tables, resembling a star. Dimension tables are often denormalized and there is data redundancy.

Snowflake also has a central table and dimensional tables that are in turn connected to other dimensional table, resembling a snowflake.

It is usually slower than a star it has to do more joints, but on the other hand there is less redundancy, reducing size and improving scalability.

Galaxy or constellation has two or more fact tables that share the same dimension table. It may have optional hierarchies in the fact tables.

Query manager operations:

  • Drill-down
  • Drill-up
  • Drill-across
  • Roll-across
  • Pivot
  • Page
  • Drill-through

Big Data

Big data deals managing high volumes of data. You can read this post about big data.

Data Analytics

Data Analytics are:

  • Business Intelligence
  • Data mining

Other fields:

  • Diagnostic analytics
  • Predictive analytics
  • Preescriptive analytics

Modern Data Analytics makes extensive use of Artificial Intelligence technologies, including machine learning (ML).

Business Intelligence

Business Intelligence (BI) is considered part of data analytics.

You can read this post about business intelligence.

Some tools:

  • Report and queries
  • Online Analytical Processing (OLAP)
  • Dashboards
  • Executive Information Systems (EIS)

Data mining

Data mining has the objective of finding patterns within large volumes of data. It would be a field of statistics and information systems.

Some of these patterns are:

  • Data groups (cluster analysis)
  • Unusual registries (anomaly detection)
  • Dependencies (association rule mining)

It combines statistics, artificial intelligence, machine learning and database management systems.

A related term is Knowledge Discovery in Databases (KDD).

Data Migration

Data migration modes:

  • Refronting
  • Replacing
  • Rehosting
  • Rearchitecting
  • Retirement

These categories were found on this external link, and were mentioned in the Spain’s GSI A1 2024 exam.

Data Analysis Frameworks

Data analysis frameworks and libraries featured on this post:

  • Pandas

Pandas

Pandas is a data analysis and manipulation library written in Python.

It is FOSS, under a BSD-3-clause license.

It may be used together with library NumPy, that provides support for large, multi-dimensional arrays and matrices.

Pandas official website

Pandas code repository

Programming for Data Science and Engineering

The typical programming languages used in the fields of data science and engineering are Python and R.

Books on Programming for Data Science and Engineering

Books:

  • “Python for Data Analysis”, by Wes Mckinney
  • “Numerical Python : a practical techniques approach for industry”, by Robert Johansson
  • “Python Data Science Handbook”, by Jake Vanderplas

“Python for Data Analysis” at the Open Library

“Numerical Python” at the Open Library

“Python Data Science Handbook” at the Open Library

You might also be interested in…

External References

Leave a Reply

Your email address will not be published. Required fields are marked *