This post is about data science and engineering, that can be considered one of the main fields of information technology.
You can read this post about information Technology
DIKW Pyramid
DIKW is ordered from top to bottom:
- Wisdom
- Knowledge
- Information
- Data
Data Structures
Data structure is really part of theoretical computer science and not data science.
You can read this post about data structures.
Data Representation
You can read this post about data representation.
Data Preservation
You can read this post about data preservation.
Data Migration
You can read this post about data migration.
Fields of Data Science and Engineering
Data Science and Engineering can be divided into:
- Data Management
- Data Analytics
Data Management
Data management:
- Databases
- Data lake
- Data warehousing
- OLAP
- Big data
Databases
You can read this post about databases.
Enterprise Content Management
Enterprise Content Management (ECM) systems are designed to manage, store, and collaborate on digital content.
ECM must not be confused with a content management system (CMS), that is focused on the website content.
You can read this post about enterpirse content management (ECM).
Data Lake
A data lake is a repository of data stored in its raw format.
It may consist of structured data (relational databases), semi-structured data (XML, JSON) or unstructured data (text, images).
Data Warehousing
Data warehousing is is the process of collecting, storing, and organizing large volumes of structured data from various sources, such as a OLTP (Online Transaction Processing), within a centralized repository. These sources are called .
The data warehousing approach is batch processing, historical/long-term data and analytical purpose, in contrast to OLTP, that is real-time processing, transactional data and operational purpose.
Data warehousing provides the infrastructure for larges volumes of data, and it is tightly related to the Extract, Transform and Load (ETL) processes and data consolidation.
A staging area is a storage where only the data from the OLTP that is relevant to the data warehouse is gathered, previous any other transformation.
An operational data store (ODS) is an interim logical area for a data warehouse
A data mart is an aggregation of relevant and transformed data regarding a department or section within an organization. The aggregation of data marts form a single data warehouse.
I have not identified standard related with Data Warehousing, but an author belonging to Astic considers that the Spain’s Quality Managment standard UNE 66175:2003 is relevant.
W. H. Inmon (1945-) is considered by many the father of data warehousing.
OLAP
Online Analyticial Processing (OLAP) is a technology in which data is structured in multidimensional cubes or hypercubes to provide quick access to summarized data.
OLAP technology is frequently built on top of data warehouses, and enables the data exploitation.
OLAP provide the analytical layer to data warehousing.
OLAP technologies:
- Relational OLAP (ROLAP)
- Multidimensional OLAP (MOLAP)
- Hybrid OLAP (HOLAP)
Relational OLAP (ROLAP) is calculated after the query.
Multidimensional OLAP (MOLAP) is precalculated before the query.
Hybrid OLAP (HOLAP) combines the strenghts of ROLAP and MOLAP.
OLAP consist of:
- Dimension table
- Fact table
- Indicator
A dimension table contains descriptive values.
A fact table contains numeric values.
An indicator is an aggregation of a certain fact based on given dimensions.
OLAP architecture schemas:
- Star
- Snowflake
- Galaxy
Star has a central fact table that is connected to the other dimension tables, resembling a star. Dimension tables are often denormalized and there is data redundancy.
Snowflake also has a central table and dimensional tables that are in turn connected to other dimensional table, resembling a snowflake.
It is usually slower than a star it has to do more joints, but on the other hand there is less redundancy, reducing size and improving scalability.
Galaxy or constellation has two or more fact tables that share the same dimension table. It may have optional hierarchies in the fact tables.
Query manager operations:
- Drill-down
- Drill-up
- Drill-across
- Roll-across
- Pivot
- Page
- Drill-through
Big Data
Big data deals managing high volumes of data. You can read this post about big data.
Data Analytics
Data Analytics are:
- Business Intelligence
- Data mining
Other fields:
- Diagnostic analytics
- Predictive analytics
- Preescriptive analytics
Modern Data Analytics makes extensive use of Artificial Intelligence technologies, including machine learning (ML).
Business Intelligence
Business Intelligence (BI) is considered part of data analytics.
You can read this post about business intelligence.
Some tools:
- Report and queries
- Online Analytical Processing (OLAP)
- Dashboards
- Executive Information Systems (EIS)
Data mining
Data mining has the objective of finding patterns within large volumes of data. It would be a field of statistics and information systems.
Some of these patterns are:
- Data groups (cluster analysis)
- Unusual registries (anomaly detection)
- Dependencies (association rule mining)
It combines statistics, artificial intelligence, machine learning and database management systems.
A related term is Knowledge Discovery in Databases (KDD).
Data Migration
Data migration modes:
- Refronting
- Replacing
- Rehosting
- Rearchitecting
- Retirement
These categories were found on this external link, and were mentioned in the Spain’s GSI A1 2024 exam.
You might also be interested in…
External References
- Data warehouse schemas
- Stefano Meloccaro; “Data Warehouse Schemas Explained“; Medium