Data Serialization Language

This posts explains about data serialization formats, that are markup languages that focus on data itself rather than text format.

Do not confuse with document markup languages, that focused on tech format.

List of Data Serialization Languages

A data serialization language may include:

  • Interface definition language
  • Serialization format

A schema or interface definition language (IDL) describes structured data (messages, fields, types, etc.).

A serialization format allows to convert in-memory data structures into a serialized representation (and back).

Data serialization languages can be classified as:

  • Binary
  • Text-based

Binary Data Serialization Languages

A binary data serialization language is used for communication between machines.

Binary data serialization languages featured on this post:

  • ASN.1
  • MessagePack
  • Protobuf
  • FlatBuffers
  • Apache Thrift
  • Apache Avro
  • Cap’n Proto
  • XOP

ASN.1

ASN.1 is a framework rather than just a data serialization format.

You can read this post about ASN.1.

MessagePack

It is more compact than JSON, but less than Protobuf, FlatBuffers and Cap’n Photo.

It doesn’t require a schema. It works similarly to a binary JSON.

It is community-driven.

It is used when the simplicity of JSON is seeked with a better performance.

Protobuf

Protocol Buffers (protobuf) is developed by American company Google. It is a general-purpose serialization language.

Data is defined in .proto schema files.

The files are compact compared to other serialization methods.

It requieres clear text compilation, using the protoc compiler.

It is used on gRPC and storage.

It requieres parsing/deserialization.

FlatBuffers

FlatBuffers is developed by American company Google. It is aimed to high-performance applications.

Data is defined in .fs schema files.

It requires clear text compilation, using the flatc compiler.

Data is serialized into a binary format that can be accessed directly (zero-copy).

It doesn’t require parsing/deserialization.

It is used on video games, real-time and memory-mapped applications.

It is NOT designed as a full RPC system.

Cap’n Proto

Cap’n Proto is developed by Kenton Varda, an ex-Google developer that worked on Protobuf.

It requires a schema definition.

Data is serialized into a binary format that can be accessed directly (zero-copy).

It doesn’t require parsing/deserialization.

It is designed as a full RPC system.

Apache Thrift

Apache Thrift was originally developed originally by Facebook and then donated to the Apache Foundation.

Apache Avro

Apache Avro is part of the Hadoop ecosystem.

XOP

XML-binary Optimized Packaging (XOP) uses XOP packages.

Text-based Data Serialization Languages

Text-based want to be kept human readability, though they still be used for communication.

Data serialization formats featured on this post:

  • XML
  • JSON
  • YAML
  • TOML
  • SGML

The most popular are XML and JSON.

XML

You can read this post about XML.

JSON

JavaScript Open Notation (JSON) removes some redundancy added by XML

YAML

YAML Ain’t Markup Language (YAML) is more human-readable than JSON by using indents and break-lines instead of squared brackets and accolades. On the other hand, it is slower to be parsed and not as universal and popular as JSON.

A popular Python module that includes a yaml library is called pyyaml.

TOML

Tom’s Obvious, Minimal Language (TOML) is oriented to config files.

TOML code repository

SGML

Standard Generalized Markup Language (SGML).

HTML is based on SGML.

XPDL

XPDL is a serialization language for BPMN diagrams. It is defined by the Workflow Management Coaltion (WfMC).

Data storage formats

Binary storage formats

Binary storage formats are file formats to store binary data.

Binary storage formats featured on this post:

  • HDF5
  • Feather
  • ORC
  • Apache Parquet

HDF5 is used in the scientific domain.

Feather is column-oriented and uses Apache Arrow.

ORC is used on the big data domain.

Apache Parquet uses Apache Thrift as its internal binary data serialization language.

You might also be interested in…

Leave a Reply

Your email address will not be published. Required fields are marked *