This posts explains about data serialization formats, that are markup languages that focus on data itself rather than text format.
Do not confuse with document markup languages, that focused on tech format.
List of Data Serialization Languages
A data serialization language may include:
- Interface definition language
- Serialization format
A schema or interface definition language (IDL) describes structured data (messages, fields, types, etc.).
A serialization format allows to convert in-memory data structures into a serialized representation (and back).
Data serialization languages can be classified as:
- Binary
- Text-based
Binary Data Serialization Languages
A binary data serialization language is used for communication between machines.
Binary data serialization languages featured on this post:
- ASN.1
- MessagePack
- Protobuf
- FlatBuffers
- Apache Thrift
- Apache Avro
- Cap’n Proto
- XOP
ASN.1
ASN.1 is a framework rather than just a data serialization format.
You can read this post about ASN.1.
MessagePack
It is more compact than JSON, but less than Protobuf, FlatBuffers and Cap’n Photo.
It doesn’t require a schema. It works similarly to a binary JSON.
It is community-driven.
It is used when the simplicity of JSON is seeked with a better performance.
Protobuf
Protocol Buffers (protobuf) is developed by American company Google. It is a general-purpose serialization language.
Data is defined in .proto schema files.
The files are compact compared to other serialization methods.
It requieres clear text compilation, using the protoc compiler.
It is used on gRPC and storage.
It requieres parsing/deserialization.
FlatBuffers
FlatBuffers is developed by American company Google. It is aimed to high-performance applications.
Data is defined in .fs schema files.
It requires clear text compilation, using the flatc compiler.
Data is serialized into a binary format that can be accessed directly (zero-copy).
It doesn’t require parsing/deserialization.
It is used on video games, real-time and memory-mapped applications.
It is NOT designed as a full RPC system.
Cap’n Proto
Cap’n Proto is developed by Kenton Varda, an ex-Google developer that worked on Protobuf.
It requires a schema definition.
Data is serialized into a binary format that can be accessed directly (zero-copy).
It doesn’t require parsing/deserialization.
It is designed as a full RPC system.
Apache Thrift
Apache Thrift was originally developed originally by Facebook and then donated to the Apache Foundation.
Apache Avro
Apache Avro is part of the Hadoop ecosystem.
XOP
XML-binary Optimized Packaging (XOP) uses XOP packages.
Text-based Data Serialization Languages
Text-based want to be kept human readability, though they still be used for communication.
Data serialization formats featured on this post:
- XML
- JSON
- YAML
- TOML
- SGML
The most popular are XML and JSON.
XML
You can read this post about XML.
JSON
JavaScript Open Notation (JSON) removes some redundancy added by XML
YAML
YAML Ain’t Markup Language (YAML) is more human-readable than JSON by using indents and break-lines instead of squared brackets and accolades. On the other hand, it is slower to be parsed and not as universal and popular as JSON.
A popular Python module that includes a yaml library is called pyyaml.
TOML
Tom’s Obvious, Minimal Language (TOML) is oriented to config files.
SGML
Standard Generalized Markup Language (SGML).
HTML is based on SGML.
XPDL
XPDL is a serialization language for BPMN diagrams. It is defined by the Workflow Management Coaltion (WfMC).
Data storage formats
Binary storage formats
Binary storage formats are file formats to store binary data.
Binary storage formats featured on this post:
- HDF5
- Feather
- ORC
- Apache Parquet
HDF5 is used in the scientific domain.
Feather is column-oriented and uses Apache Arrow.
ORC is used on the big data domain.
Apache Parquet uses Apache Thrift as its internal binary data serialization language.