This posts explains about data serialization formats, that are markup languages that focus on data itself rather than text format.
Do not confuse with document markup languages, that focused on tech format.
List of Data Serialization Languages
Data serialization languages can be classified as:
- Binary
- Text-based
Binary are used for communication between machines.
Text-based want to be kept human readability, though they still be used for communication.
Binary Data Serialization Languages
Binary data serialization languages featured on this post:
- ASN.1
- MessagePack
- Protobuf
- FlatBuffers
- Apache Thrift
- Apache Avro
- Cap’n Proto
- XOP
ASN.1
ASN.1 is a framework rather than just a data serialization format.
You can read this post about ASN.1.
MessagePack
It is more compact than JSON, but less than Protobuf, FlatBuffers and Cap’n Photo.
It doesn’t require a schema. It works similarly to a binary JSON.
It is community-driven.
It is used when the simplicity of JSON is seeked with a better performance.
Protobuf
Protocol Buffers (protobuf) is developed by American company Google. It is a general-purpose serialization language.
Data is defined in .proto schema files.
The files are compact compared to other serialization methods.
It requieres clear text compilation, using the protoc compiler.
It is used on gRPC and storage.
It requieres parsing/deserialization.
FlatBuffers
FlatBuffers is developed by American company Google. It is aimed to high-performance applications.
Data is defined in .fs schema files.
It requires clear text compilation, using the flatc compiler.
Data is serialized into a binary format that can be accessed directly (zero-copy).
It doesn’t require parsing/deserialization.
It is used on video games, real-time and memory-mapped applications.
It is NOT designed as a full RPC system.
Cap’n Proto
Cap’n Proto is developed by Kenton Varda, an ex-Google developer that worked on Protobuf.
It requires a schema definition.
Data is serialized into a binary format that can be accessed directly (zero-copy).
It doesn’t require parsing/deserialization.
It is designed as a full RPC system.
Apache Thrift
Apache Thrift was developed originally by Facebook and then donated to the Apache Foundation.
Apache Avro
Apache Avro is part of the Hadoop ecosystem.
XOP
XML-binary Optimized Packaging (XOP) uses XOP packages.
Text-based Data Serialization Languages
Data serialization formats featured on this post:
- XML
- JSON
- YAML
- TOML
- SGML
The most popular are XML and JSON.
XML
Extensible Markup Language (XML)
XML can be defined based on:
- Document type definition (DTD)
- XML Schema
Document Type Definition (DTD) has a limited set of data type, and it does not allow to create new types. DTD is not extensible.
XML Schema is newer than DTD. It is strongly typed. It is written in XML syntax.
This definition file is optional. If there is no definition file, an XML with a correct syntax is just well-formed. When an XML additionally fulfills a definition, it is valid.
JSON
JavaScript Open Notation (JSON) removes some redundancy added by XML
YAML
YAML Ain’t Markup Language (YAML) is more human-readable than JSON by using indents and break-lines instead of squared brackets and accolades. On the other hand, it is slower to be parsed and not as universal and popular as JSON.
A popular Python module that includes a yaml library is called pyyaml.
TOML
Tom’s Obvious, Minimal Language (TOML) is oriented to config files.
SGML
Standard Generalized Markup Language (SGML).
HTML is based on SGML.
XPDL
XPDL is a serialization language for BPMN diagrams. It is defined by the Workflow Management Coaltion (WfMC).