Serialization
Serialization is primarily provided through the rbuf
interface, which is
a custom IDL format.
A typical rbuf
file looks something like this:
namespace ark::example;
schema MyMessage
{
uint32 identifier;
string payload;
}
The IDL format supports a handful of features:
- primitives (see below for a full list)
- define “schemas” (objects)
- define enumerations (with varying sizes, such as 8-bit or 32-bit)
- reference other schemas or enums
- specify a namespace
- create dynamic or fixed-size lists of objects
- create dictionaries of objects
- create variants of schemas
- optional fields
- include other rbuf files
- define global constant strings/integers
- define constant primitive types within “schemas”
- use forwards/backwards compatibility with “groups” and field retiring
Additional functionality will be added as-needed. The rbuf compiler will generate C++, JavaScript, or Python code that includes both the structure definition and the code to serialize/deserialize the structure.
In addition to basic serialization, Ark also uses rbuf internally as the backend for ROS serialization, along with the ability to deserialize YAML-based configuration into rbuf-generated structures.
cmake
You can specify one or more rbuf files to be compiled into a messages library with this cmake definition:
add_rbuf(my_message_lib my_message1.rbuf my_message2.rbuf)
Supported Field Types
The normal primitives are all supported (uint8
, uint16
, uint32
, uint64
, int8
,
int16
, int32
, int64
, float
, double
, and bool
). Additionally, a few other types are
supported:
string
- Astd::string
equivalent.guid
- Acore::Guid
equivalent.duration
- Astd::chrono::nanoseconds
equivalent.steady_time_point
- Astd::chrono::steady_clock::now
equivalent.system_time_point
- Astd::chrono::system_clock::now
equivalent.byte_buffer
- Acore::ByteBuffer
equivalent.arraylist<T>
- Astd::vector<T>
equivalent.dictionary<K, V>
- Astd::map<K, V>
equivalent.array<T, S>
- A fixed-size array (std::array<T, S>
) equivalent.variant<T1, T2, ... Tn>
- One of typeT
(astd::variant
equivalent).
Constant Fields
You can also include constant primitives (including string) in a schema.
schema MyObject
{
const string NAME = "name";
const float VALUE = 4.2;
string str;
}
These values are stored in the schema and not serialized into individual messages.
NOTE: Floating point values must include a number before and after the decimal.
Enumerations
You can also create enumerations, and make use of those in your schema definitions. For example:
enum MyEnum
{
Empty = 0;
GoodValue = 1;
BadValue = 2;
}
schema MyObject
{
MyEnum value;
}
Enums resolve to 32-bit integers, and use enum classes in C++.
You can also make use of “enum classes” (just like C++):
enum MyEnum : int16
{
Negative = -100;
Positive = 100;
}
Class types supported are all signed and unsigned integers (8-bit through 64-bit).
When using reflection, you will need to use std::holds_alternative
to discover
the underlying type (or look at the schema).
Note that the class type (storage type) is not versioned (nor are values). To preserve binary compatibility, you cannot change the size or existing values after you have written data to disk.
You can fetch all potential values an enum can supply by using the ark::serialization::values()
API. For example, you can use ark::serialization::values<MyEnum>()
from the example above,
and you will receive a std::vector<int16_t>
which contains the values -100
and 100
.
The ark::serialization::count()
API is provided to return a count of the number of values
contained in an enum. This is only the count of the enum as compiled by rbuf. Forwards/backwards
compatible messages may have values outside this range, if enumerations were added/removed. In
the example above, ark::serialization::count<MyEnum>()
would yield the value 2
.
The ark::serialization::max()
API is provided to return the maximum value seen in an
enumeration, as a convenience over defining it manually. This is only the maximum value of the enum
as compiled by rbuf. Forwards/backwards compatible messages may have values outside this range,
if enumerations were added/removed. In the example above, ark::serialization::max<MyEnum>()
would yield the value 100
.
Bitmasks
You can mark an enumeration as a bitmask, which causes the code generators
to emit some additional functionality for string conversion and the combination
operators (|
, &
, |=
, &=
, and ~
).
enum bitmask MyBitmask : uint16
{
FirstSlot = 0x0001;
SecondSlot = 0x0002;
TopSlot = 0x8000;
}
You can use the combination operators to set or unset bits. Note that it is possible
to get into a situation (particularly with ~
) where some bits that are set are not
actually defined in your bitmask.
You can then use code like this:
// Set bitmask equal to 0x03
MyBitmask bitmask = MyBitmask::FirstSlot | MyBitmask::SecondSlot;
// Remove the first bit, so it is now 0x02.
bitmask &= ~MyBitmask::FirstSlot;
// And with 0x8000, so it is now zero.
bitmask &= MyBitmask::TopSlot;
Using the to_string
or from_string
APIs would yield strings that contain
all of the values, broken down, such as “FirstSlot, TopSlot”, which would be
converted back to a mask of MyBitmask::FirstSlot | MyBitmask::TopSlot
.
Nested Objects
Schemas can reference each other. As an example:
schema FirstObject
{
uint64 identifier;
}
schema SecondObject
{
FirstObject primary;
arraylist<FirstObject> additional;
}
schema ThirdObject
{
dictionary<string, SecondObject> name_to_object_mapping;
}
This will get compiled into two separate structures, with the second object referencing the first as you’d expect.
Variants
A variant allows you to specify a type that contains exactly one of a list
of potential types. It’s equivalent to the C++ std::variant
class.
When specifying types for the variant, you must tag them with an integer ‘index’. This ‘index’ must be unique over the lifetime of the schema – it is intended to allow for backwards and forwards compatibility, so you can remove or add variant members at later times without breaking message compatibility.
For example:
schema ObjectA
{
string content;
}
schema ObjectB
{
uint64 content;
}
schema Message
{
variant<ObjectA = 1, ObjectB = 2> object;
}
This creates a variant with two types, ObjectA
, and ObjectB
. The former has a tag of 1, while
the latter has a tag of 2.
This will yield a C++ type of std::variant<ObjectA, ObjectB>
, and serialize only
the object that is actually used. In this example, this allows you to store either a string
or a uint64
depending on your needs.
You could then remove ObjectA
and add an ObjectC
like so:
schema Message
{
variant<ObjectB = 2, ObjectC = 3> object;
}
Variants in rbuf can only contain “object” types – they cannot contain lists, maps, or primitives like bare strings, byte buffers, or integers. In those cases, you should wrap the type in an object and make use of that.
For readability, some people prefer to specify each subtype on their own line. This is also permitted:
schema Message
{
variant<
ObjectA = 1,
ObjectB = 2,
ObjectC = 3
> object;
}
Normal whitespace rules apply within the <>
subtype list.
Arrays
You can use either the arraylist
or array
keyword to generate an array in
rbuf
. The arraylist
keyword creates a dynamically-sized array, whereas
the array
keyword creates a fixed-size array.
You should consider using fixed-size arrays only if you know the size of the array ahead of time, and you expect that size to never change. If you change the size of the array, it will break backwards compatibility. The primary benefit is that the on-wire size will be a bit smaller, as the size of the array does not need to be written out.
Examples:
schema MySchema
{
arraylist<uint32> dynamic_primitive_array;
arraylist<Object> dynamic_object_array;
array<uint32, 8> fixed_primitive_array;
array<Object, 3> fixed_object_array;
}
Including
The rbufc
compiler will obey include search paths (-I
on the command
line). You can then use include statements in your code:
include "path/to/my.rbuf"
schema MySchema
{
ObjectInMyRbuf object;
}
This assumes ObjectInMyRbuf
is defined in the my.rbuf
file. The
generated code will handle emitting the proper C++ includes. Normal
cmake rules apply: if the rbuf you are including is in a different library,
you will need to define that in your CMakeLists.txt
file explicitly
or you may get non-deterministic build errors.
For example, if you have an rbuf in the rbuf library project::perception_messages
that includes an rbuf from the rbuf library project::geometry_messages
, you would
add this to your CMakeLists.txt
:
target_link_libraries(
project_perception_messages
PUBLIC project::geometry_messages)
Versioning
rbuf supports both forwards and backwards compatable serialization, with some caveats.
The language structures messages into “groups”, such that any group may or may not be present when deserializing the structure.
For example:
schema VersionedObject
{
uint32 identifier;
string text;
group 0
{
arraylist<string> metadata;
}
}
In this example, the fields identifier
and text
are present in every
version of the object. The field metadata
is only set if your software
both knows about group #0 and if the incoming data has group #0 in it.
Older software (without knowledge of group #0) can still read the object, but the metadata field will not be available to them.
If a group is removed, older software will see default values for all of the fields contained within that group. The fields are default-initialized with C++ logic (containers and strings are empty, primitives are zero initialized, variants contain the first element).
You cannot change the fields that are present (either the “base” fields of a structure, or the fields that are in an established group). You may retire groups (just remove it from the schema) without issue.
You can use group identifiers 0 through 255.
If you wish to omit versioning information altogether, you can apply the
[[final]]
attribute to your schema. This will prevent you from using
groups at all, and your schema must never change (or it will break backwards
compatibility).
An example:
[[final]]
schema FinalObject
{
array<float32, 3> position;
array<float32, 4> colors;
}
The upside to this is that the on-wire representation is more succinct, which can make a big difference if a particular schema is never expected to change and is repeated in lists many times. The above example will only consume 28 bytes on the wire (whereas if you used dynamic lists and non-final, it would consume around 37 bytes).
Renaming
In binary serialization, you can rename your fields to anything you like without worrying about versioning. The only thing that is relevant is the field type and the ordering.
For JSON or YAML (config) serialization, you can rename fields, but any existing plain-text serialized data will drop data from the renamed field.
In other words, if you had:
config:
default:
my_field_value: 10
And renamed my_field_value
to new_name
, then new_name
would be empty
when deserializing the above YAML, rather than containing the value 10
. You
would need to manually rename the fields in all existing configuration
structures.
Removing Fields
Right now, you can safely remove any group without hindering binary compatibility. The entire group must be removed.
If you wish to remove an individual field, you must leave it in your rbuf
definition,
but prefix it with the [[removed]]
attribute, like so:
schema Message
{
[[removed]] arraylist<double> my_old_array;
arraylist<double> my_new_array;
}
This will prevent my_old_array
from being a public member in your structure. Data will
be skipped from that structure when deserializing, and when serializing, it will be as if
the default value is serialized. This allows you to maintain forwards/backwards compatibility.
Optional Fields
Any field can be marked as ‘optional’ with the [[optional]]
attribute. Note that changing
the optional attribute will break forwards/backwards compatibility. Once something is marked
as optional, it is always optional (or vice-versa).
For example:
schema Message
{
[[optional]] string name;
[[optional]] string address;
}
These two fields are entirely optional. In C++, they will be generated with the std::optional
type wrapping the string (so, std::optional<std::string>
). In Javascript the types are undefined
if they are not present, and they are set to None
if they are not present in Python.
When emitted as JSON, the field is entirely missing if it is not set. Similarly, if it is missing from the JSON, the optional is not set in the parsing language.
You can use this to reduce storage costs for types if they are not present, as the type is not emitted at all if it is optional and not set.
Combining Attributes
You can combine attributes with a comma operator, like so: [[removed, optional]]
.
Constant Strings
It can be useful to define a constant string, so that you can reference things like channel names in a more type-safe way. For example:
const CAMERA_CHANNEL = "camera_image/raw";
Would generate C++ code that allows you to reference the variable
CAMERA_CHANNEL
as a name.
Constant Expressions
You can do very basic constant expressions when assigning constant values to either variables or enum values. For example:
const BASE_VALUE = 0x1000;
const FAULT_CODE_1 = BASE_VALUE | 0x0001;
const FAULT_CODE_2 = BASE_VALUE | 0x0002;
enum FaultCodes : int32
{
Code1 = FAULT_CODE_1 + 0x100;
Code2 = FAULT_CODE_2 + 0x200;
}
In this case, FAULT_CODE_1
would become 0x1001
and FaultCodes::Code1
would become 0x1101
. Today,
only single-operation expressions are supported. Expressions may reference other constants or enum values (from
within the rbuf or an included rbuf). Standard namespace resolution rules apply.
Serializing/Deserializing
You can serialize and deserialize with one-liners if you like:
#include "ark/serialization/helpers.hh"
// Create a structure and serialize it.
MySerializableStruct structure;
auto bytes = ark::serialization::serialize(structure);
// Roundtrip it by deserializing into a different structure.
auto roundtrip = ark::serialization::deserialize<MySerializableStruct>(bytes);
At this point, structure
and roundtrip
should be identical.
Under the hood, these helpers are actually instantiating and managing input and output streams for you. You can manually manage these streams yourself, if that makes it easier:
ark::serialization::OutputStream output;
output.write(<my object>);
ark::serialization::InputStream input(output);
input.read(<my object>);
You can instantiate input streams against arrays of bytes, vectors, byte buffers, or other output streams. They simply wrap existing data; they don’t take ownership of that data or copy of it, so the source must not fall out of scope.
JSON
Any rbuf type can be serialized to or from JSON automatically. You can call these helper routines:
to_json
- Converts the object into aJsonBlob
from_json
- Converts aJsonBlob
into the destination object
The JsonBlob
type exists to abstract out the use of JSON libraries
internally. You can call str()
on it to get a string representation
of the JSON.
Example code for serializing to and from JSON:
MySerializableStruct structure;
auto json = ark::serialization::to_json(structure);
MySerializableStruct roundtrip;
ark::serialization::from_json(json, roundtrip);
std::cout << json.str();
At this point structure
and roundtrip
should be identical, and
you should have a JSON string printed to stdout.
There’s also a one-liner to make moving from JSON easier:
#include "ark/serialization/helpers.hh"
auto message = ark::serialization::from_json<MySerializableStruct>("... my json...");
That will treat the string passed in as JSON, and deserialize it into the type specified in your template.
Reflection
Every rbuf object has an associated Schema
. This schema is a
structure that you can use to serialize and deserialize an object
without knowing its type at compile time.
This works by storing the structure of the rbuf in a programmatic structure, and then parsing input/output streams manually.
This is not a very performant way to serialize/deserialize objects, but it has an advantage in that you can runtime deserialize types without knowing (at compile time) what they are.
For example, for logging and ark-spy
, they can simply fetch the
schemas live to deserialize content for you, rather than
needing to dynamically load C++ code.
An example of using this:
// Assuming you have a "registry" from a source...
ark::serialization::Object object(registry, "MySerializableStruct");
object.deserialize(input_bitstream, registry);
// You can now lookup fields...
auto object = object.field<uint32_t>("name");
// Or output JSON:
std::cout << object.json().str() << std::endl;
Schema
provides all of the internal fields in its fields
member. These
are made up of FieldDescriptors
which contain enough metadata to understand
the name and type of a field.
You can retrieve nested objects by getting their ObjectPtr
:
auto nested = object.field<ark::serialization::ObjectPtr>("nested_object");
std::cout << nested.json().str() << std::endl;
That code will retrieve a nested object, and then print out its JSON contents.
Finally, a SchemaRegistry
exists as a mechanism to group many schemas
together into one place. This is both convenient and necessary for deserializing
objects that may reference multiple object types.
You can typically get a SchemaRegistry
by querying the log or the HTTP server:
these will return a JSON-serialized registry, which can be deserialized into
a concrete SchemaRegistry.
You can also get a serialized SchemaRegistry
for any type by using this API:
ark::serialization::schema<MySerializableStruct>();
This provides a registry instead of a schema, and that registry will consist of all of the schemas necessary to serialize/deserialize that type.
Supported Languages
The rbufc compiler supports compiling rbuf files and emitting code in both C++, JavaScript, and Python. In all languages, you are capable of fully serializing and deserializing objects using native code (pure C++, pure JavaScript, or pure Python).
Note that Python serialization/deserialization is achieved without the use of C bindings. This means that it is typically slower then just using something like pickle or msgpack (for simple benchmarks, it is typically 1.6x slower). As an example, a small structure with a few strings and primitives takes about 3us to serialize/deserialize in pickle, and about 4.5us to do the same in rbuf. The C++ and JavaScript code is considerably faster.
Use the --langauge
command line option to rbufc
to generate in a particular
language (C++ is chosen by default). See --help
for more details.
Writing Files
You can write rbuf
objects out to files (and read them back in) using the
APIs in the ark/serialization/file.hh
header. For example, say you wanted
to record your MySerializableStruct
to disk:
ark::serialization::write_serializable_type_to_file("test.rbin", my_rbuf_type);
auto content = ark::serialization::read_serializable_type_from_file<MySerializableStruct>("test.rbin");
At this point, content
and my_rbuf_type
will be identical. Note that you can
pass along the ark::core::WriteOptions
flags to the write_serializable_type_to_file
API, so you can
write files out atomically if desired.
This API writes out the binary form of the rbuf
to disk, and also includes
additional schema infomration and other data, allowing you to read an rbuf file even if you
don’t know the type:
auto content = ark::serialization::read_generic_object_from_serialized_file("test.rbin");
std::cout << content->json().str();
This will cause whatever is in test.rbin
to be deserialized and output as a JSON object. There is
a commandline tool, ark-cat-rbuf-file
, which will do this for you.
Note that ark::serialization::read_serializable_type_from_file
can also read rbuf files that were
simply written to disk directly (say, using ark::core::write_string_to_disk
APIs). In that case,
you cannot invoke the read_generic_object_from_serialized_file
APIs, as the schema information was
not written out.
The opposite is also true – files written out with the write_serializable_type_to_file
can be read
in and deserialized using standard APIs. The serialized data is written out first, followed by a header
block, which would be ignored.
Writing to Parquet
You can write simple rbuf files out as rows in a Parquet file. As long as your rbuf only contains basic (non-nested) fields, such as fundamental types, timestamps, strings, or GUIDs, you can write it out as a row in a parquet file.
For example:
#include "ark/serialization/parquet/parquet_writer.hh"
#include "my_rbuf_type.hh"
auto writer = ParquetWriter::Create<MyRbufType>("path/to/file.parquet");
for (size_t row = 0; row < 10; ++row)
{
MyRbufType type{.my_value = row};
writer.write(type);
}
writer.close();
Please note that you can also provide a compression type to the constructor, allowing you to create compressed parquet files if desired.
rbuf Intermediate Representation
For some use cases (such as language servers), it is possible to get an intermediate
representation (IR) of the parsed rbuf. You can do this programmatically (by using the
ark::serialization::parse_rbuf
function, or through rbufc
.
If using rbufc
, pass the option --language ir
, and it will emit a JSON document
containing the intermediate representation. This is not considered to be a stable
interface, and may change over time. However, it does include things like line numbers,
source files, schema definitions, and field names and types, all of which could be useful
for further code generation or language servers.