On-Disk Format

This document describes the on-disk format of Ark Logs. In general, it is preferable to use the existing APIs for reading/writing logs.

Logs are comprised of two major file types:

the manifest, which contains metadata and indicates which split files comprise a ’log'
split files, which contain the raw message data

All of the data structures used within these two files are serialized as rbufs with one exception, which will be mentioned later.

The directory structure of a log typically looks like this:

<root directory>
 |
 |-> manifests/<guid>
 |-> splits/<guid>

The “manifest GUID” is the identifier of the log itself. The “split GUID” is the identifier of each individual split file.

It is worth noting that a split file may be referenced by multiple manifests. For example, you may create a ‘snippet’ or ‘subindex’ or a log, which would generate a new manifest that references existing split files.

Split File Format

Every split file must be at least 1024 bytes long. It immediately begins with a serialized header, which is defined by the ark::logging::SplitFileHeader rbuf structure located in ark/logging/split_header.rbuf. The current version is reproduced here:

///
/// A structure that maps GUIDs of object identifiers into
/// short integers for internal writing.
///
schema SplitTypeMappingInfo
{
    /// The original object identifier.
    guid object_identifier;

    /// The short (8-bit) identifier we map to.
    uint8 short_identifier;
}

///
/// The definition for the header that is written out to every
/// split file.
///
schema SplitFileHeader
{
    /// Indicates the format of this split file. This value should
    /// always be one.
    uint16 format_version;

    /// Original GUID of this split file.
    guid split_identifier;

    /// The offset of the file into the index.
    uint32 index_offset;

    /// The number of entries in the index.
    uint32 indexed_item_count;

    /// The maximum number of items that are stored in this index.
    uint32 maximum_item_count;

    /// Information mapping type information to short types.
    arraylist<SplitTypeMappingInfo> type_mappings;
}

On disk, this would be represented as:

1 byte for the rbuf bitstream header (should be 0xd1)
2 bytes for format_version
16 bytes for split_identifier
4 bytes for index_offset
4 bytes for indexed_item_count
4 bytes for maximum_item_count
4 bytes to indicate the length of the type_mappings list
For each item in type_mappings:
- 1 byte for the rbuf bitstream header (should be 0xd1)
- 16 bytes for the object_identifier
- 1 byte for the short_identifier

The actual index contents are represented by the structure in ark/logging/wire_structure.hh:

struct SplitIndexEntry
{
    /// The timestamp of the object.
    uint64_t timestamp_ns;

    /// The offset of the object within the split.
    uint32_t offset;

    /// The size of the object within the split.
    uint32_t size;

    /// A reference to the identifier that is used internally
    /// to the split.
    uint8_t object_identifier;

    /// Additional flags (reserved for now).
    uint8_t flags;
} __attribute__((packed));

You would read from the file offset index_offset (from the SplitFileHeader) as a flat array consisting of indexed_item_count items. The index is already sorted in timestamp order.

Each offset in the index entries corresponds to the exact file offset where that message’s data will live. The size consists of the (compressed) size of the data.

The object_identifier is a short (8 bit) identifier that you can use to retrieve the full object identifier GUID (from the SplitFileHeader). This is necessary to retrieve object name and metadata information from the manifest.

Finally, these entries may be compressed with some compression format. You will need to decompress these before the data can be read. The compression format (if any) is specified in the manifest. Typically we use Zstd or LZ4 compression for data.

Manifest File Format

These consist of a single rbuf object that is serialized. This is the ark::logging::Manifest object located in the ark/logging/manifest.rbuf file.

At a top level, it is defined as:

schema Manifest
{
    /// The identifier of this log. This allows you to uniquely identify
    /// a particular view into a set of data.
    guid log_identifier;

    /// An identifier common to all logs within a "family" (ie, all logs that
    /// share some data).
    guid common_identifier;

    /// The human-readable name of this manifest.
    string name;

    /// The list of columns. This is deprecated and will be removed
    /// at some point -- its just in here for backwards compatibility
    /// with our few existing logs and does not appear in the wild.
    [[removed]] arraylist<ColumnDeclaration> v1_columns;

    group 0 
    {
        arraylist<ManifestSection> sections;
    }

    group 1
    {
        ///
        /// Metadata associated with the log, stored as key value pairs.
        /// where the value is json format
        ///
        LogMetadataEntries metadata;
    }
}

Note that this makes use of ‘rbuf groups’. The section order is important to maintain. Typically ‘base logs’ will consist entirely of a single section. Additional sections may be layered on to form something called ‘amendments’, which will restrict what data is visible (or add additional data) to the original split files.

The section format is defined as:

schema ManifestSection
{
    /// The list of columns contained within this section.
    arraylist<ColumnDeclaration> columns;

    /// The original log identifier that this section came
    /// from.
    guid original_log_identifier;

    /// Acceptance filter -- each section parsed after this
    /// one has this query applied to it.
    Query acceptance_filter;
}

The acceptance_filter object consists of a serialized query. The query applies to all following sections, and can be used to ‘hide’ data such as objects or channels (or to restrict time ranges).

Each column_declaration is represented by:

schema ColumnDeclaration
{
    /// The human-readable name of this column.
    string name;

    /// Compression type for all splits in this column.
    string compression_type;

    /// All of the objects tracked in this column.
    arraylist<ObjectDeclaration> object_types;

    /// All of the splits tracked in this column.
    arraylist<SplitDeclaration> splits;
}

A column defines a set of split files that share common characteristics (such as object types or compression). Note that compression_type applies to all objects contained within that column.

The splits object consists of a list of split declarations, which has a GUID identifying the split file that is part of the column, and a list of object statistics that indicate how much data was written for each object in that file. It also consists of object declarations, which define the actual channel names, type identifiers (ie, ark::image::Image), and the identifier. It also consists of the schema registry (which is a list of message definitions needed to deserialize that object type), and an indication if the data is latched or not.

The definition for an object is located here:

schema ObjectDeclaration
{
    /// The human readable name of this object.
    string name;

    /// The type identifier of this object.
    string type_identifier;

    /// The object identifier of this object.
    guid identifier;

    /// The schema registry of the object, containing everything
    /// necessary to reconstruct this data.
    string schema_registry;

    group 0
    {
        /// If this object is "latched" or not -- this means that the most
        /// recently received message if this type will be returned at
        /// the beginning of any query.
        bool latched;
    }
}

Fully parsing the manifest will require at least a simple rbuf parser. A simple example might look like this:

1 byte for the rbuf bitstream header (should be 0xd1) and may have bit 0x04 set
16 bytes for the log identifier
16 bytes for the common identifier
4 bytes for the name length (in bytes), followed by the name (in UTF-8)
4 bytes, always zero
If bit 0x08 in the bitstream header is set, groups are present, and if so, read:
- one byte for the header, which should be 0xe0, and will have bit 0x04 set if additional groups are present
- one byte for the group header number (0, 1, etc)
- four bytes for the group size
- if the group header number is zero, then:
  - read 4 bytes for the length of manifest sections
  - for each manifest section, continue on
- if the group header number is one, then:
  - read 1 byte for the rbuf bitstream header (should be 0xd1)
  - read 4 bytes for the number of metadata pairs, for each pair, read:
    - 4 bytes for the key length (in bytes), followed by the key (in UTF-8)
    - 4 bytes for the value length (in bytes), followed by the value (in UTF-8)
- if any other number, advance by group_size bytes

Please see the rbuf definitions to further process the manifest file. Note that the manifest is actually stored in the ‘rbuf serializable file’ format (which can be decoded with the ark-cat-rbuf tool), and contains a trailer that holds metadata such as content size, version info, and schema metadata. It’s fine to just ignore this information.