Fault Stage
The Fault stage allows you to capture insights througth the defined metrics and status of your system. This is very useful to gain an overall state of the “health” of your system.
Configuration
You need to configure the faults with the message type, field, value for the fault
condition. Be sure to include a description of the fault as you would like it to
appear on the UI. This can all be done through the FaultStageConfig
structure,
which can be populated with YAML:
---
config:
# Rate at which the FaultReport should be published
publish_report_rate_s: 2
# List of configuration for faults
fault_configs:
# Fault configured to activate if the boolean `logging` state in `/logger/statistics` is in the `false` state.
- channel_name: "/logger/statistics"
field_string: "logging"
field_type: "bool"
operator_type: EQUAL
limit_threshold_value_str: "false"
fault_description: "System is not logging. Please enable the log writer system."
# Fault configured to activate if the double `value` in `/your/metric` is greater than 3.14.
- channel_name: "/your/metric"
field_string: "value"
field_type: "double"
operator_type: GREATER_THAN
limit_threshold_value_str: "3.14"
fault_description: "The metric value has exceeded the maximum limit."
This configures two faults. The first fault is configured to activate if the
boolean logging
state in /logger/statistics
is in the false
state. The second
fault is configured to activate if the double value
in /your/metric
is greater
than 3.14. The operator_type
is exclusive of the limit_threshold_value_str
value.
The field_type
string must be in the form of a FieldType string, which is the same
string mapping type used in the configuration system. See details in configuration
system documentation.
The currently supported operator_type
values are:
EQUAL
- Value matches thresholdNOT_EQUAL
- Value does not match thresholdGREATER_THAN
- Value is greater than thresholdLESS_THAN
- Value is less than thresholdBITMASK_AND
- The bitmask of the threshold and the value is equal to the bitmaskSTRING_CONTAINS
- The field is a string (rather than a double/integer/bool) and contains the threshold string.STRING_EQUALS
- The field is a string (rather than a double/integer/bool) and is exactly equal to the threshold string.STRING_NOT_EQUAL
- The field is a string (rather than a double/integer/bool) and is does not equal the threshold string.
You can transform the value before the comparison is made on it, by setting
the transform_type
field. This can be one of two values:
NO_TRANSFORM
- The default, just compare the value of the fieldDIFFERENCE
- Compare the difference between two successive values to the threshold
For example, if you have an incrementing counter, like packet_receive_count
, you could use
the DIFFERENCE
transform to ensure that there is at least 20 packets received between updates.
Configuration: Channel Pattern Matching
You can use the channel_pattern
option to configure pattern matching (as a regular expression) to match all the channels to the fault configuration. This is very helpful when trying to capture faults from multiple instances of a stage running in different namespaced environments.
# Fault configured to activate if the `total_packet_count` difference between two messages is less than 1.
# Note that this configuration is for channel_pattern, which allows a regex pattern evaluation for the channel name.
# This allows any of the /lidar/*/sweep_stats channels to be subscribed to this fault configuration
- channel_pattern: "/lidar/.*/sweep_stats"
field_string: "total_packet_count"
field_type: "float"
operator_type: LESS_THAN
transform_type: DIFFERENCE
limit_threshold_value_str: "3.14"
priority: "Normal"
limit_threshold_value_str: "1"
fault_description: "No Lidar data is being received."
Let’s say that data is being emitted on channels, /lidar/forward/sweep_stats
and /lidar/rear/sweep_stats
.
This configuration demonstrates capturing both of the “sweep_stats” channels to monitor for any faults. The fault is configured to examine the differences in the total_packet_count between consecutive messages.
Configuration: Field String
There are two syntax formats supported for accessing values in the fault structures.
Assuming our metric structure is:
schema ArrayValue
{
string str;
}
schema NestedMetric
{
int32 value;
arraylist<int32> primitives;
arraylist<ArrayValue> structs;
}
schema Metric
{
int32 value;
NestedMetric nested;
}
Dot
The dot format is the same format as if you were to access the field in c++.
value
nested.value
nested.primitves[1]
nested.structs[2].str
Json Pointer
This is just the official json pointer syntax for accessing fields.
/value
/nested/value
/nested/primitves/1
/nested/structs/2/str
Interacting
As the metrics messages are transmitted through the system, the fault stage will process them to determine if the fault condition is active or not. Once configured, there is no interaction with the FaultStage.
Metrics / Output
The fault stage produces a fault report at a configured rate that shows
an aggregated view of all of the configured Faults in the system. The report
is transmitted on the /fault_report
channel.
Thee channel /fault_state_changes
will contain a message (FaultState
) which
indicates when a particular fault has changed state (either to become active
or to become inactive). Messages are published on this channel as soon as the state
changes, so there is a low latency path for downstream stages to “handle” faults as
they occur.
As reports are only emitted at (relatively rare) intervals, and so may not reflect
the current fault state, it is advised that any fault handling stage should listen
on the /fault_state_changes
channel.