Fault Stage

The Fault stage allows you to capture insights througth the defined metrics and status of your system. This is very useful to gain an overall state of the “health” of your system.

Configuration

You need to configure the faults with the message type, field, value for the fault condition. Be sure to include a description of the fault as you would like it to appear on the UI. This can all be done through the FaultStageConfig structure, which can be populated with YAML:

---
config:
  # Rate at which the FaultReport should be published
  publish_report_rate_s: 2

  # List of configuration for faults
  fault_configs: 

    # Fault configured to activate if the boolean `logging` state in `/logger/statistics` is in the `false` state. 
    - channel_name: "/logger/statistics"
      field_string: "logging"
      field_type: "bool"
      operator_type: EQUAL
      limit_threshold_value_str: "false"
      fault_description: "System is not logging.  Please enable the log writer system."

    # Fault configured to activate if the double `value` in `/your/metric` is greater than 3.14.
    - channel_name: "/your/metric"
      field_string: "value"
      field_type: "double"
      operator_type: GREATER_THAN
      limit_threshold_value_str: "3.14"  
      fault_description: "The metric value has exceeded the maximum limit."

This configures two faults. The first fault is configured to activate if the boolean logging state in /logger/statistics is in the false state. The second fault is configured to activate if the double value in /your/metric is greater than 3.14. The operator_type is exclusive of the limit_threshold_value_str value.

The field_type string must be in the form of a FieldType string, which is the same string mapping type used in the configuration system. See details in configuration system documentation.

The currently supported operator_type values are:

  • EQUAL - Value matches threshold
  • NOT_EQUAL - Value does not match threshold
  • GREATER_THAN - Value is greater than threshold
  • LESS_THAN - Value is less than threshold
  • BITMASK_AND - The bitmask of the threshold and the value is equal to the bitmask
  • STRING_CONTAINS - The field is a string (rather than a double/integer/bool) and contains the threshold string.
  • STRING_EQUALS - The field is a string (rather than a double/integer/bool) and is exactly equal to the threshold string.
  • STRING_NOT_EQUAL - The field is a string (rather than a double/integer/bool) and is does not equal the threshold string.

You can transform the value before the comparison is made on it, by setting the transform_type field. This can be one of two values:

  • NO_TRANSFORM - The default, just compare the value of the field
  • DIFFERENCE - Compare the difference between two successive values to the threshold

For example, if you have an incrementing counter, like packet_receive_count, you could use the DIFFERENCE transform to ensure that there is at least 20 packets received between updates.

Configuration: Channel Pattern Matching

You can use the channel_pattern option to configure pattern matching (as a regular expression) to match all the channels to the fault configuration. This is very helpful when trying to capture faults from multiple instances of a stage running in different namespaced environments.


    # Fault configured to activate if the `total_packet_count` difference between two messages is less than 1.
    # Note that this configuration is for channel_pattern, which allows a regex pattern evaluation for the channel name.
    # This allows any of the /lidar/*/sweep_stats channels to be subscribed to this fault configuration
    - channel_pattern: "/lidar/.*/sweep_stats"
      field_string: "total_packet_count"
      field_type: "float"
      operator_type: LESS_THAN
      transform_type: DIFFERENCE
      limit_threshold_value_str: "3.14"  
      priority: "Normal"
      limit_threshold_value_str: "1"
      fault_description: "No Lidar data is being received."

Let’s say that data is being emitted on channels, /lidar/forward/sweep_stats and /lidar/rear/sweep_stats.

This configuration demonstrates capturing both of the “sweep_stats” channels to monitor for any faults. The fault is configured to examine the differences in the total_packet_count between consecutive messages.

Configuration: Field String

There are two syntax formats supported for accessing values in the fault structures.

Assuming our metric structure is:

schema ArrayValue
{
    string str;
}
schema NestedMetric
{
    int32 value;
    arraylist<int32> primitives;
    arraylist<ArrayValue> structs;
}

schema Metric
{
    int32 value;
    NestedMetric nested;
}

Dot

The dot format is the same format as if you were to access the field in c++.

value
nested.value
nested.primitves[1]
nested.structs[2].str

Json Pointer

This is just the official json pointer syntax for accessing fields.

/value
/nested/value
/nested/primitves/1
/nested/structs/2/str

Interacting

As the metrics messages are transmitted through the system, the fault stage will process them to determine if the fault condition is active or not. Once configured, there is no interaction with the FaultStage.

Metrics / Output

The fault stage produces a fault report at a configured rate that shows an aggregated view of all of the configured Faults in the system. The report is transmitted on the /fault_report channel.

Thee channel /fault_state_changes will contain a message (FaultState) which indicates when a particular fault has changed state (either to become active or to become inactive). Messages are published on this channel as soon as the state changes, so there is a low latency path for downstream stages to “handle” faults as they occur.

As reports are only emitted at (relatively rare) intervals, and so may not reflect the current fault state, it is advised that any fault handling stage should listen on the /fault_state_changes channel.