Ingest Jobs

When new logs are offloaded, or artifacts added to the system, the scheduler will start kicking off “ingest” batch jobs against those artifacts. You can configure any number of batch jobs to run when a new log is added.

These jobs are responsible for doing things like extracting metadata, creation of videos, outputting pose, validating data, or can even be used for non-catalog related purposes, such as running your own checks or ETL jobs.

You can check on your ingest job’s progress by browsing ot your log:

Ingest Status

Clicking on any of those jobs will give you its recorded standard output, to help you debug errors (in the event that there are any):

Ingest Stdout

While the job is running, this output will automatically update every few seconds.

Job Rules

Each job has rules that determine what artifacts it will run against. The first such rule is simply the organization the job was created under.

A job will only run against artifacts that are within the same organization as the job. The exception to this is are ‘global’ ingest jobs, defined by an organization GUID of 77777777-7777-4777-7777-777777777777. These require administrator access to create, and will run against all logs ingested.

You can also have finer grained rules. These come in the form of a series of individual rules such as <LHS> <COMPARATOR> <RHS>, where the left hand side and right hand side are expressions, and the compator can be ’equals’ or ‘does not equal’.

A typical rule might be "${ARTIFACT_TYPE}" "Equals" "Log", which would mean that the ingest rule only runs on logs.

Rules can be defined to look at artifact types, identifiers, or attributes. For example, if your artifact/log has an attribute called ROBOT, you can create a rule "${ROBOT}" "Equals" "TestRobot". This would only run on artifacts related to TestRobot.

You can have multiple rules specified. They are all “anded” together (in other words, they must all be true).

Adding New Jobs (GUI)

You can add custom jobs to execute each time a new artifact is ingested. By default, adding a new job will not execute against older artifacts. You must manually request that those artifacts are reingested. that job run against it (“backfilled”).

You can use jobs to process artifacts as they come in, extracting useful data for other downstream tools to make use of.

To add a new job through the web interface, move to the ‘Settings’ page in the top-right corner and click on “Ingest Jobs”, then click on “Add New Job”.

You should see a screen that looks like this:

Add New Job

Fill in the details as appropriate, then click ‘Add Job’.

Adding New Jobs (CLI)

You can use the ark-catalog-ingest-tool to add new ingest jobs to the system. This is done by creating a template of your ingest job, such as:

config:
  default:
    name: "Example Job"
    organization_id: "<organization GUID>"
    container_image: "<container url>"
    cpu_count: 2
    memory_size_mb: 4096
    timeout_s: 3600
    command:
      - "${ARTIFACT_ID}"
    rules:
      rules:
        - left_type: "${ARTIFACT_TYPE}"
          comparator: "Equals"
          right_type: "Log"

You can see here that this job will be called ‘Example Job’, will have 2 cores and 4GB of RAM, and execute against any log artifact.

To actually create this job, use the ark-catalog-ingest-tool:

~/ark$ ./build/ark-catalog-ingest-tool --create-ingest-job "path/to/yaml"

This will return success or failure based on if your job is valid and you have permission to add such a job.

Listing Jobs

You can view high level details about ingest jobs from the Catalog frontend. If you want to see additional details, use the ark-catalog-ingest-tool:

~/ark$ ./build/ark-catalog-ingest-tool --list-ingest-jobs
Extract Camera Video
  Identifier:      2
  Organization:    77777777-7777-4777-7777-777777777777
  Container Image: 095412845506.dkr.ecr.us-east-1.amazonaws.com/2ea25035-6aa4-42fb-9955-92288ea1b972/catalog/extract-video
  Core Count:      2
  Memory Size:     2048MB
  Timeout:         3600s
  Command:         ${ARTIFACT_ID}
  Rules:
    '${ARTIFACT_TYPE}' Equals 'Log'

From here, you can see that there is a single ingest job called ’extract camera video’. It has an organizational GUID of 77777777-7777-4777-7777-777777777777 which means it applies to all logs across all organizations.

Updating Jobs

You can update existing jobs similarly to creating jobs. Again, use the ark-catalog-ingest-tool, and pass it the path to an ingest job YAML file.

One key difference from creating inject jobs is you must pass in the job definition identifier that you wish to update. For example:

~ark$ /build/ark-catalog-ingest-tool --update-ingest-job "path/to/yaml" --ingest-job-definition-id "job-id"

This will return success or failure based on if the job is updated. All of the sames rules apply for updating a job as creating a job.

Deleting Jobs

You can completely delete a job with the ark-catalog-ingest-tool. You can only delete jobs that were created in your organization (and you must be an admin to delete global jobs).

When a job is deleted, not only will it no longer affect any new artifacts, but it will also remove all previously-run job state from artifacts it had already run against. Any attachments will remain; simply the standard output of the job run (and job state) will be removed.

To delete a job, you must supply the job definition identifier:

./build/ark-catalog-ingest-tool --delete-ingest-job "job-id"

Execution

The ‘container image’ references a Docker registry or ECR that contains the container you wish to execute. It is assumed that the container has an ENTRYPOINT defined with the binary that you wish to launch.

You provide the command line that should be given to the container’s entry point. A typical example is ${ARTIFACT_ID} which will resolve to the GUID of the artifact that is being ingested.

The machines that run the container image are hosted on m5a.xlarge instances on AWS.

Reingesting Artifacts

You can reingest a particular artifact with the ark-catalog-ingest-tool as such:

~/ark$ ./build/ark-catalog-ingest-tool --reingest-artifact <GUID>

As-is, that command will re-run all matching ingest jobs on the given artifact (in other words, any ingest job whose rules match the artifact will be re-run).

You an also use the --filter-job-definition command line argument to only rerun one particular ingest job on an artifact.

Finally, if you are an administrator, you can use the --reingest-all-artifacts flag to reingest all of the artifacts in the catalog. It’s recommended that you further filter this with --filter-job-definition or --filter-artifact-type to avoid creation of a large number of jobs.