Rossum Developer Hub

Rossum Data Capture for Developers and Integrators

Welcome to the Rossum developer hub. You'll find comprehensive guides and documentation to help you implement Rossum as quickly as possible, as well as support if you get stuck.

Let's jump right in!

Developer Guides    API Reference    User Help Center

History based data checks

When thinking about automation of documents (skipping human review), we should keep in mind that there are two categories of fields. One category includes fields that change on every document from a particular vendor, such as Invoice ID, Issue Date or Due Date. The second category contains fields that almost never change. Such fields include Bank account number, IBAN or the vendor's VAT ID.

Knowing that some fields almost never change, we have enabled the option to compare captured data from a new document to those from already exported documents. If some values were already seen for a given vendor many times, we can say with very high confidence that the newly captured value is fine and it can be automated.

Below, we explain in detail how the history based data checks work, how they can be configured on a Queue to your needs and how it effects the automation process.

Step 1: Verifying data against a relevant group of documents

When verifying data on the newly processed documents, we should be comparing them only to a relevant set of exported documents. In Accounts Payable domain, the relevant set of documents would be the ones which are issued by the same vendor.

Let's consider you have the following fields defined in your Extraction schema:

[
  {
    "rir_field_names": [
      "account_num"
    ],
    "constraints": {
      "required": false
    },
    "default_value": null,
    "category": "datapoint",
    "id": "account_number",
    "label": "Account number",
    "hidden": false,
    "type": "string",
    "can_export": false
  },
  {
    "rir_field_names": [
      "sender_ic"
    ],
    "constraints": {
      "required": false
    },
    "default_value": null,
    "category": "datapoint",
    "id": "vendor_id",
    "label": "Vendor company ID",
    "hidden": false,
    "type": "string",
    "can_export": false
  },
  {
    "rir_field_names": [
      "sender_vat_id",
      "sender_dic"
    ],
    "constraints": {
      "required": false
    },
    "default_value": null,
    "category": "datapoint",
    "id": "vendor_vat_id",
    "label": "Vendor VAT number",
    "type": "string"
  },
  {
    "rir_field_names": [
       "sender_name"
     ],
     "constraints": {
       "required": false
     },
     "default_value": null,
     "category": "datapoint",
     "id": "vendor_name",
     "label": "Vendor name",
     "type": "string"
  }
]

Now, imagine that a new document comes in and Rossum's AI Engine automatically captures the following values:

  • account_number=12345678
  • vendor_id=CZ444444
  • vendor_name=ABC Company

The fields vendor_id and vendor_vat_id can be used to uniquely identify a specific vendor, so these would be good candidates for getting documents that were sent from the same vendor. Additionally, the Data field "account_number" can be partially used for identifying the same vendor since bank account numbers are unique within one banking institution (but the same bank account number can appear in multiple banks).

At Rossum, we have enabled to find the relevant documents by defining a list of AI Engine outputs which populate your fields (available on Queue.settings). Rossum would search all exported documents in your queue to find the fields that were initialized by the defined rir_field_names and if the values of such fields match with those on the newly imported document, Rossum would consider this data as relevant candidates for data verification.

{
  "autopilot": {
    "enabled": true,
    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"]
    }
}

More strict rules for getting the relevant documents

You might decide to select the relevant group of documents by sender_ic rir_field_name. However, Rossum might sometimes makes mistakes when reading out the vendor ID, f.e. when some of the numbers would be OCRed incorrectly and thus it would capture value "CZ444244" instead of "CZ444444". Of course, you don't want to be verifying data to an incorrect group of documents. Therefore, in addition to having multiple unique fields defining vendors, you can set a minimum number of the Data field matches on the newly captured document and on the already exported documents.

{
  "autopilot": {
    "enabled": true,

    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"],
        "matching_fields_threshold": 2
}

Such a restriction would result into a well verified set of relevant documents you can check the new data against. Like this, at least two values extracted under the specified rir_field_names should match on the newly imported and already exported document to consider the documents to be used for data verification.

Step 2: Automating export based on previously confirmed values

Let's assume there are 20 already exported documents with both account_number=12345678 and vendor_id=CZ444444 in your Rossum document history.

Now, we can have a look at selected fields on the newly imported document and selected exported documents and decide whether some of the field values keep appearing so many times that we are very confident about the correctness of the value.

First, we decide based on what fields we would like to automate the document. We have already pre-selected the relevant documents in step 1. Now, we will define the list of rir_field_names which will be used for verifying the newly extracted values. Under the key "automate_fields" you can define the list of AI Engine outputs. Every Data field initialized by one of the rir_field_names within the "automate_fields" object will be considered for automation by the History based data checks.

{
  "autopilot": {
    "enabled": true,

    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"],
        "matching_fields_threshold": 2
    },
    "automate_fields":{
        "rir_field_names": [
            "account_num",
            "sender_ic",
            "sender_name"
        ]
    } 
  }
}

In this example, in addition to fields initialized with account_num and sender_ic, we will also try to verify the value initialized by "sender_name" AI output. Unlike fields like Issue date and Due date, where the value can easily be additionaly checked against a set of selected custom rules or built-in checks, a vendor name cannot be easily validated by a formula constructed from the data appearing on the document. And in such cases, the autopilot feature comes in handy, while verifying the correctness of the data based on the already reviewed documents.

More strict rules for automating a field

Similarly as when restricting the relevant documents, you can tell Rossum to verify a value only if it was seen at least N times among the relevant documents.

{
  "autopilot": {
    "enabled": true,

    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"],
        "matching_fields_threshold": 2
    },
    "automate_fields":{
        "rir_field_names": [
            "account_num",
            "sender_ic",
            "sender_name"
        ],
        "field_repeated_min": 5
    } 
  }
}

In this example, all three values (account number, vendor ID and vendor name) has to appear at least 5 times in the relevant documents set to be verified on the newly imported document.

How does this help to automate the documents?

If some field has been successfully verified on the set of relevant documents, the validation_sources of the field's datapoint will contain "history" value.

validation_sources:["history"]

Thanks to such a check, the Confident Automation feature would consider the field to be verified and automate it without regard for the field's confidence score.

Updated about a month ago

History based data checks


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.