History based data checks

When considering the automation of documents (skipping human review), we should remember that there are two field categories. One category includes fields that change on every document from a particular vendor, such as Invoice ID, Issue Date, or Due Date. The second category contains fields that almost never change. Such fields include Bank account number, IBAN, or the vendor's VAT ID.

Knowing that some fields rarely change, we have enabled the option to compare captured data from a new document to those from already exported documents. If some values were already seen for a given vendor many times, we could say with very high confidence that the newly captured value is fine and it can be automated.

Below, we explain in detail how the history-based data checks work, how they can be configured on a Queue to your needs, and how it affects the automation process.

Step 1: Verifying data against a relevant group of documents

When verifying data on the newly processed documents, we should compare them only to a relevant set of exported documents. In Accounts Payable domain, the relevant set of documents would be the ones that the same vendor issues.

Let's consider you have the following fields defined in your Extraction schema:

[
  {
    "rir_field_names": [
      "account_num"
    ],
    "constraints": {
      "required": false
    },
    "default_value": null,
    "category": "datapoint",
    "id": "account_number",
    "label": "Account number",
    "hidden": false,
    "type": "string",
    "can_export": false
  },
  {
    "rir_field_names": [
      "sender_ic"
    ],
    "constraints": {
      "required": false
    },
    "default_value": null,
    "category": "datapoint",
    "id": "vendor_id",
    "label": "Vendor company ID",
    "hidden": false,
    "type": "string",
    "can_export": false
  },
  {
    "rir_field_names": [
      "sender_vat_id",
      "sender_dic"
    ],
    "constraints": {
      "required": false
    },
    "default_value": null,
    "category": "datapoint",
    "id": "vendor_vat_id",
    "label": "Vendor VAT number",
    "type": "string"
  },
  {
    "rir_field_names": [
       "sender_name"
     ],
     "constraints": {
       "required": false
     },
     "default_value": null,
     "category": "datapoint",
     "id": "vendor_name",
     "label": "Vendor name",
     "type": "string"
  }
]

Now, imagine that a new document comes in, and Rossum's AI Engine automatically captures the following values:

  • account_number=12345678
  • vendor_id=CZ444444
  • vendor_name=ABC Company

The fields vendor_id and vendor_vat_id can be used to identify a specific vendor uniquely, so these would be good candidates for getting documents that were sent from the same vendor. Additionally, the Data field "account_number" can be partially used for identifying the same vendor since bank account numbers are unique within one banking institution (but the same bank account number can appear in multiple banks).

At Rossum, we have enabled you to find the relevant documents by defining a list of AI Engine outputs that populate your fields (available on Queue.settings). Rossum would search all exported documents in your queue to find the fields that were initialized by the defined rir_field_names. If such fields' values match those on the newly imported document, Rossum would consider this data as relevant candidates for data verification.

{
  "autopilot": {
    "enabled": true,
    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"]
    }
}

More strict rules for getting the relevant documents

You might decide to select the relevant group of documents by sender_ic rir_field_name. However, Rossum might sometimes make mistakes when reading out the vendor ID, f.e. when some numbers would be OCRed incorrectly. Thus it would capture the value "CZ444244" instead of "CZ444444". Of course, you don't want to verify data to an incorrect group of documents. Therefore, in addition to having multiple unique fields defining vendors, you can set a minimum number of Data field matches on the newly captured document and the already exported documents.

{
  "autopilot": {
    "enabled": true,

    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"],
        "matching_fields_threshold": 2
}

Such a restriction would result in a well-verified set of relevant documents you can check the new data against. Like this, at least two values extracted under the specified rir_field_names should match on the newly imported and already exported document to consider the documents to be used for data verification.

Step 2: Automating export based on previously confirmed values

Let's assume there are 20 already exported documents with both account_number=12345678 and vendor_id=CZ444444 in your Rossum document history.

Now, we can have a look at selected fields on the newly imported document and selected exported documents and decide whether some of the field values keep appearing so many times that we are very confident about the correctness of the value.

First, we decide based on what fields we would like to automate the document. We have already pre-selected the relevant documents in step 1. Now, we will define the list of rir_field_names which will be used for verifying the newly extracted values. Under the key "automate_fields" you can define the list of AI Engine outputs. Every Data field initialized by one of the rir_field_names within the "automate_fields" object will be considered for automation by the History based data checks.

{
  "autopilot": {
    "enabled": true,

    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"],
        "matching_fields_threshold": 2
    },
    "automate_fields":{
        "rir_field_names": [
            "account_num",
            "sender_ic",
            "sender_name"
        ]
    } 
  }
}

In this example, in addition to fields initialized with account_num and sender_ic, we will also try to verify the value initialized by "sender_name" AI output. Unlike fields like Issue date and Due date, where the value can easily be additionaly checked against a set of selected custom rules or built-in checks, a vendor name cannot be easily validated by a formula constructed from the data appearing on the document. And in such cases, the autopilot feature comes in handy, while verifying the correctness of the data based on the already reviewed documents.

More strict rules for automating a field

Similarly as when restricting the relevant documents, you can tell Rossum to verify a value only if it was seen at least N times among the relevant documents.

{
  "autopilot": {
    "enabled": true,

    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"],
        "matching_fields_threshold": 2
    },
    "automate_fields":{
        "rir_field_names": [
            "account_num",
            "sender_ic",
            "sender_name"
        ],
        "field_repeated_min": 5
    } 
  }
}

In this example, all three values (account number, vendor ID and vendor name) has to appear at least 5 times in the relevant documents set to be verified on the newly imported document.

How to configure the History based data checks?

The history based data checks can currently be configured only over the API. Fill the HTTP request below with your custom parameters and you should be able to update the its configuration.

curl -X PATCH -H 'Authorization: token <YOUR_TOKEN>' -H 'Content-Type: application/json' \
  -d '{"settings": "{
  "autopilot": {
    "enabled": true,
    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"]
    }
}"}' \
  'https://example.rossum.app/api/v1/queues/<YOUR_QUEUE_ID>'

How does this help to automate the documents?

If some field has been successfully verified on the set of relevant documents, the validation_sources of the field's datapoint will contain "history" value.

validation_sources:["history"]

Thanks to such a check, the Confident Automation feature would consider the field to be verified and automate it without regards for the field's confidence score.

📘

Combining history based data checks with other automation components

Learn how the different automation components interact together to automate the documents as well as how the automation works with hidden and required fields.

How are history based data checks configured by default?

History based data checks are enabled from day zero on your queue. The default configuration is following:

{
  "autopilot": {
    "enabled": true,

    "search_history":{
        "rir_field_names": ["sender_ic", "sender_dic", "account_num"],
        "matching_fields_threshold": 2
    },
    "automate_fields":{
        "rir_field_names": [
            "account_num",
            "bank_num",
            "iban",
            "bic",
            "sender_dic",
            "sender_ic",
            "recipient_dic",
            "recipient_ic",
            "const_sym"
        ],
        "field_repeated_min": 5
    } 
  }
}

As you can see, the setting for finding documents from the same vendor is already pretty conservative. You can update the list of rir_field_names and other parameters which are relevant in your case over the API.

Patch the Queue.settings with the updated configuration in order to use it on newly imported documents.