Extracting Data from Email Content with Custom Function

When using Rossum's email inbox, your vendors might send you values of specific fields in the email content or subject. Such values could be the Invoice ID, PO number, or invoice category.

You can extend the Rossum's default behavior to capture the values from the email content and pre-fill the captured values of the selected fields in the validation schema/extraction schema.

Define mapping of the email fields to Extraction schema

Rossum can use several values from email to fill the values in the validation screen. Have a look at how to import such documents via email and use the fields in the extraction schema.

Rossum can also be extended with a custom extension listening to email.received event action that receives the email metadata and tries to extract values from the email content. See below how to tell Rossum's Extraction schema that a field's value should be filled with a value from email called (email:category in rir_field_names).

{
        "rir_field_names": [
          "email:category"
        ],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "category",
        "label": "Category",
        "type": "string"
      }

Example function for parsing values from email content

In order to extract custom values from the email content and propagate them to the annotation content you should:

  • Create a custom function with the code shown below
  • Assign the function to email.received event action
  • Assign the function to a specific queue
import re

settings = {
    "email_fields": [
        {"id": "email:category", "regexps": ["HEX[0-9]+"]},
        {"id": "email:invoice_id", "regexps": ["ID-[0-9]+"]}
    ]
}

"""
The rossum_hook_request_handler is an obligatory main function that accepts
input and produces output of the rossum custom function hook.
:param payload: see https://api.elis.rossum.ai/docs/#annotation-content-event-data-format
:return: dict with files to be processed
"""


def rossum_hook_request_handler(payload):
    if payload['event'] == 'email' and payload["action"] == "received":

        try:
            files = main(payload)

        except Exception as e:
            print("Serverless function exception: {0}".format(e))
            return payload["files"]

        return {"files": files}


"""
Try to pass parsed values from email content to each of the documents to be processed.
:param payload: dict representing the payload
:return: dict with the API response
"""

def main(payload):
    incoming_files = payload["files"]
    email_subject = payload["headers"]["subject"]
    email_body = payload["body"]["body_text_plain"]

    accepted_files = []

    for file in incoming_files:
        
        print("Processing file: {0}".format(file))
        for field in settings["email_fields"]:
            
            print("Looking for field: {0}".format(field["id"]))
            parsed_values = parse_values_from_text(email_subject + email_body, field["regexps"])

            if parsed_values != []:
                
                if "values" not in file:
                    file["values"] = []
                
                print("Parsed values: {0}".format(parsed_values))
                file["values"].append({"id": field["id"], "value": ",".join(parsed_values)})

        accepted_files.append(file)
        
    print(accepted_files)

    return accepted_files

"""
Find all occurrences of the field's values defined by regular expressions.
:param text: Text to be searched
:return: list of found values.
"""

def parse_values_from_text(text, regexps):
    values = []

    for regexp in regexps:
        matches = re.findall(regexp, text)

        values += matches

    return values

Testing Input

You can use the sample input below for testing your custom function in Rossum's developer UI.

{
  "request_id": "ae7bc8dd-73bd-489b-a3d2-f5214b209591",
  "timestamp": "2020-01-01T00:00:00.000000Z",
  "hook": "https://example.rossum.app/api/v1/hooks/781",
  "action": "received",
  "event": "email",
  "files": [
    {
      "id": "1",
      "filename": "image.png",
      "mime_type": "image/png",
      "n_pages": 1,
      "height_px": 50,
      "width_px": 150
    },
    {
      "id": "2",
      "filename": "MS word.docx",
      "mime_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "n_pages": 30,
      "height_px": null,
      "width_px": null
    },
    {
      "id": "3",
      "filename": "agreement pdf.pdf",
      "mime_type": "application/pdf",
      "n_pages": 3,
      "height_px": 3510,
      "width_px": 2480
    },
    {
      "id": "4",
      "filename": "unknown_file",
      "mime_type": "application/pdf",
      "n_pages": 1,
      "height_px": null,
      "width_px": null
    }
  ],
  "headers": {
    "from": "[email protected]",
    "to": "[email protected]",
    "subject": "Invoice ABC from email",
    "date": "Mon, 04 May 2020 11:01:32 +0200",
    "message-id": "15909e7e68e4b5f56fd78a3b4263c4765df6cc4d"
  },
  "body": {
    "body_text_plain": "This is my invoice for categories HEX10, HEX30. And the Invoice ID is ID-123456 "
  }
}