Email filtering by mime type, page size and filename with a function

If you are using Rossum Email Inbox you might be dealing with the problem that Rossum is processing many irrelevant files such as:

  • logos
  • documents with unnecessary mime types
  • files where you can decide by filename whether to process them or not

Imagine you could build your own smart email inbox filtering. With the email.received hook event action this is possible and you can get access to the email header and the list of files that will be sent to Rossum.

You could go ahead and create your custom logic for filtering the list of documents sent to Rossum over email with a webhook or with a custom function. Example of such a smart email filtering function will be shown below.

For the purpose of this article, create a new custom function.

Data payload sent to email received event action

The data payload sent to email.received hook action is very straight-forward. The main parts you would be interested are the list of files found in the email and the email header metadata.

{
  "request_id": "ae7bc8dd-73bd-489b-a3d2-f5214b209591",
  "timestamp": "2020-01-01T00:00:00.000000Z",
  "hook": "https://api.elis.rossum.ai/v1/hooks/781",
  "action": "received",
  "event": "email",
  "files": [
    {
      "id": "1",
      "filename": "image.png",
      "mime_type": "image/png",
      "n_pages": 1,
      "height_px": 100.0,
      "width_px": 150.0
    },
    {
      "id": "2",
      "filename": "MS word.docx",
      "mime_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "n_pages": 1,
      "height_px": null,
      "width_px": null
    },
    {
      "id": "3",
      "filename": "A4 pdf.pdf",
      "mime_type": "application/pdf",
      "n_pages": 3,
      "height_px": 3510.0,
      "width_px": 2480.0
    },
    {
      "id": "4",
      "filename": "unknown_file",
      "mime_type": "text/xml",
      "n_pages": 1,
      "height_px": null,
      "width_px": null
    }
  ],
  "headers": {
    "from": "[email protected]",
    "to": "[email protected]",
    "subject": "Some subject",
    "date": "Mon, 04 May 2020 11:01:32 +0200",
    "message-id": "15909e7e68e4b5f56fd78a3b4263c4765df6cc4d"
  }
}

For the purpose of function testing, you can use the function editor where you will be taken after you create a new function. Moreover, you can copy the data payload above and save in the testing framework of the function editor, as shown in this article.

Example of email filtering function

The example function below filters documents from incoming emails by mime type, page size and filenames. Read the comments in the function and you will find out more.

// This custom function example can be used for email.received hook action and 
// is called whenever a new email with attachments is received in Rossum email inbox.
// Such an email inbox is always linked to a specific queue (https://api.elis.rossum.ai/docs/#inbox) .

// The function below shows how to:
// 1. Filter out documents with mime types that should not be processed by Rossum
// 2. Filter out documents where page height and width is too low
// 3. Filter out documents where filename matches specific patterns

// The rejected files will not appear in Rossum at all.

// --- ROSSUM HOOK REQUEST HANDLER ---

// The rossum_hook_request_handler is an obligatory main function that accepts
// input and produces output of the rossum custom function hook. Currently,
// the only available programming language is Javascript executed on Node.js 12 environment.
// @param {Array} files - List of attachments received in the email.
//   See https://api.elis.rossum.ai/docs/#email-received-event-data-format 
//   where sample data sent to the email hook is shown in the right pannel.
// @param {Object} headers - Dictionary of header fields found in the received email.
// @returns {Object} - dictionary containing files that should be processed by Rossum.

exports.rossum_hook_request_handler = ({
  files,
  headers
}) => {
    
    files = filterByMimeTypes(files, allowedMimeTypes=['application/pdf', 'image/png', 'image/jpeg']);
    
    files = filterBySize(files, height_px=100, width_px=100);

    files = filterFilenameByRegexRules(files, rejectRegexRules=['.*flier.*', '.*contract.*']);

    // Return list of files to be processed by Rossum
    return {
        "files": files
    };
};

// --- HELPER FUNCTIONS ---

// Return files that match the given mime types
// @param {Array} files - list of files where each file is represented by a dictionary
// @param {Array} allowedMimeTypes - list of mime types Rossum should process
// @returns {Array} - list of files Rossum should process

const filterByMimeTypes = (files, allowedMimeTypes) => {
    return files.filter(file => allowedMimeTypes.includes(file.mime_type))
}

// Return files that comply with the specified height and width
// @param {Array} files - list of files where each file is represented by a dictionary
// @param {Number} height_px - Minimal height of documents in pixels to be processed by Rossum
// @param {Number} width_px - Minimal width of documents in pixels to be processed by Rossum
// @returns {Array} - list of files Rossum should process
const filterBySize = (files, height_px, width_px) => {
    return files.filter(file => (file.height_px >= height_px) && (file.width_px >= width_px));
}

// Return files whose filenames do not match the reject rules
// @param {Array} files - list of files where each file is represented by a dictionary
// @param {Array} rejectRegexRules - list of reject rules for files that should not be processed by Rossum. A rule is represented by regular expression pattern 
// @returns {Array} - list of files Rossum should process
const filterFilenameByRegexRules = (files, rejectRegexRules) => {
    return files.filter(file => !rejectRegexRules.some(regex => new RegExp(regex).test(file.filename)))
}

Setting up email received event action on a hook

As you probably noticed, the email.received event action is not yet available in the UI when creating a new hook. However, you can set the "email.received" action on a hook over the API - https://api.elis.rossum.ai/docs/#update-part-of-a-hook.

Once you have the new extension assigned to a queue, you send a new email with various attachments to Rossum Inbox and see what happens!

1846

What if email hook fails

If the email hook fails, the documents that are coming from this email will be sent to Rossum so that you do not loose any data on the way. In such a case, Rossum applies very simple logic for document filtering. All mime types are accepted by default and very small logos are filtered out.

Do I always have to implement email hook

If you need a simple document filtering by mime type and no other custom logic is needed, you can just set the list of allowed mime types on a Queue as explained here.