Understanding Invoices with Document AI

At AppFolio, we specialize in helping Property Management Companies to automate their most repetitive, time-consuming, and tedious processes. Accounting-related tasks often check all of these boxes – and one of them is to maintain an accurate system of record for accounts payable and accounts receivable. This is a critical task due to strict federal and state regulations, and to ensure it is carried out accurately our customers use AppFolio’s Property Management (APM) software to enter invoices. However, these bills must be manually entered –  a time-consuming and error-prone process.

To help, we created Smart Bill Entry, a tool powered by state-of-the-art Machine Learning (ML) based models and our Document AI technology. Smart Bill Entry automatically extracts the most important information from the multitude of invoices a Property Manager receives in a given month - and in 2022 alone, we processed over 10 Million invoices. Here we will explore the tools enabling this technology and explain how it works behind the scenes.

Smart Bill Entry

Smart Bill Entry enables Property Managers to upload invoices in PDF format to APM directly via drag and drop, or by forwarding an email to a specific inbox. Once the invoice is uploaded, our ML models automatically extract key information, such as the amount, vendor name, the property’s address, and the invoice number associated with the bill.

We know Property Managers place a heavy emphasis on accuracy to ensure that bills get paid correctly. Based on our estimations, we have found that when manually entering invoices humans are roughly 95-99% accurate, depending on the field. Our model is designed to mirror this accuracy, and we compute calibrated confidence scores for each field to get our models to a state where they are as close to human accuracy as possible.

The model outputs a number between 0 and 1, where a score closer to 1 indicates that the model is confident that the class is correct. However, ML models tend to be over- or under-confident in their predictions, meaning that a given score doesn’t correspond to the probability that the prediction is correct - as demonstrated by the diagram below.

ML models tend to be over or underconfident, such that their confidence scores don’t exactly correspond to the probability that the prediction is correct. we fit additional calibration models To account for this, such that we can make consistent decisions of whether or not to show predictions. Image credit “On Calibration of Modern Neural Networks”

Calibration is the process of transforming the confidence score predicted by the model into a reliable probability that can be used to make decisions, like showing or not showing the prediction to the user. We do this as an added layer of precision. If our calibrated model is returning a confidence score of 0.95 for the “Bill Amount” field, we can expect that the model will be correct 95 times out of 100 predictions.

Document AI

The technology driving Smart Bill Entry’s ability to extract information from PDFs is Document AI. Document AI is a set of Machine Learning tools that are specifically designed for extracting information from a PDF, independent of the layout. The input data used to train our Document AI models includes two types of information: 

  1. The Optical Character Recognition (OCR) information of the documents. This is the typed or written text on a document converted into machine-encoded text. Since we are only interested in fields that are critical to the invoice, we use datasets that zero in on this information.

  2. The bounding box coordinates and content of the target fields we are looking to extract from the documents. We derive labels from these coordinates. To get the labels (e.g. “Vendor Name”, “Property Name”, etc.) for the content of each bounding box, human supervision is necessary - and the operators who assess the content of the invoices assign the appropriate label.

The image below is a visual representation of an invoice, the OCR information identified by the model for that invoice, and what labels would be generated for each object that is present on the invoice. 

Original Invoice vs OCR Information vs Labels

After collecting the OCR and bounding box dataset, we then apply different techniques that combine Computer Vision, Natural Language Processing, and Tabular Features. In the next section, we’ll discuss two of our approaches to extracting the above information, depending on our customer’s specific needs.  

Traditional vs Deep Learning

Smart Bill Entry combines two approaches to extracting information from invoices: traditional Machine Learning solutions and advanced Deep Learning solutions. For example, when extracting the ”Property Address” and the “Vendor Name” fields, we are using tree-based  models customized for each Property Manager. When we extract generic fields, such as the “Amount” and “Invoice Number” we use powerful DL models that can take advantage of layout and text using state-of-the-art Transformer architectures.

Traditional Machine Learning 

We extract the “Property Address” and “Vendor Name” fields from an invoice using traditional ML models. The input data for training the model is a simple Bag-of-words representation of the invoices, together with other engineered features relating to the layout. After extensive benchmarking we landed on a multi-class Random Forest Classifier as our base estimator. It fits several decision tree classifiers on various sub-samples of the dataset and uses averaging to improve predictive accuracy and prevent over-fitting.

Because AppFolio is a Property Management software used by thousands of companies (each with a large number of vendors and properties in their database), training only one multi-class Random Forest Classifier for all the vendors and properties in our database is a very challenging feat due to the high cardinality in the target variable and the challenge of deduplicating entities.

We train separate Random Forest ClassifierS for fields that have a unique set of target classes for each customer.

To tackle this, we decided to train one model per Property Management company. Not surprisingly, the drawback of this approach is that it requires a comprehensive and robust ML infrastructure to efficiently maintain and deploy thousands of models to production, as well as a short cold start phase to learn how to correctly map to specific vendor and property entities. It also means that each individual model needs to be relatively light weight and fast to train.
At AppFolio, we combined in-house solutions with third-party MLOps solutions, to create a state-of-the-art ML infrastructure that helps every team quickly train, deploy and monitor ML models at scale in production. We will talk more about our infrastructure in future posts in this blog.   

Deep Learning

For fields that have well defined labels across all customers such as the “Amount” and “Invoice Number”, we opted for a solution that implements a single Deep Learning model across all Property Management companies as opposed to one model per company. This approach generalizes well to unseen layouts and eliminates cold start issues.

Due to its very high learning capacity we were able to leverage almost all available training data in our database - which lends to our solution producing much more accurate predictions than traditional ML models. We also benchmarked against off the shelf Document AI solutions and found that our models significantly outperform them when evaluated on our holdout data.

Deep Learning Architecture

When deciding which DL architecture to implement for our model, we tried numerous approaches including: 

  • Computer Vision models that use the image of an invoice as their input and output a bounding box for each class

  • Natural Language Processing models that start from the OCR and classify each bounding box according to one of the given classes, and

  • Multimodal models that use as input both the OCR and the image of the invoice. 

We implemented our models in such a way that we can exchange the architecture without modifying the input data and output format. In the image below, we show how different architectures (the yellow boxes in the diagram) can be exchanged without affecting the first and last layers of the model. This gives us the flexibility and agility to test different solutions and optimize our metrics. 

Keeping the input and output fixed we can easily switch between Deep learning architectures to optimize metrics such as accuracy, training and inference time. This also gives us the flexibility to quickly try out new approaches as the field advances.

When considering which architecture to deploy to production, we chose the solution that best balances accuracy with training time and inference speed. We used two processes to evaluate the performance of each model:

1. Offline Evaluation
After training a new model, we compare its performance against a frozen and static dataset that we use as a benchmark.

2. Online Evaluation
After training a new model, we deploy it in shadow-mode, where we do not show the predictions to the user, but rather just record them and compare metrics. 

Stay tuned for a future blog post where we will discuss the details of our ML infrastructure, and how we train, evaluate, deploy and monitor each model!

Authors and contributors: Ezequiel Esposito, Ari Polakof, Christfried Focke, Tony Froccaro