Case study : Pdf data extraction

Category

Case Studies

Author

Wissen Team

Date

April 28, 2023

Business need

Our client, a leading global manufacturing firm, had a requirement to automate extraction of data from purchase orders. Many of the firm’s customers were loading their purchase orders,  in PDF format, into the firm’s CRM system. These PDFs were in various regional languages and in different formats. There was a manual step involved in extracting the data from these PDFs and creating the purchase orders. This manual step existed in all geographies. As per their research, the cost of manual data extraction could go up to 9 EUR for each document. The firm was looking for an ingenious solution, which would automate this data extraction process across all formats and languages, with minimal manual intervention.

Solution

Wissen’s Machine Learning expertise and experience was used to create a novel solution for this problem. The solution would “learn” from a few examples of different formats and then would be able to successfully extract the data from purchase orders. The system used spacial relations and typographical information in the PDF to learn the process of extracting data from the required fields. The main feature of the solution was that once it learned from a few examples of a particular format, it could extract the data from other documents of the same format automatically going forward. The system was language agnostic and scalable. It could extract complex information from both the header, like order number, as well as from the line items, like item number, price and quantity. Wissen also has another solution for Optical Character Recognition (OCR) that can extract data from printed text in scanned images. This other solution could be combined with the PDF Data Extractor solution, to extract data from scanned purchase orders.