PDF Tables and Converting PDF to Excel
Although counterintuitive, the problem to identify and extract data from pdf tables is more complex than just converting pdf to excel.
Converting pdf to excel and pdf to xml is necessary each time we want to extract information from pdf files in an editable or machine readable form. However, most often the data we want actually extract from a dpf file to excel are nothing less than PDF tables.
Identifying PDF tables and tables in general is a complex computer science problem. The reason is that what is apparent to the human eye it is not for a computer. In a software algorithm each and every element of the human cognition needs to be broken in logic steps and implemented within the framework of repeatable process.
The challenges in converting pdf to excel when pdf tables are present in the document can be categorized in three items
- Identifying where the table is within the document
- Identifying the document contour
- Correctly identifying the elements of the pdf table
- Meaning of “,” in a number
- All the elements of a number
- The separation between units and number
- The separation between numbers and labels e.g. $. Eu…
Computer scientists have addressed these multiple problems from various perspectives and notably taken three general approaches:
- The computer vision approach that seeks to identify different patterns in the text color with the purpose of identifying regular variations typical of a pdf table.
- The Euristic approach that seeks to build rectangular structures in a pdf document of different sizes and then minimizes the error associated to a certain distance.
- The machine learning approach that seeks to identify regular presence of characters and borders within document subsets and then leverages classification rules to reconstruct the table.
The fact is that none of this approaches achieve 100% precision because the type and positions of pdf tables within a pdf document can differ and considerably vary from document to document.
In fact the detection of PDF tables with the subsequent conversion of pdf to excel is still object of academic research today. For this reasons developers interested to incorporate a pdf library capable of precisely converting pdf to excel relies on specialized sdk or pdf APIs.