PDF Tables and Converting PDF to Excel

 In Blog

Pdf tables are encountered in a number of documents including, invoices, listings,  financial reports, bill of landing, certificate of origin, scientific papers and many others.

Although counterintuitive, the problem to identify and extract data from pdf tables is more complex than just converting pdf to excel.

Converting pdf to excel and pdf to xml is necessary each time we want to extract information from pdf files in an editable or machine readable form. However, most often the data we want actually extract from a dpf file to excel are nothing less than PDF tables.

Identifying PDF tables and tables in general is a complex computer science problem. The reason is that what is apparent to the human eye it is not for a computer. In a software algorithm each and every element of the human cognition needs to be broken in logic steps and implemented within the framework of repeatable process.

The challenges in converting pdf to excel when pdf tables are present in the document can be categorized in three items

  • Identifying where the table is within the document
  • Identifying the document contour
  • Correctly identifying the elements of the pdf table
    • Meaning of “,” in a number
    • All the elements of a number
    • The separation between units and number
    • The separation between numbers and labels e.g. $. Eu…

Computer scientists have addressed these multiple problems from various perspectives and notably taken three general approaches:  

  • The computer vision approach that seeks to identify different patterns in the text color with the purpose of identifying regular variations typical of a pdf table.
  • The Euristic approach that seeks to build rectangular structures in a pdf document of different sizes and then minimizes the error associated to a certain distance.
  • The machine learning approach that seeks to identify regular presence of characters and borders within document subsets and then leverages classification rules to reconstruct the table.

The fact is that none of this approaches achieve 100% precision because the type and positions of pdf tables within a pdf document can differ and considerably vary from document to document.

In fact the detection of PDF tables with the subsequent conversion of pdf to excel is still object of academic research today. For this reasons developers interested to incorporate a pdf library capable of precisely converting pdf to excel relies on specialized sdk or pdf APIs.

Third parties pdf table detection algorithms are often a combination of various approaches and can be a powerful resource for those in search of automation in converting pdf to excel or to xml.

This post is also available in: Spanish, Portuguese (Brazil)

Recommended Posts
Showing 3 comments
  • Seetmus

    A formidable share, I just given this onto a colleague who was doing slightly evaluation on this. And he the truth is bought me breakfast as a result of I found it for him.. smile. So let me reword that: Thnx for the treat! However yeah Thnkx for spending the time to debate this, I feel strongly about it and love reading more on this topic. If doable, as you change into expertise, would you thoughts updating your blog with more details? It is extremely useful for me. Large thumb up for this weblog post!

  • Deetmus

    Interesting post on pdf tables extraction. Where can I learn more about the specific algorithms you use for your pdf to xml online and your image extraction?

    • Tabex Editor

      Hi Deetmus, we do not disclose the exact algorithms behind our PDF to excel and pdf to XML API. However, there are several papers available on google scholar. Take a look under PDF table extraction and you’ll find lot of useful articles.

Leave a Comment