Attempting to extract a table from PDF using Python 3.6. Seems [pyPDF2][1] is failing and [pdfminer][2] is not compatible with 3.x. I found a python wrapper for [tabula][3].

import tabula
file_list = get_pdf_list()

text = tabula.read_pdf(file_list[0])
print(text)

tabula.convert_into(file_list[0], "test.json", ouput_format="json")

Both read_pdf and convert_into return empty results. PyPDF2 had the same issue. There are no errors when it runs

I'm starting to think it has to do with the format of my pdf. Anyone have more experience? I'm trying to extract a value from a table in a pdf.

score:0

Extracting PDF table, Python3, tabula-py Using Tabula-py

from tabula import convert_into
table_file = r"pdf_path"
o1_csv = r"file12.csv"
o2_csv = r"file13.csv"
df = convert_into(table_file, o1_csv, output_format='csv', lattice=False, stream=True, pages=1)
df1 = convert_into(table_file, o2_csv, output_format='csv', lattice=True, stream=False, pages=1)
print(df)
print(df1)
Output: print(df) : None
        print(df1): None

But csv files werenot empty

efile12.csv with stream option true resulted Lattice as true and stream false resulted file13.csv

May be the Table has no boudaries which differs it from normal text thats where tabula-py has its feature

  1. stream if true searches for row and columns of table based on text arrangement
  2. lattice if true searches for proper boundaries defining rows and column of a table

score:1

Hope already you got the answer ! But still here is my code ! And I wanted to say that tabula is one of the good PDF tables extractor. Where I'm getting lot of issue with camelot.

install latest pkg of tabula

pip install tabula-py

And the code is !

import os
from tabula import wrapper
os.path.abspath("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')

i=1
for table in tables:
    table.to_excel('output'+str(i)+'.xlsx',index=False)
    print(i)
    i=i+1

Try this out !


Related Query

More Query from same tag