How to extract Table from PDF in Python? [duplicate] - python

This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 11 days ago.
I have thousands of PDF files, composed only by tables, with this structure:
pdf file
However, despite being fairly structured, I cannot read the tables without losing the structure.
I tried PyPDF2, but the data comes completely messed up.
import PyPDF2
pdfFileObj = open(pdf_file.pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
print(pageObj.extractText().split('\n')[0])
print(pageObj.extractText().split('/')[0])
I also tried Tabula, but it only reads the header (and not the content of the tables)
from tabula import read_pdf
pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content
Any thoughts?

After struggling a little bit, I found a way.
For each page of the file, it was necessary to define into tabula's read_pdf function the area of the table and the limits of the columns.
Here is the working code:
import pypdf
from tabula import read_pdf
# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)
# For each page the table can be read with the following code
table_pdf = read_pdf(
pdf_file,
guess=False,
pages=1,
stream=True,
encoding="utf-8",
area=(96, 24, 558, 750),
columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)

Try this: pip install tabula-py
from tabula import read_pdf
df = read_pdf("file_name.pdf")

use library tabula
pip install tabula
then exract it
import tabula
# this reads page 63
dfs = tabula.read_pdf(url, pages=63, stream=True)
# if you want read all pages
dfs = tabula.read_pdf(url, pages=all)
df[1]
By the way, I tried read pdf files by using another way. Then it works better than library tabula. I will post it soon.

#fmarques
You could also try a new Python package (SLICEmyPDF) developed by StatCan specially for extracting tabular data from PDF:
https://github.com/StatCan/SLICEmyPDF
From my experience SLICEmyPDF outperforms other free Python or R packages.
The catch is that it requires the installation of a few extra free software. The instructions for the installation can be found at
https://dataworldofredhairedgirl.blogspot.com/2022/04/how-to-install-statcan-slicemypdf-on.html

Related

Is there a way I can split a huge CSV files into multiple PDF?

I'm trying to split a big CSV files into several smaller PDF files. Need some help on generating the PDFs.
I can split it into multiple CSV or html files. But not sure if there's a way to convert dataframe directly to PDF or convert HTML to PDF. Here's where I am
import pandas as pd
import glob
path = r'C:\Users\ZhangZ01\Desktop\test\NT_combine.csv'
csv = glob.glob(path + "/*.csv")
df = pd.read_csv(path, index_col= None, header=0)
## Split data by "CUSTOMER_ID"
for i, g in df.groupby('CUSTOMER_ID'):
g.to_html(r'C:\Users\ZhangZ01\Desktop\test\{}.html'.format(i), header=True, index_names = False)
I did some search online and some people say I could use pdfKit but seems that is not available for Windows user.
How do I solve the problem?
pdfKit is available for windows too, all you need to do is:
1: pip install pdfKit
2: then go to this link to download the suitable version of wkhtmlox needed for pdfKit to work
3: add PATH_OF_wkhtmlox/bin to your sys variable path
and i your python script add the folowing line:
pdfkit.from_url('your-url.html', 'your_pdf.pdf')
dont forget to import pdfkit
I don't know if you absolutely need to convert from html but if not, you may be able to use fpdf:
from fpdf import FPDF
data = [
["hello there", 3, 12],
["something", 312, 66],
["earsfg", 303, 95],
["earsfg", 303, 95],
["earsfg", 303, 95],
]
# prepare pdf
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=18)
# write some data
for idx, line in enumerate(data):
lineStr = "[{0}] {1}, {2}".format(line[0], line[1], line[2])
pdf.cell(200, 8, txt=lineStr, ln=1, align="L")
pdf.output("output.pdf")
There is a more in-depth tutorial here

Download excel file using python

I have a web link which downloads an excel file directly. It opens a page writing "your file is downloading" and starts downloading the file.
Is there any way i can automate it using requests module ?
I am able to do it with selenium but i want it to run in background so i was wondering if i can use request module.
I have used request.get but it simply gives the text i.e "your file is downloading" but somehow i am not able to get the file.
This Python3 code downloads any file from web to a memory:
import requests
from io import BytesIO
url = 'your.link/path'
def get_file_data(url):
response = requests.get(url)
f = BytesIO()
for chunk in response.iter_content(chunk_size=1024):
f.write(chunk)
f.seek(0)
return f
data = get_file_data(url)
You can use next code to read the Excel file:
import pandas as pd
xlsx = pd.read_excel(data, skiprows=0)
print(xlsx)
It sounds like you don't actually have a direct URL to the file, and instead need to engage with some javascript. Perhaps there is an underlying network call that you can find by inspecting the page traffic in your browser that shows a direct URL for downloading the file. With that you can actually just read the excel file URL directly with pandas:
import pandas as pd
url = "https://example.com/some_file.xlsx"
df = pd.read_excel(url)
print(df)
This is nice and tidy, but if you really want to use requests (or avoid pandas) you can download the raw file content as shown in this answer and then use the pyexcel_xlsx package's get_xlsx function to read it without any pandas involvement.

Loop through PDF and save all pages into a DataFrame

I am fairly new to Python and trying the PyPDF2 package for the first time. I simply want to loop through my PDF document (66 pages) and extract all the text into a DataFrame.
I have followed some blog posts (http://echrislynch.com/2018/07/13/turning-a-pdf-into-a-pandas-dataframe/) and have the following code. Unlike the blog post, I am not interested in any data cleanse or transformation at this point, I simply want the pages stored in a dataframe. :
import PyPDF2
import os
import pandas as pd
# Open PDF as an object and read it into PyPDF2
pdfFileObj = open('MyReport.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# loop through pages
pages=list()
for i in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(i)
page = pageObj.extractText()
page = page[0:]
pages.append(page[0:])
for i in range(1,len(page)):
pages=[page[2:] for page in pages]
# Create dataframe
page_df = pd.DataFrame([page])
# Concat with dbn_df
MyNewReport= pd.DataFrame([page])
page_df = page_df.iloc[0:]
MyNewReport= pd.concat([MyNewReport,page_df], axis=0,
ignore_index=True, sort=False)
I am encountering an error :
File "<ipython-input-78-729b84e346f9>", line 16, in <module>
page[i] = page[i][2:]
TypeError: 'str' object does not support item assignment
So I know the issue lies with my loop, although looking at the variable explorer, my dataframe contains the text from the last page of my pdf...so it is looping through something!
Can anyone help or recommend some further reading to understand the error and resolution?

How to download first line of xlsx file via url python

I used to use requests lib to load single line via url:
import requests
def get_line(url):
resp = requests.get(url, stream=True)
for line in resp.iter_lines(decode_unicode=True):
yield line
line = get_line(url)
print(next(line))
A text files loading perfectly. But if I want to load .xlsx, result looks like unprintable symbols:
PK [symbols] [Content_Types].xml [symbols]
Is there a way to load single row of cells?
You can't just read raw HTTP response and seek for the particular Excel data. In order to get xlsx file contents in proper format you need to use an appropriate library.
One of the common libraries is xlrd, you can install it with pip:
sudo pip3 install xlrd
Example:
import requests
import xlrd
example_url = 'http://www.excel-easy.com/examples/excel-files/fibonacci-sequence.xlsx'
r = requests.get(example_url) # make an HTTP request
workbook = xlrd.open_workbook(file_contents=r.content) # open workbook
worksheet = workbook.sheet_by_index(0) # get first sheet
first_row = worksheet.row(0) # you can iterate over rows of a worksheet as well
print(first_row) # list of cells
xlrd documentation
If you want to be able to read your data line by line - you should switch to more simple data representation format, like .csv or simple text files.

How to download a Excel file from behind a paywall into a pandas dataframe?

I have this website that requires log in to access data.
import pandas as pd
import requests
r = requests.get(my_url, cookies=my_cookies) # my_cookies are imported from a selenium session.
df = pd.io.excel.read_excel(r.content, sheetname=0)
Reponse:
IOError: [Errno 2] No such file or directory: 'Ticker\tAction\tName\tShares\tPrice\...
Apparently, the str is processed as a filename. Is there a way to process it as a file? Alternatively can we pass cookies to pd.get_html?
EDIT: After further processing we can now see that this is actually a csv file. The content of the downloaded file is:
In [201]: r.content
Out [201]: 'Ticker\tAction\tName\tShares\tPrice\tCommission\tAmount\tTarget Weight\nBRSS\tSELL\tGlobal Brass and Copper Holdings Inc\t400.0\t17.85\t-1.00\t7,140\t0.00\nCOHU\tSELL\tCohu Inc\t700.0\t12.79\t-1.00\t8,953\t0.00\nUNTD\tBUY\tUnited Online Inc\t560.0\t15.15\t-1.00\t-8,484\t0.00\nFLXS\tBUY\tFlexsteel Industries Inc\t210.0\t40.31\t-1.00\t-8,465\t0.00\nUPRO\tCOVER\tProShares UltraPro S&P500\t17.0\t71.02\t-0.00\t-1,207\t0.00\n'
Notice that it is tab delimited. Still, trying:
# csv version 1
df = pd.read_csv(r.content)
# Returns error, file does not exist. Apparently read_csv() is also trying to read it as a file.
# csv version 2
fh = io.BytesIO(r.content)
df = pd.read_csv(fh) # ValueError: No columns to parse from file.
# csv version 3
s = StringIO(r.content)
df = pd.read_csv(s)
# No error, but the resulting df is not parsed properly; \t's show up in the text of the dataframe.
Simply wrap the file contents in a BytesIO:
with io.BytesIO(r.content) as fh:
df = pd.io.excel.read_excel(fh, sheetname=0)
This functionality was included in an update from 2014. According to the documentation it is as simple as providing the url:
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx
Based on the code you've provided, it looks like you are using pandas 0.13.x? If you can upgrade to a newer version (code below is tested with 0.16.x) you can get this to work without the additional utilization of the requests library. This was added in 0.14.1
data2 = pd.read_excel(data_url)
As an example of a full script (with the example XLS document taken from the original bug report stating the read_excel didn't accept a URL):
import pandas as pd
data_url = "http://www.eia.gov/dnav/pet/xls/PET_PRI_ALLMG_A_EPM0_PTC_DPGAL_M.xls"
data = pd.read_excel(data_url, "Data 1", skiprows=2)

Categories

Resources