Extracting Tables from PDFs Using Tabula - python

I came across a great library called Tabula and it almost did the trick. Unfortunately, there is a lot of useless area on the first page that I don't want Tabula to extract. According to documentation, you can specify the page area you want to extract from. However, the useless area is only on the first page of my PDF file, and thus, for all subsequent pages, Tabula will miss the top section. Is there a way to specify the area condition to only apply to the first page of the PDF?
from tabula import read_pdf
df = read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages='all')

I'm trying to work on something similar (parsing bank statements) and had the same issue. The only way to solve this I have found so far is to parse each page individually.
The only problem is that this requires to know in advance how many pages your file is composed of. For the moment I have not found a how to do this directly with Tabula, so I've decided to use the pyPdf module to get the number of pages.
import pyPdf
from tabula import read_pdf
reader = pyPdf.PdfFileReader(open("C:\Users\riley\Desktop\Bank Statements\50340.pdf", mode='rb' ))
n = reader.getNumPages()
df = []
for page in [str(i+1) for i in range(n)]:
if page == "1":
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages=page))
else:
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", pages=page))
Notice that there are some known and open issues when reading each page individually, or all at the same time.
Good luck!
08/03/2017 EDIT:
Found a simpler way to count the pages of the pdf without going through pyPDf
import re
def count_pdf_pages(file_path):
rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
with open(file_path, "rb") as temp_file:
return len(rxcountpages.findall(temp_file.read()))
where file_path is the path to your file of course

Use the below code ! It may help you !!!
import os
os.path.abspath("E:/Documents/myPy/")
from tabula import wrapper
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')
i=1
for table in tables:
table.to_excel('output'+str(i)+'.xlsx',index=False)
print(i)
i=i+1

parameter'guess=False' will solve the problem.

Extracting Tables from PDFs Using Tabula
pip install tabula-py
pip install tabulate
#reads table from pdf file
df = read_pdf("abc.pdf", pages=[2:]) #address of pdf file
print(tabulate(df))
Parameters:
pages (str, int, list of int, optional)
An optional values specifying pages to extract from. It allows str,int, list of :int. Default: 1
Examples
'1-2,3', 'all', [1,2]
since the first page is useless dropping first page and reading upto last page

Related

How to extract a column of data from an online PDF page?

I am interested in extracting the 'Company Name' column from this link:
https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf
I was able to achieve something similar with this solution: How do I decode text from a pdf online with Requests?
However I was wondering how would I go about extracting only the company name column from that? Since the solution returns all of the text in an unstructured format. Thanks in advance as I am fairly new to python and having difficulties.
You get the error as the Server is preventing bots from web scraping or something. I don't quite understand it either but I found a fix which is to download the file locally first and then use tabula to get the data like so
import requests
from tabula import read_pdf
url = "https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf"
r = requests.get(url, allow_redirects=True)
open('data.pdf', 'wb').write(r.content)
tables = read_pdf("data.pdf", pages = "all", multiple_tables = True)
you may then get the following message
tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`
to fix it follow the steps from this thread.
`java` command is not found from this Python process. Please ensure Java is installed and PATH is set for `java`
and everything should be working.
There is a python library named tabula-py
You can install it using "pip install tabula-py"
You can use it as follows:
import tabula
file = "https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
You can use this to convert the table to a csv file
tabula.convert_into(file, "table.csv")
Then you can use csv library to get the required columns you want

Extracted tables from PDF returned incorrect data - Python

I've tried many times to find a way to import this data from this PDF.
(http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf) It's a report from a agri department in Brazil. I need just the first one.
My mission is to develop a program that gets some specific points of this report and build a paragraph with it.
The thing is that I couldn't find a way to import the table correctly.
I've tried to use tabula-py, but didn't work very well.
Does anyone know how can I import it?
Python 3.6 / Mac hight Sierra
ps: It need to be done just with python, because this code will be upload at Heroku, so I can't install softwares there. (BTW, I think even the tabula-py would not work there as I need to have Java installed... but I will try anyway)
Here what I tried:
import tabula
import requests
url = "http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf"
response = requests.get(url)
df = tabula.read_pdf(url)
tabula.convert_into("teste.pdf", "output.csv", output_format="csv", area=(67.14, 23.54,284.12, 558.01)) #I tried also without area.
I think tabula expects a file, not a URL. Try this:
#!/usr/bin/env python3
import tabula
import requests
url = "http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf"
filename = "16032018194928.pdf"
response = requests.get(url)
with open(filename, 'wb') as f:
f.write(response.content)
df = tabula.read_pdf(filename)
print(df)

Python 3 html table data

I'm new to Python and I need to get the data from a table on a
Webpage and send to a list.
I've tried everything, and the best I got is:
f = urllib.request.urlopen(url)
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-enUS.asp?Data=11/22/2017&Data1=20171122&slcTaxa=APR#"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),'lxml')
rows=list()
for tr in soup.findAll('table'):
rows.append(tr)
Any suggestions?
You're not that far !
First make sure to import the proper version of BeautifulSoup which is BeautifulSoup4 by doing apt-get install python3-bs4 (assuming you're on Ubuntu or Debian and running Python 3).
Then isolate the td elements of html table and clean data a bit. For example remove the first 3 elements of the lists which are useless, and remove the ugly '\n':
import urllib
from bs4 import BeautifulSoup
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-enUS.asp?Data=11/22/2017&Data1=20171122&slcTaxa=APR#"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),'lxml')
rows=list()
for tr in soup.findAll('table'):
for td in tr:
rows.append(td.string)
temp_list=rows[3:]
final_list=[element for element in temp_list if element != '\n']
I don't know which data you want to extract precisely. Now you need to work on your Python list (called final_list here)!
Hope it's clear.
There is a Dowload option at the end of the webpage. If you can download the file manually you are good to go.
If you want to access different dates automatically, and since it is JavaScript, I suggest to use Selenium to download the xlsx files through Python.
With the xlsx file you can use Xlsxwriter to read the data and do what you want.

How to extract tables from websites in Python

Here,
http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500
There is a table. My goal is to extract the table and save it to a csv file. I wrote a code:
import urllib
import os
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
web.close()
ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()
I lost from here. Anyone who can help on this? Thanks!
Pandas can do this right out of the box, saving you from having to parse the html yourself. read_html() extracts all tables from your html and puts them in a list of dataframes. to_csv() can be used to convert each dataframe to a csv file. For the web page in your example, the relevant table is the last one, which is why I used df_list[-1] in the code below.
import requests
import pandas as pd
url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
It's simple enough to do in one line, if you prefer:
pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>)
P.S. Just make sure you have lxml, html5lib, and BeautifulSoup4 packages installed in advance.
So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.
You already have solutions using BeautifulSoup. I'll post a solution using lxml:
from lxml import etree
import urllib.request
web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
I would recommend BeautifulSoup as it has the most functionality. I modified a table parser that I found online that can extract all tables from a webpage, as long as there are no nested tables. Some of the code is specific to the problem I was trying to solve, but it should be pretty easy to modify for your usage. Here is the pastbin link.
http://pastebin.com/RPNbtX8Q
You could use it as follows:
from urllib2 import Request, urlopen, URLError
from TableParser import TableParser
url_addr ='http://foo/bar'
req = Request(url_addr)
url = urlopen(req)
tp = TableParser()
tp.feed(url.read())
# NOTE: Here you need to know exactly how many tables are on the page and which one
# you want. Let's say it's the first table
my_table = tp.get_tables()[0]
filename = 'table_as_csv.csv'
f = open(filename, 'wb')
with f:
writer = csv.writer(f)
for row in table:
writer.writerow(row)
The code above is an outline, but if you use the table parser from the pastbin link you should be able to get to where you want to go.
You need to parse the table into an internal data structure and then output it in CSV form.
Use BeautifulSoup to parse the table. This question is about how to do that (the accepted answer uses version 3.0.8 which is out of date by now, but you can still use it, or convert the instructions to work with BeautifulSoup version 4).
Once you have the table in a data structure (probably a list of lists in this case) you can write it out with csv.write.
Look at BeautifulSOup module. In documentation you will find many examples of parsing html.
Also for csv you have ready solution - csv module.
It should be quite easy.
Look at this answer parsing table with BeautifulSoup and write in text file.
Also use google with next words "python beautifulsoup"

Extracting titles from PDF files?

I want to write a script to rename downloaded papers with their titles automatically, I'm wondering if there is any library or tricks i can make use of? The PDFs are all generated by TeX and should have some 'formal' structures.
You could try to use pyPdf and this example.
for example:
from pyPdf import PdfFileWriter, PdfFileReader
def get_pdf_title(pdf_file_path):
with open(pdf_file_path) as f:
pdf_reader = PdfFileReader(f)
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('/home/user/Desktop/my.pdf')
Assuming all these papers are from arXiv, you could instead extract the arXiv id (I'd guess that searching for "arXiv:" in the PDF's text would consistently reveal the id as the first hit).
Once you have the arXiv reference number (and have done a pip install arxiv), you can get the title using
paper_ref = '1501.00730'
arxiv.query(id_list=[paper_ref])[0].title
I would probably start with perl (seeing as it's always the first thing I reach for). There are several modules for handling PDFs. If you have a consistent structure, you could use regex to snag the titles.
You can try using iText with Jython

Categories

Resources