How to extract a column of data from an online PDF page? - python

I am interested in extracting the 'Company Name' column from this link:
https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf
I was able to achieve something similar with this solution: How do I decode text from a pdf online with Requests?
However I was wondering how would I go about extracting only the company name column from that? Since the solution returns all of the text in an unstructured format. Thanks in advance as I am fairly new to python and having difficulties.

You get the error as the Server is preventing bots from web scraping or something. I don't quite understand it either but I found a fix which is to download the file locally first and then use tabula to get the data like so
import requests
from tabula import read_pdf
url = "https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf"
r = requests.get(url, allow_redirects=True)
open('data.pdf', 'wb').write(r.content)
tables = read_pdf("data.pdf", pages = "all", multiple_tables = True)
you may then get the following message
tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`
to fix it follow the steps from this thread.
`java` command is not found from this Python process. Please ensure Java is installed and PATH is set for `java`
and everything should be working.

There is a python library named tabula-py
You can install it using "pip install tabula-py"
You can use it as follows:
import tabula
file = "https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
You can use this to convert the table to a csv file
tabula.convert_into(file, "table.csv")
Then you can use csv library to get the required columns you want

Related

Read a csv file from bitbucket using Python and convert it to a df

I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.
Any ideas on how to do this? Thank you!
Here is my example:
url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs%2Fheads%2Fmaster'
colnames=['project_id','project_name','gourmet_url']
df7 = pd.read_csv(url, names =colnames)
However, the output is not correct, its not the df being outputted its some bad data.
You have multiple options, but your question is actually 2 separate questions.
How to get a file (.csv in this case) from a remote location.
How to load a csv into a "df" which is a pandas data frame.
For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')
The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,
"Valid URL schemes include http, ftp, s3, gs, and file".
If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.
FOR BITBUCKET SPECIFICALLY:
You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.
Code example:
Assume we want (a random csv file I found on bitbucket):
https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master
What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:
https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv
Let's look at this in detail, the link is the same up to the project name:
https://bitbucket.org/pedrorijo91/nodejstutorial/
afterwards, the raw file is under raw/
then it's the same pointer (random but same letters and numbers)
db4c991864e65c4d72e98a1dc94e33606e3adde9/
Finally, it's the same directory structure:
node_modules/levelmeup/data/horse_js.csv
The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv
import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)
The above code is successful for me.
 You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().
>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()
ID Tweet Date Via
0 374667940827635712 So, yes, a 100% JS App is 100% awesome 08:59:32, 9-3, 2013 web
1 374656867466637312 "vituperating priests" who rail against JavaSc... 08:15:32, 9-3, 2013 web
2 374654221292806144 Node/Browserify/CJS folks, is there any benefit 08:05:01, 9-3, 2013 Twitter for iPhone
3 374640446955212800 100% JavaScript applications. You may get some 07:10:17, 9-3, 2013 Twitter for iPhone
4 374613490763169792 A node.js app that will order you a sandwich 05:23:10, 9-3, 2013 web

When I run the code, it runs without errors, but the csv file is not created, why?

I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.

Extracted tables from PDF returned incorrect data - Python

I've tried many times to find a way to import this data from this PDF.
(http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf) It's a report from a agri department in Brazil. I need just the first one.
My mission is to develop a program that gets some specific points of this report and build a paragraph with it.
The thing is that I couldn't find a way to import the table correctly.
I've tried to use tabula-py, but didn't work very well.
Does anyone know how can I import it?
Python 3.6 / Mac hight Sierra
ps: It need to be done just with python, because this code will be upload at Heroku, so I can't install softwares there. (BTW, I think even the tabula-py would not work there as I need to have Java installed... but I will try anyway)
Here what I tried:
import tabula
import requests
url = "http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf"
response = requests.get(url)
df = tabula.read_pdf(url)
tabula.convert_into("teste.pdf", "output.csv", output_format="csv", area=(67.14, 23.54,284.12, 558.01)) #I tried also without area.
I think tabula expects a file, not a URL. Try this:
#!/usr/bin/env python3
import tabula
import requests
url = "http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf"
filename = "16032018194928.pdf"
response = requests.get(url)
with open(filename, 'wb') as f:
f.write(response.content)
df = tabula.read_pdf(filename)
print(df)

How can I add a header to urllib.request.urlretrieve keeping my variables?

I'm trying to download a file from a website but it looks like it is detecting urllib and doesn't allow it to download (I'm getting the error "urllib.error.HTTPError: HTTP Error 403: Forbidden").
How can I fix this? I found on the internet that I had to add a header but the answers weren't going the way I need (It was using Request and I didn't find anything about an argument to add in urllib.request.urlretrieve() for a header)
I'm using Python 3.6
Here's the code:
import urllib.request
filelink = 'https://randomwebsite.com/changelog.txt'
filename = filelink.rsplit('/', 1)
filename = str(filename[1])
urllib.request.urlretrieve(filelink, filename)
I want to include a header to give me the permission to download the file but I need to keep a line like the last one, using the two variables (one for the link of the file and one for the name that depends of the link).
Thanks already for your help !
Check the below link:
https://stackoverflow.com/a/7244263/5903276
The most correct way to do this would be to use the urllib.request.urlopen function to return a file-like object that represents an HTTP response and copy it to a real file using shutil.copyfileobj.

Extracting Tables from PDFs Using Tabula

I came across a great library called Tabula and it almost did the trick. Unfortunately, there is a lot of useless area on the first page that I don't want Tabula to extract. According to documentation, you can specify the page area you want to extract from. However, the useless area is only on the first page of my PDF file, and thus, for all subsequent pages, Tabula will miss the top section. Is there a way to specify the area condition to only apply to the first page of the PDF?
from tabula import read_pdf
df = read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages='all')
I'm trying to work on something similar (parsing bank statements) and had the same issue. The only way to solve this I have found so far is to parse each page individually.
The only problem is that this requires to know in advance how many pages your file is composed of. For the moment I have not found a how to do this directly with Tabula, so I've decided to use the pyPdf module to get the number of pages.
import pyPdf
from tabula import read_pdf
reader = pyPdf.PdfFileReader(open("C:\Users\riley\Desktop\Bank Statements\50340.pdf", mode='rb' ))
n = reader.getNumPages()
df = []
for page in [str(i+1) for i in range(n)]:
if page == "1":
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages=page))
else:
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", pages=page))
Notice that there are some known and open issues when reading each page individually, or all at the same time.
Good luck!
08/03/2017 EDIT:
Found a simpler way to count the pages of the pdf without going through pyPDf
import re
def count_pdf_pages(file_path):
rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
with open(file_path, "rb") as temp_file:
return len(rxcountpages.findall(temp_file.read()))
where file_path is the path to your file of course
Use the below code ! It may help you !!!
import os
os.path.abspath("E:/Documents/myPy/")
from tabula import wrapper
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')
i=1
for table in tables:
table.to_excel('output'+str(i)+'.xlsx',index=False)
print(i)
i=i+1
parameter'guess=False' will solve the problem.
Extracting Tables from PDFs Using Tabula
pip install tabula-py
pip install tabulate
#reads table from pdf file
df = read_pdf("abc.pdf", pages=[2:]) #address of pdf file
print(tabulate(df))
Parameters:
pages (str, int, list of int, optional)
An optional values specifying pages to extract from. It allows str,int, list of :int. Default: 1
Examples
'1-2,3', 'all', [1,2]
since the first page is useless dropping first page and reading upto last page

Categories

Resources