Trying to collect data from local files using BeautifulSoup - python

I want to run a python script to parse html files and collect a list of all the links with a target="_blank" attribute.
I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error?
import os
import sys
from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path
def main():
ROOT = Path(os.path.realpath(__file__)).ancestor(3)
src = ROOT.child("src")
templatedir = src.child("templates")
for (dirpath, dirs, files) in os.walk(templatedir):
for path in (Path(dirpath, f) for f in files):
if path.endswith(".html"):
for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
print link
if __name__ == "__main__":
sys.exit(main())

I think you need something like this
if path.endswith(".html"):
htmlfile = open(dirpath)
for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
print link

The usage BeautifulSoup is OK but you should pass in the html string, not just the path of the html file. BeautifulSoup accepts the html string as argument, not the file path. It will not open it and then read the content automatically. You should do it yourself. If you pass in a.html, the soup will be <html><body><p>a.html</p></body></html>. This is not the content of the file. Surely there is no links. You should use BeautifulSoup(open(path).read(), ...).
edit:
It also accepts the file descriptor. BeautifulSoup(open(path), ...) is enough.

Related

Web Scraping FileNotFoundError When Saving to PDF

I copy some Python code in order to download data from a website. Here is my specific website:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1
Here is the code which I copied:
import requests
from bs4 import BeautifulSoup
def _getUrls_(res):
hrefs = []
soup = BeautifulSoup(res.text, 'lxml')
main_content = soup.find('div',{'id' : 'content-core'})
table = main_content.find("table")
for a in table.findAll('a', href=True):
hrefs.append(a['href'])
return(hrefs)
bidurl = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
r = requests.get(bidurl)
hrefs = _getUrls_(r)
def _getPdfs_(hrefs, basedir):
for i in range(len(hrefs)):
print(hrefs[i])
respdf = requests.get(hrefs[i])
pdffile = basedir + "/pdf_dot/" + hrefs[i].split("/")[-1] + ".pdf"
try:
with open(pdffile, 'wb') as p:
p.write(respdf.content)
p.close()
except FileNotFoundError:
print("No PDF produced")
basedir= "/Users/ABC/Desktop"
_getPdfs_(hrefs, basedir)
The code runs successfully, but it did not download anything at all, even though there is no Filenotfounderror obviously.
I tried the following two URLs:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-088a-035-20360
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-r100-258-21125
However both of these URLs return >>> No PDF produced.
The thing is that the code worked and downloaded successfully for other people, but not me.
Your code works I just tested. You need to make sure the basedir exists, you want to add this to your code:
if not os.path.exists(basedir):
os.makedirs(basedir)
I used this exact (indented) code but replaced the basedir with my own dir and it worked only after I made sure that the path actually exists. This code does not create the folder in case it does not exist.
As others have pointed out, you need to create basedir beforehand. The user running the script may not have the directory created. Make sure you insert this code at the beginning of the script, before the main logic.
Additionally, hardcoding the base directory might not be a good idea when transferring the script to different systems. It would be preferable to use the users %USERPROFILE% enviorment variable:
from os import envioron
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
Which would be the same as C:\Users\blah\Desktop\pdf_dot.
However, the above enivorment variable only works for Windows. If you want it to work Linux, you will have to use os.environ["HOME"] instead.
If you need to transfer between both systems, then you can use os.name:
from os import name
from os import environ
# Windows
if name == 'nt':
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
# Linux
elif name == 'posix':
basedir = join(environ["HOME"], "Desktop", "pdf_dot")
You don't need to specify the directory or create any folder manually. All you need do is run the following script. When the execution is done, you should get a folder named pdf_dot in your desktop containing the pdf files you wish to grab.
import requests
from bs4 import BeautifulSoup
import os
URL = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
dirf = os.environ['USERPROFILE'] + '\Desktop\pdf_dot'
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'lxml')
pdflinks = [itemlink['href'] for itemlink in soup.find_all("a",{"data-linktype":"internal"}) if "reject" not in itemlink['href']]
for pdflink in pdflinks:
filename = f'{pdflink.split("/")[-1]}{".pdf"}'
with open(filename, 'wb') as f:
f.write(requests.get(pdflink).content)

Parsing the file name from list of url links

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, i.e. /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌​ocumentId=XXXXXX&xsl‌​FileName=rher2xml.xs‌​l&outputFileName=XXX‌​X_2017_06_25_4.xls where the name of the file comes after outputFileName=
Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?
You could parse the link on the go.
Example using a regular expression:
import re
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
match = re.search(regexp, link.rstrip())
if match is None:
# Make the user aware that something went wrong, e.g. raise exception
# and/or just print something
print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
else:
file_name = match.group(0)
urllib.request.urlretrieve(link, file_name)
You can use urlparse and parse_qs to get the query string
from urlparse import urlparse,parse_qs
parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2')
print(parse_qs(parse.query)['name'][0]) # prints Python

urllib: Get name of file from direct download link

Python 3. Probably need to use urllib to do this,
I need to know how to send a request to a direct download link, and get the name of the file it attempts to save.
(As an example, a KSP mod from CurseForge: https://kerbal.curseforge.com/projects/mechjeb/files/2355387/download)
Of course, the file ID (2355387) will be changed. It could be from any project, but always on CurseForge. (If that makes a difference on the way it's downloaded.)
That example link results in the file:
How can I return that file name in Python?
Edit: I should note that I want to avoid saving the file, reading the name, then deleting it if possible. That seems like the worst way to do this.
Using urllib.request, when you request a response from a url, the response contains a reference to the url you are downloading.
>>> from urllib.request import urlopen
>>> url = 'https://kerbal.curseforge.com/projects/mechjeb/files/2355387/download'
>>> response = urlopen(url)
>>> response.url
'https://addons-origin.cursecdn.com/files/2355/387/MechJeb2-2.6.0.0.zip'
You can use os.path.basename to get the filename:
>>> from os.path import basename
>>> basename(response.url)
'MechJeb2-2.6.0.0.zip'
from urllib import request
url = 'file download link'
filename = request.urlopen(request.Request(url)).info().get_filename()

Download csv file through python (url)

I work on a project and I want to download a csv file from a url. I did some research on the site but none of the solutions presented worked for me.
The url offers you directly to download or open the file of the blow I do not know how to say a python to save the file (it would be nice if I could also rename it)
But when I open the url with this code nothing happens.
import urllib
url='https://data.toulouse-metropole.fr/api/records/1.0/download/?dataset=dechets-menagers-et-assimiles-collectes'
testfile = urllib.request.urlopen(url)
Any ideas?
Try this. Change "folder" to a folder on your machine
import os
import requests
url='https://data.toulouse-metropole.fr/api/records/1.0/download/?dataset=dechets-menagers-et-assimiles-collectes'
response = requests.get(url)
with open(os.path.join("folder", "file"), 'wb') as f:
f.write(response.content)
You can adapt an example from the docs
import urllib.request
url='https://data.toulouse-metropole.fr/api/records/1.0/download/?dataset=dechets-menagers-et-assimiles-collectes'
with urllib.request.urlopen(url) as testfile, open('dataset.csv', 'w') as f:
f.write(testfile.read().decode())

Metaprogramming Python Script for e-mail Capture

How can I modify the code below to capture all e-mails instead of images:
import urllib2
import re
from os.path import basename
from urlparse import urlsplit
url = "URL WITH IMAGES"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)
# download all images
for imgUrl in imgUrls:
try:
imgData = urllib2.urlopen(imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
Need to get a directory from an array of websites. I'm using C++ to create code for Unix by calling the .py file multiple times and then appending it to an existing file each time.
Parsing/validating email address requires a strong regex. You can look for those on google. I am showing you a simple email address parsing regex.
emails = re.findall('([a-zA-Z0-9\.]+#[a-zA-Z0-9]+\.[a-zA-Z]{2,3})', urlContent)
This is just a rudimentary example. You need to use a powerful one.

Categories

Resources