I learned recently that you can use wget -r -P ./pdfs -A pdf http://example.com/ to recursively download pdf files from a website. However this is not cross-platform as Windows doesn't have wget. I want to use Python to achieve the same thing. The only solutions I've seen are non-recursive - e.g. https://stackoverflow.com/a/54618327/3042018
I would also like to be able to just get the names of the files without downloading so I can check if a file has already been downloaded.
There are so many tools available in Python. What is a good solution here? Should I use one of the "mainstream" packages like scrapy or selenium or maybe just requests? Which is the most suitable for this task please, and how do I implement it?
You can try several more ways, maybe you can find the right one for you. Here's an example.
If it's just a single download, you can use the following methods.
from simplified_scrapy import req, utils
res = req.get("http://example.com/xxx.pdf")
path = "./pdfs/xxx.pdf"
utils.saveResponseAsFile(res, path)
If you need to download the page first, and then extract the PDF link from the page, you can use the following method。
import os, sys
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
name = 'download_pdf'
start_urls = ["http://example.com/"] # Entry page
def __init__(self):
Spider.__init__(self, self.name) #necessary
if (not os.path.exists('./pdfs')):
os.mkdir('./pdfs')
def afterResponse(self, response, url, error=None, extra=None):
try:
path = './pdfs' + url[url.rindex('/'):]
index = path.find('?')
if index > 0: path = path[:index]
flag = utils.saveResponseAsFile(response, path, fileType="pdf")
if flag:
return None
else: # If it's not a pdf, leave it to the frame
return Spider.afterResponse(self, response, url, error)
except Exception as err:
print(err)
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lst = doc.selects('a').containsReg(".*.pdf", attr="href")
for a in lst:
a["url"] = utils.absoluteUrl(url.url, a["href"])
return {"Urls": lst, "Data": None}
SimplifiedMain.startThread(MySpider()) # Start download
Related
I'm trying to do an NLP task. For that purpose I need a considerable amount of Readme.md files from GitHub. This is what I am trying to do:
For a given number n, I want to list the first n GitHub repositories (And Their URLs) based on the number of their stars.
I want to download the Readme.md file from those URLs.
I want to save the Readme.md Files on my hard drive, each in a separate folder. The folder name should be the name of the repository.
I'm not acquainted with crawling and web scraping, but I am relatively good with python. I'll be thankful if you can give me some help on how to accomplish this steps. Any help would be appreciated.
My effort: I've searched a little, and I found a website (gitstar-ranking.com) that ranks GitHub repos based on their stars. But that does not solve my problem because it is again a scraping task to get the name or the URL of those repos from this website.
Here's my attempt using the suggestion from #Luke. I changed the minimum stars to 500 since we don't need 5 million results (>500 still yields 66513 results).
You might not need the ssl workaround on lines 29-30, but since I'm behind a proxy, it's a pain to do it properly.
The script finds files called readme.md in any combination of lower- and uppercase but nothing else. It saves the file as README.md (uppercase) but this can be adjusted by using the actual filename.
import urllib.request
import json
import ssl
import os
import time
n = 5 # number of fetched READMEs
url = 'https://api.github.com/search/repositories?q=stars:%3E500&sort=stars'
request = urllib.request.urlopen(url)
page = request.read().decode()
api_json = json.loads(page)
repos = api_json['items'][:n]
for repo in repos:
full_name = repo['full_name']
print('fetching readme from', full_name)
# find readme url (case senitive)
contents_url = repo['url'] + '/contents'
request = urllib.request.urlopen(contents_url)
page = request.read().decode()
contents_json = contents_json = json.loads(page)
readme_url = [file['download_url'] for file in contents_json if file['name'].lower() == 'readme.md'][0]
# download readme contents
try:
context = ssl._create_unverified_context() # prevent ssl problems
request = urllib.request.urlopen(readme_url, context=context)
except urllib.error.HTTPError as error:
print(error)
continue # if the url can't be opened, there's no use to try to download anything
readme = request.read().decode()
# create folder named after repo's name and save readme.md there
try:
os.mkdir(repo['name'])
except OSError as error:
print(error)
f = open(repo['name'] + '/README.md', 'w', encoding="utf-8")
f.write(readme)
print('ok')
# only 10 requests per min for unauthenticated requests
if n >= 9: # n + 1 initial request
time.sleep(6)
Python - 2.7.5
Google Chrome
First off I am self taught coder and will accept any critique and/or suggestions to any of my posted codes below. This issue has been a joy to work through because I love challenging myself but I am afraid I have hit a brick wall and need some guidance. I will be as detailed as possible below to fully explain the overall picture of my script and then show where I am at with the actual issue that is explained in the title.
I am putting together a script that will go out and download data automatically, upzip, and export to a GDB. We serve a wide region of users and have a very large enterprise SDE setup containing large amount of public data that we have to go out and search and update for our end users. Most of our data is updated monthly by local government entities and we have to go out and search for the data manually, download, unzip, QAQC, etc. I am wanting to put a script a together that will automate the first part of this process by going out and downloading all my data for me and exporting to a local GDB, from there I can QAQC everything and upload to our SDE for our users to access.
The process has been pretty straight forward so far until I got to this issue I have before me. My script will search a webpage for specific keywords and find the relevant link and begin the download. For this post I will use two examples, one that works and one that is currently giving me issues. What works is my function for searching and downloading the Metro GIS dataset and below shows my current process for finding this. So far all http websites I have included will use the posted function below. Like Metro is being shown I plan on having a defined function for each group of data.
import requests, zipfile, StringIO, time, arcpy, urllib2, urlparse
from BeautifulSoup import BeautifulSoup
arcpy.env.overwriteOutput = True
workPath = -- #The output GDB
timestr = time.strftime("%Y%m%d")
gdbName = "GlobalSDEUpdate_" + timestr
gdbPath = workPath + "\\" + gdbName + ".gdb"
class global_DataFinder(object):
def __init__(self):
object.__init__(self)
self.gdbSetup()
self.metro()
def gdbSetup(self):
arcpy.CreateFileGDB_management(workPath, gdbName)
def fileDownload(self, key, url, dlPath, dsName):
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll('a', href = True):
if not 'http://' in link['href']:
if urlparse.urljoin(url, link['href']) not in urlList:
zipDL = urlparse.urljoin(url, link['href'])
if zipDL.endswith(".zip"):
if key in zipDL:
urlList.append(zipDL)
for x in urlList:
print x
r = requests.get(x, stream=True)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
z.extractall(dlPath)
arcpy.CreateFeatureDataset_management(gdbPath, dsName)
arcpy.env.workspace = dlPath
shpList = []
for shp in arcpy.ListFeatureClasses():
shpList.append(shp)
arcpy.FeatureClassToGeodatabase_conversion(shpList, (gdbPath + "\\" + dsName))
del shpList[:]
def metro(self):
key = "METRO_GIS_Data_Layers"
url = "http://www.ridemetro.org/Pages/NewsDownloads.aspx"
dlPath = -- *#Where my zipfiles output to*
dsName = "Metro"
self.fileDownload(key, url, dlPath, dsName)
global_DataFinder()
As you can see above this is the method I started with using Metro as my first testing point and this is currently working great. I was hoping all my sites going forward would like this but when I got to FEMA I ran into an issue.
The website National Flood Hazard Layer (NFHL) Status hosts floodplain data for many counties across the country is available for free to any who wish to use it. When arriving at the website you will see that you can search for the county you want, then the table queries out the search, then you can simply click and download the county you desire. When checking the source this is what I came across and noticed its in an iframe.
When accessing the iframe source link through Chrome and checking the png source url this is what you get - https://hazards.fema.gov/femaportal/NFHL/searchResult
Now here is where my problem lies, unlike http sites I have quickly learned that accessing a secured https site and scraping the page is different especially when its using javascript to show the table. I have spent hours searching through forums and tried different python packages like selenium, mechanize, requests, urllib, urllib2, and I seem to always hit a dead-end before I can securely establish a connection and parse the webpage and search for my counties zipfile. The code below shows the closest I have gotten and shows the error code I am getting.
(I always test in a separate script and then when it works I bring it over to my main script, so thats why this code snippet below is separated from my original)
import urllib2, httplib, socket, ssl
from BeautifulSoup import BeautifulSoup
url = "http://www.floodmaps.fema.gov/NFHL/status.shtml"
def test():
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll("iframe", src=True):
r = urllib2.urlopen(link['src'])
iFrame = link['src']
print iFrame
def connect_patched(self):
"Connect to a host on a given (SSL) port."
sock = socket.create_connection((self.host, self.port),
self.timeout, self.source_address)
if self._tunnel_host:
self.sock = sock
self._tunnel()
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file,
ssl_version=ssl.PROTOCOL_SSLv2)
httplib.HTTPSConnection.connect = connect_patched
test()
Error I get when running this test
urllib2.URLError: urlopen error [Errno 6] _ssl.c:504: TLS/SSL connection has been closed
I am hoping a more experienced coder can see what I have done and tell me if my current methods are the way to go and if so how to get past this final error and parse the datatable properly.
Working Edits with #crmackey
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
print href
url = '/'.join([download_prefix, href])
print url
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
out_path = r"C:\Users\barr\Desktop\Test"
download_zips(out_path)
All I added was the httplib and changed the HTTPConnection at the top. That allowed to me connect to the site using your script. Now here is the current problem. I am only getting 1 zip file in my out_path, and the zip file is empty. I checked the printed source in the debug window and its showing its trying to download the TERRITORY OF THE VIRGIN ISLAND zip file from the table so it looks like its trying but its not downloading anything. After it outputs that one empty zip file the script finishes and brings up no further error messages. I temporarily removed your lines that unzipped the file because they were returning an error since the folder was empty.
I was able to get the zip files downloaded by using the requests module and also opted for using PyQuery instead of Beautiful Soup. I think the issue you were facing has to do with the SSL cert validation, where the requests module will allow you to skip checking the certificate if you set the verify parameter to False.
The function below will download all the zip files and unzip them, from there, you can import the shapefiles into your geodatabase:
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
url = '/'.join([download_prefix, href])
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
# do more stuff like unzip?
unzipped = out_zip.split('.zip')[0]
with zipfile.Zipfile(out_zip, 'r') as f:
f.extractall(unzipped)
I am trying to fetch and parse an XML file into a databse. The XML is compressed in GZIP. The GZIP file is ~8MB. When I run the code locally the memory on pythonw.exe builds up to level where the entire system (Windows 7) stops responding, and when I run it online it exceeds the memory limit on Google App Engine. Not sure if the file is too big or if I am doing something wrong. Any help would be very much appreciated!
from google.appengine.ext import webapp
from google.appengine.api.urlfetch import fetch
from xml.dom.minidom import parseString
import gzip
import base64
import StringIO
class ParseCatalog(webapp.RequestHandler):
user = xxx
password = yyy
catalog = fetch('url',
headers={"Authorization":
"Basic %s" % base64.b64encode(user + ':' + password)}, deadline=600)
xmlstring = StringIO.StringIO(catalog.content)
gz = gzip.GzipFile(fileobj=xmlstring)
gzcontent = gz.read()
contentxml = parseString(gzcontent)
items = contentxml.getElementsByTagName("Product")
for item in items:
item = DatabaseEntry()
item.name = str(coupon.getElementsByTagName("Manufacturer")[0].firstChild.data)
item.put()
UPDATE
So I tried to follow BasicWolf's suggestion to switch to LXML but am having problems importing it. I downloaded the LXML 2.3 library and put it in the folder of my app (I know this is not ideal, but it's the only way I know how to include a 3rd party library). Also, I added following to my app.yaml:
libraries:
- name: lxml
version: "2.3"
Then I wrote the following code to test if it parses:
import lxml
class ParseCatalog(webapp.RequestHandler):
user = xxx
password = yyy
catalog = fetch('url',
headers={"Authorization":
"Basic %s" % base64.b64encode(user + ':' + password)}, deadline=600)
items = etree.iterparse(catalog.content)
def get(self):
for elem in items:
self.response.out.write(str(elem.tag))
However this is resulting in the following error:
ImportError: cannot import name etree
I have checked other questions on this error and it seems that the fact that I run on Windows 7 might play a role. I also tried to install the pre-compiled binary packages from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml but that didn't change anything either.
What do you expect? First, you read a string into the memory, then - unzip it into the memory, then - construct a DOM tree, still in the memory.
Here are some improvements:
del every buffer variable the moment you don't need it.
Get rid of DOM XML parser, use event-driven LXML to save memory.
Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing.
I want to download everything on a page to make it possible to view it just from the files.
The command
wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows
does exactly that I need. However we want to be able to tie it in with other stuff that must be portable, so requires it to be in Python.
I've been looking at Beautiful Soup, scrapy, various spiders posted around the place, but these all seem to deal with getting data/links in clever but specific ways. Using these to do what I want seems like it will require a lot of work to deal with finding all of the resources, when I'm sure there must be an easy way.
thanks very much
You should be using an appropriate tool for the job at hand.
If you want to spider a site and save the pages to disk, Python probably isn't the best choice for that. Open source projects get features when someone needs that feature, and because wget does its job so well, nobody has bothered to try to write a python library to replace it.
Considering wget runs on pretty much any platform that has a Python interpreter, is there a reason you can't use wget?
My colleague wrote up this code, lots pieced together from other sources I believe. Might have some specific quirks for our system but it should help anyone wanting to do the same
"""
Downloads all links from a specified location and saves to machine.
Downloaded links will only be of a lower level then links specified.
To use: python downloader.py link
"""
import sys,re,os,urllib2,urllib,urlparse
tocrawl = set([sys.argv[1]])
# linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?')
linkregex = re.compile('href=[\'|"](.*?)[\'"].*?')
linksrc = re.compile('src=[\'|"](.*?)[\'"].*?')
def main():
link_list = []##create a list of all found links so there are no duplicates
restrict = sys.argv[1]##used to restrict found links to only have lower level
link_list.append(restrict)
parent_folder = restrict.rfind('/', 0, len(restrict)-1)
##a.com/b/c/d/ make /d/ as parent folder
while 1:
try:
crawling = tocrawl.pop()
#print crawling
except KeyError:
break
url = urlparse.urlparse(crawling)##splits url into sections
try:
response = urllib2.urlopen(crawling)##try to open the url
except:
continue
msg = response.read()##save source of url
links = linkregex.findall(msg)##search for all href in source
links = links + linksrc.findall(msg)##search for all src in source
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
##if /xxx a.com/b/c/ -> a.com/b/c/xxx
link = 'http://' + url[1] + link
elif ~link.find('#'):
continue
elif link.startswith('../'):
if link.find('../../'):##only use links that are max 1 level above reference
##if ../xxx.html a.com/b/c/d.html -> a.com/b/xxx.html
parent_pos = url[2].rfind('/')
parent_pos = url[2].rfind('/', 0, parent_pos-2) + 1
parent_url = url[2][:parent_pos]
new_link = link.find('/')+1
link = link[new_link:]
link = 'http://' + url[1] + parent_url + link
else:
continue
elif not link.startswith('http'):
if url[2].find('.html'):
##if xxx.html a.com/b/c/d.html -> a.com/b/c/xxx.html
a = url[2].rfind('/')+1
parent = url[2][:a]
link = 'http://' + url[1] + parent + link
else:
##if xxx.html a.com/b/c/ -> a.com/b/c/xxx.html
link = 'http://' + url[1] + url[2] + link
if link not in link_list:
link_list.append(link)##add link to list of already found links
if (~link.find(restrict)):
##only grab links which are below input site
print link ##print downloaded link
tocrawl.add(link)##add link to pending view links
file_name = link[parent_folder+1:]##folder structure for files to be saved
filename = file_name.rfind('/')
folder = file_name[:filename]##creates folder names
folder = os.path.abspath(folder)##creates folder path
if not os.path.exists(folder):
os.makedirs(folder)##make folder if it does not exist
try:
urllib.urlretrieve(link, file_name)##download the link
except:
print "could not download %s"%link
else:
continue
if __name__ == "__main__":
main()
thanks for the replies
I have found a python html parser that builds a dom like structure
for html sources it seems easy to use and very fast. i'm trying to write a scraper for codepad.org that retrieves the last ten posts from
http://codepad.org/recent
The EHP lib is at https://github.com/iogf/ehp
i have this code below that is working.
import requests
from ehp import Html
def catch_refs(data):
html = Html()
dom = html.feed(data)
return [ind.attr['href']
for ind in dom.find('a')
if 'view' in ind.text()]
def retrieve_source(refs, dir):
"""
Get the source code of the posts then save in a dir.
"""
pass
if __name__ == '__main__':
req = requests.get('http://codepad.org/recent')
refs = catch_refs(req.text)
retrieve_source(refs, '/tmp/')
print refs
it outputs:
[u'http://codepad.org/aQGNiQ6t',
u'http://codepad.org/HMrG1q7t',
u'http://codepad.org/zGBMaKoZ', ...]
as expected but i cant figure out how to download the source code of the files.
Actually your retrieve_source(refs, dir) don't do anything.
So you are not getting any result.
Update according to your comment:
import os
def get_code_snippet(page):
dom = Html().feed(page)
# getting all <div class=='highlight'>
elements = [el for el in dom.find('div')
if el.attr['class'] == 'highlight']
return elements[1].text()
def retrieve_source(refs, dir):
for i, ref in enumerate(refs):
with open(os.path.join(dir, str(i) + '.html'), 'w') as r:
r.write(get_code_snippet(requests.get(ref).content))