Set output location for python script - python

I want to save all images from a site. wget is horrible, at least for http://www.leveldesigninspirationmachine.tumblr.com since in the image folder it just drops html files, and nothing as an extension.
I found a python script, the usage is like this:
[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize
Finally I got the script running after some BeautifulSoup problems.
However, I can't find the files anywhere. I also tried "/" as the output dir in hope the images got on the root of my HD but no luck. Can someone either help me to simplify the script so it outputs at the cd directory set in terminal. Or give me a command that should work. I have zero python experience and I don't really want to learn python for a 2 year old script that maybe doesn't even work the way I want.
Also, how can I pass an array of website? With a lot of scrapers it gives me the first few results of the page. Tumblr has the load on scroll but that has no effect so i would like to add /page1 etc.
thanks in advance
# imageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 201009094
import urllib2
from os.path import basename
import urlparse
#from BeautifulSoup import BeautifulSoup # for HTML parsing
import bs4
from bs4 import BeautifulSoup
global urlList
urlList = []
# recursively download images starting from the root URL
def downloadImages(url, level, minFileSize): # the root URL is level 0
# do not go to other websites
global website
netloc = urlparse.urlsplit(url).netloc.split('.')
if netloc[-2] + netloc[-1] != website:
return
global urlList
if url in urlList: # prevent using the same URL again
return
try:
urlContent = urllib2.urlopen(url).read()
urlList.append(url)
print url
except:
return
soup = BeautifulSoup(''.join(urlContent))
# find and download all images
imgTags = soup.findAll('img')
for imgTag in imgTags:
imgUrl = imgTag['src']
# download only the proper image files
if imgUrl.lower().endswith('.jpeg') or \
imgUrl.lower().endswith('.jpg') or \
imgUrl.lower().endswith('.gif') or \
imgUrl.lower().endswith('.png') or \
imgUrl.lower().endswith('.bmp'):
try:
imgData = urllib2.urlopen(imgUrl).read()
if len(imgData) >= minFileSize:
print " " + imgUrl
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
print
print
# if there are links on the webpage then recursively repeat
if level > 0:
linkTags = soup.findAll('a')
if len(linkTags) > 0:
for linkTag in linkTags:
try:
linkUrl = linkTag['href']
downloadImages(linkUrl, level - 1, minFileSize)
except:
pass
# main
rootUrl = 'http://www.leveldesigninspirationmachine.tumblr.com'
netloc = urlparse.urlsplit(rootUrl).netloc.split('.')
global website
website = netloc[-2] + netloc[-1]
downloadImages(rootUrl, 1, 50000)

As Frxstream has commented, this program creates the files in the current directory (i.e. where you run it). After running the program, run ls -l (or dir) to find the files it has created.
If it seemingly hasn't created any files, then most probably it really hasn't created any files, most probably because there was an exception which your except: pass has hidden. To see what was going wrong, replace try: ... except: pass with just ..., and rerun the program. (If you can't understand and fix that, ask a separate StackOverflow question.)

it's hard to tell without looking at the errors (+1 to turning off your try/except block so you can see the exceptions) but I do see one typo here:
fileName = basename(urlsplit(imgUrl)[2])
you didn't do "from urlparse import urlsplit" you have "import urlparse" so you need to refer to it as urlparse.urlsplit() as you have in other places, so should be like this
fileName = basename(urlparse.urlsplit(imgUrl)[2])

Related

Python3. How to save downloaded webpages to a specified dir?

I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name='Downloaded
Pages/www.python.org#content'> www.python.org#python-network
<_io.BufferedWriter name='Downloaded
Pages/www.python.org#python-network'>
Traceback (most recent call last): File "/Users/Lucas/Python/AP book
exercise/Web Scraping/linkVerification.py", line 26, in
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21]
Is a directory: 'Downloaded Pages/'
I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.
This is my code:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
Appreciate any advice, thanks.
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like #Thyebri said.
And also, remember that the file you write cannot contain characters like '/', '\' or '?'
So, i dont know if the following code it's messy or not, but using the re library i would do the following:
filename = re.sub('[\/*:"?]+', '-', linkUrlToOpen.split("://")[1])
downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')
So, first i remove part i remove the "https://" part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-' and that is the name that will be given to the file.
Hope it works!

Web Scraping FileNotFoundError When Saving to PDF

I copy some Python code in order to download data from a website. Here is my specific website:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1
Here is the code which I copied:
import requests
from bs4 import BeautifulSoup
def _getUrls_(res):
hrefs = []
soup = BeautifulSoup(res.text, 'lxml')
main_content = soup.find('div',{'id' : 'content-core'})
table = main_content.find("table")
for a in table.findAll('a', href=True):
hrefs.append(a['href'])
return(hrefs)
bidurl = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
r = requests.get(bidurl)
hrefs = _getUrls_(r)
def _getPdfs_(hrefs, basedir):
for i in range(len(hrefs)):
print(hrefs[i])
respdf = requests.get(hrefs[i])
pdffile = basedir + "/pdf_dot/" + hrefs[i].split("/")[-1] + ".pdf"
try:
with open(pdffile, 'wb') as p:
p.write(respdf.content)
p.close()
except FileNotFoundError:
print("No PDF produced")
basedir= "/Users/ABC/Desktop"
_getPdfs_(hrefs, basedir)
The code runs successfully, but it did not download anything at all, even though there is no Filenotfounderror obviously.
I tried the following two URLs:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-088a-035-20360
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-r100-258-21125
However both of these URLs return >>> No PDF produced.
The thing is that the code worked and downloaded successfully for other people, but not me.
Your code works I just tested. You need to make sure the basedir exists, you want to add this to your code:
if not os.path.exists(basedir):
os.makedirs(basedir)
I used this exact (indented) code but replaced the basedir with my own dir and it worked only after I made sure that the path actually exists. This code does not create the folder in case it does not exist.
As others have pointed out, you need to create basedir beforehand. The user running the script may not have the directory created. Make sure you insert this code at the beginning of the script, before the main logic.
Additionally, hardcoding the base directory might not be a good idea when transferring the script to different systems. It would be preferable to use the users %USERPROFILE% enviorment variable:
from os import envioron
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
Which would be the same as C:\Users\blah\Desktop\pdf_dot.
However, the above enivorment variable only works for Windows. If you want it to work Linux, you will have to use os.environ["HOME"] instead.
If you need to transfer between both systems, then you can use os.name:
from os import name
from os import environ
# Windows
if name == 'nt':
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
# Linux
elif name == 'posix':
basedir = join(environ["HOME"], "Desktop", "pdf_dot")
You don't need to specify the directory or create any folder manually. All you need do is run the following script. When the execution is done, you should get a folder named pdf_dot in your desktop containing the pdf files you wish to grab.
import requests
from bs4 import BeautifulSoup
import os
URL = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
dirf = os.environ['USERPROFILE'] + '\Desktop\pdf_dot'
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'lxml')
pdflinks = [itemlink['href'] for itemlink in soup.find_all("a",{"data-linktype":"internal"}) if "reject" not in itemlink['href']]
for pdflink in pdflinks:
filename = f'{pdflink.split("/")[-1]}{".pdf"}'
with open(filename, 'wb') as f:
f.write(requests.get(pdflink).content)

how to download pics to a specific folder location on windows?

I have this script which download all images from a given web url address:
from selenium import webdriver
import urllib
class ChromefoxTest:
def __init__(self,url):
self.url=url
self.uri = []
def chromeTest(self):
# file_name = "C:\Users\Administrator\Downloads\images"
self.driver=webdriver.Chrome()
self.driver.get(self.url)
self.r=self.driver.find_elements_by_tag_name('img')
# output=open(file_name,'w')
for i, v in enumerate(self.r):
src = v.get_attribute("src")
self.uri.append(src)
pos = len(src) - src[::-1].index('/')
print src[pos:]
self.g=urllib.urlretrieve(src, src[pos:])
# output.write(src)
# output.close()
if __name__=='__main__':
FT=ChromefoxTest("http://imgur.com/")
FT.chromeTest()
my question is: how do i make this script to save all the pics to a specific folder location on my windows machine?
You need to specify the path where you want to save the file. This is explained in the documentation for urllib.urlretrieve:
The method is: urllib.urlretrieve(url[, filename[, reporthook[, data]]]).
And the documentation says:
The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name).
So...
urllib.urlretrieve(src, 'location/on/my/system/foo.png')
Will save the image to the specified folder.
Also, consider reviewing the documentation for os.path. Those functions will help you manipulate file names and paths.
If you use the requests library you can slurp up really big image files (or small ones) efficiently and arrange to store them in a place of your choice in an obvious way.
Use this code and you'll get a nice picture of a beagle dog!
image_url is the link to the remote image.
file_path is where you want to store the image locally. It can include just a file name or a full path, at your option.
chunk_size is the size of the piece of the file to be downloaded with each slurp from the remote site.
length is the actual size of the piece that is written locally. Since I did this interactively I put this in mainly so that I wouldn't have to look at a long vertical stream of 1024s on my screen.
..
>>> import requests
>>> image_url = 'http://maxpixel.freegreatpicture.com/static/photo/1x/Eyes-Dog-Portrait-Animal-Familiar-Domestic-Beagle-2507963.jpg'
>>> file_path = r'c:\scratch\beagle.jpg'
>>> r = requests.get(image_url, stream=True)
>>> with open(file_path, 'wb') as beagle:
... for chunk in r.iter_content(chunk_size=1024):
... length = beagle.write(chunk)

Python: Writing script to scrape images off of HTTPS URL database

I was messing around in python 3.x yesterday, and i wanted to scrape all of the images off of a HTTPS website. This is the code I have so far
import urllib
import urllib.request
idnum = 190154
ur = 'https://skystorage.iscorp.com/pictures/IL/Lincolnway//%d' % idnum
url = ur + '.JPG?rev=0'
filename = str(idnum) + '.JPG'
idnum = idnum + 1
try: urllib.request.urlretrieve(url , filename)
except urllib.error.URLError as e:
print(e.reason)
This, however, is not working at all as planned, as the URL is HTTPS and urllib does not seem to support this. How would I be able to do something similarly to scrape the images?
man there is much work to do,but anyway I would like to help you .
first think you should know is that if you are in a html page,first you must create a list of the urls of the image that you would like to download ,to do this you can find helpful to know what Is a regular expression and know how to use RE library of python.
With the RE you can search in the html code the urls of the image.
Then made a method that save on your computer all of image that are in the list that you have been created before.
I hope I was helpful

Download a file from Web when file name change every time

I have a task to downoad McAfee virus definition file daily from download site locally. The name of the file changes every day. The path to download site is "http://download.nai.com/products/licensed/superdat/english/intel/7612xdat.exe"
This ID 7612 will change every day for something different hence I can't hard code it. I have to find a way to either list file name before providing it as argument or etc.
On stackoverflow site I found someone's script which will work for me if someone could advice how to handle changing file name.
Here is the script that I'm going to use:
def download(url):
"""Copy the contents of a file from a given URL
to a local file.
"""
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
if __name__ == '__main__':
import sys
if len(sys.argv) == 2:
try:
download(sys.argv[1])
except IOError:
print 'Filename not found.'
else:
import os
print 'usage: %s http://server.com/path/to/filename' % os.path.basename(sys.argv[0])
Could someone advice me?
Thanks in advance
Its a two-step process. First scrape the index page looking for files. Second, grab the latest and download
import urllib
import lxml.html
import os
import shutil
# index page
pattern_files_url = "http://download.nai.com/products/licensed/superdat/english/intel"
# relative url references based here
pattern_files_base = '/'.join(pattern_files_url.split('/')[:-1])
# scrape the index page for latest file list
doc = lxml.html.parse(pattern_files_url)
pattern_files = [ref for ref in doc.xpath("//a/#href") if ref.endswith('xdat.exe')]
if pattern_files:
pattern_files.sort()
newest = pattern_files[-1]
local_name = newest.split('/')[-1]
# grab it if we don't already have it
if not os.path.exists(local_name):
url = pattern_files_base + '/' + newest
print("downloading %s to %s" % (url, local_name))
remote = urllib.urlopen(url)
print dir(remote)
with open(local_name, 'w') as local:
shutil.copyfileobj(remote, local, length=65536)

Categories

Resources