I have a task to downoad McAfee virus definition file daily from download site locally. The name of the file changes every day. The path to download site is "http://download.nai.com/products/licensed/superdat/english/intel/7612xdat.exe"
This ID 7612 will change every day for something different hence I can't hard code it. I have to find a way to either list file name before providing it as argument or etc.
On stackoverflow site I found someone's script which will work for me if someone could advice how to handle changing file name.
Here is the script that I'm going to use:
def download(url):
"""Copy the contents of a file from a given URL
to a local file.
"""
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
if __name__ == '__main__':
import sys
if len(sys.argv) == 2:
try:
download(sys.argv[1])
except IOError:
print 'Filename not found.'
else:
import os
print 'usage: %s http://server.com/path/to/filename' % os.path.basename(sys.argv[0])
Could someone advice me?
Thanks in advance
Its a two-step process. First scrape the index page looking for files. Second, grab the latest and download
import urllib
import lxml.html
import os
import shutil
# index page
pattern_files_url = "http://download.nai.com/products/licensed/superdat/english/intel"
# relative url references based here
pattern_files_base = '/'.join(pattern_files_url.split('/')[:-1])
# scrape the index page for latest file list
doc = lxml.html.parse(pattern_files_url)
pattern_files = [ref for ref in doc.xpath("//a/#href") if ref.endswith('xdat.exe')]
if pattern_files:
pattern_files.sort()
newest = pattern_files[-1]
local_name = newest.split('/')[-1]
# grab it if we don't already have it
if not os.path.exists(local_name):
url = pattern_files_base + '/' + newest
print("downloading %s to %s" % (url, local_name))
remote = urllib.urlopen(url)
print dir(remote)
with open(local_name, 'w') as local:
shutil.copyfileobj(remote, local, length=65536)
Related
I copy some Python code in order to download data from a website. Here is my specific website:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1
Here is the code which I copied:
import requests
from bs4 import BeautifulSoup
def _getUrls_(res):
hrefs = []
soup = BeautifulSoup(res.text, 'lxml')
main_content = soup.find('div',{'id' : 'content-core'})
table = main_content.find("table")
for a in table.findAll('a', href=True):
hrefs.append(a['href'])
return(hrefs)
bidurl = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
r = requests.get(bidurl)
hrefs = _getUrls_(r)
def _getPdfs_(hrefs, basedir):
for i in range(len(hrefs)):
print(hrefs[i])
respdf = requests.get(hrefs[i])
pdffile = basedir + "/pdf_dot/" + hrefs[i].split("/")[-1] + ".pdf"
try:
with open(pdffile, 'wb') as p:
p.write(respdf.content)
p.close()
except FileNotFoundError:
print("No PDF produced")
basedir= "/Users/ABC/Desktop"
_getPdfs_(hrefs, basedir)
The code runs successfully, but it did not download anything at all, even though there is no Filenotfounderror obviously.
I tried the following two URLs:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-088a-035-20360
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-r100-258-21125
However both of these URLs return >>> No PDF produced.
The thing is that the code worked and downloaded successfully for other people, but not me.
Your code works I just tested. You need to make sure the basedir exists, you want to add this to your code:
if not os.path.exists(basedir):
os.makedirs(basedir)
I used this exact (indented) code but replaced the basedir with my own dir and it worked only after I made sure that the path actually exists. This code does not create the folder in case it does not exist.
As others have pointed out, you need to create basedir beforehand. The user running the script may not have the directory created. Make sure you insert this code at the beginning of the script, before the main logic.
Additionally, hardcoding the base directory might not be a good idea when transferring the script to different systems. It would be preferable to use the users %USERPROFILE% enviorment variable:
from os import envioron
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
Which would be the same as C:\Users\blah\Desktop\pdf_dot.
However, the above enivorment variable only works for Windows. If you want it to work Linux, you will have to use os.environ["HOME"] instead.
If you need to transfer between both systems, then you can use os.name:
from os import name
from os import environ
# Windows
if name == 'nt':
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
# Linux
elif name == 'posix':
basedir = join(environ["HOME"], "Desktop", "pdf_dot")
You don't need to specify the directory or create any folder manually. All you need do is run the following script. When the execution is done, you should get a folder named pdf_dot in your desktop containing the pdf files you wish to grab.
import requests
from bs4 import BeautifulSoup
import os
URL = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
dirf = os.environ['USERPROFILE'] + '\Desktop\pdf_dot'
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'lxml')
pdflinks = [itemlink['href'] for itemlink in soup.find_all("a",{"data-linktype":"internal"}) if "reject" not in itemlink['href']]
for pdflink in pdflinks:
filename = f'{pdflink.split("/")[-1]}{".pdf"}'
with open(filename, 'wb') as f:
f.write(requests.get(pdflink).content)
I want to save all images from a site. wget is horrible, at least for http://www.leveldesigninspirationmachine.tumblr.com since in the image folder it just drops html files, and nothing as an extension.
I found a python script, the usage is like this:
[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize
Finally I got the script running after some BeautifulSoup problems.
However, I can't find the files anywhere. I also tried "/" as the output dir in hope the images got on the root of my HD but no luck. Can someone either help me to simplify the script so it outputs at the cd directory set in terminal. Or give me a command that should work. I have zero python experience and I don't really want to learn python for a 2 year old script that maybe doesn't even work the way I want.
Also, how can I pass an array of website? With a lot of scrapers it gives me the first few results of the page. Tumblr has the load on scroll but that has no effect so i would like to add /page1 etc.
thanks in advance
# imageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 201009094
import urllib2
from os.path import basename
import urlparse
#from BeautifulSoup import BeautifulSoup # for HTML parsing
import bs4
from bs4 import BeautifulSoup
global urlList
urlList = []
# recursively download images starting from the root URL
def downloadImages(url, level, minFileSize): # the root URL is level 0
# do not go to other websites
global website
netloc = urlparse.urlsplit(url).netloc.split('.')
if netloc[-2] + netloc[-1] != website:
return
global urlList
if url in urlList: # prevent using the same URL again
return
try:
urlContent = urllib2.urlopen(url).read()
urlList.append(url)
print url
except:
return
soup = BeautifulSoup(''.join(urlContent))
# find and download all images
imgTags = soup.findAll('img')
for imgTag in imgTags:
imgUrl = imgTag['src']
# download only the proper image files
if imgUrl.lower().endswith('.jpeg') or \
imgUrl.lower().endswith('.jpg') or \
imgUrl.lower().endswith('.gif') or \
imgUrl.lower().endswith('.png') or \
imgUrl.lower().endswith('.bmp'):
try:
imgData = urllib2.urlopen(imgUrl).read()
if len(imgData) >= minFileSize:
print " " + imgUrl
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
print
print
# if there are links on the webpage then recursively repeat
if level > 0:
linkTags = soup.findAll('a')
if len(linkTags) > 0:
for linkTag in linkTags:
try:
linkUrl = linkTag['href']
downloadImages(linkUrl, level - 1, minFileSize)
except:
pass
# main
rootUrl = 'http://www.leveldesigninspirationmachine.tumblr.com'
netloc = urlparse.urlsplit(rootUrl).netloc.split('.')
global website
website = netloc[-2] + netloc[-1]
downloadImages(rootUrl, 1, 50000)
As Frxstream has commented, this program creates the files in the current directory (i.e. where you run it). After running the program, run ls -l (or dir) to find the files it has created.
If it seemingly hasn't created any files, then most probably it really hasn't created any files, most probably because there was an exception which your except: pass has hidden. To see what was going wrong, replace try: ... except: pass with just ..., and rerun the program. (If you can't understand and fix that, ask a separate StackOverflow question.)
it's hard to tell without looking at the errors (+1 to turning off your try/except block so you can see the exceptions) but I do see one typo here:
fileName = basename(urlsplit(imgUrl)[2])
you didn't do "from urlparse import urlsplit" you have "import urlparse" so you need to refer to it as urlparse.urlsplit() as you have in other places, so should be like this
fileName = basename(urlparse.urlsplit(imgUrl)[2])
I'm developing an application that runs on an Apache server with Django framework. My current script works fine when it runs on the local desktop (without Django). The script downloads all the images from a website to a folder on the desktop. However, when I run the script on the server a file object is just create by Django that apparently has something in it (should be google's logo), however, I can't open up the file. I also create an html file, updated image link locations, but the html file gets created fine, I'm assuming because it's all text, maybe? I believe I may have to use a file wrapper somewhere, but I'm not sure. Any help is appreciated, below is my code, Thanks!
from django.http import HttpResponse
from bs4 import BeautifulSoup as bsoup
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
import sys
import zipfile
from django.core.servers.basehttp import FileWrapper
def getdata(request):
out = 'C:\Users\user\Desktop\images'
if request.GET.get('q'):
#url = str(request.GET['q'])
url = "http://google.com"
soup = bsoup(urlopen(url))
parsedURL = list(urlparse.urlparse(url))
for image in soup.findAll("img"):
print "Old Image Path: %(src)s" % image
#Get file name
filename = image["src"].split("/")[-1]
#Get full path name if url has to be parsed
parsedURL[2] = image["src"]
image["src"] = '%s\%s' % (out,filename)
print 'New Path: %s' % image["src"]
# print image
outpath = os.path.join(out, filename)
#retrieve images
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsedURL), out) #Constructs URL from tuple (parsedURL)
#Create HTML File and writes to it to check output (stored in same directory).
html = soup.prettify("utf-8")
with open("output.html", "wb") as file:
file.write(html)
else:
url = 'You submitted nothing!'
return HttpResponse(url)
My problem had to do with storing the files on the desktop. I stored the files in the DJango workspace folder, changed the paths, and it worked for me.
I am looking for a faster way to do my task. i have 40000 file downloadable urls. I would like to download them in local desktop is.now the thought is currently what I am doing is placing the link on the browser and then download them via a script.now what I am looking for is to place 10 urls in a chunk to the address bar and get the 10 files to be downloaded at the same time.If it possible hope overall time will be decreased.
Sorry I was late to give the code,here it is :
def _download_file(url, filename):
"""
Given a URL and a filename, this method will save a file locally to theĀ»
destination_directory path.
"""
if not os.path.exists(destination_directory):
print 'Directory [%s] does not exist, Creating directory...' % destination_directory
os.makedirs(destination_directory)
try:
urllib.urlretrieve(url, os.path.join(destination_directory, filename))
print 'Downloading File [%s]' % (filename)
except:
print 'Error Downloading File [%s]' % (filename)
def _download_all(main_url):
"""
Given a URL list, this method will download each file in the destination
directory.
"""
url_list = _create_url_list(main_url)
for url in url_list:
_download_file(url, _get_file_name(url))
Thanks,
Why use a browser? This seems like an XY problem.
To download files, I'd use a library like requests (or make a system call to wget).
Something like this:
import requests
def download_file_from_url(url, file_save_path):
r = requests.get(url)
if r.ok: # checks if the download succeeded
with file(file_save_path, 'w') as f:
f.write(r.content)
return True
else:
return r.status_code
download_file_from_url('http://imgs.xkcd.com/comics/tech_support_cheat_sheet.png', 'new_image.png')
# will download image and save to current directory as 'new_image.png'
You first have to install requests using whatever python package manager you prefer e.g., pip install requests. You can also get fancier; e.g.,
I am trying to download a zip file to a local drive and extract all files to a destination folder.
so i have come up with solution but it is only to "download" a file from a directory to another directory but it doesn't work for downloading files. for the extraction, I am able to get it to work in 2.6 but not for 2.5. so any suggestions for the work around or another approach I am definitely open to.
thanks in advance.
######################################
'''this part works but it is not good for URl links'''
import shutil
sourceFile = r"C:\Users\blueman\master\test2.5.zip"
destDir = r"C:\Users\blueman\user"
shutil.copy(sourceFile, destDir)
print "file copied"
######################################################
'''extract works but not good for version 2.5'''
import zipfile
GLBzipFilePath =r'C:\Users\blueman\user\test2.5.zip'
GLBextractDir =r'C:\Users\blueman\user'
def extract(zipFilePath, extractDir):
zip = zipfile(zipFilePath)
zip.extractall(path=extractDir)
print "it works"
extract(GLBzipFilePath,GLBextractDir)
######################################################
urllib.urlretrieve can get a file (zip or otherwise;-) from a URL to a given path.
extractall is indeed new in 2.6, but in 2.5 you can use an explicit loop (get all names, open each name, etc). Do you need example code?
So here's the general idea (needs more try/except if you want to give a nice error message in each and every case which could go wrong, of which, of course, there are a million variants -- I'm only using a couple of such cases as examples...):
import os
import urllib
import zipfile
def getunzipped(theurl, thedir):
name = os.path.join(thedir, 'temp.zip')
try:
name, hdrs = urllib.urlretrieve(theurl, name)
except IOError, e:
print "Can't retrieve %r to %r: %s" % (theurl, thedir, e)
return
try:
z = zipfile.ZipFile(name)
except zipfile.error, e:
print "Bad zipfile (from %r): %s" % (theurl, e)
return
for n in z.namelist():
dest = os.path.join(thedir, n)
destdir = os.path.dirname(dest)
if not os.path.isdir(destdir):
os.makedirs(destdir)
data = z.read(n)
f = open(dest, 'w')
f.write(data)
f.close()
z.close()
os.unlink(name)
For downloading, look at urllib:
import urllib
webFile = urllib.urlopen(url)
For unzipping, use zipfile. See also this example.
The shortest way i've found so far, is to use +alex answer, but with ZipFile.extractall() instead of the loop:
from zipfile import ZipFile
from urllib import urlretrieve
from tempfile import mktemp
filename = mktemp('.zip')
destDir = mktemp()
theurl = 'http://www.example.com/file.zip'
name, hdrs = urlretrieve(theurl, filename)
thefile=ZipFile(filename)
thefile.extractall(destDir)
thefile.close()