Script downloads data files but I can't stop the script - python

Description of code
My script below works fine. It basically just finds all the data files that I'm interested in from a given website, checks to see if they are already on my computer (and skips them if they are), and lastly downloads them using cURL on to my computer.
The problem
The problem I'm having is sometimes there are 400+ very large files and I can't download them all at the same time. I'll push Ctrl-C but it seems to cancel the cURL download not the script so I end up needing to cancel all the downloads one by one. Is there a way around this? Maybe somehow making a key command that will let me stop at the end of the current download?
#!/usr/bin/python
import os
import urllib2
import re
import timeit
filenames = []
savedir = "/Users/someguy/Documents/Research/VLF_Hissler/Data/"
#connect to a URL
website = urllib2.urlopen("http://somewebsite")
#read html code
html = website.read()
#use re.findall to get all the data files
filenames = re.findall('SP.*?\.mat', html)
#The following chunk of code checks to see if the files are already downloaded and deletes them from the download queue if they are.
count = 0
countpass = 0
for files in os.listdir(savedir):
if files.endswith(".mat"):
try:
filenames.remove(files)
count += 1
except ValueError:
countpass += 1
print "counted number of removes", count
print "counted number of failed removes", countpass
print "number files less removed:", len(filenames)
#saves the file names into an array of html link
links=len(filenames)*[0]
for j in range(len(filenames)):
links[j] = 'http://somewebsite.edu/public_web_junk/southpole/2014/'+filenames[j]
for i in range(len(links)):
os.system("curl -o "+ filenames[i] + " " + str(links[i]))
print "links downloaded:",len(links)

You could always check the file size using curl before downloading it:
import subprocess, sys
def get_file_size(url):
"""
Gets the file size of a URL using curl.
#param url: The URL to obtain information about.
#return: The file size, as an integer, in bytes.
"""
# Get the file size in bytes
p = subprocess.Popen(('curl', '-sI', url), stdout=subprocess.PIPE)
for s in p.stdout.readlines():
if 'Content-Length' in s:
file_size = int(s.strip().split()[-1])
return file_size
# Your configuration parameters
url = ... # URL that you want to download
max_size = ... # Max file size in bytes
# Now you can do a simple check to see if the file size is too big
if get_file_size(url) > max_size:
sys.exit()
# Or you could do something more advanced
bytes = get_file_size(url)
if bytes > max_size:
s = raw_input('File is {0} bytes. Do you wish to download? '
'(yes, no) '.format(bytes))
if s.lower() == 'yes':
# Add download code here....
else:
sys.exit()

Related

Script to convert multiple URLs or files to individual PDFs and save to a specific location

I have written a script where I am taking the input of URLs hardcoded and giving their filenames also hardcoded, whereas I want to take the URLs from a saved text file and save their names automatically in a chronological order to a specific folder.
My code (works) :
import requests
#input urls and filenames
urls = ['https://www.northwestknowledge.net/metdata/data/pr_1979.nc',
'https://www.northwestknowledge.net/metdata/data/pr_1980.nc',
'https://www.northwestknowledge.net/metdata/data/pr_1981.nc']
fns = [r'C:\Users\HBI8\Downloads\pr_1979.nc',
r'C:\Users\HBI8\Downloads\pr_1980.nc',
r'C:\Users\HBI8\Downloads\pr_1981.nc']
#defining the inputs
inputs= zip(urls,fns)
#define download function
def download_url(args):
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
except Exception as e:
print('Failed:', e)
#loop through all inputs and run download function
for i in inputs :
result = download_url(i)
Trying to fetch the links from text (error in code):
import requests
# getting all URLS from textfile
file = open('C:\\Users\\HBI8\\Downloads\\testing.txt','r')
#for each_url in enumerate(f):
list_of_urls = [(line.strip()).split() for line in file]
file.close()
#input urls and filenames
urls = list_of_urls
fns = [r'C:\Users\HBI8\Downloads\pr_1979.nc',
r'C:\Users\HBI8\Downloads\pr_1980.nc',
r'C:\Users\HBI8\Downloads\pr_1981.nc']
#defining the inputs
inputs= zip(urls,fns)
#define download function
def download_url(args):
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
except Exception as e:
print('Failed:', e)
#loop through all inputs and run download fupdftion
for i in inputs :
result = download_url(i)
testing.txt has those 3 links pasted in it on each line.
Error :
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1979.nc']"
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1980.nc']"
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1981.nc']"
PS :
I am new to python and it would be helpful if someone could advice me on how to loop or go through files from a text file and save them indivually in a chronological order as opposed to hardcoding the names(as I have done).
When you do list_of_urls = [(line.strip()).split() for line in file], you produce a list of lists. (For each line of the file, you produce the list of urls in this line, and then you make a list of these lists)
What you want is a list of urls.
You could do
list_of_urls = [url for line in file for url in (line.strip()).split()]
Or :
list_of_urls = []
for line in file:
list_of_urls.extend((line.strip()).split())
By far the simplest method in this simple case is use the OS command
so go to the work directory C:\Users\HBI8\Downloads
invoke cmd (you can simply put that in the address bar)
write/paste your list using >notepad testing.txt (if you don't already have it there)
Note NC HDF files are NOT.pdf
https://www.northwestknowledge.net/metdata/data/pr_1979.nc
https://www.northwestknowledge.net/metdata/data/pr_1980.nc
https://www.northwestknowledge.net/metdata/data/pr_1981.nc
then run
for /F %i in (testing.txt) do curl -O %i
92 seconds later
I have inserted a delimiter as ',' by using split function.
In order to give automated file name I used the index number of the stored list.
Data saved in following manner in txt file.
FileName | Object ID | Base URL
url_file = open('C:\\Users\\HBI8\\Downloads\\testing.txt','r')
fns=[]
list_of_urls = []
for line in url_file:
stripped_line = line.split(',')
print(stripped_line)
list_of_urls.append(stripped_line[2]+stripped_line[1])
fns.append(stripped_line[0])
url_file.close()

Set output location for python script

I want to save all images from a site. wget is horrible, at least for http://www.leveldesigninspirationmachine.tumblr.com since in the image folder it just drops html files, and nothing as an extension.
I found a python script, the usage is like this:
[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize
Finally I got the script running after some BeautifulSoup problems.
However, I can't find the files anywhere. I also tried "/" as the output dir in hope the images got on the root of my HD but no luck. Can someone either help me to simplify the script so it outputs at the cd directory set in terminal. Or give me a command that should work. I have zero python experience and I don't really want to learn python for a 2 year old script that maybe doesn't even work the way I want.
Also, how can I pass an array of website? With a lot of scrapers it gives me the first few results of the page. Tumblr has the load on scroll but that has no effect so i would like to add /page1 etc.
thanks in advance
# imageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 201009094
import urllib2
from os.path import basename
import urlparse
#from BeautifulSoup import BeautifulSoup # for HTML parsing
import bs4
from bs4 import BeautifulSoup
global urlList
urlList = []
# recursively download images starting from the root URL
def downloadImages(url, level, minFileSize): # the root URL is level 0
# do not go to other websites
global website
netloc = urlparse.urlsplit(url).netloc.split('.')
if netloc[-2] + netloc[-1] != website:
return
global urlList
if url in urlList: # prevent using the same URL again
return
try:
urlContent = urllib2.urlopen(url).read()
urlList.append(url)
print url
except:
return
soup = BeautifulSoup(''.join(urlContent))
# find and download all images
imgTags = soup.findAll('img')
for imgTag in imgTags:
imgUrl = imgTag['src']
# download only the proper image files
if imgUrl.lower().endswith('.jpeg') or \
imgUrl.lower().endswith('.jpg') or \
imgUrl.lower().endswith('.gif') or \
imgUrl.lower().endswith('.png') or \
imgUrl.lower().endswith('.bmp'):
try:
imgData = urllib2.urlopen(imgUrl).read()
if len(imgData) >= minFileSize:
print " " + imgUrl
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
print
print
# if there are links on the webpage then recursively repeat
if level > 0:
linkTags = soup.findAll('a')
if len(linkTags) > 0:
for linkTag in linkTags:
try:
linkUrl = linkTag['href']
downloadImages(linkUrl, level - 1, minFileSize)
except:
pass
# main
rootUrl = 'http://www.leveldesigninspirationmachine.tumblr.com'
netloc = urlparse.urlsplit(rootUrl).netloc.split('.')
global website
website = netloc[-2] + netloc[-1]
downloadImages(rootUrl, 1, 50000)
As Frxstream has commented, this program creates the files in the current directory (i.e. where you run it). After running the program, run ls -l (or dir) to find the files it has created.
If it seemingly hasn't created any files, then most probably it really hasn't created any files, most probably because there was an exception which your except: pass has hidden. To see what was going wrong, replace try: ... except: pass with just ..., and rerun the program. (If you can't understand and fix that, ask a separate StackOverflow question.)
it's hard to tell without looking at the errors (+1 to turning off your try/except block so you can see the exceptions) but I do see one typo here:
fileName = basename(urlsplit(imgUrl)[2])
you didn't do "from urlparse import urlsplit" you have "import urlparse" so you need to refer to it as urlparse.urlsplit() as you have in other places, so should be like this
fileName = basename(urlparse.urlsplit(imgUrl)[2])

Downloading Webcomic Saving Blank Files

I have a script for downloading the Questionable Content webcomic. It looks like it runs okay, but the files it downloads are empty, only a few kb in size.
#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os
#RegExp for the EndNum variable
RegExp = re.compile('.*<img src="http://www.questionablecontent.net/comics.*')
#Check the main QC page
site = urllib.urlopen("http://questionablecontent.net/")
contentLine = None
#For each line in the homepage's source...
for line in site.readlines():
#Break when you find the variable information
if RegExp.search(line):
contentLine = line
break
#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
contentLine = contentLine.split('/')
contentLine = contentLine[4].split('.')
EndNum = int(contentLine[0])
else:
EndNum = 2622
#First and Last comics user wishes to download
StartNum = 1
#EndNum = 2622
#Full path of destination folder needs to pre-exist
destinationFolder = "D:\Downloads\Comics\Questionable Content"
#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):
#IF you already have the comic, skip downloading it
if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
print "Skipping Comic "+str(i)+"..."
continue
#Printing User-Friendly Messages
print "Comic %d Found. Downloading..." % i
source = "http://www.questionablecontent.net/comics/"+str(i)+".png"
#Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
Why does it keep downloading empty files? Is there any workaround?
The problem here is that the server doesn't serve you the image if your user agent isn't set. Below is a sample code for Python 2.7, which should give you an idea regarding how to make your script work.
import urllib2
import time
first = 1
last = 2622
for i in range(first, last+1):
time.sleep(5) # Be nice to the server! And avoid being blocked.
for ext in ['png', 'gif']:
# Make sure that the img dir exists! If not, the script will throw an
# IOError
with open('img/{}.{}'.format(i, ext), 'wb') as ifile:
try:
req = urllib2.Request('http://www.questionablecontent.net/comics/{}.{}'.format(i, ext))
req.add_header('user-agent', 'Mozilla/5.0')
ifile.write(urllib2.urlopen(req).read())
break
except urllib2.HTTPError:
continue
else:
print 'Could not find image {}'.format(i)
continue
print 'Downloaded image {}'.format(i)
You may want to change your loop into something that resembles your loop (check whether the image has been downloaded previously etc.). This script will try to download all images from <start>.<ext> to <end>.<ext>, where <ext> is either gif or png.

Downloading .Jar File using Python

I am attempting to download a file from the internet using Python along with the sys and urllib2 modules. The general idea behind the program is for the user to input the version of the file they want to download, 1_4 for example. The program then adds the user input and the "/whateverfile.jar" to the url and downloads the file. My problem arises when the program inserts the "/whateverfile.jar" instead of inserting onto the same line the program inserts the "/whateverfile.jar" onto a new line. Which causes the program to fail to download the .jar properly.
Can anyone help me with this? The code and output is below.
Code:
import sys
import urllib2
print('Type version of file you wish to download.')
print('To download 1.4 for instance type "1_4" using underscores in place of the periods.')
W = ('http://assets.file.net/')
X = sys.stdin.readline()
Y = ('/file.jar')
Z = X+Y
V = W+X
U = V+Y
T = U.lstrip()
print(T)
def JarDownload():
url = "T"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
Output:
Type version of file you wish to download.
To download 1.4 for instance type "1_4" using underscores in place of the periods.
1_4
http://assets.file.net/1_4
/file.jar
I am currently not calling the JarDownload() function at all until the URL will display as a single line when printed to screen
When you type the input and hit Return, the sys.stdin.readline() call will append the new line symbol to the string and return it. To get the desired effect, you should strip the new line from the input before using it. This should work:
X = sys.stdin.readline().rstrip()
As a side note, you should probably give more meaningful names to your variables. Names like X, Y, Z, etc. say nothing about the variables content and make even simple operations, like your concatenations, unnecessarily hard to understand.

downloading large number of files using python

test.txt contains the list of files to be downloaded:
http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif
How these files can be downloaded using python with maximum download speed?
my thinking was as follows:
import urllib.request
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
What after that?How to select download directory?
Select a path to your desired output directory (output_dir). In your for loop split every url on / character and use the last peace as the filename. Also open the files for writing in binary mode wb since the response.read() returns bytes, not str.
import os
import urllib.request
output_dir = 'path/to/you/output/dir'
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
output_file = os.path.join(output_dir, line.split('/')[-1])
with open(output_file, 'wb') as writer:
writer.write(response.read())
Note:
Downloading multiple files can be faster if you use multiple threads since the download is rarely using the full bandwidth of your internet connection._
Also if the files you are downloading are pretty big you should probably stream the read (reading chunk by chunk). As #Tiran commented you should use shutil.copyfileobj(response, writer) instead of writer.write(response.read()).
I would only add that you should probably always specify the length parameter too: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB) since the default value of 16kb is really small and it will just slow things down.
This works fine for me: (note that name must be absolute, for example 'afaf1.tif')
import urllib,os
def download(baseUrl,fileName,layer=0):
print 'Trying to download file:',fileName
url = baseUrl+fileName
name = os.path.join('foldertodwonload',fileName)
try:
#Note that folder needs to exist
urllib.urlretrieve (url,name)
except:
# Upon failure to download retries total 5 times
print 'Download failed'
print 'Could not download file:',fileName
if layer > 4:
return
else:
layer+=1
print 'retrying',str(layer)+'/5'
download(baseUrl,fileName,layer)
print fileName+' downloaded'
for fileName in nameList:
download(url,fileName)
Moved unnecessary code out from try block

Categories

Resources