I have multiple URLs that returns zip files. Most of the files, I'm able to download using urllib2 library as follows:
request = urllib2.urlopen(url)
zip_file = request.read()
The problem I'm having is that one of the files is 35Mb in size (zipped) and I'm never able to finish downloading it using this library. I'm able to download it using wget and the browser normally.
I have tried downloading the file in chuncks like this:
request = urllib2.urlopen(url)
buffers = []
while True:
buffer = request.read(8192)
if buffer:
buffers.append(buffer)
else:
break
final_file = ''.join(buffers)
But this also does not finish the download. No error is raised, so it's hard to debug what is happening. Unfortunately, I can't post an example of the url / file here.
Any suggestions / advices?
This is copy / paste from my application which downloads it's own update installer. It reads the file in blocks and immediately saves the blocks in output file on the disk.
def DownloadThreadFunc(self):
try:
url = self.lines[1]
data = None
req = urllib2.Request(url, data, {})
handle = urllib2.urlopen(req)
self.size = int(handle.info()["Content-Length"])
self.actualSize = 0
name = path.join(DIR_UPDATES, url.split("/")[-1])
blocksize = 64*1024
fo = open(name, "wb")
while not self.terminate:
block = handle.read(blocksize)
self.actualSize += len(block)
if len(block) == 0:
break
fo.write(block)
fo.close()
except (urllib2.URLError, socket.timeout), e:
try:
fo.close()
except:
pass
error("Download failed.", unicode(e))
I use self.size and self.actualSize to show the download progress in GUI thread and self.terminate to cancel the download from the GUI button if needed.
Related
My purpose is to download the data from this website:
http://transoutage.spp.org/
When opening this website, in the bottom of web, there is a description used to illustrate how to auto-download the data. For example:
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true
The code I wrote is this:
import requests
ul_begin = 'http://transoutage.spp.org/report.aspx?download=true'
timeset = '3/1/2018' #define the time, m/d/yyyy
fn = ['&actualendgreaterthan='] + [timeset] + ['&includenulls=true']
fn = ''.join(fn)
ul = ul_begin+fn
r = requests.get(ul, verify=False)
Since, if you input the web address,
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true,
into the Chrome, it will auto-download the data in .csv file. I do not know how to continue my code.
Please help me!!!!
You need to write the response you receive to a file:
r = requests.get(ul, verify=False)
if 200 >= r.status_code <= 300:
# If the request has succeeded
file_path = '<path_where_file_has_to_be_downloaded>`
f = open(file_path, 'w+')
f.write(r.content)
f.close()
This will work properly if the csv file is small in size. but for large files, you need to use stream param to download: http://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html
I have a file to download (download path extracted from json. eg: http://testsite/abc.zip).
I need a help to perform, all the 5 threads should download the abc.zip file to the output directory and the download has to be Asynchronous or concurrent.
Currently with the below code it does download the file 5 times but it downloads one by one (Synchronous).
What I want is, the download to be simultaneous.
def dldr(file=file_url, outputdir=out1):
local_fn = str(uuid.uuid4())
if not os.path.exists(outputdir):
os.makedirs(outputdir)
s = datetime.now()
urllib.urlretrieve(file, outputdir + os.sep + local_fn)
e = datetime.now()
time_diff = e - s
logger(out1, local_fn, time_diff)
for i in range(1, 6):
t = threading.Thread(target=dldr())
t.start()
I have read Requests with multiple connections post and it's helpful, but doesn't address the requirement of the question asked.
I use threading module for download threads:
Also requests, but you can change that to urllib by yourself.
import threading
import requests
def download(link, filelocation):
r = requests.get(link, stream=True)
with open(filelocation, 'wb') as f:
for chunk in r.iter_content(1024):
if chunk:
f.write(chunk)
def createNewDownloadThread(link, filelocation):
download_thread = threading.Thread(target=download, args=(link,filelocation))
download_thread.start()
for i in range(0,5):
file = "C:\\test" + str(i) + ".png"
print file
createNewDownloadThread("http://stackoverflow.com/users/flair/2374517.png", file)
Description of code
My script below works fine. It basically just finds all the data files that I'm interested in from a given website, checks to see if they are already on my computer (and skips them if they are), and lastly downloads them using cURL on to my computer.
The problem
The problem I'm having is sometimes there are 400+ very large files and I can't download them all at the same time. I'll push Ctrl-C but it seems to cancel the cURL download not the script so I end up needing to cancel all the downloads one by one. Is there a way around this? Maybe somehow making a key command that will let me stop at the end of the current download?
#!/usr/bin/python
import os
import urllib2
import re
import timeit
filenames = []
savedir = "/Users/someguy/Documents/Research/VLF_Hissler/Data/"
#connect to a URL
website = urllib2.urlopen("http://somewebsite")
#read html code
html = website.read()
#use re.findall to get all the data files
filenames = re.findall('SP.*?\.mat', html)
#The following chunk of code checks to see if the files are already downloaded and deletes them from the download queue if they are.
count = 0
countpass = 0
for files in os.listdir(savedir):
if files.endswith(".mat"):
try:
filenames.remove(files)
count += 1
except ValueError:
countpass += 1
print "counted number of removes", count
print "counted number of failed removes", countpass
print "number files less removed:", len(filenames)
#saves the file names into an array of html link
links=len(filenames)*[0]
for j in range(len(filenames)):
links[j] = 'http://somewebsite.edu/public_web_junk/southpole/2014/'+filenames[j]
for i in range(len(links)):
os.system("curl -o "+ filenames[i] + " " + str(links[i]))
print "links downloaded:",len(links)
You could always check the file size using curl before downloading it:
import subprocess, sys
def get_file_size(url):
"""
Gets the file size of a URL using curl.
#param url: The URL to obtain information about.
#return: The file size, as an integer, in bytes.
"""
# Get the file size in bytes
p = subprocess.Popen(('curl', '-sI', url), stdout=subprocess.PIPE)
for s in p.stdout.readlines():
if 'Content-Length' in s:
file_size = int(s.strip().split()[-1])
return file_size
# Your configuration parameters
url = ... # URL that you want to download
max_size = ... # Max file size in bytes
# Now you can do a simple check to see if the file size is too big
if get_file_size(url) > max_size:
sys.exit()
# Or you could do something more advanced
bytes = get_file_size(url)
if bytes > max_size:
s = raw_input('File is {0} bytes. Do you wish to download? '
'(yes, no) '.format(bytes))
if s.lower() == 'yes':
# Add download code here....
else:
sys.exit()
I have been trying to download a video file with python and at the same time playing it with VLC.
I have tried few ways. One of them is to download in a single thread with continuous fetch and append data. This style is slow but video plays. The code is something like below
self.fp = open(dest, "w")
while not self.stop_down and _continue: with urllib2 request
try:
size = 1024 * 8
data = page.read(size)
bytld+= size
self.fp.write(data)
This function takes longer to download but I am able to play the video while its loading.
However I have been trying to download in multiple parts at the same time..
With proper threading logics
req= urllib2.Request(self.url)
req.headers['Range'] = 'bytes=%s-%s' % (self.startPos, self.end)
response = urllib2.urlopen(req)
content = response.read()
if os.path.exists(self.dest) :
out_fd = open(self.dest, "r+b")
else :
out_fd = open(self.dest, "w+b")
out_fd.seek(self.startPos, 0)
out_fd.write(content)
out_fd.close()
With my threading I am making sure that each part of the file is being saved on sequentially.
But for some reason I can't play this file at all while downloading.
Is there anything I am not doing right? Is the "Range" should be modified different way?
Turns out for each block of data in thread mode Range has to be +1 BYTE. So if the first block is 1024 next one is from 1023 to whatever.
I am currently setting up a website where I get a file uploaded from the user , do some processing on it and provide a link for the user to download the processed file from. I presently want to provide a path to the file on my local system, I am new to web2py, and am having trouble doing this.
Could someone please help me do this?
Regards
see this link for some hint: webpy: how to stream files , and may be add some code like this:
BUF_SIZE = 262144
class download:
def GET(self):
file_name = # get from url
file_path = os.path.join('/path to your file', file_name)
f = None
try:
f = open(file_path, "rb")
webpy.header('Content-Type','application/octet-stream')
webpy.header('Content-disposition', 'attachment; filename=%s' % file_name)
while True:
c = f.read(BUF_SIZE)
if c:
yield c
else:
break
except Exception, e:
# throw 403 or 500 or just leave it
pass
finally:
if f:
f.close()