how make python script for renewable downloads? - python

I've been searching (without results) a reanudable (i don't know if this is the correct word, sorry) way to download big files from internet with python, i know how do it directly with urllib2, but if something interrupt the connection, i need some way to reconnect and continue the download where it was if it's possible (like a download manager).

For other people who can help the answer, there's a HTTP protocol called Chunked Transfer Encoding that allow to do this specifying the 'Range' header of the request with the beginning and end bytes (separated by a dash), thus is possible just count how many bytes was downloaded previously and send it like the new beginning byte for continue the download. Example with requests module:
import requests
from os.path import getsize
#get size of previous downloaded chunk file
beg = getsize(PATH_TO_FILE)
#if we want we can get the size before download the file (without actually download it)
end = request.head(URL).headers['content-length']
#continue the download in the next byte from where it stopped
headers = {'Range': "bytes=%d-%s"%(beg+1,end)}
download = requests.get(URL, headers=headers)

Related

Resume a webpage .read() with urlib

Here is my code :
headers={'User-Agent': 'Mozilla/5.0'}
req=Request(url,headers=headers)
file=open(name,'wb')
file.write(urlopen(req).read())
file.close()
but when i get an exception and want to redownload the file i have to download from the begining ; and HTTPResponse don't have seek method . How can i resume my download ?
Thank you !
You can use the byte serving feature of HTTP 1.0. Here you add the range header to your request. For example adding
Range: bytes=9500-
will download the file from the 9500th Byte. So you would do something like this in the end:
See how much bytes you have already downloaded.
Start the download from the first missing byte and append the output to the file.
Notes:
The server need to support this technique (is not always the case).
You can use wget which has already a support for continuing the download (see parameter --continue)

Read specific bytes using urlopen()

I want to read specific bytes from a remote file using a python module. I am using urllib2. Specific bytes in the sense bytes in the form of Offset,Size. I know we can read X number of bytes from a remote file using urlopen(link).read(X). Is there any way so that I can read data which starts from Offset of length Size.?
def readSpecificBytes(link,Offset,size):
# code to be written
This will work with many servers (Apache, etc.), but doesn't always work, esp. not with dynamic content like CGI (*.php, *.cgi, etc.):
import urllib2
def get_part_of_url(link, start_byte, end_byte):
req = urllib2.Request(link)
req.add_header('Range', 'bytes=' + str(start_byte) + '-' + str(end_byte))
resp = urllib2.urlopen(req)
content = resp.read()
Note that this approach means that the server never has to send and you never download the data you don't need/want, which could save tons of bandwidth if you only want a small amount of data from a large file.
When it doesn't work, just read the first set of bytes before the rest.
See Wikipedia Article on HTTP headers for more details.
Unfortunately the file-like object returned by urllib2.urlopen() doesn't actually have a seek() method. You will need to work around this by doing something like this:
def readSpecificBytes(link,Offset,size):
f = urllib2.urlopen(link)
if Offset > 0:
f.read(Offset)
return f.read(size)

How to send a file via HTTP, the good way, using Python?

If a would-be-HTTP-server written in Python2.6 has local access to a file, what would be the most correct way for that server to return the file to a client, on request?
Let's say this is the current situation:
header('Content-Type', file.mimetype)
header('Content-Length', file.size) # file size in bytes
header('Content-MD5', file.hash) # an md5 hash of the entire file
return open(file.path).read()
All the files are .zip or .rar archives no bigger than a couple of megabytes.
With the current situation, browsers handle the incoming download weirdly. No browser knows the file's name, for example, so they use a random or default one. (Firefox even saved the file with a .part extension, even though it was complete and completely usable.)
What would be the best way to fix this and other errors I may not even be aware of, yet?
What headers am I not sending?
Thanks!
This is how I send ZIP file,
req.send_response(200)
req.send_header('Content-Type', 'application/zip')
req.send_header('Content-Disposition', 'attachment;'
'filename=%s' % filename)
Most browsers handle it correctly.
If you don't have to return the response body (that is, if you are given a stream for the response body by your framework) you can avoid holding the file in memory with something like this:
fp = file(path_to_the_file, 'rb')
while True:
bytes = fp.read(8192)
if bytes:
response.write(bytes)
else:
return
What web framework are you using?

Download file using urllib in Python with the wget -c feature

I am programming a software in Python to download HTTP PDF from a database.
Sometimes the download stop with this message :
retrieval incomplete: got only 3617232 out of 10689634 bytes
How can I ask the download to restart where it stops using the 206 Partial Content HTTP feature ?
I can do it using wget -c and it works pretty well, but I would like to implement it directly in my Python software.
Any idea ?
Thank you
You can request a partial download by sending a GET with the Range header:
import urllib2
req = urllib2.Request('http://www.python.org/')
#
# Here we request that bytes 18000--19000 be downloaded.
# The range is inclusive, and starts at 0.
#
req.headers['Range'] = 'bytes=%s-%s' % (18000, 19000)
f = urllib2.urlopen(req)
# This shows you the *actual* bytes that have been downloaded.
range=f.headers.get('Content-Range')
print(range)
# bytes 18000-18030/18031
print(repr(f.read()))
# ' </div>\n</body>\n</html>\n\n\n\n\n\n\n'
Be careful to check the Content-Range to learn what bytes have actually been downloaded, since your range may be out of bounds, and/or not all servers seem to respect the Range header.

Python: Downloading a large file to a local path and setting custom http headers

I am looking to download a file from a http url to a local file. The file is large enough that I want to download it and save it chunks rather than read() and write() the whole file as a single giant string.
The interface of urllib.urlretrieve is essentially what I want. However, I cannot see a way to set request headers when downloading via urllib.urlretrieve, which is something I need to do.
If I use urllib2, I can set request headers via its Request object. However, I don't see an API in urllib2 to download a file directly to a path on disk like urlretrieve. It seems that instead I will have to use a loop to iterate over the returned data in chunks, writing them to a file myself and checking when we are done.
What would be the best way to build a function that works like urllib.urlretrieve but allows request headers to be passed in?
What is the harm in writing your own function using urllib2?
import os
import sys
import urllib2
def urlretrieve(urlfile, fpath):
chunk = 4096
f = open(fpath, "w")
while 1:
data = urlfile.read(chunk)
if not data:
print "done."
break
f.write(data)
print "Read %s bytes"%len(data)
and using request object to set headers
request = urllib2.Request("http://www.google.com")
request.add_header('User-agent', 'Chrome XXX')
urlretrieve(urllib2.urlopen(request), "/tmp/del.html")
If you want to use urllib and urlretrieve, subclass urllib.URLopener and use its addheader() method to adjust the headers (ie: addheader('Accept', 'sound/basic'), which I'm pulling from the docstring for urllib.addheader).
To install your URLopener for use by urllib, see the example in the urllib._urlopener section of the docs (note the underscore):
import urllib
class MyURLopener(urllib.URLopener):
pass # your override here, perhaps to __init__
urllib._urlopener = MyURLopener
However, you'll be pleased to hear wrt your comment to the question comments, reading an empty string from read() is indeed the signal to stop. This is how urlretrieve handles when to stop, for example. TCP/IP and sockets abstract the reading process, blocking waiting for additional data unless the connection on the other end is EOF and closed, in which case read()ing from connection returns an empty string. An empty string means there is no data trickling in... you don't have to worry about ordered packet re-assembly as that has all been handled for you. If that's your concern for urllib2, I think you can safely use it.

Categories

Resources