Download file using urllib in Python with the wget -c feature - python

I am programming a software in Python to download HTTP PDF from a database.
Sometimes the download stop with this message :
retrieval incomplete: got only 3617232 out of 10689634 bytes
How can I ask the download to restart where it stops using the 206 Partial Content HTTP feature ?
I can do it using wget -c and it works pretty well, but I would like to implement it directly in my Python software.
Any idea ?
Thank you

You can request a partial download by sending a GET with the Range header:
import urllib2
req = urllib2.Request('http://www.python.org/')
#
# Here we request that bytes 18000--19000 be downloaded.
# The range is inclusive, and starts at 0.
#
req.headers['Range'] = 'bytes=%s-%s' % (18000, 19000)
f = urllib2.urlopen(req)
# This shows you the *actual* bytes that have been downloaded.
range=f.headers.get('Content-Range')
print(range)
# bytes 18000-18030/18031
print(repr(f.read()))
# ' </div>\n</body>\n</html>\n\n\n\n\n\n\n'
Be careful to check the Content-Range to learn what bytes have actually been downloaded, since your range may be out of bounds, and/or not all servers seem to respect the Range header.

Related

How to Download only the first x bytes of data Python

Situation: The file to be downloaded is a large file (>100MB). It takes quite some time, especially with slow internet connection.
Problem: However, I just need the file header (the first 512 bytes), which will decide if the whole file needs to be downloaded or not.
Question: Is there a way to do download only the first 512 bytes of a file?
Additional information: Currently the download is done using urllib.urlretrieve in Python2.7
I think curl and head would work better than a Python solution here:
curl https://my.website.com/file.txt | head -c 512 > header.txt
EDIT: Also, if you absolutely must have it in a Python script, you can use subprocess to perform the curl piped to head command execution
EDIT 2: For a fully Python solution: The urlopen function (urllib2.urlopen in Python 2, and urllib.request.urlopen in Python 3) returns a file-like stream that you can use the read function on, which allows you to specify a number of bytes. For example, urllib2.urlopen(my_url).read(512) will return the first 512 bytes of my_url
If the url you are trying to read responds with Content-Length header, then you can get the file size with urllib2 in Python 2.
def get_file_size(url):
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
response = urllib2.urlopen(request)
length = response.headers.getheader("Content-Length")
return int(length)
The function can be called to get the length and compared with some threshold value to decide whether to download or not.
if get_file_size("http://stackoverflow.com") < 1000000:
# Download
(Note that the Python 3 implimentation differs slightly:)
from urllib import request
def get_file_size(url):
r = request.Request(url)
r.get_method = lambda : 'HEAD'
response = request.urlopen(r)
length = response.getheader("Content-Length")
return int(length)

how make python script for renewable downloads?

I've been searching (without results) a reanudable (i don't know if this is the correct word, sorry) way to download big files from internet with python, i know how do it directly with urllib2, but if something interrupt the connection, i need some way to reconnect and continue the download where it was if it's possible (like a download manager).
For other people who can help the answer, there's a HTTP protocol called Chunked Transfer Encoding that allow to do this specifying the 'Range' header of the request with the beginning and end bytes (separated by a dash), thus is possible just count how many bytes was downloaded previously and send it like the new beginning byte for continue the download. Example with requests module:
import requests
from os.path import getsize
#get size of previous downloaded chunk file
beg = getsize(PATH_TO_FILE)
#if we want we can get the size before download the file (without actually download it)
end = request.head(URL).headers['content-length']
#continue the download in the next byte from where it stopped
headers = {'Range': "bytes=%d-%s"%(beg+1,end)}
download = requests.get(URL, headers=headers)

Resume a webpage .read() with urlib

Here is my code :
headers={'User-Agent': 'Mozilla/5.0'}
req=Request(url,headers=headers)
file=open(name,'wb')
file.write(urlopen(req).read())
file.close()
but when i get an exception and want to redownload the file i have to download from the begining ; and HTTPResponse don't have seek method . How can i resume my download ?
Thank you !
You can use the byte serving feature of HTTP 1.0. Here you add the range header to your request. For example adding
Range: bytes=9500-
will download the file from the 9500th Byte. So you would do something like this in the end:
See how much bytes you have already downloaded.
Start the download from the first missing byte and append the output to the file.
Notes:
The server need to support this technique (is not always the case).
You can use wget which has already a support for continuing the download (see parameter --continue)

Read specific bytes using urlopen()

I want to read specific bytes from a remote file using a python module. I am using urllib2. Specific bytes in the sense bytes in the form of Offset,Size. I know we can read X number of bytes from a remote file using urlopen(link).read(X). Is there any way so that I can read data which starts from Offset of length Size.?
def readSpecificBytes(link,Offset,size):
# code to be written
This will work with many servers (Apache, etc.), but doesn't always work, esp. not with dynamic content like CGI (*.php, *.cgi, etc.):
import urllib2
def get_part_of_url(link, start_byte, end_byte):
req = urllib2.Request(link)
req.add_header('Range', 'bytes=' + str(start_byte) + '-' + str(end_byte))
resp = urllib2.urlopen(req)
content = resp.read()
Note that this approach means that the server never has to send and you never download the data you don't need/want, which could save tons of bandwidth if you only want a small amount of data from a large file.
When it doesn't work, just read the first set of bytes before the rest.
See Wikipedia Article on HTTP headers for more details.
Unfortunately the file-like object returned by urllib2.urlopen() doesn't actually have a seek() method. You will need to work around this by doing something like this:
def readSpecificBytes(link,Offset,size):
f = urllib2.urlopen(link)
if Offset > 0:
f.read(Offset)
return f.read(size)

Python: Downloading a large file to a local path and setting custom http headers

I am looking to download a file from a http url to a local file. The file is large enough that I want to download it and save it chunks rather than read() and write() the whole file as a single giant string.
The interface of urllib.urlretrieve is essentially what I want. However, I cannot see a way to set request headers when downloading via urllib.urlretrieve, which is something I need to do.
If I use urllib2, I can set request headers via its Request object. However, I don't see an API in urllib2 to download a file directly to a path on disk like urlretrieve. It seems that instead I will have to use a loop to iterate over the returned data in chunks, writing them to a file myself and checking when we are done.
What would be the best way to build a function that works like urllib.urlretrieve but allows request headers to be passed in?
What is the harm in writing your own function using urllib2?
import os
import sys
import urllib2
def urlretrieve(urlfile, fpath):
chunk = 4096
f = open(fpath, "w")
while 1:
data = urlfile.read(chunk)
if not data:
print "done."
break
f.write(data)
print "Read %s bytes"%len(data)
and using request object to set headers
request = urllib2.Request("http://www.google.com")
request.add_header('User-agent', 'Chrome XXX')
urlretrieve(urllib2.urlopen(request), "/tmp/del.html")
If you want to use urllib and urlretrieve, subclass urllib.URLopener and use its addheader() method to adjust the headers (ie: addheader('Accept', 'sound/basic'), which I'm pulling from the docstring for urllib.addheader).
To install your URLopener for use by urllib, see the example in the urllib._urlopener section of the docs (note the underscore):
import urllib
class MyURLopener(urllib.URLopener):
pass # your override here, perhaps to __init__
urllib._urlopener = MyURLopener
However, you'll be pleased to hear wrt your comment to the question comments, reading an empty string from read() is indeed the signal to stop. This is how urlretrieve handles when to stop, for example. TCP/IP and sockets abstract the reading process, blocking waiting for additional data unless the connection on the other end is EOF and closed, in which case read()ing from connection returns an empty string. An empty string means there is no data trickling in... you don't have to worry about ordered packet re-assembly as that has all been handled for you. If that's your concern for urllib2, I think you can safely use it.

Categories

Resources