urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:
urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')
just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.
Can anybody provide some information about this problem?
Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.
You can detect any HTTP errors easily:
>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found
resp is actually HTTPResponse object that you can do a lot of useful things with:
>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"
I keep it simple:
# Simple downloading with progress indicator, by Cees Timmerman, 16mar12.
import urllib2
remote = r"http://some.big.file"
local = r"c:\downloads\bigfile.dat"
u = urllib2.urlopen(remote)
h = u.info()
totalSize = int(h["Content-Length"])
print "Downloading %s bytes..." % totalSize,
fp = open(local, 'wb')
blockSize = 8192 #100000 # urllib.urlretrieve uses 8192
count = 0
while True:
chunk = u.read(blockSize)
if not chunk: break
fp.write(chunk)
count += 1
if totalSize > 0:
percent = int(count * blockSize * 100 / totalSize)
if percent > 100: percent = 100
print "%2d%%" % percent,
if percent < 100:
print "\b\b\b\b\b", # Erase "NN% "
else:
print "Done."
fp.flush()
fp.close()
if not totalSize:
print
According to the documentation is is undocumented
to get access to the message it looks like you do something like:
a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')
b is the message instance
Since I have learned that Python it is always useful to use Python's ability to be introspective when I type
dir(b)
I see lots of methods or functions to play with
And then I started doing things with b
for example
b.items()
Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.
Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.
Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:
needparam=[]
for each in dir(b):
x='b.'+each+'()'
try:
eval(x)
print x
except:
needparam.append(x)
You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:
How to catch 404 error in urllib.urlretrieve
I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.
import tempfile
import pycurl
import os
def get_filename_parts_from_url(url):
fullname = url.split('/')[-1].split('#')[0].split('?')[0]
t = list(os.path.splitext(fullname))
if t[1]:
t[1] = t[1][1:]
return t
def retrieve(url, filename=None):
if not filename:
garbage, suffix = get_filename_parts_from_url(url)
f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
filename = f.name
else:
f = open(filename, 'wb')
c = pycurl.Curl()
c.setopt(pycurl.URL, str(url))
c.setopt(pycurl.WRITEFUNCTION, f.write)
try:
c.perform()
except:
filename = None
finally:
c.close()
f.close()
return filename
class MyURLopener(urllib.FancyURLopener):
http_error_default = urllib.URLopener.http_error_default
url = "http://page404.com"
filename = "download.txt"
def reporthook(blockcount, blocksize, totalsize):
pass
...
try:
(f,headers)=MyURLopener().retrieve(url, filename, reporthook)
except Exception, e:
print e
:) My first post on StackOverflow, have been a lurker for years. :)
Sadly dir(urllib.urlretrieve) is deficient in useful information.
So from this thread thus far I tried writing this:
a,b = urllib.urlretrieve(imgURL, saveTo)
print "A:", a
print "B:", b
which produced this:
A: /home/myuser/targetfile.gif
B: Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Cache-Control: max-age=604800
Content-Type: image/gif
Date: Mon, 07 Mar 2016 23:37:34 GMT
Etag: "4e1a5d9cc0857184df682518b9b0da33"
Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
Server: ECS (hnd/057A)
Timing-Allow-Origin: *
X-Cache: HIT
Content-Length: 27027
Connection: close
I guess one can check:
if b.Content-Length > 0:
My next step is to test a scenario where the retrieve fails...
Results against another server/website - what comes back in "B" is a bit random, but one can test for certain values:
A: get_good.jpg
B: Date: Tue, 08 Mar 2016 00:44:19 GMT
Server: Apache
Last-Modified: Sat, 02 Jan 2016 09:17:21 GMT
ETag: "524cf9-18afe-528565aef9ef0"
Accept-Ranges: bytes
Content-Length: 101118
Connection: close
Content-Type: image/jpeg
A: get_bad.jpg
B: Date: Tue, 08 Mar 2016 00:44:20 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html
In the 'bad' case (non-existing image file) "B" retrieved a small chunk of (Googlebot?) HTML code and saved it as the target, hence Content-Length of 1363 bytes.
Related
I have the following code;
def ip_addresses():
# Get external ipv4
try:
response = urllib2.urlopen('http://icanhazip.com', timeout = 2)
out = response.read()
public_ipv4 = re.sub('\n', '', out)
except:
public_ipv4 = "failed to retrieve public_ipv4"
In normal circumstance, when response from http://icanhazip.com is received, the output is something like this;
xxx#xxx:/var/log$ date && tail -1 xxx.log
Tue Jul 25 **07:43**:18 UTC 2017 {"public_ipv4": "208.185.193.131"}, "date": "2017-07-25 **07:43**:01.558242"
So, the current date and the date of the log generation are same.
However, when there is an exception, this is happening;
xxx#xxx:/var/log$ date && tail -1 xxx.log
Tue Jul 25 **07:30**:25 UTC 2017 {"public_ipv4": "failed to retrieve public_ipv4"},"date": "2017-07-25 **07:23**:01.525444"
Why is the "timeout" not working?
Try to get the verbose exception details in this manner
and then investigate what is the error all about, the difference in time
Use this format...
import sys
try:
1 / 0
except:
print sys.exc_info()
I've been using the script below to download technical videos for later analysis. The script has worked well for me and retrieves the highest resolution version available for the videos that I have needed.
Now I've come across a 4K YouTube video, and my script only saves an mp4 with 1280x720.
I'd like to know if there is a way to adjust my current script to download higher resolution versions of this video. I understand there are python packages that might address this, but right now I would like stick to this step-by-step method if possible.
above: info from Quicktime and OSX
"""
length: 175 seconds
quality: hd720
type: video/mp4; codecs="avc1.64001F, mp4a.40.2"
Last-Modified: Sun, 21 Aug 2016 10:41:48 GMT
Content-Type: video/mp4
Date: Sat, 01 Apr 2017 16:50:16 GMT
Expires: Sat, 01 Apr 2017 16:50:16 GMT
Cache-Control: private, max-age=21294
Accept-Ranges: bytes
Content-Length: 35933033
Connection: close
Alt-Svc: quic=":443"; ma=2592000
X-Content-Type-Options: nosniff
Server: gvs 1.
"""
import urlparse, urllib2
vid = "vzS1Vkpsi5k"
save_title = "YouTube SpaceX - Booster Number 4 - Thaicom 8 06-06-2016"
url_init = "https://www.youtube.com/get_video_info?video_id=" + vid
resp = urllib2.urlopen(url_init, timeout=10)
data = resp.read()
info = urlparse.parse_qs(data)
title = info['title']
print "length: ", info['length_seconds'][0] + " seconds"
stream_map = info['url_encoded_fmt_stream_map'][0]
vid_info = stream_map.split(",")
mp4_filename = save_title + ".mp4"
for video in vid_info:
item = urlparse.parse_qs(video)
print 'quality: ', item['quality'][0]
print 'type: ', item['type'][0]
url_download = item['url'][0]
resp = urllib2.urlopen(url_download)
print resp.headers
length = int(resp.headers['Content-Length'])
my_file = open(mp4_filename, "w+")
done, i = 0, 0
buff = resp.read(1024)
while buff:
my_file.write(buff)
done += 1024
percent = done * 100.0 / length
buff = resp.read(1024)
if not i%1000:
percent = done * 100.0 / length
print str(percent) + "%"
i += 1
break
Ok, so I have not taken the time to get to the bottom of this. However, I did find that when you do:
stream_map = info['url_encoded_fmt_stream_map'][0]
Somehow you only get a selection of a single 720p option, one 'medium' and two 'small'.
However, if you change that line into:
stream_map = info['adaptive_fmts'][0]
you will get all the available versions, including the 2160p one. Thus, the 4K one.
PS: You'd have to comment out the print quality and print type command since those labels aren't always available in the new throughput. When commenting them out however, and adapting your script as explained above, I was able to successfully download the 4K version.
indeed
info ['adaptive_fmts'] [0]
returns the information of the whole video, but the url is not usable directly , but the bar of advancement
urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:
urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')
just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.
Can anybody provide some information about this problem?
Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.
You can detect any HTTP errors easily:
>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found
resp is actually HTTPResponse object that you can do a lot of useful things with:
>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"
I keep it simple:
# Simple downloading with progress indicator, by Cees Timmerman, 16mar12.
import urllib2
remote = r"http://some.big.file"
local = r"c:\downloads\bigfile.dat"
u = urllib2.urlopen(remote)
h = u.info()
totalSize = int(h["Content-Length"])
print "Downloading %s bytes..." % totalSize,
fp = open(local, 'wb')
blockSize = 8192 #100000 # urllib.urlretrieve uses 8192
count = 0
while True:
chunk = u.read(blockSize)
if not chunk: break
fp.write(chunk)
count += 1
if totalSize > 0:
percent = int(count * blockSize * 100 / totalSize)
if percent > 100: percent = 100
print "%2d%%" % percent,
if percent < 100:
print "\b\b\b\b\b", # Erase "NN% "
else:
print "Done."
fp.flush()
fp.close()
if not totalSize:
print
According to the documentation is is undocumented
to get access to the message it looks like you do something like:
a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')
b is the message instance
Since I have learned that Python it is always useful to use Python's ability to be introspective when I type
dir(b)
I see lots of methods or functions to play with
And then I started doing things with b
for example
b.items()
Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.
Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.
Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:
needparam=[]
for each in dir(b):
x='b.'+each+'()'
try:
eval(x)
print x
except:
needparam.append(x)
You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:
How to catch 404 error in urllib.urlretrieve
I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.
import tempfile
import pycurl
import os
def get_filename_parts_from_url(url):
fullname = url.split('/')[-1].split('#')[0].split('?')[0]
t = list(os.path.splitext(fullname))
if t[1]:
t[1] = t[1][1:]
return t
def retrieve(url, filename=None):
if not filename:
garbage, suffix = get_filename_parts_from_url(url)
f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
filename = f.name
else:
f = open(filename, 'wb')
c = pycurl.Curl()
c.setopt(pycurl.URL, str(url))
c.setopt(pycurl.WRITEFUNCTION, f.write)
try:
c.perform()
except:
filename = None
finally:
c.close()
f.close()
return filename
class MyURLopener(urllib.FancyURLopener):
http_error_default = urllib.URLopener.http_error_default
url = "http://page404.com"
filename = "download.txt"
def reporthook(blockcount, blocksize, totalsize):
pass
...
try:
(f,headers)=MyURLopener().retrieve(url, filename, reporthook)
except Exception, e:
print e
:) My first post on StackOverflow, have been a lurker for years. :)
Sadly dir(urllib.urlretrieve) is deficient in useful information.
So from this thread thus far I tried writing this:
a,b = urllib.urlretrieve(imgURL, saveTo)
print "A:", a
print "B:", b
which produced this:
A: /home/myuser/targetfile.gif
B: Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Cache-Control: max-age=604800
Content-Type: image/gif
Date: Mon, 07 Mar 2016 23:37:34 GMT
Etag: "4e1a5d9cc0857184df682518b9b0da33"
Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
Server: ECS (hnd/057A)
Timing-Allow-Origin: *
X-Cache: HIT
Content-Length: 27027
Connection: close
I guess one can check:
if b.Content-Length > 0:
My next step is to test a scenario where the retrieve fails...
Results against another server/website - what comes back in "B" is a bit random, but one can test for certain values:
A: get_good.jpg
B: Date: Tue, 08 Mar 2016 00:44:19 GMT
Server: Apache
Last-Modified: Sat, 02 Jan 2016 09:17:21 GMT
ETag: "524cf9-18afe-528565aef9ef0"
Accept-Ranges: bytes
Content-Length: 101118
Connection: close
Content-Type: image/jpeg
A: get_bad.jpg
B: Date: Tue, 08 Mar 2016 00:44:20 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html
In the 'bad' case (non-existing image file) "B" retrieved a small chunk of (Googlebot?) HTML code and saved it as the target, hence Content-Length of 1363 bytes.
Thanks for reading.
Background:
I am trying to read a streaming API feed that returns data in JSON format, and then storing this data to a pymongo collection. The streaming API requires a "Accept-Encoding" : "Gzip" header.
What's happening:
Code fails on json.loads and outputs - Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597) (Refer Error Log below)
This does NOT happen while parsing every JSON object - it happens at random.
My guess is I am encountering some weird JSON object after every "x" proper JSON objects.
I did reference how to use pycurl if requested data is sometimes gzipped, sometimes not? and Encoding error while deserializing a json object from Google but so far have been unsuccessful at resolving this error.
Could someone please help me out here?
Error Log:
Note: The raw dump of the JSON object below is basically using the repr() method that prints the raw representation of the string without resolving CRLF/LF(s).
'{"id":"tag:search.twitter.com,2005:207958320747782146","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:493653150","link":"http://www.twitter.com/Deathnews_7_24","displayName":"Death News 7/24","postedTime":"2012-02-16T01:30:12.000Z","image":"http://a0.twimg.com/profile_images/1834408513/deathnewstwittersquare_normal.jpg","summary":"Crashes, Murders, Suicides, Accidents, Crime and Naturals Death News From All Around World","links":[{"href":"http://www.facebook.com/DeathNews724","rel":"me"}],"friendsCount":56,"followersCount":14,"listedCount":1,"statusesCount":1029,"twitterTimeZone":null,"utcOffset":null,"preferredUsername":"Deathnews_7_24","languages":["tr"]},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"web","link":"http://twitter.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","body":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","object":{"objectType":"note","id":"object:search.twitter.com,2005:207958320747782146","summary":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nytimes.com/2012/05/30/boo\xe2\x80\xa6","indices":[52,72],"expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html","url":"http://t.co/WBsNlNtA"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":11,"urls":[{"url":"http://t.co/WBsNlNtA","expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html?_r=1"}]}}\r\n{"id":"tag:search.twitter.com,2005:207958321003638785","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:178760897","link":"http://www.twitter.com/Mobanu","displayName":"Donald Ochs","postedTime":"2010-08-15T16:33:56.000Z","image":"http://a0.twimg.com/profile_images/1493224811/small_mobany_Logo_normal.jpg","summary":"","links":[{"href":"http://www.mobanuweightloss.com","rel":"me"}],"friendsCount":10272,"followersCount":9698,"listedCount":30,"statusesCount":725,"twitterTimeZone":"Mountain Time (US & Canada)","utcOffset":"-25200","preferredUsername":"Mobanu","languages":["en"],"location":{"objectType":"place","displayName":"Crested Butte, Colorado"}},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"twitterfeed","link":"http://twitterfeed.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Mobanu/statuses/207958321003638785","body":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","object":{"objectType":"note","id":"object:search.twitter.com,2005:207958321003638785","summary":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","link":"http://twitter.com/Mobanu/statuses/207958321003638785","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nyti.ms/KUmmMa","indices":[116,136],"expanded_url":"http://nyti.ms/KUmmMa","url":"http://t.co/mTsQlNQO"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":12,"urls":[{"url":"http://t.co/mTsQlNQO","expanded_url":"http://well.blogs.nytimes.com/2012/05/30/can-exercise-be-bad-for-you/?utm_medium=twitter&utm_source=twitterfeed"}]}}\r\n'
json exception: Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597)
Header Output:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Vary: Accept-Encoding
Date: Wed, 30 May 2012 22:14:48 UTC
Connection: close
Transfer-Encoding: chunked
Content-Encoding: gzip
get_stream.py:
#!/usr/bin/env python
import sys
import pycurl
import json
import pymongo
STREAM_URL = "https://stream.test.com:443/accounts/publishers/twitter/streams/track/Dev.json"
AUTH = "userid:passwd"
DB_HOST = "127.0.0.1"
DB_NAME = "stream_test"
class StreamReader:
def __init__(self):
try:
self.count = 0
self.buff = ""
self.mongo = pymongo.Connection(DB_HOST)
self.db = self.mongo[DB_NAME]
self.raw_tweets = self.db["raw_tweets_gnip"]
self.conn = pycurl.Curl()
self.conn.setopt(pycurl.ENCODING, 'gzip')
self.conn.setopt(pycurl.URL, STREAM_URL)
self.conn.setopt(pycurl.USERPWD, AUTH)
self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive)
self.conn.setopt(pycurl.HEADERFUNCTION, self.header_rcvd)
while True:
self.conn.perform()
except Exception as ex:
print "error ocurred : %s" % str(ex)
def header_rcvd(self, header_data):
print header_data
def on_receive(self, data):
temp_data = data
self.buff += data
if data.endswith("\r\n") and self.buff.strip():
try:
tweet = json.loads(self.buff, encoding = 'UTF-8')
self.buff = ""
if tweet:
try:
self.raw_tweets.insert(tweet)
except Exception as insert_ex:
print "Error inserting tweet: %s" % str(insert_ex)
self.count += 1
if self.count % 10 == 0:
print "inserted "+str(self.count)+" tweets"
except Exception as json_ex:
print "json exception: %s" % str(json_ex)
print repr(temp_data)
stream = StreamReader()
Fixed Code:
def on_receive(self, data):
self.buff += data
if data.endswith("\r\n") and self.buff.strip():
# NEW: Split the buff at \r\n to get a list of JSON objects and iterate over them
json_obj = self.buff.split("\r\n")
for obj in json_obj:
if len(obj.strip()) > 0:
try:
tweet = json.loads(obj, encoding = 'UTF-8')
except Exception as json_ex:
print "JSON Exception occurred: %s" % str(json_ex)
continue
Try to paste your dumped string into jsbeatuifier.
You'll see that it's actually two json objects, not one, which json.loads can't deal with.
They are separated by \r\n, so it should be easy to split them.
The problem is that the data argument passed to on_receive doesn't neccessarily end with \r\n if it contains a newline. As this shows it also can be somewhere in the middle of the string, so only looking at the end of the data chunk won't be enough.
Using urllibs (or urllibs2) and wanting what I want is hopeless.
Any solution?
I'm not sure how the C# implementation works, but, as internet streams are generally not seekable, my guess would be it downloads all the data to a local file or in-memory object and seeks within it from there. The Python equivalent of this would be to do as Abafei suggested and write the data to a file or StringIO and seek from there.
However, if, as your comment on Abafei's answer suggests, you want to retrieve only a particular part of the file (rather than seeking backwards and forwards through the returned data), there is another possibility. urllib2 can be used to retrieve a certain section (or 'range' in HTTP parlance) of a webpage, provided that the server supports this behaviour.
The range header
When you send a request to a server, the parameters of the request are given in various headers. One of these is the Range header, defined in section 14.35 of RFC2616 (the specification defining HTTP/1.1). This header allows you to do things such as retrieve all data starting from the 10,000th byte, or the data between bytes 1,000 and 1,500.
Server support
There is no requirement for a server to support range retrieval. Some servers will return the Accept-Ranges header (section 14.5 of RFC2616) along with a response to report if they support ranges or not. This could be checked using a HEAD request. However, there is no particular need to do this; if a server does not support ranges, it will return the entire page and we can then extract the desired portion of data in Python as before.
Checking if a range is returned
If a server returns a range, it must send the Content-Range header (section 14.16 of RFC2616) along with the response. If this is present in the headers of the response, we know a range was returned; if it is not present, the entire page was returned.
Implementation with urllib2
urllib2 allows us to add headers to a request, thus allowing us to ask the server for a range rather than the entire page. The following script takes a URL, a start position, and (optionally) a length on the command line, and tries to retrieve the given section of the page.
import sys
import urllib2
# Check command line arguments.
if len(sys.argv) < 3:
sys.stderr.write("Usage: %s url start [length]\n" % sys.argv[0])
sys.exit(1)
# Create a request for the given URL.
request = urllib2.Request(sys.argv[1])
# Add the header to specify the range to download.
if len(sys.argv) > 3:
start, length = map(int, sys.argv[2:])
request.add_header("range", "bytes=%d-%d" % (start, start + length - 1))
else:
request.add_header("range", "bytes=%s-" % sys.argv[2])
# Try to get the response. This will raise a urllib2.URLError if there is a
# problem (e.g., invalid URL).
response = urllib2.urlopen(request)
# If a content-range header is present, partial retrieval worked.
if "content-range" in response.headers:
print "Partial retrieval successful."
# The header contains the string 'bytes', followed by a space, then the
# range in the format 'start-end', followed by a slash and then the total
# size of the page (or an asterix if the total size is unknown). Lets get
# the range and total size from this.
range, total = response.headers['content-range'].split(' ')[-1].split('/')
# Print a message giving the range information.
if total == '*':
print "Bytes %s of an unknown total were retrieved." % range
else:
print "Bytes %s of a total of %s were retrieved." % (range, total)
# No header, so partial retrieval was unsuccessful.
else:
print "Unable to use partial retrieval."
# And for good measure, lets check how much data we downloaded.
data = response.read()
print "Retrieved data size: %d bytes" % len(data)
Using this, I can retrieve the final 2,000 bytes of the Python homepage:
blair#blair-eeepc:~$ python retrieverange.py http://www.python.org/ 17387
Partial retrieval successful.
Bytes 17387-19386 of a total of 19387 were retrieved.
Retrieved data size: 2000 bytes
Or 400 bytes from the middle of the homepage:
blair#blair-eeepc:~$ python retrieverange.py http://www.python.org/ 6000 400
Partial retrieval successful.
Bytes 6000-6399 of a total of 19387 were retrieved.
Retrieved data size: 400 bytes
However, the Google homepage does not support ranges:
blair#blair-eeepc:~$ python retrieverange.py http://www.google.com/ 1000 500
Unable to use partial retrieval.
Retrieved data size: 9621 bytes
In this case, it would be necessary to extract the data of interest in Python prior to any further processing.
It may work best just to write the data to a file (or even to a string, using StringIO), and to seek in that file (or string).
I did not find any existing implementations of a file-like interface with seek() to HTTP URLs, so I rolled my own simple version: https://github.com/valgur/pyhttpio. It depends on urllib.request but could probably easily be modified to use requests, if necessary.
The full code:
import cgi
import time
import urllib.request
from io import IOBase
from sys import stderr
class SeekableHTTPFile(IOBase):
def __init__(self, url, name=None, repeat_time=-1, debug=False):
"""Allow a file accessible via HTTP to be used like a local file by utilities
that use `seek()` to read arbitrary parts of the file, such as `ZipFile`.
Seeking is done via the 'range: bytes=xx-yy' HTTP header.
Parameters
----------
url : str
A HTTP or HTTPS URL
name : str, optional
The filename of the file.
Will be filled from the Content-Disposition header if not provided.
repeat_time : int, optional
In case of HTTP errors wait `repeat_time` seconds before trying again.
Negative value or `None` disables retrying and simply passes on the exception (the default).
"""
super().__init__()
self.url = url
self.name = name
self.repeat_time = repeat_time
self.debug = debug
self._pos = 0
self._seekable = True
with self._urlopen() as f:
if self.debug:
print(f.getheaders())
self.content_length = int(f.getheader("Content-Length", -1))
if self.content_length < 0:
self._seekable = False
if f.getheader("Accept-Ranges", "none").lower() != "bytes":
self._seekable = False
if name is None:
header = f.getheader("Content-Disposition")
if header:
value, params = cgi.parse_header(header)
self.name = params["filename"]
def seek(self, offset, whence=0):
if not self.seekable():
raise OSError
if whence == 0:
self._pos = 0
elif whence == 1:
pass
elif whence == 2:
self._pos = self.content_length
self._pos += offset
return self._pos
def seekable(self, *args, **kwargs):
return self._seekable
def readable(self, *args, **kwargs):
return not self.closed
def writable(self, *args, **kwargs):
return False
def read(self, amt=-1):
if self._pos >= self.content_length:
return b""
if amt < 0:
end = self.content_length - 1
else:
end = min(self._pos + amt - 1, self.content_length - 1)
byte_range = (self._pos, end)
self._pos = end + 1
with self._urlopen(byte_range) as f:
return f.read()
def readall(self):
return self.read(-1)
def tell(self):
return self._pos
def __getattribute__(self, item):
attr = object.__getattribute__(self, item)
if not object.__getattribute__(self, "debug"):
return attr
if hasattr(attr, '__call__'):
def trace(*args, **kwargs):
a = ", ".join(map(str, args))
if kwargs:
a += ", ".join(["{}={}".format(k, v) for k, v in kwargs.items()])
print("Calling: {}({})".format(item, a))
return attr(*args, **kwargs)
return trace
else:
return attr
def _urlopen(self, byte_range=None):
header = {}
if byte_range:
header = {"range": "bytes={}-{}".format(*byte_range)}
while True:
try:
r = urllib.request.Request(self.url, headers=header)
return urllib.request.urlopen(r)
except urllib.error.HTTPError as e:
if self.repeat_time is None or self.repeat_time < 0:
raise
print("Server responded with " + str(e), file=stderr)
print("Sleeping for {} seconds before trying again".format(self.repeat_time), file=stderr)
time.sleep(self.repeat_time)
A potential usage example:
url = "https://www.python.org/ftp/python/3.5.0/python-3.5.0-embed-amd64.zip"
f = SeekableHTTPFile(url, debug=True)
zf = ZipFile(f)
zf.printdir()
zf.extract("python.exe")
Edit: There is actually a mostly identical, if slightly more minimal, implementation in this answer: https://stackoverflow.com/a/7852229/2997179