Resume a webpage .read() with urlib - python

Here is my code :
headers={'User-Agent': 'Mozilla/5.0'}
req=Request(url,headers=headers)
file=open(name,'wb')
file.write(urlopen(req).read())
file.close()
but when i get an exception and want to redownload the file i have to download from the begining ; and HTTPResponse don't have seek method . How can i resume my download ?
Thank you !

You can use the byte serving feature of HTTP 1.0. Here you add the range header to your request. For example adding
Range: bytes=9500-
will download the file from the 9500th Byte. So you would do something like this in the end:
See how much bytes you have already downloaded.
Start the download from the first missing byte and append the output to the file.
Notes:
The server need to support this technique (is not always the case).
You can use wget which has already a support for continuing the download (see parameter --continue)

Related

Python ValueError: unknown url type: space (?)

I am using the urllib2 module in Python 2.7 using Spyder 3.0 to batch download text files by reading a text file that contains a list of them:
reload(sys)
sys.setdefaultencoding('utf-8')
with open('ocean_not_templated_url.txt', 'r') as text:
lines = text.readlines()
for line in lines:
url = urllib2.urlopen(line.strip('ï \xa0\t\n\r\v'))
with open(line.strip('\n\r\t ').replace('/', '!').replace(':', '~'), 'wb') as out:
for d in url:
out.write(d)
I've already discovered a bunch of weird characters in the urls that I've since stripped, however, the script fails when nearly 90% complete, giving the following error:
I thought it to be a non-breaking space (denoted by \xa0 in the code), but it still fails. Any ideas?
That's an odd URL!
Specify the communication protocol over the network. Try prefixing the URL with http:// and the domain names if the file exists on the WWW.
Files always reside somewhere, in some server's directory, or locally on your system. So there must be a network path to such files, for example:
http://127.0.0.1/folder1/samuel/file1.txt
Same example, with localhost being an alias for 127.0.0.1 (generally)
http://localhost/folder1/samuel/file1.txt
That might solve the problem. Just think about where your file exists and how it should be addressed...
Update:
I experimented quite a bit on this. I think I know why that error is raised! :D
I speculate that your file which stores the URL's actually has a sneaky empty line near the end. I can say it's near the end as you said that it executes about 90% of it and then fails. So, the python urllib2 function get_type is unable to process that empty url and throws unknown url type:
I think that's the problem! Remove that empty line in the file ocean_not_templated_url.txt and try it out!
Just check and let me know! :P

how make python script for renewable downloads?

I've been searching (without results) a reanudable (i don't know if this is the correct word, sorry) way to download big files from internet with python, i know how do it directly with urllib2, but if something interrupt the connection, i need some way to reconnect and continue the download where it was if it's possible (like a download manager).
For other people who can help the answer, there's a HTTP protocol called Chunked Transfer Encoding that allow to do this specifying the 'Range' header of the request with the beginning and end bytes (separated by a dash), thus is possible just count how many bytes was downloaded previously and send it like the new beginning byte for continue the download. Example with requests module:
import requests
from os.path import getsize
#get size of previous downloaded chunk file
beg = getsize(PATH_TO_FILE)
#if we want we can get the size before download the file (without actually download it)
end = request.head(URL).headers['content-length']
#continue the download in the next byte from where it stopped
headers = {'Range': "bytes=%d-%s"%(beg+1,end)}
download = requests.get(URL, headers=headers)

Pycurl redirect option ignored and assertion failed trying to read video from the web?

I am trying to write a program that reads a webpage looking for file links, which it then attempts to download using curl/libcurl/pycurl. I have everything up to the pycurl correctly working, and when I use a curl command in the terminal, I can get the file to download. The curl command looks like the following:
curl -LO https://archive.org/download/TheThreeStooges/TheThreeStooges-001-WomanHaters1934moeLarryCurleydivxdabaron19m20s.mp4
This results in one redirect (a file that reads as all 0s on the output) and then it correctly downloads the file. When I remove the -L flag (so the command is just -O) it only reaches the first line, where it doesn't find a file, and stops.
But when I try to do the same operation using pycurl in a Python script, I am unable to successfully set [Curl object].FOLLOWLOCATION to 1, which is supposed to be the equivalent of the -L flag. The python code looks like the following:
c = [class Curl object] # get a Curl object
fp = open(file_name,'wb')
c.setopt(c.URL , full_url) # set the url
c.setopt(c.FOLLOWLOCATION, 1)
c.setopt(c.WRITEDATA , fp)
c.perform()
When this runs, it gets to c.perform() and shows the following:
python2.7: src/pycurl.c:272: get_thread_state: Assertion `self->ob_type == p_Curl_Type' failed.
Is it missing the redirect, or am I missing something else earlier because I am relatively new to cURL?
When I enabled verbose output for the c.perform() step, I was able to uncover what I believe was/is the underlying problem that my program had. The first line, which was effectively flagged, indicated that an open connection was being reused.
I had originally packaged the file into an object oriented setup, as opposed to a script, so the curl object had been read and reused without being closed. Therefore after the first connection attempt, which failed because I didn't set options correctly, it was reusing the connection to the website/server (which presumably had the wrong connection settings).
The problem was resolved by having the script close any existing Curl objects, and create a new one before the file download.

Download file using urllib in Python with the wget -c feature

I am programming a software in Python to download HTTP PDF from a database.
Sometimes the download stop with this message :
retrieval incomplete: got only 3617232 out of 10689634 bytes
How can I ask the download to restart where it stops using the 206 Partial Content HTTP feature ?
I can do it using wget -c and it works pretty well, but I would like to implement it directly in my Python software.
Any idea ?
Thank you
You can request a partial download by sending a GET with the Range header:
import urllib2
req = urllib2.Request('http://www.python.org/')
#
# Here we request that bytes 18000--19000 be downloaded.
# The range is inclusive, and starts at 0.
#
req.headers['Range'] = 'bytes=%s-%s' % (18000, 19000)
f = urllib2.urlopen(req)
# This shows you the *actual* bytes that have been downloaded.
range=f.headers.get('Content-Range')
print(range)
# bytes 18000-18030/18031
print(repr(f.read()))
# ' </div>\n</body>\n</html>\n\n\n\n\n\n\n'
Be careful to check the Content-Range to learn what bytes have actually been downloaded, since your range may be out of bounds, and/or not all servers seem to respect the Range header.

Python: Downloading a large file to a local path and setting custom http headers

I am looking to download a file from a http url to a local file. The file is large enough that I want to download it and save it chunks rather than read() and write() the whole file as a single giant string.
The interface of urllib.urlretrieve is essentially what I want. However, I cannot see a way to set request headers when downloading via urllib.urlretrieve, which is something I need to do.
If I use urllib2, I can set request headers via its Request object. However, I don't see an API in urllib2 to download a file directly to a path on disk like urlretrieve. It seems that instead I will have to use a loop to iterate over the returned data in chunks, writing them to a file myself and checking when we are done.
What would be the best way to build a function that works like urllib.urlretrieve but allows request headers to be passed in?
What is the harm in writing your own function using urllib2?
import os
import sys
import urllib2
def urlretrieve(urlfile, fpath):
chunk = 4096
f = open(fpath, "w")
while 1:
data = urlfile.read(chunk)
if not data:
print "done."
break
f.write(data)
print "Read %s bytes"%len(data)
and using request object to set headers
request = urllib2.Request("http://www.google.com")
request.add_header('User-agent', 'Chrome XXX')
urlretrieve(urllib2.urlopen(request), "/tmp/del.html")
If you want to use urllib and urlretrieve, subclass urllib.URLopener and use its addheader() method to adjust the headers (ie: addheader('Accept', 'sound/basic'), which I'm pulling from the docstring for urllib.addheader).
To install your URLopener for use by urllib, see the example in the urllib._urlopener section of the docs (note the underscore):
import urllib
class MyURLopener(urllib.URLopener):
pass # your override here, perhaps to __init__
urllib._urlopener = MyURLopener
However, you'll be pleased to hear wrt your comment to the question comments, reading an empty string from read() is indeed the signal to stop. This is how urlretrieve handles when to stop, for example. TCP/IP and sockets abstract the reading process, blocking waiting for additional data unless the connection on the other end is EOF and closed, in which case read()ing from connection returns an empty string. An empty string means there is no data trickling in... you don't have to worry about ordered packet re-assembly as that has all been handled for you. If that's your concern for urllib2, I think you can safely use it.

Categories

Resources