Using urllib2, we can get the http response from a web server. If that server simply holds a list of files, we could parse through the files and download each individually. However, I'm not sure what the easiest, most pythonic way to parse through the files would be.
When you get a whole http response of the generic file server list, through urllib2's urlopen() method, how can we neatly download each file?
Urllib2 might be OK to retrieve the list of files. For downloading large amounts of binary files PycURL http://pycurl.sourceforge.net/ is a better choice. This works for my IIS based file server:
import re
import urllib2
import pycurl
url = "http://server.domain/"
path = "path/"
pattern = '(.*?)' % path
response = urllib2.urlopen(url+path).read()
for filename in re.findall(pattern, response):
with open(filename, "wb") as fp:
curl = pycurl.Curl()
curl.setopt(pycurl.URL, url+path+filename)
curl.setopt(pycurl.WRITEDATA, fp)
curl.perform()
curl.close()
You can use urllib.urlretrieve (in Python 3.x: urllib.request.urlretrieve):
import urllib
urllib.urlretrieve('http://site.com/', filename='filez.txt')
This should be work :)
and this is a fnction that can do the same thing (using urllib):
def download(url):
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
Can you guarantee that the URL you're requesting is a directory listing? If so, can you guarantee the format of the directory listing?
If so, you could use lxml to parse the returned document and find all of the elements that hold the path to a file, then iterate over those elements and download each file.
Download the index file
If it's really huge, it may be worth reading a chunk at a time;
otherwise it's probably easier to just grab the whole thing into memory.
Extract the list of files to get
If the list is xml or html, use a proper parser;
else if there is much string processing to do, use regex;
else use simple string methods.
Again, you can parse it all-at-once or incrementally.
Incrementally is somewhat more efficient and elegant,
but unless you are processing multiple tens of thousands
of lines it's probably not critical.
For each file, download it and save it to a file.
If you want to try to speed things up, you could try
running multiple download threads;
another (significantly faster) approach might be
to delegate the work to a dedicated downloader
program like Aria2 http://aria2.sourceforge.net/ -
note that Aria2 can be run as a service and controlled
via XMLRPC, see http://sourceforge.net/apps/trac/aria2/wiki/XmlrpcInterface#InteractWitharia2UsingPython
My suggestion would be to use BeautifulSoup (which is an HTML/XML parser) to parse the page for a list of files. Then, pycURL would definitely come in handy.
Another method, after you've got the list of files, is to use urllib.urlretrieve in a way similar to wget in order to simply download the file to a location on your filesystem.
This is a non-convential way, but although it works
fPointer = open(picName, 'wb')
self.curl.setopt(self.curl.WRITEFUNCTION, fPointer.write)
urllib.urlretrieve(link, picName) - correct way
Here's an untested solution:
import urllib2
response = urllib2.urlopen('http://server.com/file.txt')
urls = response.read().replace('\r', '').split('\n')
for file in urls:
print 'Downloading ' + file
response = urllib2.urlopen(file)
handle = open(file, 'w')
handle.write(response.read())
handle.close()
It's untested, and it probably won't work. This is assuming you have an actual list of files inside of another file. Good luck!
Related
I've been searching (without results) a reanudable (i don't know if this is the correct word, sorry) way to download big files from internet with python, i know how do it directly with urllib2, but if something interrupt the connection, i need some way to reconnect and continue the download where it was if it's possible (like a download manager).
For other people who can help the answer, there's a HTTP protocol called Chunked Transfer Encoding that allow to do this specifying the 'Range' header of the request with the beginning and end bytes (separated by a dash), thus is possible just count how many bytes was downloaded previously and send it like the new beginning byte for continue the download. Example with requests module:
import requests
from os.path import getsize
#get size of previous downloaded chunk file
beg = getsize(PATH_TO_FILE)
#if we want we can get the size before download the file (without actually download it)
end = request.head(URL).headers['content-length']
#continue the download in the next byte from where it stopped
headers = {'Range': "bytes=%d-%s"%(beg+1,end)}
download = requests.get(URL, headers=headers)
I want to read specific bytes from a remote file using a python module. I am using urllib2. Specific bytes in the sense bytes in the form of Offset,Size. I know we can read X number of bytes from a remote file using urlopen(link).read(X). Is there any way so that I can read data which starts from Offset of length Size.?
def readSpecificBytes(link,Offset,size):
# code to be written
This will work with many servers (Apache, etc.), but doesn't always work, esp. not with dynamic content like CGI (*.php, *.cgi, etc.):
import urllib2
def get_part_of_url(link, start_byte, end_byte):
req = urllib2.Request(link)
req.add_header('Range', 'bytes=' + str(start_byte) + '-' + str(end_byte))
resp = urllib2.urlopen(req)
content = resp.read()
Note that this approach means that the server never has to send and you never download the data you don't need/want, which could save tons of bandwidth if you only want a small amount of data from a large file.
When it doesn't work, just read the first set of bytes before the rest.
See Wikipedia Article on HTTP headers for more details.
Unfortunately the file-like object returned by urllib2.urlopen() doesn't actually have a seek() method. You will need to work around this by doing something like this:
def readSpecificBytes(link,Offset,size):
f = urllib2.urlopen(link)
if Offset > 0:
f.read(Offset)
return f.read(size)
Dears I want get source page but not in internet rather in local system
example : url=urllib.request.urlopen ('c://1.html')
>>> import urllib.request
>>> url=urllib.request.urlopen ('http://google.com')
>>> page =url.read()
>>> page=page.decode()
>>> page
what's my problem ?
from os.path import abspath
with open(abspath('c:/1.html') as fh:
print(fh.read())
Since url.read() just gives you the data as-is, and .decode() doesn't really do anything except convert the byte data from the socket to a traditional string, just print the filecontents?
urllib is mainly (if not only) a transporter to recieve HTML data, not actually parse the content. So all it does is connect to the source, separate the headers and give you the content. If you've already stored it locally, in a file.. Well then urllib has no more use to you. Consider looking at a HTML Parsing library such as BeautifulSoup for instance.
If a would-be-HTTP-server written in Python2.6 has local access to a file, what would be the most correct way for that server to return the file to a client, on request?
Let's say this is the current situation:
header('Content-Type', file.mimetype)
header('Content-Length', file.size) # file size in bytes
header('Content-MD5', file.hash) # an md5 hash of the entire file
return open(file.path).read()
All the files are .zip or .rar archives no bigger than a couple of megabytes.
With the current situation, browsers handle the incoming download weirdly. No browser knows the file's name, for example, so they use a random or default one. (Firefox even saved the file with a .part extension, even though it was complete and completely usable.)
What would be the best way to fix this and other errors I may not even be aware of, yet?
What headers am I not sending?
Thanks!
This is how I send ZIP file,
req.send_response(200)
req.send_header('Content-Type', 'application/zip')
req.send_header('Content-Disposition', 'attachment;'
'filename=%s' % filename)
Most browsers handle it correctly.
If you don't have to return the response body (that is, if you are given a stream for the response body by your framework) you can avoid holding the file in memory with something like this:
fp = file(path_to_the_file, 'rb')
while True:
bytes = fp.read(8192)
if bytes:
response.write(bytes)
else:
return
What web framework are you using?
I am looking to download a file from a http url to a local file. The file is large enough that I want to download it and save it chunks rather than read() and write() the whole file as a single giant string.
The interface of urllib.urlretrieve is essentially what I want. However, I cannot see a way to set request headers when downloading via urllib.urlretrieve, which is something I need to do.
If I use urllib2, I can set request headers via its Request object. However, I don't see an API in urllib2 to download a file directly to a path on disk like urlretrieve. It seems that instead I will have to use a loop to iterate over the returned data in chunks, writing them to a file myself and checking when we are done.
What would be the best way to build a function that works like urllib.urlretrieve but allows request headers to be passed in?
What is the harm in writing your own function using urllib2?
import os
import sys
import urllib2
def urlretrieve(urlfile, fpath):
chunk = 4096
f = open(fpath, "w")
while 1:
data = urlfile.read(chunk)
if not data:
print "done."
break
f.write(data)
print "Read %s bytes"%len(data)
and using request object to set headers
request = urllib2.Request("http://www.google.com")
request.add_header('User-agent', 'Chrome XXX')
urlretrieve(urllib2.urlopen(request), "/tmp/del.html")
If you want to use urllib and urlretrieve, subclass urllib.URLopener and use its addheader() method to adjust the headers (ie: addheader('Accept', 'sound/basic'), which I'm pulling from the docstring for urllib.addheader).
To install your URLopener for use by urllib, see the example in the urllib._urlopener section of the docs (note the underscore):
import urllib
class MyURLopener(urllib.URLopener):
pass # your override here, perhaps to __init__
urllib._urlopener = MyURLopener
However, you'll be pleased to hear wrt your comment to the question comments, reading an empty string from read() is indeed the signal to stop. This is how urlretrieve handles when to stop, for example. TCP/IP and sockets abstract the reading process, blocking waiting for additional data unless the connection on the other end is EOF and closed, in which case read()ing from connection returns an empty string. An empty string means there is no data trickling in... you don't have to worry about ordered packet re-assembly as that has all been handled for you. If that's your concern for urllib2, I think you can safely use it.