Download a file in python with urllib2 instead of urllib - python

I'm trying to download a tarball file and save it locally with python. With urllib it's pretty simple:
import urllib
urllib2.urlopen(url, 'compressed_file.tar.gz')
tar = tarfile.open('compressed_file.tar.gz')
print tar.getmembers()
So my question is really simple: What's the way to achieve this using the urllib2 library?

Quoting docs:
urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[,
context]]]]])
Open the URL url, which can be either a string or a
Request object.
data may be a string specifying additional data to send to the server, or None if no such data is needed.
Nothing in urlopen interface documentation says, that second argument is a name of file where response should be written.
You need to explicitly write data read from response to file:
r = urllib2.urlopen(url)
CHUNK_SIZE = 1 << 20
with open('compressed_file.tar.gz', 'wb') as f:
# line belows downloads all file at once to memory, and dumps it to file afterwards
# f.write(r.read())
# below is preferable lazy solution - download and write data in chunks
while True:
chunk = r.read(CHUNK_SIZE)
if not chunk:
break
f.write(chunk)

Related

Download file using python without knowing its extension. - content-type - stream

Hey i just did some research and found that i could download images from urls which end with filename.extension like 000000.jpeg. i wonder now how i could downoad a picture which doesnt have any extension.
Here is my url which i want to download the image http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api
when i put the url directly to the browser it displays an image
furthermore here is what i tried:
from six.moves import urllib
thumbnail='http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api'
img=urllib.request.Request(thumbnail)
pic=urllib.request.urlopen(img)
pic=urllib.request.urlopen(img).read()
Anyhelp will be appreciated so much
This is a way to do it using HTTP response headers :
import requests
import time
r = requests.get("http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api", stream=True)
ext = r.headers['content-type'].split('/')[-1] # converts response headers mime type to an extension (may not work with everything)
with open("%s.%s" % (time.time(), ext), 'wb') as f: # open the file to write as binary - replace 'wb' with 'w' for text files
for chunk in r.iter_content(1024): # iterate on stream using 1KB packets
f.write(chunk) # write the file

Python download file from URL

Trying to download a file from this URL. It is an excel file and it downloads only 1 kb of it. While the file size is actually 7mb. I dont understand what has to be done here
But if copy and paste the url in IE, the entire file is downloaded
res = requests.get('http://fescodoc.***.com/fescodoc/component/getcontent?objectId=09016fe98f2b59bb&current=true')
res.raise_for_status()
playFile = open('DC_Estimation Form.xlsm', 'wb')
for chunk in res.iter_content(1024):
playFile.write(chunk)
You should set stream to true in the get(),
res = requests.get('http://fescodoc.***.com/fescodoc/component/getcontent?objectId=09016fe98f2b59bb&current=true', stream=True)
res.raise_for_status()
with open('DC_Estimation Form.xlsm', 'wb') as playFile:
for chunk in res.iter_content(1024):
playFile.write(chunk)
See here: http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow
It is easier to use built-in module urllib for such cases: https://docs.python.org/2/library/urllib.html#urllib.urlretrieve
urllib.urlretrieve('http://fescodoc/component/.', 'DC_Estimation_Form.xslm')

Using configparser for text read from URL using urllib2

I have to read a txt ini file from my browser. [this is required]
res = urllib2.urlopen(URL)
inifile = res.read()
Then I want to basically use this the same way as I would have read any txt file.
config = ConfigParser.SafeConfigParser()
config.read( inifile )
But now looks like I can't use it as this is actually a string now
Can anybody suggest a way around?
You want configparser.readfp -- Presumably, you might even be able to get away with:
res = urllib2.urlopen(URL)
config = ConfgiParser.SafeConfigParser()
config.readfp(res)
assuming that urllib2.urlopen returns an object that is sufficiently file-like (i.e. it has a readline method). For easier debugging, you could do:
config.readfp(res, URL)
If you have to read it the data from a string, you could pack the whole thing into a io.StringIO (or StringIO.StringIO) buffer and read from that:
import io
res = urllib2.urlopen(URL)
inifile_text = res.read()
inifile = io.StringIO(inifile_text)
inifile.seek(0)
config.readfp(inifile)

downloading a file, not the contents

I am trying to automate downloading a .Z file from a website, but the file I get is 2kb when it should be around 700 kb and it contains a list of the contents of the page (ie: all the files available for download). I am able to download it manually without a problem. I have tried urllib and urllib2 and different configurations of each, but each does the same thing. I should add that the urlVar and fileName variables are generated in a different part of the code, but I have given an example of each here to demonstrate.
import urllib2
urlVar = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z"
fileName = txga1000.14d.Z
downFile = urllib2.urlopen(urlVar)
with open(fileName, "wb") as f:
f.write(downFile.read())
At least the urllib2documentation suggest you should use the Requestobject. This works with me:
import urllib2
req = urllib2.Request("ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z")
response = urllib2.urlopen(req)
data = response.read()
Data length seems to be 740725.
I was able to download what seems like the correct size for your file with the following python2 code:
import urllib2
filename = "txga1000.14d.Z"
url = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/{}".format(filename)
reply = urllib2.urlopen(url)
buf = reply.read()
with open(filename, "wb") as fh:
fh.write(buf)
Edit: The post above me was answered faster and is much better.. I thought I'd post since I tested and wrote this out anyways.

can python-requests fetch url directly to file handle on disk like curl?

curl has an option to directly save file and header data on disk:
curl_setopt($curl_obj, CURLOPT_WRITEHEADER, $header_handle);
curl_setopt($curl_obj, CURLOPT_FILE, $file_handle);
Is there same ability in python-requests ?
As far as I know, requests does not provide a function that save content to a file.
import requests
with open('local-file', 'wb') as f:
r = requests.get('url', stream=True)
f.writelines(r.iter_content(1024))
See request.Response.iter_content documentation.
iter_content(chunk_size=1, decode_unicode=False)
Iterates over the response data. When stream=True is set on the
request, this avoids reading the content at once into memory for large
responses. The chunk size is the number of bytes it should read into
memory. This is not necessarily the length of each item returned as
decoding can take place.
if you're saving something that is not a textfile, don't use f.writelines(). Instead use one of these:
import requests
try:
r = requests.get(chosen, stream=True)
except Exception as E:
print(E)
# handle exceptions here.
# both methods work here...
with open(filepath, 'wb') as handle:
for block in r.iter_content(1024):
handle.write(block)
# or...
import shutil
with open(filepath, 'wb') as handle:
shutil.copyfileobj(r.raw, handle)
shutil is much more flexible for dealing with missing folders or recursive file copying, etc. And it allows you to save the raw data from requests without worrying about blocksize and all that.

Categories

Resources