I have the following view code that attempts to "stream" a zipfile to the client for download:
import os
import zipfile
import tempfile
from pyramid.response import FileIter
def zipper(request):
_temp_path = request.registry.settings['_temp']
tmpfile = tempfile.NamedTemporaryFile('w', dir=_temp_path, delete=True)
tmpfile_path = tmpfile.name
## creating zipfile and adding files
z = zipfile.ZipFile(tmpfile_path, "w")
z.write('somefile1.txt')
z.write('somefile2.txt')
z.close()
## renaming the zipfile
new_zip_path = _temp_path + '/somefilegroup.zip'
os.rename(tmpfile_path, new_zip_path)
## re-opening the zipfile with new name
z = zipfile.ZipFile(new_zip_path, 'r')
response = FileIter(z.fp)
return response
However, this is the Response I get in the browser:
Could not convert return value of the view callable function newsite.static.zipper into a response object. The value returned was .
I suppose I am not using FileIter correctly.
UPDATE:
Since updating with Michael Merickel's suggestions, the FileIter function is working correctly. However, still lingering is a MIME type error that appears on the client (browser):
Resource interpreted as Document but transferred with MIME type application/zip: "http://newsite.local:6543/zipper?data=%7B%22ids%22%3A%5B6%2C7%5D%7D"
To better illustrate the issue, I have included a tiny .py and .pt file on Github: https://github.com/thapar/zipper-fix
FileIter is not a response object, just like your error message says. It is an iterable that can be used for the response body, that's it. Also the ZipFile can accept a file object, which is more useful here than a file path. Let's try writing into the tmpfile, then rewinding that file pointer back to the start, and using it to write out without doing any fancy renaming.
import os
import zipfile
import tempfile
from pyramid.response import FileIter
def zipper(request):
_temp_path = request.registry.settings['_temp']
fp = tempfile.NamedTemporaryFile('w+b', dir=_temp_path, delete=True)
## creating zipfile and adding files
z = zipfile.ZipFile(fp, "w")
z.write('somefile1.txt')
z.write('somefile2.txt')
z.close()
# rewind fp back to start of the file
fp.seek(0)
response = request.response
response.content_type = 'application/zip'
response.app_iter = FileIter(fp)
return response
I changed the mode on NamedTemporaryFile to 'w+b' as per the docs to allow the file to be written to and read from.
current Pyramid version has 2 convenience classes for this use case- FileResponse, FileIter. The snippet below will serve a static file. I ran this code - the downloaded file is named "download" like the view name. To change the file name and more set the Content-Disposition header or have a look at the arguments of pyramid.response.Response.
from pyramid.response import FileResponse
#view_config(name="download")
def zipper(request):
path = 'path_to_file'
return FileResponse(path, request) #passing request is required
docs:
http://docs.pylonsproject.org/projects/pyramid/en/latest/api/response.html#
hint: extract the Zip logic from the view if possible
Related
This is a small widget that I am designing that is designed to 'browse' while circumventing proxy settings. I have been told on Code Review that it would be beneficial here, but am struggling to put it in with my program's current logic. Here is the code:
import urllib.request
import webbrowser
import os
import tempfile
location = os.path.dirname(os.path.abspath(__file__))
proxy_handler = urllib.request.ProxyHandler(proxies=None)
opener = urllib.request.build_opener(proxy_handler)
def navigate(query):
response = opener.open(query)
html = response.read()
return html
def parse(data):
start = str(data)[2:-1]
lines = start.split('\\n')
return lines
while True:
url = input("Path: ")
raw_data = navigate(url)
content = parse(raw_data)
with open('cache.html', 'w') as f:
f.writelines(content)
webbrowser.open_new_tab(os.path.join(location, 'cache.html'))
Hopefully someone who has worked with these modules before can help me. The reason that I want to use tempfile is that my program gets raw html, parses it and stores it in a file. This file is overwritten every time a new input comes in, and would ideally be deleted when the program stops running. Also, the file doesn't have to exist when the program initializes so it seems logical from that view also.
Since you are passing the name of the file to webbrowser.open_new_tab(), you should use a NamedTemporaryFile
cache = tempfile.NamedTemporaryFile()
...
cache.seek(0)
cache.writelines(bytes(line, 'UTF-8') for line in content)
cache.seek(0)
webbrowser.open_new_tab('file://' + cache.name)
Currently, I'm just serving files like this:
# view callable
def export(request):
response = Response(content_type='application/csv')
# use datetime in filename to avoid collisions
f = open('/temp/XML_Export_%s.xml' % datetime.now(), 'r')
# this is where I usually put stuff in the file
response.app_iter = f
response.headers['Content-Disposition'] = ("attachment; filename=Export.xml")
return response
The problem with this is that I can't close or, even better, delete the file after the response has been returned. The file gets orphaned. I can think of some hacky ways around this, but I'm hoping there's a standard way out there somewhere. Any help would be awesome.
You do not want to set a file pointer as the app_iter. This will cause the WSGI server to read the file line by line (same as for line in file), which is typically not the most efficient way to control a file upload (imagine one character per line). Pyramid's supported way of serving files is via pyramid.response.FileResponse. You can create one of these by passing a file object.
response = FileResponse('/some/path/to/a/file.txt')
response.headers['Content-Disposition'] = ...
Another option is to pass a file pointer to app_iter but wrap it in the pyramid.response.FileIter object, which will use a sane block size to avoid just reading the file line by line.
The WSGI specification has strict requirements that response iterators which contain a close method will be invoked at the end of the response. Thus setting response.app_iter = open(...) should not cause any memory leaks. Both FileResponse and FileIter also support a close method and will thus be cleaned up as expected.
As a minor update to this answer I thought I'd explain why FileResponse takes a file path and not a file pointer. The WSGI protocol provides servers an optional ability to provide an optimized mechanism for serving static files via environ['wsgi.file_wrapper']. FileResponse will automatically handle this if your WSGI server has provided that support. With this in mind, you find it to be a win to save your data to a tmpfile on a ramdisk and providing the FileResponse with the full path, instead of trying to pass a file pointer to FileIter.
http://docs.pylonsproject.org/projects/pyramid/en/1.4-branch/api/response.html#pyramid.response.FileResponse
Update:
Please see Michael Merickel's answer for a better solution and explanation.
If you want to have the file deleted once response is returned, you can try the following:
import os
from datetime import datetime
from tempfile import NamedTemporaryFile
# view callable
def export(request):
response = Response(content_type='application/csv')
with NamedTemporaryFile(prefix='XML_Export_%s' % datetime.now(),
suffix='.xml', delete=True) as f:
# this is where I usually put stuff in the file
response = FileResponse(os.path.abspath(f.name))
response.headers['Content-Disposition'] = ("attachment; filename=Export.xml")
return response
You can consider using NamedTemporaryFile:
NamedTemporaryFile(prefix='XML_Export_%s' % datetime.now(), suffix='.xml', delete=True)
Setting delete=True so that the file is deleted as soon as it is closed.
Now, with the help of with you can always have the guarantee that the file will be closed, and hence deleted:
from tempfile import NamedTemporaryFile
from datetime import datetime
# view callable
def export(request):
response = Response(content_type='application/csv')
with NamedTemporaryFile(prefix='XML_Export_%s' % datetime.now(),
suffix='.xml', delete=True) as f:
# this is where I usually put stuff in the file
response.app_iter = f
response.headers['Content-Disposition'] = ("attachment; filename=Export.xml")
return response
The combination of Michael and Kay's response works great under Linux/Mac but won't work under Windows (for auto-deletion). Windows doesn't like the fact that FileResponse tries to open the already open file (see description of NamedTemporaryFile).
I worked around this by creating a FileDecriptorResponse class which is essentially a copy of FileResponse, but takes the file descriptor of the open NamedTemporaryFile. Just replace the open with a seek(0) and all the path based calls (last_modified, content_length) with their fstat equivalents.
class FileDescriptorResponse(Response):
"""
A Response object that can be used to serve a static file from an open
file descriptor. This is essentially identical to Pyramid's FileResponse
but takes a file descriptor instead of a path as a workaround for auto-delete
not working with NamedTemporaryFile under Windows.
``file`` is a file descriptor for an open file.
``content_type``, if passed, is the content_type of the response.
``content_encoding``, if passed is the content_encoding of the response.
It's generally safe to leave this set to ``None`` if you're serving a
binary file. This argument will be ignored if you don't also pass
``content-type``.
"""
def __init__(self, file, content_type=None, content_encoding=None):
super(FileDescriptorResponse, self).__init__(conditional_response=True)
self.last_modified = fstat(file.fileno()).st_mtime
if content_type is None:
content_type, content_encoding = mimetypes.guess_type(path,
strict=False)
if content_type is None:
content_type = 'application/octet-stream'
self.content_type = content_type
self.content_encoding = content_encoding
content_length = fstat(file.fileno()).st_size
file.seek(0)
app_iter = FileIter(file, _BLOCK_SIZE)
self.app_iter = app_iter
# assignment of content_length must come after assignment of app_iter
self.content_length = content_length
Hope that's helpful.
There is also repoze.filesafe which will take care of generating a temporary file for you, and delete it at the end. I use it for saving files uploaded to my server. Perhaps it can be useful to you too.
Because your Object response is holding a file handle for the file '/temp/XML_Export_%s.xml'. Use del statement to delete handle 'response.app_iter'.
del response.app_iter
both Michael Merickel and Kay Zhu are fine.
I found out that I also need to reset file position at the begninnign of the NamedTemporaryFile before passing it to response, as it seems that the response starts from the actual position in the file and not from the beginning (It's fine, you just need to now it).
With NamedTemporaryFile with deletion set, you can not close and reopen it, because it would delete it (and you can't reopen it anyway), so you need to use something like this:
f = tempfile.NamedTemporaryFile()
#fill your file here
f.seek(0, 0)
response = FileResponse(
f,
request=request,
content_type='application/csv'
)
hope it helps ;)
XLRD is installed and tested:
>>> import xlrd
>>> workbook = xlrd.open_workbook('Sample.xls')
When I read the file through html form like below, I'm able to access all the values.
xls_file = request.params['xls_file']
print xls_file.filename, xls_file.type
I'm using Pylons module, request comes from: from pylons import request, tmpl_context as c
My questions:
Is xls_file read through requst.params an object?
How can I read xls_file and make it work with xlrd?
Update:
The xls_file is uploaded on web server, but the xlrd library expects a filename instead of an open file object, How can I make the uploaded file to work with xlrd? (Thanks to Martijn Pieters, I was being unable to formulate the question clearly.)
xlrd does support providing data directly without a filepath, just use the file_contents argument:
xlrd.open_workbook(file_contents=fileobj.read())
From the documentation:
file_contents – A string or an mmap.mmap object or some other behave-alike object. If file_contents is supplied, filename will not be used, except (possibly) in messages.
What I met is not totally the same with the question, but I think maybe it is similar and I can give some hints.
I am using django rest framework's request instead of pylons request.
If I write simple codes like following:
#api_view(['POST'])
#renderer_classes([JSONRenderer])
def upload_files(request):
file_obj = request.FILES['file']
from xlrd import open_workbook
wb = open_workbook(file_contents=file_obj.read())
result = {"code": "0", "message": "success", "data": {}}
return Response(status=200, data=result)
Here We can read using open_workbook(file_contents=file_obj.read()) as mentioned in previous comments.
But if you write code in following way:
from rest_framework.views import APIView
from rest_framework.parsers import MultiPartParser
class FileUploadView(APIView):
parser_classes = (MultiPartParser,)
def put(self, request, filename, format=None):
file_obj = request.FILES.get('file')
from xlrd import open_workbook
wb = open_workbook(file_contents=file_obj.read())
# do some stuff with uploaded file
return Response(status=204)
You must pay attention that using MultiPartParser instead of FileUploadParser, using FileUploadParser will raise some BOF error.
So I am wondering somehow it is also affected by how you write the API.
For me this code works. Python 3
xlrd.open_workbook(file_contents=fileobj.content)
You could try something like...
import xlrd
def newopen(fileobject, modes):
return fileobject
oldopen = __builtins__.open
__builtins__.open = newopen
InputWorkBook = xlrd.open_workbook(fileobject)
__builtins__.open = oldopen
You may have to wrap the fileobject in StringIO if it isn't already a file handle.
I am wondering whether there is a way to upload a zip file to django web server and put the zip's files into django database WITHOUT accessing the actual file system in the process (e.g. extracting the files in the zip into a tmp dir and then load them)
Django provides a function to convert python File to Django File, so if there is a way to convert ZipExtFile to python File, it should be fine.
thanks for help!
Django model:
from django.db import models
class Foo:
file = models.FileField(upload_to='somewhere')
Usage:
from zipfile import ZipFile
from django.core.exceptions import ValidationError
from django.core.files import File
from io import BytesIO
z = ZipFile('zipFile')
istream = z.open('subfile')
ostream = BytesIO(istream.read())
tmp = Foo(file=File(ostream))
try:
tmp.full_clean()
except Validation, e:
print e
Output:
{'file': [u'This field cannot be blank.']}
[SOLUTION] Solution using an ugly hack:
As correctly pointed out by Don Quest, file-like classes such as StringIO or BytesIO should represent the data as a virtual file. However, Django File's constructor only accepts the build-in file type and nothing else, although the file-like classes would have done the job as well. The hack is to set the variables in Django::File manually:
buf = bytesarray(OPENED_ZIP_OBJECT.read(FILE_NAME))
tmp_file = BytesIO(buf)
dummy_file = File(tmp_file) # this line actually fails
dummy_file.name = SOME_RANDOM_NAME
dummy_file.size = len(buf)
dummy_file.file = tmp_file
# dummy file is now valid
Please keep commenting if you have a better solution (except for custom storage)
There's an easier way to do this:
from django.core.files.base import ContentFile
uploaded_zip = zipfile.ZipFile(uploaded_file, 'r') # ZipFile
for filename in uploaded_zip.namelist():
with uploaded_zip.open(filename) as f: # ZipExtFile
my_django_file = ContentFile(f.read())
Using this, you can convert a file that was uploaded to memory directly to a django file. For a more complete example, let's say you wanted to upload a series of image files inside of a zip to the file system:
# some_app/models.py
class Photo(models.Model):
image = models.ImageField(upload_to='some/upload/path')
...
# Upload code
from some_app.models import Photo
for filename in uploaded_zip.namelist():
with uploaded_zip.open(filename) as f: # ZipExtFile
new_photo = Photo()
new_photo.image.save(filename, ContentFile(f.read(), save=True)
Without knowing to much about Django, i can tell you to take a look at the "io" package.
You could do something like:
from zipfile import ZipFile
from io import StringIO
zname,zipextfile = 'zipcontainer.zip', 'file_in_archive'
istream = ZipFile(zname).open(zipextfile)
ostream = StringIO(istream.read())
And then do whatever you would like to do with your "virtual" ostream Stream/File.
I've used the following django file class to avoid the need to read ZipExtFile into a another datastructure (StingIO or BytesIO) while properly impelementing what Django needs in order to save the file directly.
from django.core.files.base import File
class DjangoZipExtFile(File):
def __init__(self, zipextfile, zipinfo):
self.file = zipextfile
self.zipinfo = zipinfo
self.mode = 'r'
self.name = zipinfo.filename
self._size = zipinfo.file_size
def seek(self, position):
if position != 0:
#this will raise an unsupported operation
return self.file.seek(position)
#TODO if we have already done a read, reopen file
zipextfile = archive.open(path, 'r')
zipinfo = archive.getinfo(path)
djangofile = DjangoZipExtFile(zipextfile, zipinfo)
storage = DefaultStorage()
result = storage.save(djangofile.name, djangofile)
I have managed to get my first python script to work which downloads a list of .ZIP files from a URL and then proceeds to extract the ZIP files and writes them to disk.
I am now at a loss to achieve the next step.
My primary goal is to download and extract the zip file and pass the contents (CSV data) via a TCP stream. I would prefer not to actually write any of the zip or extracted files to disk if I could get away with it.
Here is my current script which works but unfortunately has to write the files to disk.
import urllib, urllister
import zipfile
import urllib2
import os
import time
import pickle
# check for extraction directories existence
if not os.path.isdir('downloaded'):
os.makedirs('downloaded')
if not os.path.isdir('extracted'):
os.makedirs('extracted')
# open logfile for downloaded data and save to local variable
if os.path.isfile('downloaded.pickle'):
downloadedLog = pickle.load(open('downloaded.pickle'))
else:
downloadedLog = {'key':'value'}
# remove entries older than 5 days (to maintain speed)
# path of zip files
zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files"
# retrieve list of URLs from the webservers
usock = urllib.urlopen(zipFileURL)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
# only parse urls
for url in parser.urls:
if "PUBLIC_P5MIN" in url:
# download the file
downloadURL = zipFileURL + url
outputFilename = "downloaded/" + url
# check if file already exists on disk
if url in downloadedLog or os.path.isfile(outputFilename):
print "Skipping " + downloadURL
continue
print "Downloading ",downloadURL
response = urllib2.urlopen(downloadURL)
zippedData = response.read()
# save data to disk
print "Saving to ",outputFilename
output = open(outputFilename,'wb')
output.write(zippedData)
output.close()
# extract the data
zfobj = zipfile.ZipFile(outputFilename)
for name in zfobj.namelist():
uncompressed = zfobj.read(name)
# save uncompressed data to disk
outputFilename = "extracted/" + name
print "Saving extracted file to ",outputFilename
output = open(outputFilename,'wb')
output.write(uncompressed)
output.close()
# send data via tcp stream
# file successfully downloaded and extracted store into local log and filesystem log
downloadedLog[url] = time.time();
pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))
Below is a code snippet I used to fetch zipped csv file, please have a look:
Python 2:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(StringIO(resp.read()))
for line in myzip.open(file).readlines():
print line
Python 3:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
# or: requests.get(url).content
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(BytesIO(resp.read()))
for line in myzip.open(file).readlines():
print(line.decode('utf-8'))
Here file is a string. To get the actual string that you want to pass, you can use zipfile.namelist(). For instance,
resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')
myzip = ZipFile(BytesIO(resp.read()))
myzip.namelist()
# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
import zipfile
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
Or more simply (apologies to Vishal):
myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
[ ... ]
In Python 3 use BytesIO instead of StringIO:
import zipfile
from io import BytesIO
filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
[ ... ]
I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.
from io import BytesIO
from zipfile import ZipFile
import urllib.request
url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")
with ZipFile(BytesIO(url.read())) as my_zip_file:
for contained_file in my_zip_file.namelist():
# with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
for line in my_zip_file.open(contained_file).readlines():
print(line)
# output.write(line)
Necessary changes:
There's no StringIO module in Python 3 (it's been moved to io.StringIO). Instead, I use io.BytesIO]2, because we will be handling a bytestream -- Docs, also this thread.
urlopen:
"The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen.", Docs and this thread.
Note:
In Python 3, the printed output lines will look like so: b'some text'. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.
A few minor changes I made:
I use with ... as instead of zipfile = ... according to the Docs.
The script now uses .namelist() to cycle through all the files in the zip and print their contents.
I moved the creation of the ZipFile object into the with statement, although I'm not sure if that's better.
I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds "unzipped_and_read_" to the beginning of the filename and a ".file" extension (I prefer not to use ".txt" for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.
Need to be careful here -- because we have a byte string, we use binary mode, so "wb"; I have a feeling that writing binary opens a can of worms anyway...
I am using an example file, the UN/LOCODE text archive:
What I didn't do:
NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.
Here's a way:
import urllib.request
import shutil
with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
shutil.copyfileobj(response, out_file)
I'd like to add my Python3 answer for completeness:
from io import BytesIO
from zipfile import ZipFile
import requests
def get_zip(file_url):
url = requests.get(file_url)
zipfile = ZipFile(BytesIO(url.content))
files = [zipfile.open(file_name) for file_name in zipfile.namelist()]
return files.pop() if len(files) == 1 else files
write to a temporary file which resides in RAM
it turns out the tempfile module ( http://docs.python.org/library/tempfile.html ) has just the thing:
tempfile.SpooledTemporaryFile([max_size=0[,
mode='w+b'[, bufsize=-1[, suffix=''[,
prefix='tmp'[, dir=None]]]]]])
This
function operates exactly as
TemporaryFile() does, except that data
is spooled in memory until the file
size exceeds max_size, or until the
file’s fileno() method is called, at
which point the contents are written
to disk and operation proceeds as with
TemporaryFile().
The resulting file has one additional
method, rollover(), which causes the
file to roll over to an on-disk file
regardless of its size.
The returned object is a file-like
object whose _file attribute is either
a StringIO object or a true file
object, depending on whether
rollover() has been called. This
file-like object can be used in a with
statement, just like a normal file.
New in version 2.6.
or if you're lazy and you have a tmpfs-mounted /tmp on Linux, you can just make a file there, but you have to delete it yourself and deal with naming
Adding on to the other answers using requests:
# download from web
import requests
url = 'http://mlg.ucd.ie/files/datasets/bbc.zip'
content = requests.get(url)
# unzip the content
from io import BytesIO
from zipfile import ZipFile
f = ZipFile(BytesIO(content.content))
print(f.namelist())
# outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
Use help(f) to get more functions details for e.g. extractall() which extracts the contents in zip file which later can be used with with open.
All of these answers appear too bulky and long. Use requests to shorten the code, e.g.:
import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("/path/to/directory")
Vishal's example, however great, confuses when it comes to the file name, and I do not see the merit of redefing 'zipfile'.
Here is my example that downloads a zip that contains some files, one of which is a csv file that I subsequently read into a pandas DataFrame:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
import pandas
url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip")
zf = ZipFile(StringIO(url.read()))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
(Note, I use Python 2.7.13)
This is the exact solution that worked for me. I just tweaked it a little bit for Python 3 version by removing StringIO and adding IO library
Python 3 Version
from io import BytesIO
from zipfile import ZipFile
import pandas
import requests
url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip"
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
It wasn't obvious in Vishal's answer what the file name was supposed to be in cases where there is no file on disk. I've modified his answer to work without modification for most needs.
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
def unzip_string(zipped_string):
unzipped_string = ''
zipfile = ZipFile(StringIO(zipped_string))
for name in zipfile.namelist():
unzipped_string += zipfile.open(name).read()
return unzipped_string
Use the zipfile module. To extract a file from a URL, you'll need to wrap the result of a urlopen call in a BytesIO object. This is because the result of a web request returned by urlopen doesn't support seeking:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
zip_url = 'http://example.com/my_file.zip'
with urlopen(zip_url) as f:
with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read())
If you already have the file downloaded locally, you don't need BytesIO, just open it in binary mode and pass to ZipFile directly:
from zipfile import ZipFile
zip_filename = 'my_file.zip'
with open(zip_filename, 'rb') as f:
with ZipFile(f) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read().decode('utf-8'))
Again, note that you have to open the file in binary ('rb') mode, not as text or you'll get a zipfile.BadZipFile: File is not a zip file error.
It's good practice to use all these things as context managers with the with statement, so that they'll be closed properly.