gz to zip conversion with S3 and Python [duplicate] - python

Is there a Python library that allows manipulation of zip archives in memory, without having to use actual disk files?
The ZipFile library does not allow you to update the archive. The only way seems to be to extract it to a directory, make your changes, and create a new zip from that directory. I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them.
Something similar to Java's ZipInputStream/ZipOutputStream would do the trick, although any interface at all that avoids disk access would be fine.

According to the Python docs:
class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])
Open a ZIP file, where file can be either a path to a file (a string) or a file-like object.
So, to open the file in memory, just create a file-like object (perhaps using BytesIO).
file_like_object = io.BytesIO(my_zip_data)
zipfile_ob = zipfile.ZipFile(file_like_object)

PYTHON 3
import io
import zipfile
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a",
zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, data in [('1.txt', io.BytesIO(b'111')),
('2.txt', io.BytesIO(b'222'))]:
zip_file.writestr(file_name, data.getvalue())
with open('C:/1.zip', 'wb') as f:
f.write(zip_buffer.getvalue())

From the article In-Memory Zip in Python:
Below is a post of mine from May of 2008 on zipping in memory with Python, re-posted since Posterous is shutting down.
I recently noticed that there is a for-pay component available to zip files in-memory with Python. Considering this is something that should be free, I threw together the following code. It has only gone through very basic testing, so if anyone finds any errors, let me know and I’ll update this.
import zipfile
import StringIO
class InMemoryZip(object):
def __init__(self):
# Create the in-memory file-like object
self.in_memory_zip = StringIO.StringIO()
def append(self, filename_in_zip, file_contents):
'''Appends a file with name filename_in_zip and contents of
file_contents to the in-memory zip.'''
# Get a handle to the in-memory zip in append mode
zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)
# Write the file to the in-memory zip
zf.writestr(filename_in_zip, file_contents)
# Mark the files as having been created on Windows so that
# Unix permissions are not inferred as 0000
for zfile in zf.filelist:
zfile.create_system = 0
return self
def read(self):
'''Returns a string with the contents of the in-memory zip.'''
self.in_memory_zip.seek(0)
return self.in_memory_zip.read()
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, "w")
f.write(self.read())
f.close()
if __name__ == "__main__":
# Run a test
imz = InMemoryZip()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")

The example Ethier provided has several problems, some of them major:
doesn't work for real data on Windows. A ZIP file is binary and its data should always be written with a file opened 'wb'
the ZIP file is appended to for each file, this is inefficient. It can just be opened and kept as an InMemoryZip attribute
the documentation states that ZIP files should be closed explicitly, this is not done in the append function (it probably works (for the example) because zf goes out of scope and that closes the ZIP file)
the create_system flag is set for all the files in the zipfile every time a file is appended instead of just once per file.
on Python < 3 cStringIO is much more efficient than StringIO
doesn't work on Python 3 (the original article was from before the 3.0 release, but by the time the code was posted 3.1 had been out for a long time).
An updated version is available if you install ruamel.std.zipfile (of which I am the author). After
pip install ruamel.std.zipfile
or including the code for the class from here, you can do:
import ruamel.std.zipfile as zipfile
# Run a test
zipfile.InMemoryZipFile()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")
You can alternatively write the contents using imz.data to any place you need.
You can also use the with statement, and if you provide a filename, the contents of the ZIP will be written on leaving that context:
with zipfile.InMemoryZipFile('test.zip') as imz:
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
because of the delayed writing to disc, you can actually read from an old test.zip within that context.

I am using Flask to create an in-memory zipfile and return it as a download. Builds on the example above from Vladimir. The seek(0) took a while to figure out.
import io
import zipfile
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, data in [('1.txt', io.BytesIO(b'111')), ('2.txt', io.BytesIO(b'222'))]:
zip_file.writestr(file_name, data.getvalue())
zip_buffer.seek(0)
return send_file(zip_buffer, attachment_filename='filename.zip', as_attachment=True)

I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them
This is possible using the two libraries https://github.com/uktrade/stream-unzip and https://github.com/uktrade/stream-zip (full disclosure: written by me). And depending on the changes, you might not even have to store the entire zip in memory at once.
Say you just want to download, unzip, zip, and re-upload. Slightly pointless, but you could slot in some changes to the unzipped content:
from datetime import datetime
import httpx
from stream_unzip import stream_unzip
from stream_zip import stream_zip, ZIP_64
def get_source_bytes_iter(url):
with httpx.stream('GET', url) as r:
yield from r.iter_bytes()
def get_target_files(files):
# stream-unzip doesn't expose perms or modified_at, but stream-zip requires them
modified_at = datetime.now()
perms = 0o600
for name, _, chunks in files:
# Could change name, manipulate chunks, skip a file, or yield a new file
yield name.decode(), modified_at, perms, ZIP_64, chunks
source_url = 'https://source.test/file.zip'
target_url = 'https://target.test/file.zip'
source_bytes_iter = get_source_bytes_iter(source_url)
source_files = stream_unzip(source_bytes_iter)
target_files = get_target_files(source_files)
target_bytes_iter = stream_zip(target_files)
httpx.put(target_url, data=target_bytes_iter)

Helper to create in-memory zip file with multiple files based on data like {'1.txt': 'string', '2.txt": b'bytes'}
import io, zipfile
def prepare_zip_file_content(file_name_content: dict) -> bytes:
"""returns Zip bytes ready to be saved with
open('C:/1.zip', 'wb') as f: f.write(bytes)
#file_name_content dict like {'1.txt': 'string', '2.txt": b'bytes'}
"""
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, file_data in file_name_content.items():
zip_file.writestr(file_name, file_data)
zip_buffer.seek(0)
return zip_buffer.getvalue()

You can use the library libarchive in Python through ctypes - it offers ways of manipulating ZIP data in memory, with a focus on streaming (at least historically).
Say we want to uncompress ZIP files on the fly while downloading from an HTTP server. The below code
from contextlib import contextmanager
from ctypes import CFUNCTYPE, POINTER, create_string_buffer, cdll, byref, c_ssize_t, c_char_p, c_int, c_void_p, c_char
from ctypes.util import find_library
import httpx
def get_zipped_chunks(url, chunk_size=6553):
with httpx.stream('GET', url) as r:
yield from r.iter_bytes()
def stream_unzip(zipped_chunks, chunk_size=65536):
# Library
libarchive = cdll.LoadLibrary(find_library('archive'))
# Callback types
open_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
read_callback_type = CFUNCTYPE(c_ssize_t, c_void_p, c_void_p, POINTER(POINTER(c_char)))
close_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
# Function types
libarchive.archive_read_new.restype = c_void_p
libarchive.archive_read_open.argtypes = [c_void_p, c_void_p, open_callback_type, read_callback_type, close_callback_type]
libarchive.archive_read_finish.argtypes = [c_void_p]
libarchive.archive_entry_new.restype = c_void_p
libarchive.archive_read_next_header.argtypes = [c_void_p, c_void_p]
libarchive.archive_read_support_compression_all.argtypes = [c_void_p]
libarchive.archive_read_support_format_all.argtypes = [c_void_p]
libarchive.archive_entry_pathname.argtypes = [c_void_p]
libarchive.archive_entry_pathname.restype = c_char_p
libarchive.archive_read_data.argtypes = [c_void_p, POINTER(c_char), c_ssize_t]
libarchive.archive_read_data.restype = c_ssize_t
libarchive.archive_error_string.argtypes = [c_void_p]
libarchive.archive_error_string.restype = c_char_p
ARCHIVE_EOF = 1
ARCHIVE_OK = 0
it = iter(zipped_chunks)
compressed_bytes = None # Make sure not garbage collected
#contextmanager
def get_archive():
archive = libarchive.archive_read_new()
if not archive:
raise Exception('Unable to allocate archive')
try:
yield archive
finally:
libarchive.archive_read_finish(archive)
def read_callback(archive, client_data, buffer):
nonlocal compressed_bytes
try:
compressed_bytes = create_string_buffer(next(it))
except StopIteration:
return 0
else:
buffer[0] = compressed_bytes
return len(compressed_bytes) - 1
def uncompressed_chunks(archive):
uncompressed_bytes = create_string_buffer(chunk_size)
while (num := libarchive.archive_read_data(archive, uncompressed_bytes, len(uncompressed_bytes))) > 0:
yield uncompressed_bytes.value[:num]
if num < 0:
raise Exception(libarchive.archive_error_string(archive))
with get_archive() as archive:
libarchive.archive_read_support_compression_all(archive)
libarchive.archive_read_support_format_all(archive)
libarchive.archive_read_open(
archive, 0,
open_callback_type(0), read_callback_type(read_callback), close_callback_type(0),
)
entry = c_void_p(libarchive.archive_entry_new())
if not entry:
raise Exception('Unable to allocate entry')
while (status := libarchive.archive_read_next_header(archive, byref(entry))) == ARCHIVE_OK:
yield (libarchive.archive_entry_pathname(entry), uncompressed_chunks(archive))
if status != ARCHIVE_EOF:
raise Exception(libarchive.archive_error_string(archive))
can be used as follows to do that
zipped_chunks = get_zipped_chunks('https://domain.test/file.zip')
files = stream_unzip(zipped_chunks)
for name, uncompressed_chunks in stream_unzip(zipped_chunks):
print(name)
for uncompressed_chunk in uncompressed_chunks:
print(uncompressed_chunk)
In fact since libarchive supports multiple archive formats, and nothing above is particularly ZIP-specific, it may well work with other formats.

Related

Shared file access between Python and Matlab

I have a Matlab application that writes in to a .csv file and a Python script that reads from it. These operations happen concurrently and at their own respective periods (not necessarily the same). All of this runs on Windows 7.
I wish to know :
Would the OS inherently provide some sort of locking mechanism so that only one of the two applications - Matlab or Python - have access to the shared file?
In the Python application, how do I check if the file is already "open"ed by Matlab application? What's the loop structure for this so that the Python application is blocked until it gets access to read the file?
I am not sure about window's API for locking files
Heres a possible solution:
While matlab has the file open, you create an empty file called "data.lock" or something to that effect.
When python tries to read the file, it will check for the lock file, and if it is there, then it will sleep for a given interval.
When matlab is done with the file, it can delete the "data.lock" file.
Its a programmatic solution, but it is simpler than digging through the windows api and finding the right calls in matlab and python.
If Python is only reading the file, I believe you have to lock it in MATLAB because a read-only open call from Python may not fail. I am not sure how to accomplish that, you may want to read this question atomically creating a file lock in MATLAB (file mutex)
However, if you are simply consuming the data with python, did you consider using a socket instead of a file?
In Windows on the Python side, CreateFile can be called (directly or indirectly via the CRT) with a specific sharing mode. For example, if the desired sharing mode is FILE_SHARE_READ, then the open will fail if the file is already open for writing. If the latter call instead succeeds, then a future attempt to open the file for writing will fail (e.g. in Matlab).
The Windows CRT function _wsopen_s allows setting the sharing mode. You can call it with ctypes in a Python 3 opener:
import sys
import os
import ctypes as ctypes
import ctypes.util
__all__ = ['shdeny', 'shdeny_write', 'shdeny_read']
_SH_DENYRW = 0x10 # deny read/write mode
_SH_DENYWR = 0x20 # deny write mode
_SH_DENYRD = 0x30 # deny read
_S_IWRITE = 0x0080 # for O_CREAT, a new file is not readonly
if sys.version_info[:2] < (3,5):
_wsopen_s = ctypes.CDLL(ctypes.util.find_library('c'))._wsopen_s
else:
# find_library('c') may be deprecated on Windows in 3.5, if the
# universal CRT removes named exports. The following probably
# isn't future proof; I don't know how the '-l1-1-0' suffix
# should be handled.
_wsopen_s = ctypes.CDLL('api-ms-win-crt-stdio-l1-1-0')._wsopen_s
_wsopen_s.argtypes = (ctypes.POINTER(ctypes.c_int), # pfh
ctypes.c_wchar_p, # filename
ctypes.c_int, # oflag
ctypes.c_int, # shflag
ctypes.c_int) # pmode
def shdeny(file, flags):
fh = ctypes.c_int()
err = _wsopen_s(ctypes.byref(fh),
file, flags, _SH_DENYRW, _S_IWRITE)
if err:
raise IOError(err, os.strerror(err), file)
return fh.value
def shdeny_write(file, flags):
fh = ctypes.c_int()
err = _wsopen_s(ctypes.byref(fh),
file, flags, _SH_DENYWR, _S_IWRITE)
if err:
raise IOError(err, os.strerror(err), file)
return fh.value
def shdeny_read(file, flags):
fh = ctypes.c_int()
err = _wsopen_s(ctypes.byref(fh),
file, flags, _SH_DENYRD, _S_IWRITE)
if err:
raise IOError(err, os.strerror(err), file)
return fh.value
For example:
if __name__ == '__main__':
import tempfile
filename = tempfile.mktemp()
fw = open(filename, 'w')
fw.write('spam')
fw.flush()
fr = open(filename)
assert fr.read() == 'spam'
try:
f = open(filename, opener=shdeny_write)
except PermissionError:
fw.close()
with open(filename, opener=shdeny_write) as f:
assert f.read() == 'spam'
try:
f = open(filename, opener=shdeny_read)
except PermissionError:
fr.close()
with open(filename, opener=shdeny_read) as f:
assert f.read() == 'spam'
with open(filename, opener=shdeny) as f:
assert f.read() == 'spam'
os.remove(filename)
In Python 2 you'll have to combine the above openers with os.fdopen, e.g.:
f = os.fdopen(shdeny_write(filename, os.O_RDONLY|os.O_TEXT), 'r')
Or define an sopen wrapper that lets you pass the share mode explicitly and calls os.fdopen to return a Python 2 file. This will require a bit more work to get the file mode from the passed in flags, or vice versa.

Python zipfile, bizarre limit to number of files: "folder is invalid"

The computer is toying with me, I know it!
I am creating a zip folder in Python. The individual files are generated in memory and then the whole thing is zipped and saved to a file. I am allowed to add 9 files to the zip. I am allowed to add 11 files to the zip. But 10, no, not 10 files. The zip file IS saved to my computer, but I'm not allowed to open it; Windows says that the compressed zipped folder is invalid.
I use the code below, which I got from another stackoverflow question. It appends 10 files and saves the zipped folder. When I click on the folder, I cannot extract it. BUT, remove one of the appends() and it's fine. Or, add another append and it works!
What am I missing here? How can I make this work every time?
imz = InMemoryZip()
imz.append("1a.txt", "a").append("2a.txt", "a").append("3a.txt", "a").append("4a.txt", "a").append("5a.txt", "a").append("6a.txt", "a").append("7a.txt", "a").append("8a.txt", "a").append("9a.txt", "a").append("10a.txt", "a")
imz.writetofile("C:/path/test.zip")
import zipfile
import StringIO
class InMemoryZip(object):
def __init__(self):
# Create the in-memory file-like object
self.in_memory_zip = StringIO.StringIO()
def append(self, filename_in_zip, file_contents):
'''Appends a file with name filename_in_zip and contents of
file_contents to the in-memory zip.'''
# Get a handle to the in-memory zip in append mode
zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)
# Write the file to the in-memory zip
zf.writestr(filename_in_zip, file_contents)
# Mark the files as having been created on Windows so that
# Unix permissions are not inferred as 0000
for zfile in zf.filelist:
zfile.create_system = 0
return self
def read(self):
'''Returns a string with the contents of the in-memory zip.'''
self.in_memory_zip.seek(0)
return self.in_memory_zip.read()
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, "w")
f.write(self.read())
f.close()
You should use the 'wb' mode when creating the file you are saving to the file system. This will ensure that the file is written in binary.
Otherwise, any time a newline (\n) character happens to be encountered in the zip file python will replace it to match the windows line ending (\r\n). The reason 10 files is a problem is that 10 happens to be the code for \n.
So your write function should look like this:
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, 'wb')
f.write(self.read())
f.close()
This should fix your problem and work for the files in your example. Although, in your case you might find it easier to write the zip file directly to the file system like this code which includes some of the comments from above:
import StringIO
import zipfile
class ZipCreator:
buffer = None
def __init__(self, fileName=None):
if fileName:
self.zipFile = zipfile.ZipFile(fileName, 'w', zipfile.ZIP_DEFLATED, False)
return
self.buffer = StringIO.StringIO()
self.zipFile = zipfile.ZipFile(self.buffer, 'w', zipfile.ZIP_DEFLATED, False)
def addToZipFromFileSystem(self, filePath, filenameInZip):
self.zipFile.write(filePath, filenameInZip)
def addToZipFromMemory(self, filenameInZip, fileContents):
self.zipFile.writestr(filenameInZip, fileContents)
for zipFile in self.zipFile.filelist:
zipFile.create_system = 0
def write(self, fileName):
if not self.buffer: # If the buffer was not initialized the file is written by the ZipFile
self.zipFile.close()
return
f = file(fileName, 'wb')
f.write(self.buffer.getvalue())
f.close()
# Use File Handle
zipCreator = ZipCreator('C:/path/test.zip')
# Use Memory Buffer
# zipCreator = ZipCreator()
for i in range(1, 10):
zipCreator.addToZipFromMemory('test/%sa.txt' % i, 'a')
zipCreator.write('C:/path/test.zip')
Ideally, you would probably use separate classes for an in-memory zip and a zip that is tied to the file system from the beginning. I have also seem some issues with the in-memory zip when folders are added which are difficult to recreate and which I am still trying to track down.

Downloading and zipping files from amazon

I'm currently storing all my photos on amazon s3 and using django for my website. I want a to have a button that allows users to click it and have all their photos zipped and returned to them.
I'm currently using boto to interface with amazon and found that I can go through the entire bucket list / use get_key to look for specific files and download them
After this I would need temporarily store them, then zip and return.
What is the best way to go about doing this?
Thanks
you can take a look at this question or at this snippet to download the file
# This is not a full working example, just a starting point
# for downloading images in different formats.
import subprocess
import Image
def image_as_png_pdf(request):
output_format = request.GET.get('format')
im = Image.open(path_to_image) # any Image object should work
if output_format == 'png':
response = HttpResponse(mimetype='image/png')
response['Content-Disposition'] = 'attachment; filename=%s.png' % filename
im.save(response, 'png') # will call response.write()
else:
# Temporary disk space, server process needs write access
tmp_path = '/tmp/'
# Full path to ImageMagick convert binary
convert_bin = '/usr/bin/convert'
im.save(tmp_path+filename+'.png', 'png')
response = HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'attachment; filename=%s.pdf' % filename
ret = subprocess.Popen([ convert_bin,
"%s%s.png"%(tmp_path,filename), "pdf:-" ],
stdout=subprocess.PIPE)
response.write(ret.stdout.read())
return response
to create a zip follow the link that i gave you, you can also use zipimport as shown here examples are on the bottom of the page, follow the documentation for newer versions
you might also be interested in this although it was made for django 1.2, it might not work on 1.3
Using python-zipstream as patched with this pull request you can do something like this:
import boto
import io
import zipstream
import sys
def iterable_to_stream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module). For efficiency, the stream is buffered.
From: https://stackoverflow.com/a/20260030/729491
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def iterate_key():
b = boto.connect_s3().get_bucket('lastage')
key = b.get_key('README.markdown')
for b in key:
yield b
with open('/tmp/foo.zip', 'w') as f:
z = zipstream.ZipFile(mode='w')
z.write(iterable_to_stream(iterate_key()), arcname='foo1')
z.write(iterable_to_stream(iterate_key()), arcname='foo2')
z.write(iterable_to_stream(iterate_key()), arcname='foo3')
for chunk in z:
print "CHUNK", len(chunk)
f.write(chunk)
Basically we iterate over the key contents using boto, convert this iterator to a stream using the iterable_to_stream method from this answer and then have python-zipstream create a zip file on-the-fly.

ZipExtFile to Django File

I am wondering whether there is a way to upload a zip file to django web server and put the zip's files into django database WITHOUT accessing the actual file system in the process (e.g. extracting the files in the zip into a tmp dir and then load them)
Django provides a function to convert python File to Django File, so if there is a way to convert ZipExtFile to python File, it should be fine.
thanks for help!
Django model:
from django.db import models
class Foo:
file = models.FileField(upload_to='somewhere')
Usage:
from zipfile import ZipFile
from django.core.exceptions import ValidationError
from django.core.files import File
from io import BytesIO
z = ZipFile('zipFile')
istream = z.open('subfile')
ostream = BytesIO(istream.read())
tmp = Foo(file=File(ostream))
try:
tmp.full_clean()
except Validation, e:
print e
Output:
{'file': [u'This field cannot be blank.']}
[SOLUTION] Solution using an ugly hack:
As correctly pointed out by Don Quest, file-like classes such as StringIO or BytesIO should represent the data as a virtual file. However, Django File's constructor only accepts the build-in file type and nothing else, although the file-like classes would have done the job as well. The hack is to set the variables in Django::File manually:
buf = bytesarray(OPENED_ZIP_OBJECT.read(FILE_NAME))
tmp_file = BytesIO(buf)
dummy_file = File(tmp_file) # this line actually fails
dummy_file.name = SOME_RANDOM_NAME
dummy_file.size = len(buf)
dummy_file.file = tmp_file
# dummy file is now valid
Please keep commenting if you have a better solution (except for custom storage)
There's an easier way to do this:
from django.core.files.base import ContentFile
uploaded_zip = zipfile.ZipFile(uploaded_file, 'r') # ZipFile
for filename in uploaded_zip.namelist():
with uploaded_zip.open(filename) as f: # ZipExtFile
my_django_file = ContentFile(f.read())
Using this, you can convert a file that was uploaded to memory directly to a django file. For a more complete example, let's say you wanted to upload a series of image files inside of a zip to the file system:
# some_app/models.py
class Photo(models.Model):
image = models.ImageField(upload_to='some/upload/path')
...
# Upload code
from some_app.models import Photo
for filename in uploaded_zip.namelist():
with uploaded_zip.open(filename) as f: # ZipExtFile
new_photo = Photo()
new_photo.image.save(filename, ContentFile(f.read(), save=True)
Without knowing to much about Django, i can tell you to take a look at the "io" package.
You could do something like:
from zipfile import ZipFile
from io import StringIO
zname,zipextfile = 'zipcontainer.zip', 'file_in_archive'
istream = ZipFile(zname).open(zipextfile)
ostream = StringIO(istream.read())
And then do whatever you would like to do with your "virtual" ostream Stream/File.
I've used the following django file class to avoid the need to read ZipExtFile into a another datastructure (StingIO or BytesIO) while properly impelementing what Django needs in order to save the file directly.
from django.core.files.base import File
class DjangoZipExtFile(File):
def __init__(self, zipextfile, zipinfo):
self.file = zipextfile
self.zipinfo = zipinfo
self.mode = 'r'
self.name = zipinfo.filename
self._size = zipinfo.file_size
def seek(self, position):
if position != 0:
#this will raise an unsupported operation
return self.file.seek(position)
#TODO if we have already done a read, reopen file
zipextfile = archive.open(path, 'r')
zipinfo = archive.getinfo(path)
djangofile = DjangoZipExtFile(zipextfile, zipinfo)
storage = DefaultStorage()
result = storage.save(djangofile.name, djangofile)

Downloading and unzipping a .zip file without writing to disk

I have managed to get my first python script to work which downloads a list of .ZIP files from a URL and then proceeds to extract the ZIP files and writes them to disk.
I am now at a loss to achieve the next step.
My primary goal is to download and extract the zip file and pass the contents (CSV data) via a TCP stream. I would prefer not to actually write any of the zip or extracted files to disk if I could get away with it.
Here is my current script which works but unfortunately has to write the files to disk.
import urllib, urllister
import zipfile
import urllib2
import os
import time
import pickle
# check for extraction directories existence
if not os.path.isdir('downloaded'):
os.makedirs('downloaded')
if not os.path.isdir('extracted'):
os.makedirs('extracted')
# open logfile for downloaded data and save to local variable
if os.path.isfile('downloaded.pickle'):
downloadedLog = pickle.load(open('downloaded.pickle'))
else:
downloadedLog = {'key':'value'}
# remove entries older than 5 days (to maintain speed)
# path of zip files
zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files"
# retrieve list of URLs from the webservers
usock = urllib.urlopen(zipFileURL)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
# only parse urls
for url in parser.urls:
if "PUBLIC_P5MIN" in url:
# download the file
downloadURL = zipFileURL + url
outputFilename = "downloaded/" + url
# check if file already exists on disk
if url in downloadedLog or os.path.isfile(outputFilename):
print "Skipping " + downloadURL
continue
print "Downloading ",downloadURL
response = urllib2.urlopen(downloadURL)
zippedData = response.read()
# save data to disk
print "Saving to ",outputFilename
output = open(outputFilename,'wb')
output.write(zippedData)
output.close()
# extract the data
zfobj = zipfile.ZipFile(outputFilename)
for name in zfobj.namelist():
uncompressed = zfobj.read(name)
# save uncompressed data to disk
outputFilename = "extracted/" + name
print "Saving extracted file to ",outputFilename
output = open(outputFilename,'wb')
output.write(uncompressed)
output.close()
# send data via tcp stream
# file successfully downloaded and extracted store into local log and filesystem log
downloadedLog[url] = time.time();
pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))
Below is a code snippet I used to fetch zipped csv file, please have a look:
Python 2:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(StringIO(resp.read()))
for line in myzip.open(file).readlines():
print line
Python 3:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
# or: requests.get(url).content
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(BytesIO(resp.read()))
for line in myzip.open(file).readlines():
print(line.decode('utf-8'))
Here file is a string. To get the actual string that you want to pass, you can use zipfile.namelist(). For instance,
resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')
myzip = ZipFile(BytesIO(resp.read()))
myzip.namelist()
# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
import zipfile
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
Or more simply (apologies to Vishal):
myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
[ ... ]
In Python 3 use BytesIO instead of StringIO:
import zipfile
from io import BytesIO
filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
[ ... ]
I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.
from io import BytesIO
from zipfile import ZipFile
import urllib.request
url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")
with ZipFile(BytesIO(url.read())) as my_zip_file:
for contained_file in my_zip_file.namelist():
# with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
for line in my_zip_file.open(contained_file).readlines():
print(line)
# output.write(line)
Necessary changes:
There's no StringIO module in Python 3 (it's been moved to io.StringIO). Instead, I use io.BytesIO]2, because we will be handling a bytestream -- Docs, also this thread.
urlopen:
"The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen.", Docs and this thread.
Note:
In Python 3, the printed output lines will look like so: b'some text'. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.
A few minor changes I made:
I use with ... as instead of zipfile = ... according to the Docs.
The script now uses .namelist() to cycle through all the files in the zip and print their contents.
I moved the creation of the ZipFile object into the with statement, although I'm not sure if that's better.
I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds "unzipped_and_read_" to the beginning of the filename and a ".file" extension (I prefer not to use ".txt" for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.
Need to be careful here -- because we have a byte string, we use binary mode, so "wb"; I have a feeling that writing binary opens a can of worms anyway...
I am using an example file, the UN/LOCODE text archive:
What I didn't do:
NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.
Here's a way:
import urllib.request
import shutil
with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
shutil.copyfileobj(response, out_file)
I'd like to add my Python3 answer for completeness:
from io import BytesIO
from zipfile import ZipFile
import requests
def get_zip(file_url):
url = requests.get(file_url)
zipfile = ZipFile(BytesIO(url.content))
files = [zipfile.open(file_name) for file_name in zipfile.namelist()]
return files.pop() if len(files) == 1 else files
write to a temporary file which resides in RAM
it turns out the tempfile module ( http://docs.python.org/library/tempfile.html ) has just the thing:
tempfile.SpooledTemporaryFile([max_size=0[,
mode='w+b'[, bufsize=-1[, suffix=''[,
prefix='tmp'[, dir=None]]]]]])
This
function operates exactly as
TemporaryFile() does, except that data
is spooled in memory until the file
size exceeds max_size, or until the
file’s fileno() method is called, at
which point the contents are written
to disk and operation proceeds as with
TemporaryFile().
The resulting file has one additional
method, rollover(), which causes the
file to roll over to an on-disk file
regardless of its size.
The returned object is a file-like
object whose _file attribute is either
a StringIO object or a true file
object, depending on whether
rollover() has been called. This
file-like object can be used in a with
statement, just like a normal file.
New in version 2.6.
or if you're lazy and you have a tmpfs-mounted /tmp on Linux, you can just make a file there, but you have to delete it yourself and deal with naming
Adding on to the other answers using requests:
# download from web
import requests
url = 'http://mlg.ucd.ie/files/datasets/bbc.zip'
content = requests.get(url)
# unzip the content
from io import BytesIO
from zipfile import ZipFile
f = ZipFile(BytesIO(content.content))
print(f.namelist())
# outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
Use help(f) to get more functions details for e.g. extractall() which extracts the contents in zip file which later can be used with with open.
All of these answers appear too bulky and long. Use requests to shorten the code, e.g.:
import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("/path/to/directory")
Vishal's example, however great, confuses when it comes to the file name, and I do not see the merit of redefing 'zipfile'.
Here is my example that downloads a zip that contains some files, one of which is a csv file that I subsequently read into a pandas DataFrame:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
import pandas
url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip")
zf = ZipFile(StringIO(url.read()))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
(Note, I use Python 2.7.13)
This is the exact solution that worked for me. I just tweaked it a little bit for Python 3 version by removing StringIO and adding IO library
Python 3 Version
from io import BytesIO
from zipfile import ZipFile
import pandas
import requests
url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip"
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
It wasn't obvious in Vishal's answer what the file name was supposed to be in cases where there is no file on disk. I've modified his answer to work without modification for most needs.
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
def unzip_string(zipped_string):
unzipped_string = ''
zipfile = ZipFile(StringIO(zipped_string))
for name in zipfile.namelist():
unzipped_string += zipfile.open(name).read()
return unzipped_string
Use the zipfile module. To extract a file from a URL, you'll need to wrap the result of a urlopen call in a BytesIO object. This is because the result of a web request returned by urlopen doesn't support seeking:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
zip_url = 'http://example.com/my_file.zip'
with urlopen(zip_url) as f:
with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read())
If you already have the file downloaded locally, you don't need BytesIO, just open it in binary mode and pass to ZipFile directly:
from zipfile import ZipFile
zip_filename = 'my_file.zip'
with open(zip_filename, 'rb') as f:
with ZipFile(f) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read().decode('utf-8'))
Again, note that you have to open the file in binary ('rb') mode, not as text or you'll get a zipfile.BadZipFile: File is not a zip file error.
It's good practice to use all these things as context managers with the with statement, so that they'll be closed properly.

Categories

Resources