Is there a Python library that allows manipulation of zip archives in memory, without having to use actual disk files?
The ZipFile library does not allow you to update the archive. The only way seems to be to extract it to a directory, make your changes, and create a new zip from that directory. I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them.
Something similar to Java's ZipInputStream/ZipOutputStream would do the trick, although any interface at all that avoids disk access would be fine.
According to the Python docs:
class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])
Open a ZIP file, where file can be either a path to a file (a string) or a file-like object.
So, to open the file in memory, just create a file-like object (perhaps using BytesIO).
file_like_object = io.BytesIO(my_zip_data)
zipfile_ob = zipfile.ZipFile(file_like_object)
PYTHON 3
import io
import zipfile
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a",
zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, data in [('1.txt', io.BytesIO(b'111')),
('2.txt', io.BytesIO(b'222'))]:
zip_file.writestr(file_name, data.getvalue())
with open('C:/1.zip', 'wb') as f:
f.write(zip_buffer.getvalue())
From the article In-Memory Zip in Python:
Below is a post of mine from May of 2008 on zipping in memory with Python, re-posted since Posterous is shutting down.
I recently noticed that there is a for-pay component available to zip files in-memory with Python. Considering this is something that should be free, I threw together the following code. It has only gone through very basic testing, so if anyone finds any errors, let me know and I’ll update this.
import zipfile
import StringIO
class InMemoryZip(object):
def __init__(self):
# Create the in-memory file-like object
self.in_memory_zip = StringIO.StringIO()
def append(self, filename_in_zip, file_contents):
'''Appends a file with name filename_in_zip and contents of
file_contents to the in-memory zip.'''
# Get a handle to the in-memory zip in append mode
zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)
# Write the file to the in-memory zip
zf.writestr(filename_in_zip, file_contents)
# Mark the files as having been created on Windows so that
# Unix permissions are not inferred as 0000
for zfile in zf.filelist:
zfile.create_system = 0
return self
def read(self):
'''Returns a string with the contents of the in-memory zip.'''
self.in_memory_zip.seek(0)
return self.in_memory_zip.read()
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, "w")
f.write(self.read())
f.close()
if __name__ == "__main__":
# Run a test
imz = InMemoryZip()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")
The example Ethier provided has several problems, some of them major:
doesn't work for real data on Windows. A ZIP file is binary and its data should always be written with a file opened 'wb'
the ZIP file is appended to for each file, this is inefficient. It can just be opened and kept as an InMemoryZip attribute
the documentation states that ZIP files should be closed explicitly, this is not done in the append function (it probably works (for the example) because zf goes out of scope and that closes the ZIP file)
the create_system flag is set for all the files in the zipfile every time a file is appended instead of just once per file.
on Python < 3 cStringIO is much more efficient than StringIO
doesn't work on Python 3 (the original article was from before the 3.0 release, but by the time the code was posted 3.1 had been out for a long time).
An updated version is available if you install ruamel.std.zipfile (of which I am the author). After
pip install ruamel.std.zipfile
or including the code for the class from here, you can do:
import ruamel.std.zipfile as zipfile
# Run a test
zipfile.InMemoryZipFile()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")
You can alternatively write the contents using imz.data to any place you need.
You can also use the with statement, and if you provide a filename, the contents of the ZIP will be written on leaving that context:
with zipfile.InMemoryZipFile('test.zip') as imz:
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
because of the delayed writing to disc, you can actually read from an old test.zip within that context.
I am using Flask to create an in-memory zipfile and return it as a download. Builds on the example above from Vladimir. The seek(0) took a while to figure out.
import io
import zipfile
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, data in [('1.txt', io.BytesIO(b'111')), ('2.txt', io.BytesIO(b'222'))]:
zip_file.writestr(file_name, data.getvalue())
zip_buffer.seek(0)
return send_file(zip_buffer, attachment_filename='filename.zip', as_attachment=True)
I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them
This is possible using the two libraries https://github.com/uktrade/stream-unzip and https://github.com/uktrade/stream-zip (full disclosure: written by me). And depending on the changes, you might not even have to store the entire zip in memory at once.
Say you just want to download, unzip, zip, and re-upload. Slightly pointless, but you could slot in some changes to the unzipped content:
from datetime import datetime
import httpx
from stream_unzip import stream_unzip
from stream_zip import stream_zip, ZIP_64
def get_source_bytes_iter(url):
with httpx.stream('GET', url) as r:
yield from r.iter_bytes()
def get_target_files(files):
# stream-unzip doesn't expose perms or modified_at, but stream-zip requires them
modified_at = datetime.now()
perms = 0o600
for name, _, chunks in files:
# Could change name, manipulate chunks, skip a file, or yield a new file
yield name.decode(), modified_at, perms, ZIP_64, chunks
source_url = 'https://source.test/file.zip'
target_url = 'https://target.test/file.zip'
source_bytes_iter = get_source_bytes_iter(source_url)
source_files = stream_unzip(source_bytes_iter)
target_files = get_target_files(source_files)
target_bytes_iter = stream_zip(target_files)
httpx.put(target_url, data=target_bytes_iter)
Helper to create in-memory zip file with multiple files based on data like {'1.txt': 'string', '2.txt": b'bytes'}
import io, zipfile
def prepare_zip_file_content(file_name_content: dict) -> bytes:
"""returns Zip bytes ready to be saved with
open('C:/1.zip', 'wb') as f: f.write(bytes)
#file_name_content dict like {'1.txt': 'string', '2.txt": b'bytes'}
"""
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, file_data in file_name_content.items():
zip_file.writestr(file_name, file_data)
zip_buffer.seek(0)
return zip_buffer.getvalue()
You can use the library libarchive in Python through ctypes - it offers ways of manipulating ZIP data in memory, with a focus on streaming (at least historically).
Say we want to uncompress ZIP files on the fly while downloading from an HTTP server. The below code
from contextlib import contextmanager
from ctypes import CFUNCTYPE, POINTER, create_string_buffer, cdll, byref, c_ssize_t, c_char_p, c_int, c_void_p, c_char
from ctypes.util import find_library
import httpx
def get_zipped_chunks(url, chunk_size=6553):
with httpx.stream('GET', url) as r:
yield from r.iter_bytes()
def stream_unzip(zipped_chunks, chunk_size=65536):
# Library
libarchive = cdll.LoadLibrary(find_library('archive'))
# Callback types
open_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
read_callback_type = CFUNCTYPE(c_ssize_t, c_void_p, c_void_p, POINTER(POINTER(c_char)))
close_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
# Function types
libarchive.archive_read_new.restype = c_void_p
libarchive.archive_read_open.argtypes = [c_void_p, c_void_p, open_callback_type, read_callback_type, close_callback_type]
libarchive.archive_read_finish.argtypes = [c_void_p]
libarchive.archive_entry_new.restype = c_void_p
libarchive.archive_read_next_header.argtypes = [c_void_p, c_void_p]
libarchive.archive_read_support_compression_all.argtypes = [c_void_p]
libarchive.archive_read_support_format_all.argtypes = [c_void_p]
libarchive.archive_entry_pathname.argtypes = [c_void_p]
libarchive.archive_entry_pathname.restype = c_char_p
libarchive.archive_read_data.argtypes = [c_void_p, POINTER(c_char), c_ssize_t]
libarchive.archive_read_data.restype = c_ssize_t
libarchive.archive_error_string.argtypes = [c_void_p]
libarchive.archive_error_string.restype = c_char_p
ARCHIVE_EOF = 1
ARCHIVE_OK = 0
it = iter(zipped_chunks)
compressed_bytes = None # Make sure not garbage collected
#contextmanager
def get_archive():
archive = libarchive.archive_read_new()
if not archive:
raise Exception('Unable to allocate archive')
try:
yield archive
finally:
libarchive.archive_read_finish(archive)
def read_callback(archive, client_data, buffer):
nonlocal compressed_bytes
try:
compressed_bytes = create_string_buffer(next(it))
except StopIteration:
return 0
else:
buffer[0] = compressed_bytes
return len(compressed_bytes) - 1
def uncompressed_chunks(archive):
uncompressed_bytes = create_string_buffer(chunk_size)
while (num := libarchive.archive_read_data(archive, uncompressed_bytes, len(uncompressed_bytes))) > 0:
yield uncompressed_bytes.value[:num]
if num < 0:
raise Exception(libarchive.archive_error_string(archive))
with get_archive() as archive:
libarchive.archive_read_support_compression_all(archive)
libarchive.archive_read_support_format_all(archive)
libarchive.archive_read_open(
archive, 0,
open_callback_type(0), read_callback_type(read_callback), close_callback_type(0),
)
entry = c_void_p(libarchive.archive_entry_new())
if not entry:
raise Exception('Unable to allocate entry')
while (status := libarchive.archive_read_next_header(archive, byref(entry))) == ARCHIVE_OK:
yield (libarchive.archive_entry_pathname(entry), uncompressed_chunks(archive))
if status != ARCHIVE_EOF:
raise Exception(libarchive.archive_error_string(archive))
can be used as follows to do that
zipped_chunks = get_zipped_chunks('https://domain.test/file.zip')
files = stream_unzip(zipped_chunks)
for name, uncompressed_chunks in stream_unzip(zipped_chunks):
print(name)
for uncompressed_chunk in uncompressed_chunks:
print(uncompressed_chunk)
In fact since libarchive supports multiple archive formats, and nothing above is particularly ZIP-specific, it may well work with other formats.
I have a question quite similar to this question, where I need the follow conditions to be upheld:
If a file is opened for reading, that file may only be opened for reading by any other process/program
If a file is opened for writing, that file may only be opened for reading by any other process/program
The solution posted in the linked question uses a third party library which adds an arbitrary .LOCK file in the same directory as the file in question. It is a solution that only works wrt to the program in which that library is being used and doesn't prevent any other process/program from using the file as they may not be implemented to check for a .LOCK association.
In essence, I wish to replicate this result using only Python's standard library.
BLUF: Need a standard library implementation specific to Windows for exclusive file locking
To give an example of the problem set, assume there is:
1 file on a shared network/drive
2 users on separate processes/programs
Suppose that User 1 is running Program A on the file and at some point the following is executed:
with open(fp, 'rb') as f:
while True:
chunk = f.read(10)
if chunk:
# do something with chunk
else:
break
Thus they are iterating through the file 10 bytes at a time.
Now User 2 runs Program B on the same file a moment later:
with open(fp, 'wb') as f:
for b in data: # some byte array
f.write(b)
On Windows, the file in question is immediately truncated and Program A stops iterating (even if it wasn't done) and Program B begins to write to the file. Therefore I need a way to ensure that the file may not be opened in a different mode that would alter its content if previously opened.
I was looking at the msvcrt library, namely the msvcrt.locking() interface. What I have been successful at doing is ensuring that a file opened for reading can be locked for reading, but nobody else can read the file (as I lock the entire file):
>>> f1 = open(fp, 'rb')
>>> f2 = open(fp, 'rb')
>>> msvcrt.locking(f1.fileno(), msvcrt.LK_LOCK, os.stat(fp).st_size)
>>> next(f1)
b"\x00\x05'\n"
>>> next(f2)
PermissionError: [Errno 13] Permission denied
This is an acceptible result, just not the most desired.
In the same scenario, User 1 runs Program A which includes:
with open(fp, 'rb') as f
msvcrt.locking(f.fileno(), msvcrt.LK_LOCK, os.stat(fp).st_size)
# repeat while block
msvcrt.locking(f.fileno(), msvcrt.LK_UNLCK, os.stat(fp).st_size)
Then User 2 runs Program B a moment later, the same result occurs and the file is truncated.
At this point, I would've liked a way to throw an error to User 2 stating the file is opened for reading somewhere else and cannot be written at this time. But if User 3 came along and opened the file for reading, then there would be no problem.
Update:
A potential solution is to change the permissions of a file (with exception catching if the file is already in use):
>>> os.chmod(fp, stat.S_IRUSR | stat.S_IRGRP | stat.S_IROTH)
>>> with open(fp, 'wb') as f:
# do something
PermissionError: [Errno 13] Permission denied <fp>
This doesn't feel like the best solution (particularly if the users didn't have the permission to even change permissions). Still looking for a proper locking solution but msvcrt doesn't prevent truncating and writing if the file is locked for reading. There still doesn't appear to be a way to generate an exclusive lock with Python's standard library.
For those who are interested in a Windows specific solution:
import os
import ctypes
import msvcrt
import pathlib
# Windows constants for file operations
NULL = 0x00000000
CREATE_ALWAYS = 0x00000002
OPEN_EXISTING = 0x00000003
FILE_SHARE_READ = 0x00000001
FILE_ATTRIBUTE_READONLY = 0x00000001 # strictly for file reading
FILE_ATTRIBUTE_NORMAL = 0x00000080 # strictly for file writing
FILE_FLAG_SEQUENTIAL_SCAN = 0x08000000
GENERIC_READ = 0x80000000
GENERIC_WRITE = 0x40000000
_ACCESS_MASK = os.O_RDONLY | os.O_WRONLY
_ACCESS_MAP = {os.O_RDONLY: GENERIC_READ,
os.O_WRONLY: GENERIC_WRITE
}
_CREATE_MASK = os.O_CREAT | os.O_TRUNC
_CREATE_MAP = {NULL: OPEN_EXISTING,
os.O_CREAT | os.O_TRUNC: CREATE_ALWAYS
}
win32 = ctypes.WinDLL('kernel32.dll', use_last_error=True)
win32.CreateFileW.restype = ctypes.c_void_p
INVALID_FILE_HANDLE = ctypes.c_void_p(-1).value
def _opener(path: pathlib.Path, flags: int) -> int:
access_flags = _ACCESS_MAP[flags & _ACCESS_MASK]
create_flags = _CREATE_MAP[flags & _CREATE_MASK]
if flags & os.O_WRONLY:
share_flags = NULL
attr_flags = FILE_ATTRIBUTE_NORMAL
else:
share_flags = FILE_SHARE_READ
attr_flags = FILE_ATTRIBUTE_READONLY
attr_flags |= FILE_FLAG_SEQUENTIAL_SCAN
h = win32.CreateFileW(path, access_flags, share_flags, NULL, create_flags, attr_flags, NULL)
if h == INVALID_FILE_HANDLE:
raise ctypes.WinError(ctypes.get_last_error())
return msvcrt.open_osfhandle(h, flags)
class _FileControlAccessor(pathlib._NormalAccessor):
open = staticmethod(_opener)
_control_accessor = _FileControlAccessor()
class Path(pathlib.WindowsPath):
def _init(self) -> None:
self._closed = False
self._accessor = _control_accessor
def _opener(self, name, flags) -> int:
return self._accessor.open(name, flags)
I have a python program (say reader.py) which uses file setting.py to read from:
while( True ):
...
execfile( settings.py )
...
But there is other python program (say writer.py) that uses this file to write to:
...
try:
settings = open('settings.py', 'w')
settings.truncate()
settings.write( 'some text')
except IOError:
print('Cannot write to file')
finally:
settings.close()
...
Note1: reader.py and writer.py do not ''know'' about each other.
Note2: reader.py reads settings.py cyclically, though writer.py writes to file when user wants to (not necessarily right after he/she clicked ''write'', it just means that there is no any rule when to write).
Question: What is the best way to cooperate two programs in order to avoid any contradiction? I know this might depend on platform. I am using Linux. Distributions are: Ubuntu, Scientific Linux.
EDIT1: If I choose to use FiFo I encounter the following problem: Once writer has write to settings file it will probably never write again but reader should have access to settings anyway in this case. In other words, reader should have an ability to read from file and not to wait for writer in this case. Otherwise reader has to wait for writer.
Ordinary using of FiFo does not allow reader to read from file if writer does not write (until it has written). How to deal with this problem?
You may be interested in using a named pipe for your interprocess communications. Available in Linux, it is a special type of file designed for client (writer.py), server (reader.py), tasks. After writing to the pipe, the client will wait until the server has received the data. This allows you to sync the two processes somewhat.
Linux Manual for FiFo
Python doc: os.mkfifo(path[, mode])
I found the following solution which seems to be working. I use flock to create locks.
Reader:
import errno
import fcntl
from time import *
path = "testLock.py"
f = open(path, "r")
while True:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
break
except IOError as e:
if e.errno != errno.EAGAIN:
raise
else:
sleep(1)
print 'Waiting...'
#reader's action
execfile(path)
#drop lock
fcntl.flock(f, fcntl.LOCK_UN)
Writer:
import errno
import fcntl
from time import *
path = "testLock.py"
f = open(path, "w")
while True:
try:
fcntl.flock(f, fcntl.LOCK_SH | fcntl.LOCK_NB)
break
except IOError as e:
if e.errno != errno.EAGAIN:
raise
else:
sleep(1)
print 'Waiting...'
#writer's action
for i in (1,10,2):
f.write('print "%d" % i')
sleep(1)
#drop lock
fcntl.flock(f, fcntl.LOCK_UN)
I have some question here:
Qusetion 1: Is it correct usage of LOCK_EX and LOCK_SH I mean are they in the right place?
Question 2: Is the reader's action i.e execfile correct here? If the file is already opened is execfile try to open it anyway?
Is there an option I can pass open() that will cause an IOerror when trying to write a nonexistent file? I am using python to read and write block devices via symlinks, and if the link is missing I want to raise an error rather than create a regular file. I know I could add a check to see if the file exists and manually raise the error, but would prefer to use something built-in if it exists.
Current code looks like this:
device = open(device_path, 'wb', 0)
device.write(data)
device.close()
Yes.
open(path, 'r+b')
Specifying the "r" option means the file must exist and you can read.
Specifying "+" means you can write and that you will be positioned at the end.
https://docs.python.org/3/library/functions.html?#open
Use os.path.islink() or os.path.isfile() to check if the file exists.
Doing the check each time is a nuisance, but you can always wrap open():
import os
def open_if_exists(*args, **kwargs):
if not os.path.exists(args[0]):
raise IOError('{:s} does not exist.'.format(args[0]))
f = open(*args, **kwargs)
return f
f = open_if_exists(r'file_does_not_exist.txt', 'w+')
This is just quick and dirty, so it doesn't allow for usage as: with open_if_exists(...).
Update
The lack of a context manager was bothering me, so here goes:
import os
from contextlib import contextmanager
#contextmanager
def open_if_exists(*args, **kwargs):
if not os.path.exists(args[0]):
raise IOError('{:s} does not exist.'.format(args[0]))
f = open(*args, **kwargs)
try:
yield f
finally:
f.close()
with open_if_exists(r'file_does_not_exist.txt', 'w+') as f:
print('foo', file=f)
I am afraid you can't perform the check of file existence and raise error using the open() function.
Below is the signature of open() in python where name is the file_name, mode is the access mode and buffering to indicate if buffering is to be performed while accessing a file.
open(name[, mode[, buffering]])
Instead, you can check if the file exists or not.
>>> import os
>>> os.path.isfile(file_name)
This will return True or False depending on if the file exists. To test a file specifically, you can use this.
To test the existence of both files and directories, you can use:
>>> os.path.exists(file_path)
I am trying to create a testcontroller and wants the execution of tests to be collected a a file.
i know using, tee and redirecting the test script execution to a certain file, but I am interested to do it with python over linux.
So, in this case whenever a test is executed the log file should get created, and all the execution logs including stdin,stdout and stderr should get collected to this file.
Requesting some body to suggest me, how to implement this kind of idea!
Thanks
OpenFile
There are several good logging modules, starting with the built-in logging, here is the official cookbook. Among the more interesting 3rd party libraries is Logbook, here is a pretty bare example just scratching the surface of its very cool features:
import logbook
def f(i,j):
return i+j
logger = logbook.Logger('my application logger')
log = logbook.FileHandler('so.log')
log.push_application()
try:
f(1, '2')
logger.info('called '+f.__name__)
except:
logger.warn('failed on ')
try:
f(1, 2)
logger.info('called '+f.__name__)
except:
logger.warn('choked on, ')
so.log then looks like this:
[2011-05-19 07:40] WARNING: my application logger: failed on
[2011-05-19 07:40] INFO: my application logger: called f
Try this:
import sys
# Save the current stream
save_out = sys.stdout
# Define the log file
f = "a_log_file.log"
# Append to existing log file.
# Change 'a' to 'w' to recreate the log file each time.
fsock = open(f, 'a')
# Set stream to file
sys.stdout = fsock
###
# do something here
# any print function calls will send the stream to file f
###
# Reset back the stream to what it was
# any print function calls will send the stream to the previous stream
sys.stdout = save_out
fsock.close()
Open and write to a file:
mylogfile = 'bla.log'
f = open(mylogfile, 'a')
f.write('i am logging! logging logging!....loggin? timber!....')
f.close()
look in root direct of script for 'bla.log' and read, enjoy
You can write a function like this:
def writeInLog(msg):
with open("log", "a") as f:
f.write(msg+"\n")
It will open the file "log", and append ("a") the message followed by a newline, then close the file.
# Save the current stream
saveout = sys.stdout
f = "a_log_file.log"
fsock = open(f, 'w')
# Set stream to file
sys.stdout = fsock
###
# do something here
# any print function will send the stream to file f
###
# Reset back the stream to what it was
sys.stdout = saveout
fsock.close()