Memory leak when using py7zlib to open .7z archives - python

I am trying to use py7zlib to open and read files stored in .7z archives. I am able to do this, but it appears to be causing a memory leak. After scanning through a few hundred .7z files using py7zlib, Python crashes with a MemoryError. I don't have this problem when doing the equivalent operations on .zip files using the built-in zipfile library. My process with the .7z files is essentially as follows (look for a subfile in the archive with a given name and return its contents):
with open(filename, 'rb') as f:
z = py7zlib.Archive7z(f)
names = z.getnames()
if subName in names:
subFile = z.getmember(subName)
contents = subFile.read()
else:
contents = None
return contents
Does anyone know why this would be causing a memory leak once the Archive7z object passes out of scope if I am closing the .7z file object? Is there any kind of cleanup or file-closing procedure I need to follow (like with the zipfile library's ZipFile.close())?

Related

Python ZipFile: remove embedded archive from a containing file

There a technic of store ZIP archive concatenated with some other file (e. g. with EXE to store additional resources or with JPEG for steganography). Python's ZipFile supports such files (e. g. if you open ZipFile in "a" mode on non-ZIP file, it will append ZIP headers to the end). I would like to update such archive (possible add, update and delete files from ZIP archive).
Python's ZipFile doesn't support deleting and overriding of the files inside the archive, only appending, so the only way for me is completely recreate ZIP file with new contents. But I need to conserve the main file in which ZIP was embedded. If I just open it in "w" mode, the whole file has completed overridden.
I need a way how to remove a ZIP file from the end of an ordinary file. I'd prefer use only functions which are available in Python 3 standard library.
I found a solution:
min_header_offset = None
with ZipFile(output_filename, "r") as zip_file:
for info in zip_file.infolist():
if min_header_offset is None or info.header_offset < min_header_offset:
min_header_offset = info.header_offset
# Here also possible to save existing files if them needed for update
if min_header_offset is not None:
with open(output_filename, "r+b") as f:
f.truncate(min_header_offset)
# Somehow populate new archive contents
with ZipFile(args.output, "a") as zip_file:
for input_filename in input_filenames:
zip_file.write(input_filename)
It clears the archive, but don't touch anything what is going before the archive.

Archive files directly from memory in Python

I'm writing this program where I get a number of files, then zip them with encryption using pyzipper, and also I'm using io.BitesIO() to write these files to it so I keep them in-memory. So now, after some other additions, I want to get all of these in-memory files and zip them together in a single encrypted zip file using the same pyzipper.
The code looks something like this:
# Create the in-memory file object
in_memory = BytesIO()
# Create the zip file and open in write mode
with pyzipper.AESZipFile(in_memory, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zip_file:
# Set password
zip_file.setpassword(b"password")
# Save "data" with file_name
zip_file.writestr(file_name, data)
# Go to the beginning
in_memory.seek(0)
# Read the zip file data
data = in_memory.read()
# Add the data to a list
files.append(data)
So, as you may guess the "files" list is an attribute from a class and the whole thing above is a function that does this a number of times and then you get the full files list. For simplicity's sake, I removed most of the irrelevant parts.
I get no errors for now, but when I try to write all files to a new zip file I get an error. Here's the code:
with pyzipper.AESZipFile(test_name, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zfile:
zfile.setpassword(b"pass")
for file in files:
zfile.write(file)
I get a ValueError because of os.stat:
File "C:\Users\vulka\AppData\Local\Programs\Python\Python310\lib\site-packages\pyzipper\zipfile.py", line 820, in from_file
st = os.stat(filename)
ValueError: stat: embedded null character in path
[WHAT I TRIED]
So, I tried using mmap for this purpose but I don't think this can help me and if it can - then I have no idea how to make it work.
I also tried using fs.memoryfs.MemoryFS to temporarily create a virtual filessystem in memory to store all the files and then get them back to zip everything together and then save it to disk. Again - failed. I got tons of different errors in my tests and TBH, there's very little information out there on this fs method and even if what I'm trying to do is possible - I couldn't figure it out.
P.S: I don't know if pyzipper (almost 1:1 zipfile with the addition of encryption) supports nested zip files at all. This could be the problem I'm facing but if it doesn't I'm open to any suggestions for a new approach to doing this. Also, I don't want to rely on a 3rd party software, even if it is open source! (I'm talking about the method of using 7zip to do all the archiving and ecryption, even though it shouldn't even be possible to use it without saving the files to disk in the first place, which is the main thing I'm trying to avoid)

How to write file to memory filepath and read from memory filepath in Python?

An existing Python package requires a filepath as input parameter for a method to be able to parse the file from the filepath. I want to use this very specific Python package in a cloud environment, where I can't write files to the harddrive. I don't have direct control over the code in the existing Python package, and it's not easy to switch to another environment, where I would be able to write files to the harddrive. So I'm looking for a solution that is able to write a file to a memory filepath, and let the parser read directly from this memory filepath. Is this possible in Python? Or are there any other solutions?
Example Python code that works by using harddrive, which should be changed so that no harddrive is used:
temp_filepath = "./temp.txt"
with open(temp_filepath, "wb") as file:
file.write("some binary data")
model = Model()
model.parse(temp_filepath)
Example Python code that uses memory filesystem to store file, but which does not let parser read file from memory filesystem:
from fs import open_fs
temp_filepath = "./temp.txt"
with open_fs('osfs://~/') as home_fs:
home_fs.writetext(temp_filepath, "some binary data")
model = Model()
model.parse(temp_filepath)
You're probably looking for StringIO or BytesIO from io
import io
with io.BytesIO() as tmp:
tmp.write(content)
# to continue working, rewind file pointer
tmp.seek(0)
# work with tmp
pathlib may also be an advantage

Unzip folder by chunks in python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

MetaData of downloaded zipped file

url='http://www.test.com/test.zip'
z = zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
z.extractall(path='D:')
I am writing above code to download a zipped file from a url and have downloaded and extracted all files from it to a specified drive and it is working fine.
Is there a way I can get meta data of all files extracted from z for example.
filenames,file sizes and file extenstions etc?
Zipfile objects actually have built in tools for this that you can use without even extracting anything. infolist returns a list of ZipInfo objects that you can read certain information out of, including full file name and uncompressed size.
import os
url='http://www.test.com/test.zip'
z = zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
info = z.infolist()
data = []
for obj in info:
name = os.path.splitext(obj.filename)
data.append(name[0], name[1], obj.file_size)
I also used os.path.splitext just to separate out the file's name from its extension as you did ask for file type separately from the name.
I don't know of a built-in way to do that using the zipfile module, however it is easily done using os.path:
import os
EXTRACT_PATH = "D:"
z= zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
z.extractall(path=EXTRACT_PATH)
extracted_files = [os.path.join(EXTRACT_PATH, filename) for filename in z.namelist()]
for extracted_file in extracted_files:
# All metadata operations here, such as:
print os.path.getsize(extracted_file)

Categories

Resources