Unzip folder by chunks in python - python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.

You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/

You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

Related

Python ZipFile: remove embedded archive from a containing file

There a technic of store ZIP archive concatenated with some other file (e. g. with EXE to store additional resources or with JPEG for steganography). Python's ZipFile supports such files (e. g. if you open ZipFile in "a" mode on non-ZIP file, it will append ZIP headers to the end). I would like to update such archive (possible add, update and delete files from ZIP archive).
Python's ZipFile doesn't support deleting and overriding of the files inside the archive, only appending, so the only way for me is completely recreate ZIP file with new contents. But I need to conserve the main file in which ZIP was embedded. If I just open it in "w" mode, the whole file has completed overridden.
I need a way how to remove a ZIP file from the end of an ordinary file. I'd prefer use only functions which are available in Python 3 standard library.
I found a solution:
min_header_offset = None
with ZipFile(output_filename, "r") as zip_file:
for info in zip_file.infolist():
if min_header_offset is None or info.header_offset < min_header_offset:
min_header_offset = info.header_offset
# Here also possible to save existing files if them needed for update
if min_header_offset is not None:
with open(output_filename, "r+b") as f:
f.truncate(min_header_offset)
# Somehow populate new archive contents
with ZipFile(args.output, "a") as zip_file:
for input_filename in input_filenames:
zip_file.write(input_filename)
It clears the archive, but don't touch anything what is going before the archive.

Is it possible to read a specific file directly from a zip file that is stored on S3?

I have a file called story.txt in a zip file called big.zip that is stored in an S3 bucket called zips-bucket.
I want my Python code to read the content of just story.txt without downloading or even scanning the entire big zip file. Is it possible? How?
No, in your particular case it is not possible. However, S3 offers a functionality called S3 Select that can selectively read a portion of the file if some requirements are met. You can check out the documentation.
Yes, this is possible. You will need to import the smart-open and zipfile modules. Say your compressed file is in s3://zips-bucket/big.zip. Do the following:
import smart_open as so
import zipfile
with so.open('s3://zips-bucket/big.zip', 'rb') as file_data
with zipfile.ZipFile(file_data) as z:
with z.open('story.txt') as zip_file_data:
story_lines = zip_file_data.readlines()
And that should do it!

Memory leak when using py7zlib to open .7z archives

I am trying to use py7zlib to open and read files stored in .7z archives. I am able to do this, but it appears to be causing a memory leak. After scanning through a few hundred .7z files using py7zlib, Python crashes with a MemoryError. I don't have this problem when doing the equivalent operations on .zip files using the built-in zipfile library. My process with the .7z files is essentially as follows (look for a subfile in the archive with a given name and return its contents):
with open(filename, 'rb') as f:
z = py7zlib.Archive7z(f)
names = z.getnames()
if subName in names:
subFile = z.getmember(subName)
contents = subFile.read()
else:
contents = None
return contents
Does anyone know why this would be causing a memory leak once the Archive7z object passes out of scope if I am closing the .7z file object? Is there any kind of cleanup or file-closing procedure I need to follow (like with the zipfile library's ZipFile.close())?

python zipfile.ZipFile() method generates 20G zip file from 6M original

I am running a python program (v2.7) which zips output so that it can be emailed.
Usually this works as expected, but occasionally the zipped file is so huge that the machine runs out of disk space. Yet when I zip the file manually using the finder, it works fine.
In this case, the 6MB file gets zipped down to a 1.6MB file using the finder, but the python zip method generated a 20GB file. Here is the code where the zipping is happening:
zip = zipfile.ZipFile(zipfilename,"w",zipfile.ZIP_DEFLATED)
for f in os.listdir("."):
if fnmatch.fnmatch(f,"*final*"):
zip.write(f)
zip.close()
Is there a way to fix this or at least avoid generating a gigantic file?
Do you maybe create that zip file in the same directory and the program is then trying to add the zipfile itself to the zip file?
Its Linux?, i think you are including hidden files and folders?

How to walk a tar.gz file that contains zip files without extraction

I have a large tar.gz file to analyze using a python script. The tar.gz file contains a number of zip files which might embed other .gz files in it. Before extracting the file, I would like to walk through the directory structure within the compressed files to see if certain files or directories are present. By looking at tarfile and zipfile module I don't see any existing function that allow me to get a table of content of a zip file within a tar.gz file.
Appreciate your help,
You can't get at it without extracting the file. However, you don't need to extract it to disk if you don't want to. You can use the tarfile.TarFile.extractfile method to get a file-like object that you can then pass to tarfile.open as the fileobj argument. For example, given these nested tarfiles:
$ cat bar/baz.txt
This is bar/baz.txt.
$ tar cvfz bar.tgz bar
bar/
bar/baz.txt
$ tar cvfz baz.tgz bar.tgz
bar.tgz
You can access files from the inner one like so:
>>> import tarfile
>>> baz = tarfile.open('baz.tgz')
>>> bar = tarfile.open(fileobj=baz.extractfile('bar.tgz'))
>>> bar.extractfile('bar/baz.txt').read()
'This is bar/baz.txt.\n'
and they're only ever extracted to memory.
I suspect that this is not possible and that you'll have to program it manually.
.tar.gz files are first tar'd then gzipped with what is essentially two different applications, in succession. To access the tar file, you're probably going to have to un-gzip it, first.
Also, once you do have access to the tar file after ungzipping it, it does not do random-access well. There is no central repository in the tar file that lists the contents.

Categories

Resources