python zipfile.ZipFile() method generates 20G zip file from 6M original - python

I am running a python program (v2.7) which zips output so that it can be emailed.
Usually this works as expected, but occasionally the zipped file is so huge that the machine runs out of disk space. Yet when I zip the file manually using the finder, it works fine.
In this case, the 6MB file gets zipped down to a 1.6MB file using the finder, but the python zip method generated a 20GB file. Here is the code where the zipping is happening:
zip = zipfile.ZipFile(zipfilename,"w",zipfile.ZIP_DEFLATED)
for f in os.listdir("."):
if fnmatch.fnmatch(f,"*final*"):
zip.write(f)
zip.close()
Is there a way to fix this or at least avoid generating a gigantic file?

Do you maybe create that zip file in the same directory and the program is then trying to add the zipfile itself to the zip file?

Its Linux?, i think you are including hidden files and folders?

Related

Archive files directly from memory in Python

I'm writing this program where I get a number of files, then zip them with encryption using pyzipper, and also I'm using io.BitesIO() to write these files to it so I keep them in-memory. So now, after some other additions, I want to get all of these in-memory files and zip them together in a single encrypted zip file using the same pyzipper.
The code looks something like this:
# Create the in-memory file object
in_memory = BytesIO()
# Create the zip file and open in write mode
with pyzipper.AESZipFile(in_memory, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zip_file:
# Set password
zip_file.setpassword(b"password")
# Save "data" with file_name
zip_file.writestr(file_name, data)
# Go to the beginning
in_memory.seek(0)
# Read the zip file data
data = in_memory.read()
# Add the data to a list
files.append(data)
So, as you may guess the "files" list is an attribute from a class and the whole thing above is a function that does this a number of times and then you get the full files list. For simplicity's sake, I removed most of the irrelevant parts.
I get no errors for now, but when I try to write all files to a new zip file I get an error. Here's the code:
with pyzipper.AESZipFile(test_name, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zfile:
zfile.setpassword(b"pass")
for file in files:
zfile.write(file)
I get a ValueError because of os.stat:
File "C:\Users\vulka\AppData\Local\Programs\Python\Python310\lib\site-packages\pyzipper\zipfile.py", line 820, in from_file
st = os.stat(filename)
ValueError: stat: embedded null character in path
[WHAT I TRIED]
So, I tried using mmap for this purpose but I don't think this can help me and if it can - then I have no idea how to make it work.
I also tried using fs.memoryfs.MemoryFS to temporarily create a virtual filessystem in memory to store all the files and then get them back to zip everything together and then save it to disk. Again - failed. I got tons of different errors in my tests and TBH, there's very little information out there on this fs method and even if what I'm trying to do is possible - I couldn't figure it out.
P.S: I don't know if pyzipper (almost 1:1 zipfile with the addition of encryption) supports nested zip files at all. This could be the problem I'm facing but if it doesn't I'm open to any suggestions for a new approach to doing this. Also, I don't want to rely on a 3rd party software, even if it is open source! (I'm talking about the method of using 7zip to do all the archiving and ecryption, even though it shouldn't even be possible to use it without saving the files to disk in the first place, which is the main thing I'm trying to avoid)

Missing contents in .txt file after shutil.make_archive

Sysinfo = open('SystemInformation.txt', 'w')
Sysinfo.write("something useful",)
Sysinfo.close
#a handful more processes occur here
os.chdir(dstFolder)
shutil.make_archive('filename', 'zip', srcFolder)
I have the above code and everything zips up just fine except for the SystemInformation.txt file I created. When I open it up after extracting the .zip file it is completely blank. The odd part to me is that the same file in the source folder before it gets zipped is completely fine.
Make sure you call functions properly. You are missing the following:
Sysinfo.close()

Run python zip file from memory at runtime?

I am trying to run a python zip file which is retrieved using requests.get. The zip file has several directories of python files in addition to the __main__.py, so in the interest of easily sending it as a single file, I am zipping it.
I know the file is being sent correctly, as I can save it to a file and then run it, however, I want to execute it without writing it to storage.
The working part is more or less as follows:
import requests
response = requests.get("http://myurl.com/get_zip")
I can write the zip to file using
f = open("myapp.zip","wb")
f.write(response.content)
f.close()
and manually run it from command line. However, I want something more like
exec(response.content)
This doesn't work since it's still compressed, but you get the idea.
I am also open to ideas that replace the zip with some other format of sending the code over internet, if it's easier to execute it from memory.
A possible solution is this
import io
import requests
from zipfile import ZipFile
response = requests.get("http://myurl.com/get_zip")
# Read the contents of the zip into a bytes object.
binary_zip = io.BytesIO(response.content)
# Convert the bytes object into a ZipFile.
zip_file = ZipFile(binary_zip, "r")
# Iterate over all files in the zip (folders should be also ok).
for script_file in zip_file.namelist():
exec(zip_file.read(script_file))
But it is a bit convoluted and probably can be improved.

Unzip folder by chunks in python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

Python ZipFile giving different namelist than unzipping utility

I have a bunch of timestamped .jpgs in a zip file, and when I open that zip file using Python's ZipFile package, I see three files:
>>> cameraZip = zipfile.ZipFile(zipPath, 'r')
>>> cameraZip.namelist()
['20131108_200152.jpg', '20131108_203158.jpg', '20131108_205521.jpg']
When I unpack the file using Mac OSX's default .zip unexpander, I get 371 files, from '20131101_000159.jpg' up to '20131108_193152.jpg'.
Unzipping this file gives the same result as the .zip unexpander:
$ unzip 2013.11.zip
extracting: 20131101_000159.jpg
extracting: 20131101_003156.jpg
...
extracting: 20131108_190155.jpg
extracting: 20131108_193152.jpg
Anybody have any idea what's going on?
Most likely the problem is in zip central directory record, which wasn't correctly flushed when zip file was created. While Python looks for central directory (I guess), other implementations process local file headers and found all of them.

Categories

Resources