Python zipfile module doesn't compress files - python

I have a problem with compression in Python.
I know I should call the ZIP_DEFLATED method when writing to make the zip file compressed, but it does not work for me.
I have 3 PDF documents in the C:zip directory.
When I run the following code it works just fine:
import os,sys
list = os.listdir('C:\zip')
file = ZipFile('test.zip','w')
for item in list:
file.write(item)
file.close()
It makes the test.zip file without the compression.
When I change the fourth row to this:
file = ZipFile('test.zip','w', compression = ZIP_DEFLATED)
It also makes the test.zip file without the compression.
I also tried to change the write method to give it the compress_ type argument:
file.write(item, compress_type = ZIP_DEFLATED)
But that doesn't work either.
I use Python version 2.7.4 with Win7.
I tired the code with another computer (same circumstances, Win7 and Python 2.7.4), and it made the test.zip file compressed just like it should.
I know the zlib module should be available, when I run this:
import zlib
It doesn't return an error, also if there would be something wrong with the zlib module the code at the top should had return an error too, so I suspect that zlib isn't the problem.

By default the ZIP module only store data, to compress it you can do this:
import zipfile
try:
import zlib
mode= zipfile.ZIP_DEFLATED
except:
mode= zipfile.ZIP_STORED
zip= zipfile.ZipFile('zipfilename', 'w', mode)
zip.write(item)
zip.close()

In case you get here as I did, I'll add something.
If you use ZipInfo objects, they always override the compression method specified while creating the ZipFile, which is then useless.
So either you set their compression method (no parameter on the constructor, you must set the attribute) or specify the compression method when calling write (or writestr).
import zlib
from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
def write_things():
zip_buffer = io.BytesIO()
with ZipFile(file = zip_buffer, mode = "w", compression = ZIP_DEFLATED) as zipper:
# Get some data to write
fname, content, zip_ts = get_file_data()
file_object = ZipInfo(fname, zip_ts)
zipper.writestr(file_object, content) # Surprise, no compression
# This is required to get compression
# zipper.writestr(file_object, content, compress_type = ZIP_DEFLATED)

Related

Python error: "That compression method is not supported"

I am trying to decompress some .zip or .rar archives, and i am getting the error "That Compression methond is not supported". All the files from this directory are .zip files.
import rarfile
import sys
import os, zipfile
from tkinter import *
from tkinter import filedialog
from tkinter import messagebox
ZipExtension='.zip'
RarExtension='.rar'
#filesZIP="..\directory"
try:
os.chdir(filesZIP) # change directory from working dir to dir with files
except:
messagebox.showerror("Error","The folder with the archives was not selected! Please run the app again and select the folder.")
sys.exit()
for item in os.listdir(filesZIP):# loop through items in dir
if item.endswith(ZipExtension): # check for ".zip" extension
file_name = os.path.abspath(item) # get full path of files
zip_ref = zipfile.ZipFile(file_name) # create zipfile object
zip_ref.extractall(filesZIP) # extract file to dir
zip_ref.close() # close file
for item in os.listdir(filesZIP):
if item.endswith(RarExtension):
file_name = os.path.abspath(item)
rar_ref = rarfile.RarFile(file_name)
rar_ref.extractall()
rar_ref.close()
messagebox.showinfo("Information",'Successful!')
The problem is that sometimes it works, and in some cases, like the one above, it gives me that error, even though there are all .zip files, with no password
Background
By design zip archives support at lot of different compression methods. The support for these different compression methods in python varies depending on the version of the zipfile library you are running.
With Python 2.x, I see zipfile supports only deflate and store
zipfile.ZIP_STORED
The numeric constant for an uncompressed archive member.
zipfile.ZIP_DEFLATED
The numeric constant for the usual ZIP compression method. This requires the zlib module. No other compression methods are currently supported.
while with Python 3, zipfile supports a few more
zipfile.ZIP_STORED
The numeric constant for an uncompressed archive member.
zipfile.ZIP_DEFLATED
The numeric constant for the usual ZIP compression method. This requires the zlib module.
zipfile.ZIP_BZIP2
The numeric constant for the BZIP2 compression method. This requires the bz2 module.
New in version 3.3.
zipfile.ZIP_LZMA
The numeric constant for the LZMA compression method. This requires the lzma module
What Compression Methods are being used?
To see if this is your issue, you first need to see what compression method is actually being used in your zip files.
Let me work though an example to see how that works.
First create a zip file using bzip2 compression
zip -Z bzip2 /tmp/try.zip /tmp/in.txt
Let's check what unzip can tell us about the compression method it actually used.
$ unzip -lv try.zip
Archive: try.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
387776 BZip2 30986 92% 2022-09-20 14:11 f3d1fbaf in.txt
-------- ------- --- -------
387776 30986 92% 1 file
In unzip the Method column says it is using Bzip2 compression. I'm sure that WinZip has an equivalent report.
Unzip with Python 2.7
Next try uncompressing this zip file with Python 2.7 - I'll use the code below with Python 2 & Python 3
import zipfile
zip_ref = zipfile.ZipFile('/tmp/try.zip')
if zip_ref.testzip() is None:
print("zip file is ok")
zip_ref.close()
First Python 2.7 -- that matches what you are seeing. So that confirms that zipfile with Python 2.7 doesn't support bziip2 compression.
$ python2.7 /tmp/z.py
Traceback (most recent call last):
File "/tmp/z.py", line 4, in <module>
if zip_ref.testzip() is None:
File "/usr/lib/python2.7/zipfile.py", line 921, in testzip
with self.open(zinfo.filename, "r") as f:
File "/usr/lib/python2.7/zipfile.py", line 1033, in open
close_fileobj=should_close)
File "/usr/lib/python2.7/zipfile.py", line 553, in __init__
raise NotImplementedError("compression type %d (%s)" % (self._compress_type, descr))
NotImplementedError: compression type 12 (bzip2)
Unzip with Python 3.10
Next with Python 3.10.
$ python3.10 /tmp/z.py
zip file is ok
As expected, all is fine in this instance -- zipdetails with Python 3 does support bzip2 compression.

How to write file to memory filepath and read from memory filepath in Python?

An existing Python package requires a filepath as input parameter for a method to be able to parse the file from the filepath. I want to use this very specific Python package in a cloud environment, where I can't write files to the harddrive. I don't have direct control over the code in the existing Python package, and it's not easy to switch to another environment, where I would be able to write files to the harddrive. So I'm looking for a solution that is able to write a file to a memory filepath, and let the parser read directly from this memory filepath. Is this possible in Python? Or are there any other solutions?
Example Python code that works by using harddrive, which should be changed so that no harddrive is used:
temp_filepath = "./temp.txt"
with open(temp_filepath, "wb") as file:
file.write("some binary data")
model = Model()
model.parse(temp_filepath)
Example Python code that uses memory filesystem to store file, but which does not let parser read file from memory filesystem:
from fs import open_fs
temp_filepath = "./temp.txt"
with open_fs('osfs://~/') as home_fs:
home_fs.writetext(temp_filepath, "some binary data")
model = Model()
model.parse(temp_filepath)
You're probably looking for StringIO or BytesIO from io
import io
with io.BytesIO() as tmp:
tmp.write(content)
# to continue working, rewind file pointer
tmp.seek(0)
# work with tmp
pathlib may also be an advantage

Read compressed binary file (.grib2.bz2)

I have downloaded one of the files from this list https://opendata.dwd.de/weather/nwp/icon-eu/grib/03/t_2m/ (the actual filenames change every day) which are bz2 compressed.
I can read in the decompressed file using e.g.
import xarray as xr
# cfgrib + dependencies are also required
grib1 = xr.open_dataset("icon-eu_europe_regular-lat-lon_single-level_2020101212_001_ASHFL_S.grib2", engine='cfgrib')
However, I would like to read in the compressed file.
I tried things like
with bz2.open("icon-eu_europe_regular-lat-lon_single-level_2020101818_002_ASWDIFD_S.grib2.bz2", "rb") as f:
xr.open_dataset(f, engine='cfgrib')
but this does not work.
I am looking for any way to programmatically read in the compressed file.
I had the same issue within processing numerical weather prediction data.
What I am doing here is to download the file and hold it as a Binary Object (e.g. with urlopen or requests). Pass this object into the following function:
import bz2, shutil
from io import BytesIO
from pathlib import Path
def bunzip_store(file: BytesIO, local_intermediate_file: Path):
with bz2.BZ2File(file) as fr, local_intermediate_file.open(mode="wb") as fw:
shutil.copyfileobj(fr, fw)
An unzipped file will be stored underlocal_intermediate_file. Now you should be able to open this file.

Python: Assign a compression level to tarfile

My question is a follow up to this one.
I would like to know how I can modify the following code so that I can assign a compression level:
import os
import tarfile
home = '//global//scratch//chamar//parsed_data//batch0'
backup_dir = '//global//scratch//chamar//parsed_data//'
home_dirs = [ name for name in os.listdir(home) if os.path.isdir(os.path.join(home, name)) ]
for directory in home_dirs:
full_dir = os.path.join(home, directory)
tar = tarfile.open(os.path.join(backup_dir, directory+'.tar.gz'), 'w:gz')
tar.add(full_dir, arcname=directory)
tar.close()
Basically, what the code does is that I loop through each directory in batch0 and compress each directory (where in each directory there are 6000+ files) and create a tar.gz compressed file for each directory in //global//scratch//chamar//parsed_data//.
I think by default the compression level is = 9 but it takes a lot of time to compressed. I don't need a lot of compression. A level 5 would be enough. How can I modify the above code to include a compression level?
There is a compresslevel attribute you can pass to open() (no need to use gzopen() directly):
tar = tarfile.open(filename, "w:gz", compresslevel=5)
From the gzip documentation, compresslevel can be a number between 1 and 9 (9 is the default), 1 being the fastest and least compressed, and 9 being the slowest and most compressed.
[See also: tarfile documentation]
There is a compresslevel option in the gzopen method. The line below should replace the one with the tarfile.open call in your example:
tar = tarfile.TarFile.gzopen(os.path.join(backup_dir, directory+'.tar.gz'), mode='w', compresslevel=5)

how to use pkgutils.get_data with csv.reader in python?

I have a python module that has a variety of data files, (a set of csv files representing curves) that need to be loaded at runtime. The csv module works very well
# curvefile = "ntc.10k.csv"
raw = csv.reader(open(curvefile, 'rb'), delimiter=',')
But if I import this module into another script, I need to find the full path to the data file.
/project
/shared
curve.py
ntc.10k.csv
ntc.2k5.csv
/apps
script.py
I want the script.py to just refer to the curves by basic filename, not with full paths. In the module code, I can use:
pkgutil.get_data("curve", "ntc.10k.csv")
which works very well at finding the file, but it returns the csv file already read in, whereas the csv.reader requires the file handle itself. Is there any way to make these two modules play well together? They're both standard libary modules, so I wasn't really expecting problems. I know I can start splitting the pkgutil binary file data, but then I might as well not be using the csv library.
I know I can just use this in the module code, and forget about pkgutils, but it seems like pkgutils is really exactly what this is for.
this_dir, this_filename = os.path.split(__file__)
DATA_PATH = os.path.join(this_dir, curvefile)
raw = csv.reader(open(DATA_PATH, "rb"))
I opened up the source code to get_data, and it is trivial to have it return the path to the file instead of the loaded file. This module should do the trick. Use the keyword as_string=True to return the file read into memory, or as_string=False, to return the path.
import os, sys
from pkgutil import get_loader
def get_data_smart(package, resource, as_string=True):
"""Rewrite of pkgutil.get_data() that actually lets the user determine if data should
be returned read into memory (aka as_string=True) or just return the file path.
"""
loader = get_loader(package)
if loader is None or not hasattr(loader, 'get_data'):
return None
mod = sys.modules.get(package) or loader.load_module(package)
if mod is None or not hasattr(mod, '__file__'):
return None
# Modify the resource name to be compatible with the loader.get_data
# signature - an os.path format "filename" starting with the dirname of
# the package's __file__
parts = resource.split('/')
parts.insert(0, os.path.dirname(mod.__file__))
resource_name = os.path.join(*parts)
if as_string:
return loader.get_data(resource_name)
else:
return resource_name
It's not ideal, especially for very large files, but you can use StringIO to turn a string into something with a read() method, which csv.reader should be able to handle.
csvdata = pkgutil.get_data("curve", "ntc.10k.csv")
csvio = StringIO(csvdata)
raw = csv.reader(csvio)
Over 10 years after the question has been asked, but I came here using Google and went down the rabbit hole posted in other answers. Nowadays this seems to be more straightforward. Below my implementation using stdlib's importlib that returns the filesystem path to the package's resource as string. Should work with 3.6+.
import importlib.resources
import os
def get_data_file_path(package: str, resource: str) -> str:
"""
Returns the filesystem path of a resource marked as package
data of a Python package installed.
:param package: string of the Python package the resource is
located in, e.g. "mypackage.module"
:param resource: string of the filename of the resource (do not
include directory names), e.g. "myfile.png"
:return: string of the full (absolute) filesystem path to the
resource if it exists.
:raises ModuleNotFoundError: In case the package `package` is not found.
:raises FileNotFoundError: In case the file in `resource` is not
found in the package.
"""
# Guard against non-existing files, or else importlib.resources.path
# may raise a confusing TypeError.
if not importlib.resources.is_resource(package, resource):
raise FileNotFoundError(f"Python package '{package}' resource '{resource}' not found.")
with importlib.resources.path(package, resource) as resource_path:
return os.fspath(resource_path)
Another way is to use json.loads() along-with file.decode(). As get_data() retrieves data as bytes, need to convert it to string in-order to process it as json
import json
import pkgutil
data_file = pkgutil.get_data('test.testmodel', 'data/test_data.json')
length_data_file = len(json.loads(data_file.decode()))
Reference

Categories

Resources