Transform zarr directory storage to zip storage

Transform zarr directory storage to zip storage - python

codes:
store = zarr.ZipStore("/mnt/test.zip", "r")
Problem description:
Hi, sry for bothering, I found this statement inside Zarr official documentation about ZipStorage:
Alternatively, use a DirectoryStore when writing the data, then manually Zip the directory and use the Zip file for subsequent reads.
I am trying to transform a DirectoryStorage format Zarr dataset to a ZipStorage. I use zip operation provided in Linux.
zip -r test.zip test.zarr here test.zarr is a directory storage dataset including three groups. However, when I try to use the codes above to open it, get the error as below:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/eddie/miniconda3/envs/train/lib/python3.8/site-packages/zarr/storage.py", line 1445, in __init__
self.zf = zipfile.ZipFile(path, mode=mode, compression=compression,
File "/home/eddie/miniconda3/envs/train/lib/python3.8/zipfile.py", line 1190, in __init__
_check_compression(compression)
File "/home/eddie/miniconda3/envs/train/lib/python3.8/zipfile.py", line 686, in _check_compression
raise NotImplementedError("That compression method is not supported")
NotImplementedError: That compression method is not supported
I wonder if my compression method is wrong, and if there some workarounds to transform directory storage to zip storage or some other DB format, cause when the groups rise, the previous storage has so many nodes and not so convenient to transport. Thanks in advance.
Version and installation information
Value of zarr.__version__: 2.8.1
Value of numcodecs.__version__: 0.7.3
Version of Python interpreter: 3.8.0
Operating system (Linux/Windows/Mac): linux ubuntu 18.04
How Zarr was installed: pip

because zarr already uses compression, there is no need to use compression when creating the zip archive. I.e., you can use zip -r -0 to store files in the zip archive only, without compression.
Also, you might need to be careful about the paths that get stored within the zip archive. E.g., if I have a zarr hierarchy in some directory "/path/to/foo" and I want to store this into a zip file at "/path/to/bar.zip" I would do:
cd /path/to/foo
zip -r0 /path/to/bar.zip
This ensures that the paths that get stored within the zip archive are relative to the original root directory.

After zip with -r0 option, you can try with store = zarr.ZipStore("/mnt/test.zip"), this way, you won't get the error any more.

Related

Python error: "That compression method is not supported"

I am trying to decompress some .zip or .rar archives, and i am getting the error "That Compression methond is not supported". All the files from this directory are .zip files.
import rarfile
import sys
import os, zipfile
from tkinter import *
from tkinter import filedialog
from tkinter import messagebox
ZipExtension='.zip'
RarExtension='.rar'
#filesZIP="..\directory"
try:
os.chdir(filesZIP) # change directory from working dir to dir with files
except:
messagebox.showerror("Error","The folder with the archives was not selected! Please run the app again and select the folder.")
sys.exit()
for item in os.listdir(filesZIP):# loop through items in dir
if item.endswith(ZipExtension): # check for ".zip" extension
file_name = os.path.abspath(item) # get full path of files
zip_ref = zipfile.ZipFile(file_name) # create zipfile object
zip_ref.extractall(filesZIP) # extract file to dir
zip_ref.close() # close file
for item in os.listdir(filesZIP):
if item.endswith(RarExtension):
file_name = os.path.abspath(item)
rar_ref = rarfile.RarFile(file_name)
rar_ref.extractall()
rar_ref.close()
messagebox.showinfo("Information",'Successful!')
The problem is that sometimes it works, and in some cases, like the one above, it gives me that error, even though there are all .zip files, with no password

Background
By design zip archives support at lot of different compression methods. The support for these different compression methods in python varies depending on the version of the zipfile library you are running.
With Python 2.x, I see zipfile supports only deflate and store
zipfile.ZIP_STORED
The numeric constant for an uncompressed archive member.
zipfile.ZIP_DEFLATED
The numeric constant for the usual ZIP compression method. This requires the zlib module. No other compression methods are currently supported.
while with Python 3, zipfile supports a few more
zipfile.ZIP_STORED
The numeric constant for an uncompressed archive member.
zipfile.ZIP_DEFLATED
The numeric constant for the usual ZIP compression method. This requires the zlib module.
zipfile.ZIP_BZIP2
The numeric constant for the BZIP2 compression method. This requires the bz2 module.
New in version 3.3.
zipfile.ZIP_LZMA
The numeric constant for the LZMA compression method. This requires the lzma module
What Compression Methods are being used?
To see if this is your issue, you first need to see what compression method is actually being used in your zip files.
Let me work though an example to see how that works.
First create a zip file using bzip2 compression
zip -Z bzip2 /tmp/try.zip /tmp/in.txt
Let's check what unzip can tell us about the compression method it actually used.
$ unzip -lv try.zip
Archive: try.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
387776 BZip2 30986 92% 2022-09-20 14:11 f3d1fbaf in.txt
-------- ------- --- -------
387776 30986 92% 1 file
In unzip the Method column says it is using Bzip2 compression. I'm sure that WinZip has an equivalent report.
Unzip with Python 2.7
Next try uncompressing this zip file with Python 2.7 - I'll use the code below with Python 2 & Python 3
import zipfile
zip_ref = zipfile.ZipFile('/tmp/try.zip')
if zip_ref.testzip() is None:
print("zip file is ok")
zip_ref.close()
First Python 2.7 -- that matches what you are seeing. So that confirms that zipfile with Python 2.7 doesn't support bziip2 compression.
$ python2.7 /tmp/z.py
Traceback (most recent call last):
File "/tmp/z.py", line 4, in <module>
if zip_ref.testzip() is None:
File "/usr/lib/python2.7/zipfile.py", line 921, in testzip
with self.open(zinfo.filename, "r") as f:
File "/usr/lib/python2.7/zipfile.py", line 1033, in open
close_fileobj=should_close)
File "/usr/lib/python2.7/zipfile.py", line 553, in __init__
raise NotImplementedError("compression type %d (%s)" % (self._compress_type, descr))
NotImplementedError: compression type 12 (bzip2)
Unzip with Python 3.10
Next with Python 3.10.
$ python3.10 /tmp/z.py
zip file is ok
As expected, all is fine in this instance -- zipdetails with Python 3 does support bzip2 compression.

Unzip folder by chunks in python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.

You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/

You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

Transparently mount a tar.gz archive with Python

How can I mount a tar.gz archive transparently with Python?
I have a tar.gz archive whose contents have to be read by an external program. The contents will only be needed temporarily. I could just unpack it to a temporary folder and point my external program there to read it. Afterwards, I could just delete the temp folder again. However, the archives may be large (>1 GB when extracted) so that unpacking them will take up a lot of space on the disk. My server is rather weak regarding HD performance and I cannot waste space ad lib but it does have a lot of RAM and CPU power.
That's why I want to try to mount the archive transparently without unpacking it entirely. I came across archivemount which seems to do exactly what I want. Is there a way to do what archivemount does in pure Python? No subprocess.call "solutions", please. It should run on 64-bit Linux.
I believe there should be a smart way to use tarfile to access archive's contents and then fusepy to create a user-space file system which exposes the contents of the archive. Has anyone already put these pieces together? Any ideas?
If you think that this is not a good idea, please post relevant comments. If you know what is better, please comment.

As of version 0.3.1 of my ratarmount module, you can use it or take a look at its source to mount a .tar.gz in Python. The gzip seeking support is from the dependency indexed_gzip. Ratarmount itself is based on tarindexer, which implements the idea to use tarfile to get offsets and then seek to it. But, ratarmount adds a FUSE layer among other usability and performance features.
You can install ratarmount from PyPI:
pip3 install --user ratarmount
and then call its command line interface directly from python like so:
import ratarmount
ratarmount.cli( [ '--help' ] )
ratarmount.cli( [ pathToTar, pathToMountPoint ] )
The heart of the module is as you already surmised tarfile, which is used to iterate over all TarInfo objects and create a list of filepath,offset,size, which then can be used to seek directly to the offset in the raw tar file and the simply read the next size bytes. This works because TAR is that simple of a format.
Here is the unoptimized and very bare core idea:
import sys
import tarfile
from indexed_gzip import IndexedGzipFile
targzfile = sys.argv[1]
filetoprint = sys.argv[2]
index = {} # path : ( offset, size )
file = IndexedGzipFile( targzfile )
for tarinfo in tarfile.open( fileobj = file, mode = 'r|' ):
index[tarinfo.name] = ( tarinfo.offset_data, tarinfo.size )
# at this point you could save or load the index for faster consecutive file seeks
file.seek( index[filetoprint][0] )
sys.stdout.buffer.write( file.read( index[filetoprint][1] ) )
The above example was tested to work with:
wget -O- 'https://ftp.mozilla.org/pub/firefox/releases/70.0/linux-x86_64/en-US/firefox-70.0.tar.bz2' | bzip2 -d -c | gzip > firefox.tgz
python3 minimal-example.py firefox.tgz firefox/updater.ini

Create a SFX archive using python

I am looking for some help with python script to create a Self Extracting Archive (SFX) an exe file which can be created by WinRar basically.
I would want to archive a folder with password protection and also split volume by 3900 MB so that it can be easily burned to a disk.
I know WinRar has command line parameters to create a archive, but i am not sure how to call it via python anyhelp on this would be of great help.
Here are main things I want:
Archive Format - RAR
Compression Method Normal
Split Volume size, 3900 MB
Password protection
I looked up everywhere but don't seem to find anything around this functionality.

You could have a look at rarfile
Alternatively use something like:
from subprocess import call
cmdlineargs = "command -switch1 -switchN archive files.. path_to_extract"
call(["WinRAR"] + cmdlineargs.split())
Note in the second line you will need to use the correct command line arguments, the ones above are just as an example.

Python ZipFile giving different namelist than unzipping utility

I have a bunch of timestamped .jpgs in a zip file, and when I open that zip file using Python's ZipFile package, I see three files:
>>> cameraZip = zipfile.ZipFile(zipPath, 'r')
>>> cameraZip.namelist()
['20131108_200152.jpg', '20131108_203158.jpg', '20131108_205521.jpg']
When I unpack the file using Mac OSX's default .zip unexpander, I get 371 files, from '20131101_000159.jpg' up to '20131108_193152.jpg'.
Unzipping this file gives the same result as the .zip unexpander:
$ unzip 2013.11.zip
extracting: 20131101_000159.jpg
extracting: 20131101_003156.jpg
...
extracting: 20131108_190155.jpg
extracting: 20131108_193152.jpg
Anybody have any idea what's going on?

Most likely the problem is in zip central directory record, which wasn't correctly flushed when zip file was created. While Python looks for central directory (I guess), other implementations process local file headers and found all of them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.