Python - faster library to compress data than ZipFile

Python - faster library to compress data than ZipFile - python

I am developing a web application in Python 3.9 using django 3.0.7 as a framework. In Python I am creating a function that can convert a dictionary to a json and then to a zip file using the ZipFile library. Currently this is the code in use:
def zip_dict(data: dict) -> bytes:
with io.BytesIO() as archive:
unzipped = bytes(json.dumps(data), "utf-8")
with zipfile.ZipFile(archive, mode="w", compression=zipfile.ZIP_DEFLATED) as zipFile:
zipFile.writestr(zinfo_or_arcname="data", data=unzipped)
return archive.getvalue()
Then I save the zip in the Azure Blob Storage. It works but the problem is that this function is a bit slow for my purposes. I tried to use the zlib library but the performance doesn't change and, moreover, the created zip file seems to be corrupt (I can't even open it manually with WinRAR). Is there any other library to increase the compression speed (without touching the compression ratio)?

Try adding a compresslevel=3 or compresslevel=1 parameter to the zipfile.ZipFile() object creation, and see if that provides a more satisfactory speed and sufficient compression.

Related

Compress an existing large CSV file with Python using GZIP in low memory machine

I'm trying to take an existing csv file on a small windows 10 vm that is about 17GB and compress it using gzip.
Its too large to read into memory.
Looking for ways to do this efficiently that doesn't involve partitioning the file.
This is what I'm trying right now and its quite slow:
import gzip
import shutil
with open('file_to_be_compressed.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
EDIT:
This is part of a pipeline where I'm downloading the data from one source and uploading it to another. The upload will be much faster if I can compress the files first (unfortunately they weren't already compressed on the source side).
So I have code that pulls a list of files, and for each file I'd like to compress it and then I'll upload it.

If I understand correctly, you are downloading a large .csv, you would ideally like to compress it on the fly, and upload the compressed version.
If that's right, then you do not need to write either the uncompressed or compressed data to files. You can instead download a piece at a time, compress as you go along, and upload the compressed data.
To do that, use zlib. Create a zlib.compressobj() object, with a wbits value of 31, to select the gzip format. Let's say you call that object gz. Then use gz.compress(data) for each chunk of the .csv you download. Upload the result returned by each gz.compress(data). When the download is complete, upload the result of gz.flush() to complete the compressed data.

read numpy compressed file using matlab [duplicate]

I was wondering if there is way to read .npy files in Matlab? I know I can convert those to Matlab-style .mat files using scipy.io.savemat in Python; however I'm more interested in a native or plugin support for .npy files in Matlab.

This did the job for me, I used it to read npy files.
https://github.com/kwikteam/npy-matlab
If you only want to read .npy file all you need from the npy-matlab project are two files: readNPY.m and readNPYheader.m.
Usage is as simple as:
>> im = readNPY('/path/to/file.npy');

There is a c++ library available https://github.com/rogersce/cnpy
You could write a mex function to read the data. I would prefer to store everything in hdf5

A quick way would be to read it in python, as below,
data = np.load('/tmp/123.npz')
Then save it as '.csv', again by python, using python documentation or,
numpy.savetxt('FileName.csv', arrayToSave)
(more documentation here)
Finally, you can read it in MATLAB using the following command,
csvread()
Quick Update:
As the user "Ender" mentioned in the comments, the csvread() is now deprecated and readmatrix() stands in its lieu. (documentation)

Load .npz from a HTTP link

I use a web service to train some of my deep learning models through a Jupyter Notebook on AWS. For cost reasons I would like to store my data as .npz files on my own server and load them straight to memory of my remote machine.
The np.load() function doesn't work with http links and using urlretrieve I wasn't able to make it work. I only got it working downloading the data with wget and then loading the file from a local path. However, this doesn't fully solve my problem.
Any recommendations?

The thing is that if the first argument of np.load is a file-like object, it has to be seek-able:
file : file-like object, string, or pathlib.Path The file to read.
File-like objects must support the seek() and read() methods. Pickled
files require that the file-like object support the readline() method
as well.
If you are going to serve those files over http and your server supports the Range headers, you could employ the implementation (Python 2) presented in this answer for example as:
F = HttpFile('http://localhost:8000/meta.data.npz')
with np.load(F) as data:
a = data['arr_0']
print(a)
Alternatively, you could fetch the entire file, create an in-memory file-like object and pass it to np.load:
from io import BytesIO
import numpy as np
import requests
r = requests.get('http://localhost:8000/meta.data.npz', stream = True)
data = np.load(BytesIO(r.raw.read()))
print(data['arr_0'])

zlib does not compress in standard zip format

I am using python zlib and I am doing the following:
Compress large strings in memory (zlib.compress)
Upload to S3
Download and read and decompress the data as string from S3 (zlib.decompress)
Everything is working fine but when I directly download files from S3 and try to open them with a standard zip program I get an error. I noticed that instead of PK, the begining of the file is:
xµ}ko$7’םחע¯¸?ְ)$“שo³¶w¯1k{`
I am flexible and dont mind switching from zlib to another package but it has to be pythonic (Heroku compatible)
Thanks!

zlib compresses a file; it does not create a ZIP archive. For that, see zipfile.

If this is about compressing just strings, then zlib is the way to go. A zip file is for storing a file or even a whole directory tree with files. It keeps file meta data. It can be (somehow) used for, but is not appropriate for storing just strings.
If your application is just about storing and retrieving compressed strings, there is no point in "directly downloading files from S3 and try to open them with a standard zip program". Why would you do this?
Edit:
S3 generally is for storing files, not strings. You say you want to store strings. Are you sure that S3 is the right service for you? Did you look at SimpleDB?
Consider you want to stick to S3 and would like to upload zipped strings. Your S3 client library most likely expects to receive a file-like object to read from. To solve this efficiently, store the zipped string in a Python StringIO object (in an in-memory file) and provide this in-memory file to your S3 client library for uploading it to S3.
For downloading do the same. Use Python. Also for debugging purposes. There is no point in trying to force a string into a zipfile. There will be more overhead (due to file metadata) than by using plain zlibbed strings.

An alternative to writing zip files just for debugging purposes, which is entirely the wrong format for your application, is to have a utility that can decompress zlib streams, which is entirely the right format for your application. That utility is pigz with the -z option.

Read EXE, MSI, and ZIP file metadata in Python in Linux

I am writing a Python script to index a large set of Windows installers into a DB.
I would like top know how to read the metadata information (Company, Product Name, Version, etc) from EXE, MSI and ZIP files using Python running on Linux.
Software
I am using Python 2.6.5 on Ubuntu 10.04 64-bit with Django 1.2.1.
Found so far:
Windows command line utilities that can extract EXE metadata (like filever from SysUtils), or other individual CL utils that only work in Windows. I've tried running these through Wine but they have problems and it hasn't been worth the work to go and find the libs and frameworks that those CL utils depend on and try installing them in Wine/Crossover.
Win32 modules for Python that can do some things but won't run in Linux (right?)
Secondary question:
Obviously changing the file's metadata would change the MD5 hashsum of the file. Is there a general method of hashing a file independent of the metadata besides locating it and reading it in (ex: like skipping the first 1024 byes?)

Take a look at this library: http://bitbucket.org/haypo/hachoir/wiki/Home and this example program that uses the library: http://pypi.python.org/pypi/hachoir-metadata/1.3.3. The second link is an example program which uses the Hachoir binary file manipulation library (first link) to parse the metadata.
The library can handle these formats:
Archives: bzip2, gzip, zip, tar
Audio: MPEG audio ("MP3"), WAV, Sun/NeXT audio, Ogg/Vorbis (OGG), MIDI, AIFF, AIFC, Real audio (RA)
Image: BMP, CUR, EMF, ICO, GIF, JPEG, PCX, PNG, TGA, TIFF, WMF, XCF
Misc: Torrent
Program: EXE
Video: ASF format (WMV video), AVI, Matroska (MKV), Quicktime (MOV), Ogg/Theora, Real media (RM)
Additionally, Hachoir can do some file manipulation operations which I would assume includes some primitive metadata manipulation.

The hachoir-metadata get the "Product Version" but the compilers changes the "File Version".
Then the version returned is not the we need.
I found a small a well working soluction:
http://pev.sourceforge.net/
I've tested with success. It's simple, fast and stable.

To answer one of your questions, you can use the zipfile module, specifically the ZipInfo object to get the metadata for zip files.
As for hashing only the data of the file, you can only to that if you know which parts are data and which are metadata. There can be no general method as many file formats store their metadata differently.

To answer your second question: no, there is no way to hash a PE file or ZIP file, ignoring the metadata, without locating and reading the metadata. This is because the metadata you're interested in is stored at variable locations in the file.
In the case of PE files (EXE, DLL, etc), it's stored in a resource block, typically towards the end of the file, and a series of pointers and tables at the start of the file gives the location.
In the case of ZIP files, it's scattered throughout the archive -- each included file is preceded by its own metadata, and then there's a table at the end giving the locations of each metadata block. But it sounds like you might actually be wanting to read the files within the ZIP archive and look for EXEs in there if you're after program metadata; the ZIP archive itself does not store company names or version numbers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - faster library to compress data than ZipFile - python

Try adding a compresslevel=3 or compresslevel=1 parameter to the zipfile.ZipFile() object creation, and see if that provides a more satisfactory speed and sufficient compression.

Related

Compress an existing large CSV file with Python using GZIP in low memory machine

read numpy compressed file using matlab [duplicate]

Load .npz from a HTTP link

zlib does not compress in standard zip format

Read EXE, MSI, and ZIP file metadata in Python in Linux

Categories

Resources