zlib does not compress in standard zip format - python

I am using python zlib and I am doing the following:
Compress large strings in memory (zlib.compress)
Upload to S3
Download and read and decompress the data as string from S3 (zlib.decompress)
Everything is working fine but when I directly download files from S3 and try to open them with a standard zip program I get an error. I noticed that instead of PK, the begining of the file is:
xµ}ko$7’םחע¯¸?ְ)$“שo³¶w¯1k{`
I am flexible and dont mind switching from zlib to another package but it has to be pythonic (Heroku compatible)
Thanks!

zlib compresses a file; it does not create a ZIP archive. For that, see zipfile.

If this is about compressing just strings, then zlib is the way to go. A zip file is for storing a file or even a whole directory tree with files. It keeps file meta data. It can be (somehow) used for, but is not appropriate for storing just strings.
If your application is just about storing and retrieving compressed strings, there is no point in "directly downloading files from S3 and try to open them with a standard zip program". Why would you do this?
Edit:
S3 generally is for storing files, not strings. You say you want to store strings. Are you sure that S3 is the right service for you? Did you look at SimpleDB?
Consider you want to stick to S3 and would like to upload zipped strings. Your S3 client library most likely expects to receive a file-like object to read from. To solve this efficiently, store the zipped string in a Python StringIO object (in an in-memory file) and provide this in-memory file to your S3 client library for uploading it to S3.
For downloading do the same. Use Python. Also for debugging purposes. There is no point in trying to force a string into a zipfile. There will be more overhead (due to file metadata) than by using plain zlibbed strings.

An alternative to writing zip files just for debugging purposes, which is entirely the wrong format for your application, is to have a utility that can decompress zlib streams, which is entirely the right format for your application. That utility is pigz with the -z option.

Related

Python - faster library to compress data than ZipFile

I am developing a web application in Python 3.9 using django 3.0.7 as a framework. In Python I am creating a function that can convert a dictionary to a json and then to a zip file using the ZipFile library. Currently this is the code in use:
def zip_dict(data: dict) -> bytes:
with io.BytesIO() as archive:
unzipped = bytes(json.dumps(data), "utf-8")
with zipfile.ZipFile(archive, mode="w", compression=zipfile.ZIP_DEFLATED) as zipFile:
zipFile.writestr(zinfo_or_arcname="data", data=unzipped)
return archive.getvalue()
Then I save the zip in the Azure Blob Storage. It works but the problem is that this function is a bit slow for my purposes. I tried to use the zlib library but the performance doesn't change and, moreover, the created zip file seems to be corrupt (I can't even open it manually with WinRAR). Is there any other library to increase the compression speed (without touching the compression ratio)?
Try adding a compresslevel=3 or compresslevel=1 parameter to the zipfile.ZipFile() object creation, and see if that provides a more satisfactory speed and sufficient compression.

download a .gz file from a subdirectory in a s3 bucket using boto

I have a file named combine.gz which I need to download from a subfolder on s3 . I am able to get to the combine.gz files (specifically one per directory) but I am unable to find a method in boto to read the .gz files to my local machine.
All I can find are the boto.utils.fetch_file, key.get_contents_to_filename , key.get_contents_to_file methods all of which as I understand, directly stream the contents of the file.
Is there be a way for me to first read the compressed file in .gz format onto my local machine from S3 using boto and then uncompress it?
Any help would be much appreciated.
You can read the full contents as a string and then manage it as a string object. This is very dangerous and could lead to memory or buffer issues so be careful.
Check into using cStringIO.StringIO, gzip.GzipFile, and boto
datastring = key.get_contents_as_string()
data = cStringIO.StringIO(datastring)
rawdata = gzip.GzipFile(fileobj=data).read()
again - be careful as this has lots of memory and potential security issues in the event the gzip file is malformed. You'll want to wrap with try, except and code defensively if you don't control both sides.

Python GCS how to rename file within inner zip file?

Suppose I have a file hosted on GCS on a Python AppEngine project. unfortunately, this file structure is something like:
outer.zip/
- inner.zip/
- vid_file
- png_file
the problem is that the two files inside inner.zip do not have the .mp4 extension on the file, and it's causing all sorts of troubles. How do i rename the files so that it appears like:
outer.zip/
- inner.zip/
- vid_file.mp4
- png_file.png
so that the files inside inner.zip have their extensions?
I keep running into all sorts of limitations since gcs doesn't allow file renaming, unarchiving...etc.
the files aren't terribly large.
P.S. i'm not very familiar with Python, so any code examples would be great appreciated, thanks!
There is absolutely no way to perform any alteration to GCS objects -- full stop. They are exactly the bunch of bytes you decided at their birth (uninterpreted by GCS itself) and thus they will stay.
The best you can do is create a new object which is almost like the original except it fixes little errors and oopses you did when creating the original. Then you can overwrite (i.e completely replace) the original with the new, improved version.
Hopefully it's a one-off terrible mistake you made just once and now want to fix so it's not worth writing a program for that. Just download that GCS object, use normal tools to unzip it and unzip any further zipfiles it may contain, do the fixes on the filesystem with your favorite local filesystem tools, zip things up again, upload/rewrite the final zip to your desired new GCS object -- phew, you're done.
Alex is right that objects are immutable, i.e., no editing in-place. The only way to accomplish what you're talking about would be to download the current file, unzip it, update the new files, re-zip the files into the same-named file, and upload to GCS. GCS object overwrites are transactional, so the old content will be visible until the instant the upload completes. Doing it this way is obviously not very network efficient but at least it wouldn't leave periods of time when the object is invisible (as deleting and re-uploading would).
"Import zipfile" and you can unzip the file once it's downloaded into gcs storage.
I have code doing exactly this on a nightly basis from a cron job.
Ive never tried creating a zip file with GAE but the docs say you can do it.
https://docs.python.org/2/library/zipfile.html

Upload image with an in-memory stream to input using Pillow + WebDriver?

I'm getting an Image from URL with Pillow, and creating an stream (BytesIO/StringIO).
r = requests.get("http://i.imgur.com/SH9lKxu.jpg")
stream = Image.open(BytesIO(r.content))
Since I want to upload this image using an <input type="file" /> with selenium WebDriver. I can do something like this to upload a file:
self.driver.find_element_by_xpath("//input[#type='file']").send_keys("PATH_TO_IMAGE")
I would like to know If its possible to upload that image from a stream without having to mess with files / file paths... I'm trying to avoid filesystem Read/Write. And do it in-memory or as much with temporary files. I'm also Wondering If that stream could be encoded to Base64, and then uploaded passing the string to the send_keys function you can see above :$
PS: Hope you like the image :P
You seem to be asking multiple questions here.
First, how do you convert a a JPEG without downloading it to a file? You're already doing that, so I don't know what you're asking here.
Next, "And do it in-memory or as much with temporary files." I don't know what this means, but you can do it with temporary files with the tempfile library in the stdlib, and you can do it in-memory too; both are easy.
Next, you want to know how to do a streaming upload with requests. The easy way to do that, as explained in Streaming Uploads, is to "simply provide a file-like object for your body". This can be a tempfile, but it can just as easily be a BytesIO. Since you're already using one in your question, I assume you know how to do this.
(As a side note, I'm not sure why you're using BytesIO(r.content) when requests already gives you a way to use a response object as a file-like object, and even to do it by streaming on demand instead of by waiting until the full content is available, but that isn't relevant here.)
If you want to upload it with selenium instead of requests… well then you do need a temporary file. The whole point of selenium is that it's scripting a web browser. You can't just type a bunch of bytes at your web browser in an upload form, you have to select a file on your filesystem. So selenium needs to fake you selecting a file on your filesystem. This is a perfect job for tempfile.NamedTemporaryFile.
Finally, "I'm also Wondering If that stream could be encoded to Base64".
Sure it can. Since you're just converting the image in-memory, you can just encode it with, e.g., base64.b64encode. Or, if you prefer, you can wrap your BytesIO in a codecs wrapper to base-64 it on the fly. But I'm not sure why you want to do that here.

Read EXE, MSI, and ZIP file metadata in Python in Linux

I am writing a Python script to index a large set of Windows installers into a DB.
I would like top know how to read the metadata information (Company, Product Name, Version, etc) from EXE, MSI and ZIP files using Python running on Linux.
Software
I am using Python 2.6.5 on Ubuntu 10.04 64-bit with Django 1.2.1.
Found so far:
Windows command line utilities that can extract EXE metadata (like filever from SysUtils), or other individual CL utils that only work in Windows. I've tried running these through Wine but they have problems and it hasn't been worth the work to go and find the libs and frameworks that those CL utils depend on and try installing them in Wine/Crossover.
Win32 modules for Python that can do some things but won't run in Linux (right?)
Secondary question:
Obviously changing the file's metadata would change the MD5 hashsum of the file. Is there a general method of hashing a file independent of the metadata besides locating it and reading it in (ex: like skipping the first 1024 byes?)
Take a look at this library: http://bitbucket.org/haypo/hachoir/wiki/Home and this example program that uses the library: http://pypi.python.org/pypi/hachoir-metadata/1.3.3. The second link is an example program which uses the Hachoir binary file manipulation library (first link) to parse the metadata.
The library can handle these formats:
Archives: bzip2, gzip, zip, tar
Audio: MPEG audio ("MP3"), WAV, Sun/NeXT audio, Ogg/Vorbis (OGG), MIDI, AIFF, AIFC, Real audio (RA)
Image: BMP, CUR, EMF, ICO, GIF, JPEG, PCX, PNG, TGA, TIFF, WMF, XCF
Misc: Torrent
Program: EXE
Video: ASF format (WMV video), AVI, Matroska (MKV), Quicktime (MOV), Ogg/Theora, Real media (RM)
Additionally, Hachoir can do some file manipulation operations which I would assume includes some primitive metadata manipulation.
The hachoir-metadata get the "Product Version" but the compilers changes the "File Version".
Then the version returned is not the we need.
I found a small a well working soluction:
http://pev.sourceforge.net/
I've tested with success. It's simple, fast and stable.
To answer one of your questions, you can use the zipfile module, specifically the ZipInfo object to get the metadata for zip files.
As for hashing only the data of the file, you can only to that if you know which parts are data and which are metadata. There can be no general method as many file formats store their metadata differently.
To answer your second question: no, there is no way to hash a PE file or ZIP file, ignoring the metadata, without locating and reading the metadata. This is because the metadata you're interested in is stored at variable locations in the file.
In the case of PE files (EXE, DLL, etc), it's stored in a resource block, typically towards the end of the file, and a series of pointers and tables at the start of the file gives the location.
In the case of ZIP files, it's scattered throughout the archive -- each included file is preceded by its own metadata, and then there's a table at the end giving the locations of each metadata block. But it sounds like you might actually be wanting to read the files within the ZIP archive and look for EXEs in there if you're after program metadata; the ZIP archive itself does not store company names or version numbers.

Categories

Resources