I am writing a Python script to index a large set of Windows installers into a DB.
I would like top know how to read the metadata information (Company, Product Name, Version, etc) from EXE, MSI and ZIP files using Python running on Linux.
Software
I am using Python 2.6.5 on Ubuntu 10.04 64-bit with Django 1.2.1.
Found so far:
Windows command line utilities that can extract EXE metadata (like filever from SysUtils), or other individual CL utils that only work in Windows. I've tried running these through Wine but they have problems and it hasn't been worth the work to go and find the libs and frameworks that those CL utils depend on and try installing them in Wine/Crossover.
Win32 modules for Python that can do some things but won't run in Linux (right?)
Secondary question:
Obviously changing the file's metadata would change the MD5 hashsum of the file. Is there a general method of hashing a file independent of the metadata besides locating it and reading it in (ex: like skipping the first 1024 byes?)
Take a look at this library: http://bitbucket.org/haypo/hachoir/wiki/Home and this example program that uses the library: http://pypi.python.org/pypi/hachoir-metadata/1.3.3. The second link is an example program which uses the Hachoir binary file manipulation library (first link) to parse the metadata.
The library can handle these formats:
Archives: bzip2, gzip, zip, tar
Audio: MPEG audio ("MP3"), WAV, Sun/NeXT audio, Ogg/Vorbis (OGG), MIDI, AIFF, AIFC, Real audio (RA)
Image: BMP, CUR, EMF, ICO, GIF, JPEG, PCX, PNG, TGA, TIFF, WMF, XCF
Misc: Torrent
Program: EXE
Video: ASF format (WMV video), AVI, Matroska (MKV), Quicktime (MOV), Ogg/Theora, Real media (RM)
Additionally, Hachoir can do some file manipulation operations which I would assume includes some primitive metadata manipulation.
The hachoir-metadata get the "Product Version" but the compilers changes the "File Version".
Then the version returned is not the we need.
I found a small a well working soluction:
http://pev.sourceforge.net/
I've tested with success. It's simple, fast and stable.
To answer one of your questions, you can use the zipfile module, specifically the ZipInfo object to get the metadata for zip files.
As for hashing only the data of the file, you can only to that if you know which parts are data and which are metadata. There can be no general method as many file formats store their metadata differently.
To answer your second question: no, there is no way to hash a PE file or ZIP file, ignoring the metadata, without locating and reading the metadata. This is because the metadata you're interested in is stored at variable locations in the file.
In the case of PE files (EXE, DLL, etc), it's stored in a resource block, typically towards the end of the file, and a series of pointers and tables at the start of the file gives the location.
In the case of ZIP files, it's scattered throughout the archive -- each included file is preceded by its own metadata, and then there's a table at the end giving the locations of each metadata block. But it sounds like you might actually be wanting to read the files within the ZIP archive and look for EXEs in there if you're after program metadata; the ZIP archive itself does not store company names or version numbers.
Related
The .dss (or .ds2) file format is used for dictation devices e.g. by Philips or Olympus and stores metadata on the dictation file in addition to the audio information.
Is there a way to somehow readout this metadata by using a simple python routine?
One idea was to read out the file in binary format, yet I could not do it by myself.
Help anyone :-) ?
Sample file (short dictation with metadata) available here: https://www.dropbox.com/s/g5uk22prkqht372/TH10094.DSS?dl=0
In the sample file there are the metadata "Sofort", "Heartbeat" and "WGB", which needs to be looked for. Can't find them though.
.doc files, .pdf files, and some image formats all contain metadata about the file, such as the author.
Is a .py file just a plain text file whose contents are all visible once opened with a code editor like Sublime, or does it also contain metadata? If so, how does one access this metadata?
On Linux and most Unixes, .py's are just text (sometimes unicode text).
On Windows and Mac, there are cubbyholes where you can stash data, but I doubt Python uses them.
.pyc's, on the other hand, have at least a little metadata stuff in them - or so I've heard. Specifically: there's supposed to be a timestamp in them, so that if you copy a filesystem hierarchy, python won't automatically recreate all the .pyc's on import. There may or may not be more.
I want to open a .fif file of size around 800MB. I googled and found that these kind of files can be opened with photoshop. Is there a way to extract the images and store in some other standard format using python or c++.
This is probably an EEG or MEG data file. The full specification is here, and it can be read in with the MNE package in Python.
import mne
raw = mne.io.read_raw_fif('filename.fif')
FIF stands for Fractal Image Format and seems to be output of the Genuine Fractals Plugin for Adobe's Photoshop. Unfortunately, there is no format specification available and the plugin claims to use patented algorithms so you won't be able to read these files from within your own software.
There however are other tools which can do fractal compression. Here's some information about one example. While this won't allow you to open FIF files from the Genuine Fractals Plugin, it would allow you to compress the original file, if still available.
XnView seems to handle FIF files, but it's windows-only. There is a MP or Multiplatform version, but it seems less complete and didn't work when I tried to view a FIF file.
Update: XnView MP, which does work on Linux and OSX claims to support FIF, but I couldn't get it to work.
Update2: There's also an open source project:Fiasco that can work with fractal images, but not sure it's compatible with the proprietary FIF format.
I'd like to be able to read in the first couple kilobytes of unknown file types and see if it matches any known file types (i.e. mp3 file, jpeg, etc...). I was thinking of trying to load meta data from files from libraries like PIL, sndhdr, py264, etc... and see if they picked up any valid formats but I thought this must have been a problem someone has solved before.
Is there one library or a gist showing the usage of multiple libraries which would do this?
Use python-magic to do the fingerprinting.
The library can determine file type from bytes data only:
import magic
magic.from_buffer(start_data_from_something)
The library provides access to the libmagic file type identification library, which also drives the UNIX file command.
I am using python zlib and I am doing the following:
Compress large strings in memory (zlib.compress)
Upload to S3
Download and read and decompress the data as string from S3 (zlib.decompress)
Everything is working fine but when I directly download files from S3 and try to open them with a standard zip program I get an error. I noticed that instead of PK, the begining of the file is:
xµ}ko$7’םחע¯¸?ְ)$“שo³¶w¯1k{`
I am flexible and dont mind switching from zlib to another package but it has to be pythonic (Heroku compatible)
Thanks!
zlib compresses a file; it does not create a ZIP archive. For that, see zipfile.
If this is about compressing just strings, then zlib is the way to go. A zip file is for storing a file or even a whole directory tree with files. It keeps file meta data. It can be (somehow) used for, but is not appropriate for storing just strings.
If your application is just about storing and retrieving compressed strings, there is no point in "directly downloading files from S3 and try to open them with a standard zip program". Why would you do this?
Edit:
S3 generally is for storing files, not strings. You say you want to store strings. Are you sure that S3 is the right service for you? Did you look at SimpleDB?
Consider you want to stick to S3 and would like to upload zipped strings. Your S3 client library most likely expects to receive a file-like object to read from. To solve this efficiently, store the zipped string in a Python StringIO object (in an in-memory file) and provide this in-memory file to your S3 client library for uploading it to S3.
For downloading do the same. Use Python. Also for debugging purposes. There is no point in trying to force a string into a zipfile. There will be more overhead (due to file metadata) than by using plain zlibbed strings.
An alternative to writing zip files just for debugging purposes, which is entirely the wrong format for your application, is to have a utility that can decompress zlib streams, which is entirely the right format for your application. That utility is pigz with the -z option.