python zipfile module doesn't seem to be compressing my files

python zipfile module doesn't seem to be compressing my files - python

I made a little helper function:
import zipfile
def main(archive_list=[],zfilename='default.zip'):
print zfilename
zout = zipfile.ZipFile(zfilename, "w")
for fname in archive_list:
print "writing: ", fname
zout.write(fname)
zout.close()
if __name__ == '__main__':
main()
The problem is that all my files are NOT being COMPRESSED! The files are the same size and, effectively, just the extension is being change to ".zip" (from ".xls" in this case).
I'm running python 2.5 on winXP sp2.

This is because ZipFile requires you to specify the compression method. If you don't specify it, it assumes the compression method to be zipfile.ZIP_STORED, which only stores the files without compressing them. You need to specify the method to be zipfile.ZIP_DEFLATED. You will need to have the zlib module installed for this (it is usually installed by default).
import zipfile
def main(archive_list=[],zfilename='default.zip'):
print zfilename
zout = zipfile.ZipFile(zfilename, "w", zipfile.ZIP_DEFLATED) # <--- this is the change you need to make
for fname in archive_list:
print "writing: ", fname
zout.write(fname)
zout.close()
if __name__ == '__main__':
main()
Update: As per the documentation (python 3.7), value for 'compression' argument should be specified to override the default, which is ZIP_STORED. The available options are ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA and the corresponding libraries zlib, bz2 or lzma should be available.

There is a really easy way to compress zip format,
Use in shutil.make_archive library.
For example:
import shutil
shutil.make_archive(file_name, 'zip', file location after compression)
Can see more extensive documentation at: Here

Hope this is going to be useful to someone.
I tested all zip modes and benchmarked them on two data sets. First one small (~30 MB) and other large (~ 1,5 GB). They consisted of various types of files so it would be as close to real life scenario as possible. I did two methods of tests on each dataset: the “proportional” one and the “complete” one. Both tests where repeated 3 times one after another to get an average. Those result may differ depending on your machines, but I think it’s still a good place to start.
I did the test in two methods because I’m trying to make my own specialized backup solution.
The proportional method creates more zip files but it allows me to transfer smaller packages of data if necessary eg. replacing only things that changed. It's more complicated than that, but it is not important right now.
The complete method is just straight up compressing whole folder.
Compression ratio calculation:
size_difference = source_size - compressed_size
compression_ratio = (size_difference * 100.0) / source_size
Basically the higher that number the better.
Each zip archive was initialized like this:
# Mode tests
with zipfile.ZipFile(target_zip, 'w', compression_method) as ziph:
# Level tests
with zipfile.ZipFile(target_zip, 'w', compression_method, compresslevel=level) as ziph:
Here are the results:
It seems that no matter the method, the most optimal compression mode is ZIP_DEFLATED.
The only smaller archive size gave me ZIP_LZMA mode, but it was only fraction of % and it took about 8x longer for large data sets.
Furthermore I tried different levels of compression with the same data set and methods. Except this time there was only one run per level.
It looks like ZIP_DEFLATED and ZIP_BIP2 have similar compression capabilities, but the second one is much slower. For large data sets the compression level of 1 or 2 should suffice. Increasing it more gives no significant effect on final file size. If the workload demands a lot of “small” zip files it is better to use level 9. It gives high compression ratio but takes about the same amount of time as at level 1.

Related

Speed up reading multiple pickle files

I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.
Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.
Using threading wouldn't help either because of GIL.
I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?
UPDATE
Answering your questions:
Files are partial products of data processing for the purpose of ML
There are pandas.Series objects but the dtype is not known upfront
I want to have many files because we want to pick any subset easily
I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
The size of the files can vary a lot
I use python 3.7 so I believe it's cPickle in fact
Using pickle is very flexible because I don't have to worry about underlying types - I can save anything

I agree with what has been noted in the comments, namely that due to the constraint of python itself (chiefly, the GIL lock, as you noted) and there may simply be no faster loading the information beyond what you are doing now. Or, if there is a way, it may be both highly technical and, in the end, only gives you a modest increase in speed.
That said, depending on the datatypes you have, it may be faster to use quickle or pyrobuf.

I think that the solution would be some library written in C that
takes a list of files to read and then runs multiple threads (without
GIL). Is there something like this around?
In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).
If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.
Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil
import pickle, shutil, os
#some dummy data
d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}
#create two pickles
with open('test1.pickle', 'wb') as f:
pickle.Pickler(f).dump(d1)
with open('test2.pickle', 'wb') as f:
pickle.Pickler(f).dump(d2)
#combine list of pickle files
with open('test3.pickle', 'wb') as dst:
for pickle_file in ['test1.pickle', 'test2.pickle']:
with open(pickle_file, 'rb') as src:
shutil.copyfileobj(src, dst)
#unpack the data
with open('test3.pickle', 'rb') as f:
p = pickle.Unpickler(f)
while True:
try:
print(p.load())
except EOFError:
break
#cleanup
os.remove('test1.pickle')
os.remove('test2.pickle')
os.remove('test3.pickle')

I think you should try and use mmap(memory mapped files) that is similar to open() but way faster.
Note: If your each file is big in size then use mmap otherwise If the files are small in size use regular methods.
I have written a sample that you can try.
import mmap
from time import perf_counter as pf
def load_files(filelist):
start = pf() # for rough time calculations
for filename in filelist:
with open(filename, mode="r", encoding="utf8") as file_obj:
with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_file_obj:
data = pickle.load(mmap_file_obj)
print(data)
print(f'Operation took {pf()-start} sec(s)')
Here mmap.ACCESS_READ is the mode to open the file in binary. The file_obj returned by open is just used to get the file descriptor which is used to open the stream to the file via mmap as a memory mapped file.
As you can see below in the documentation of python open returns the file descriptor or fd for short. So we don't have to do anything with the file_obj operation wise. We just need its fileno() method to get its file descriptor. Also we are not closing the file_obj before the mmap_file_obj. Please take a proper look. We are closing the the mmap block first.
As you said in your comment.
open (file, flags[, mode])
Open the file file and set various flags according to flags and possibly its mode according to mode.
The default mode is 0777 (octal), and the current umask value is first masked out.
Return the file descriptor for the newly opened file.
Give it a try and see how much impact does it do on your operation
You can read more about mmap here. And about file descriptor here

You can try multiprocessing:
import os,pickle
pickle_list=os.listdir("pickles")
output_dict=dict.fromkeys(pickle_list, '')
def pickle_process_func(picklename):
with open("pickles/"+picklename, 'rb') as file:
dapickle=pickle.load(file)
#if you need previus files output wait for it
while(!output_dict[pickle_list[pickle_list.index(picklename)-1]]):
continue
#thandosomesh
print("loaded")
output_dict[picklename]=custom_func_i_dunno(dapickle)
from multiprocessing import Pool
with Pool(processes=10) as pool:
pool.map(pickle_process_func, pickle_list)

Consider using HDF5 via h5py instead of pickle. The performance is generally much better than pickle with numerical data in Pandas and numpy data structures and it supports most common data types and compression.

use binary data after main program in Python [duplicate]

I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?

You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))

Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards

If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.

(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Detecting duplicate files with Python

I'm trying to write a script in Python for sorting through files (photos, videos), checking metadata of each, finding and moving all duplicates to a separate directory. Got stuck with the metadata checking part. Tried os.stat - doesn't return True for duplicate files. Ideally, I should be able to do something like :
if os.stat("original.jpg")== os.stat("duplicate.jpg"):
shutil.copy("duplicate.jpg","C:\\Duplicate Folder")
Pointers anyone?

There's a few things you can do. You can compare the contents or hash of each file or you can check a few select properties from the os.stat result, ex
def is_duplicate(file1, file2):
stat1, stat2 = os.stat(file1), os.stat(file2)
return stat1.st_size==stat2.st_size and stat1.st_mtime==stat2.st_mtime

A basic loop using a set to keep track of already encountered files:
import glob
import hashlib
uniq = set()
for fname in glob.glob('*.txt'):
with open(fname,"rb") as f:
sig = hashlib.sha256(f.read()).digest()
if sig not in uniq:
uniq.add(sig)
print fname
else:
print fname, " (duplicate)"
Please note as with any hash function there is a slight chance of collision. That is two different files having the same digest. Depending your needs, this is acceptable of not.
According to Thomas Pornin in an other answer :
"For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability [of collision] is about 4.3*10-60."
Given your need, if you have to check for additional properties in order to identify "true" duplicates, change the sig = ....line to whatever suits you. For example, if you need to check for "same content" and "same owner" (st_uidas returned by os.stat()), write:
sig = ( hashlib.sha256(f.read()).digest(),
os.stat(fname).st_uid )

If two files have the same md5 they are exact duplicates.
from hashlib import md5
with open(file1, "r") as original:
original_md5 = md5(original.read()).hexdigest()
with open(file2, "r") as duplicate:
duplicate_md5 = md5(duplicate.read()).hexdigest()
if original_md5 == duplicate_md5:
do_stuff()
In your example you're using jpg file in that case you want to call the method open with its second argument equals to rb. For that see the documentation for open

os.stat offers information about some file's metadata and features, including the creation time. That is not a good approach in order to find out if two files are the same.
For instance: Two files can be the same and have different time creation. Hence, comparing stats will fail here. Sylvain Leroux approach is the best one when combining performance and accuracy, since it is very rare two different files has the same hash.
So, unless you have an incredibly large amount of data and a repeated file will cause a system fatality, this is the way to go.
If that your case (it not seems to be), well ... the only way you can be 100% sure two file are the same is iterating and perform a comparison byte per byte.

Performance issue with reading integers from a binary file at specific locations

I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.
import os, struct
def read_values(filename, indices):
# indices are sorted and unique
values = []
with open(filename, 'rb') as f:
for index in indices:
f.seek(index*4L, os.SEEK_SET)
b = f.read(4)
v = struct.unpack("#i", b)[0]
values.append(v)
return values
For comparison here is the F# version:
open System
open System.IO
let readValue (reader:BinaryReader) cellIndex =
// set stream to correct location
reader.BaseStream.Position <- cellIndex*4L
match reader.ReadInt32() with
| Int32.MinValue -> None
| v -> Some(v)
let readValues fileName indices =
use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
// Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
let values = List.map (readValue reader) (List.ofSeq indices)
values
Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?
Update
Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):
import tables
def read_values_hdf5(filename, indices):
values = []
with tables.open_file(filename) as f:
dset = f.root.raster
return dset[indices]
Update 2
I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.

Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.
One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.
import numpy as np
# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]
If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.
Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.
It should be something like this:
import mmap
with open("my_index", "rb") as f:
memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
for i in indices:
# the index at position i:
idx_value = struct.unpack('I', memory_map[4*i:4*i+4])
(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)
Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:
import numpy as np
index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]
I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.
EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.
What is mmap?
Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.
File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.
The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.
If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)
mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.
mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.

Is the indices list sorted? i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.