Python MD5 Hash Faster Calculation - python

I will try my best to explain my problem and my line of thought on how I think I can solve it.
I use this code
for root, dirs, files in os.walk(downloaddir):
for infile in files:
f = open(os.path.join(root,infile),'rb')
filehash = hashlib.md5()
while True:
data = f.read(10240)
if len(data) == 0:
break
filehash.update(data)
print "FILENAME: " , infile
print "FILE HASH: " , filehash.hexdigest()
and using start = time.time() elapsed = time.time() - start I measure how long it takes to calculate an hash. Pointing my code to a file with 653megs this is the result:
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.624
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.373
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.540
Ok now 12 seconds +- on a 653mb file, my problem is I intend to use this code on a program that will run through multiple files, some of them might be 4/5/6Gb and it will take wayy longer to calculate. What am wondering is if there is a faster way for me to calculate the hash of the file? Maybe by doing some multithreading? I used a another script to check the use of the CPU second by second and I see that my code is only using 1 out of my 2 CPUs and only at 25% max, any way I can change this?
Thank you all in advance for the given help.

Hash calculation in your case will almost certanly be I/O bound (unless you'll be running it on a machine with a really slow processor), so multithreading or processing of multiple files at once probably won't yield you expected results.
Arraging files over multiple drives or on a faster (SSD) drive would probably help, even though that is probably not the solution you are looking for.

Aren't disk operations a bottleneck here?
Assuming 80MB/sec read speed (this is how my hard disk performs), it takes about 8 seconds to read the file.

For what it's worth, doing this:
c:\python\Python.exe c:\python\Tools\scripts\md5sum.py cd.iso
takes 9.671 seconds on my laptop (2GHz core2 duo with an 80 GB SATA laptop hard drive).
As others have mentioned, MD5s are disk-bound, but your 12 second benchmark is probably pretty close to the fastest you could get.
Also, python's md5sum.py uses 8096 for the buffer size (even though I'm sure they meant either 4096 or 8192).

It helped me to increase my buffer size, up to a point. I started with 1024 and multiplied it with 2^N, increasing N each time starting from 1. With this method, I found that on my system that a buffer size of 65536 seemed to be about as good as it would get. However, it only gave me about an 7% improvement in running time.
Profiling indicated that about 80% of the time is spent in the MD5 update method and the other 20% is reading in the file. Since MD5 is a serial algorithm and the Python algorithm is already implemented in C, I don't think that there is much that you can do to speed up the MD5 part. You can try calculating the MD5s of two different files in parallel, but as everyone has said, you're ultimately going to be limited by the disk access speed.

Related

Randomized-offset binary raw disk writes with no caching whatsoever

For my application, I am attempting to determine whether a data backup system missed any writes. I am doing this by writing an incrementing integer counter to a 1GB virtual disk, and to make sure no writes were missed I can look at the reverted snapshot and see if there were any gaps (i.e. if I see 1, 2, 3, 0, 0, 6, 7 I know that the backup didn't get writes 4 and 5 correctly). This is all on a CentOS 7 VM, with mostly Python 2.7 scripts for writes/reads (speed isn't a huge concern)
A big part of my issues has been caching: since I'm simulating random I/O, writes are often flushed from caches and written to disk out of order. This makes every test appear as a false positive, since it looks like some data is missing at the time of the snapshot. Again, I don't really care about efficiency at all, so I don't mind really slow writes. Reads can use caching, that's not a problem, but also doesn't matter much one way or the other
Here are the things I have done to try to disable caching:
disable the disk write cache with sudo hdparm -W 0 /dev/sdb where /dev/sdb
writing to a raw disk with no filesystem, so no filesystem caching
set the buffering flag on with open in the Python script to 0 (no Python write cache)
Is it basically an impossible task to make sure that my writes get put on the disk in sequential order? All I need is write #(n) to happen before write #(n+1), and #(n+1) before #(n+2), etc.
This is the Python script I'm using to write to disk (SIZE and PRIME change based on the size of the disk an a random seed):
from struct import pack, unpack
import sys
SIZE,PRIME = [x],[x]
# random I/O traversal iterator
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
with open('/dev/sdb', 'rb+', buffering=0) as f:
index_gen = rand_index_generator(PRIME, SIZE)
# random traversal using iterator above, write counter to file
for counter in xrange(1, SIZE-16):
f.seek(index_gen.next()*4)
f.write(pack('>I', counter))
Then to validate I traverse in the same order and watch for gaps of unwritten data. This is after reverting the VM back to the snapshot. I know all the traversal and writing things work since validation will work smoothly with no missed writes before reverting, but I think some "written" data dies in RAM and doesn't make it to disk
Will take any suggestions to guarantee the write order I need for this application
Found out the answer to this question. I misunderstood the effect of writing to a raw disk, it did not eliminate OS caching since I was still calling the OS to write to my raw disk. Oops
To bypass OS caches you should use os.open and pass os.O_DIRECT and os.O_SYNC flags to make sure writes happen in the correct sequence (more info on those flags) and are not stuck in volatile memory. I used mmap and os file descriptors but you could also use the normal filehandles like this
Page size is specific to your operating system. For Linux it is 4096
The top section of the code stayed the same but here is the write loop:
PAGESIZE = 4096
filedesc = os.open('/dev/sdb', os.O_DIRECT|os.O_SYNC|os.O_RDWR)
for counter in xrange(1, SIZE-16):
write_loc = index_gen.next()*4
page_dist = (write_loc%PAGESIZE)
offset = write_loc - page_dist
bytemap = mmap.mmap(filedesc, PAGESIZE, offset=offset)
bytemap[page_dist:(page_dist+4)] = pack('>I', counter)
bytemap.flush()
bytemap.close()

Efficiently reading few lines from a very large binary file

Here is a simple example to illustrate my problem:
I have a large binary file with 10 million values.
I want to get 5K values from certain points in this file.
I have a list of indexes giving me the exact place in the file I have my value in.
To solve this I tried two methods:
Going through the values and simply using seek() (from the start of the file) to get each value, something like this:
binaryFile_new = open(binary_folder_path, "r+b")
for index in index_list:
binaryFile_new.seek (size * (index), 0)
wanted_line = binaryFile_new.read (size)
wanted_line_list.append(wanted_line)
binaryFile_new.close()
But as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
Sorting the indexes so I could go through the file "once" while seeking from the current position with something like that:
binaryFile_new = open(binary_folder_path, "r+b")
sorted_index_list = sorted(index_list)
for i, index in enumerate(sorted_index_list):
if i == 0:
binaryFile_new.seek (size * (v), 0)
else:
binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1)
binaryFile_new.seek (size * (index), 0)
wanted_line = binaryFile_new.read (size)
wanted_line_list.append(wanted_line)
binaryFile_new.close()
I expected the second solution to be much faster because in theory it would go through the whole file once O(N).
But for some reason both solutions run the same.
I also have a hard constraint on memory usage, as I run this operation in parallel and on many files, so I can't read files into memory.
Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.
I'd go with #1:
for index in index_list:
binary_file.seek(size * index)
# ...
(I cleaned up your code a bit to comply with Python naming conventions and to avoid using a magic 0 constant, as SEEK_SET is default anyway.)
as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
No, a seek() does not "read through from the beginning", that would defeat the point of seeking. Seeking to the beginning of file and to the end of file have roughly the same cost.
Sorting the indexes so I could go through the file "once" while seeking from the current position
I can't quickly find a reference for this, but I believe there's absolutely no point in calculating the relative offset in order to use SEEK_CUR instead of SEEK_SET.
There might be a small improvement just from seeking to the positions you need in order instead of randomly, as there's an increased chance your random reads will be serviced from cache, in case many of the points you need to read happen to be close to each other (and so your read patterns trigger read-ahead in the file system).
Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.
mmap doesn't scan the file. It sets up a region in your program's virtual memory to correspond to the file, so that accessing any page from this region the first time leads to a page fault, during which the OS reads that page (several KB) from the file (assuming it's not in the page cache) before letting your program proceed.
The internet is full of discussions of relative merits of read vs mmap, but I recommend you don't bother with trying to optimize by using mmap and use this time to learn about the virtual memory and the page cache.
[edit] reading in chunks larger than the size of your values might save you a bit of CPU time in case many of the values you need to read are in the same chunk (which is not a given) - but unless your program is CPU bound in production, I wouldn't bother with that either.

Performance issue with loop on datasets with h5py

I want to apply a simple function to the datasets contained in an hdf5 file.
I am using a code similar to this
import h5py
data_sums = []
with h5py.File(input_file, "r") as f:
for (name, data) in f["group"].iteritems():
print name
# data_sums.append(data.sum(1))
data[()] # My goal is similar to the line above but this line is enough
# to replicate the problem
It goes very fast at the beginning and after a certain number, reproducible to some extent, of datasets it slow down dramatically.
If I comment the last line, it finishes almost instantly. It does not matter if the data are stored (here append to a list) or not: something like data[:100] as a similar effect.
The number of datasets that can be treated before the drop in performance is dependent to the size of portion that is accessed at each iteration.
Iterating over smaller chunks does not solve the issue.
I suppose I am filling some memory space and that the process slows down when it is full but I do not understand why.
How to circumvent this performance issue?
I run python 2.6.5 on ubuntu 10.04.
Edit:
The following code does not slow down if the second line of the loop is un-commented. It does slow down without out it
f = h5py.File(path to file, "r")
list_name = f["data"].keys()
f.close()
import numpy as np
for name in list_name:
f = h5py.File(d.storage_path, "r")
# name = list_name[0] # with this line the issue vanishes.
data = f["data"][name]
tag = get_tag(name)
data[:, 1].sum()
print "."
f.close()
Edit: I found out that accessing the first dimension of multidimensional datasets seems to run without issues. The problem occurs when higher dimensions are involved.
platform?
on windows 64 bit, python 2.6.6, i have seen some weird issues when crossing a 2GB barrier (i think) if you have allocated it in small chunks.
you can see it with a script like this:
ix = []
for i in xrange(20000000):
if i % 100000 == 0:
print i
ix.append('*' * 1000)
you can see that it will run pretty fast for a while, and then suddenly slow down.
but if you run it in larger blocks:
ix = []
for i in xrange(20000):
if i % 100000 == 0:
print i
ix.append('*' * 1000000)
it doesn't seem to have the problem (though it will run out of memory, depending on how much you have - 8GB here).
weirder yet, if you eat the memory using large blocks, and then clear the memory (ix=[] again, so back to almost no memory in use), and then re-run the small block test, it isn't slow anymore.
i think there was some dependence on the pyreadline version - 2.0-dev1 helped a lot with these sorts of issues. but don't remember too much. when i tried it now, i don't see this issue anymore really - both slow down significant around 4.8GB, which with everything else i have running is about where it hits the limits of physical memory and starts swapping.

Advice on writing to a log file with python

I have some code that will need to write about 20 bytes of data every 10 seconds.
I'm on Windows 7 using python 2.7
You guys recommend any 'least strain to the os/hard drive' way to do this?
I was thinking about opening and closing the same file very 10 seconds:
f = open('log_file.txt', 'w')
f.write(information)
f.close()
Or should I keep it open and just flush() the data and not close it as often?
What about sqllite? Will it improve performance and be less intensive than the open and close file operations?
(Isn't it just a flat file database so == to text file anyways...?)
What about mysql (this uses a local server/process.. not sure the specifics on when/how it saves data to hdd) ?
I'm just worried about not frying my hard drive and improving the performance on this logging procedure. I will be receiving new log information about every 10 seconds, and this will be going on 24/7 24 hours a day.
Your advice?
ie: Think about programs like utorrent that require saving large amounts of data on a constant basis for long periods of time, (my log file is significantly less data that those being written in such "downloader type programs" like utorrent)
import random
import time
def get_data():
letters = 'isn\'t the code obvious'
data = ''
for i in xrange(20):
data += random.choice(letters)
return data
while True:
f = open('log_file.txt', 'w')
f.write(get_data())
f.close()
time.sleep(10)
My CPU starts whining after about 15 seconds... (or is that my hdd? )
As expected, python comes included with a great tool for this, have a look at the logging module
Use the logging framework. This is exactly what it is designed to do.
Edit: Balls, beaten to it :).
Don't worry about "frying" your hard drive - 20 bytes every 10 seconds is a small fraction of the data written to the disk in the normal operation of the OS.

Fast Reading of 10000 Binary Files?

I have 10,000 binary files, named like this:
file0.bin
file1.bin
............
............
file10000.bin
Each of the above files contains exactly 391 float values (1564 bytes per file).
my goal is to read all of the files into a python array in the fastest way possible. If I open & close each file using a script, it takes a lot of time (about 8min!).
are there any other creative ways to read these files FAST?
I am using Ubuntu Linux and would prefer a solution that can work with Python. Thanks.
If you want even faster make ramdisk:
# mkfs -q /dev/ram1 $(( 2 * 10000)) ## roughly the size you need
# mkdir -p /ramcache
# mount /dev/ram1 /ramcache
# df -H | grep ramcache
now concat
# cat file{1..10000}.bin >> /ramcache/concat.bin ## thanks SiegeX
Then let your script on that file
Since I haven't tested I prefixed everything with '#' so that you wouldn't have any accidents. Just remove them if you want it to work.
This is an option but I would urge you to consider looking at the comments people have posted directly under your Q
You could probably get better results examining what you are doing wrong as I could not reproduce your speed problem of 8 mins.
Iterate over them and use optimise flag you might also want to parse them using pypy it compiles python via a JIT compiler allowing for a somewhat marked increase in speed.
You have 10001 files (0 to 10000 inclusive) and it takes 8 minutes to run the following?
try: xrange # python 2 - 3 compatibility
except NameError: xrange= range
import array
final= array.array('f')
for file_seq in xrange(10001):
with open("file%d.bin" % file_seq, "rb") as fp:
final.fromfile(fp, 391)
What's the underlying filesystem? How much RAM do you have? What's your processor and its speed?

Categories

Resources