Getting MD5 hash of the files is terribly slow

Getting MD5 hash of the files is terribly slow - python

I'm using the following code to get a MD5 hash for several files with an approx. total size of 1GB:
md5 = hashlib.md5()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(128*md5.block_size), b''):
md5.update(chunk)
fileHash = md5.hexdigest()
For me, it's getting it pretty fast as it takes about 3 seconds to complete. But unfortunately for my users (having an old PC's), this method is very slow and from my observations it may take about 4 minutes for some user to get all of the file hashes. This is a very annoying process for them, but at the same I think this is the simplest & fastest way possible - am I right?
Would it be possible to speed-up the hash collecting process somehow?

I have a fairly weak laptop as well, and I just tried it - I can md5 one GB in four seconds as well. To go to several minutes, I suspect it's not the calculation but reading the file from hard disk. Try reading 1 MB blocks, i.e., f.read(2**20). That should need far fewer reads and increase the overall reading speed.

Related

python multiprocess read data from disk

it confused me long time.
my program has two process, both read data from disk, disk max read speed 10M/s
1. if two process both read 10M data, is two process spend time same with one process read twice?
2. if two process both read 5M data, two process read data spend 1s, one process read twice spend 1s, i know multi process can save time from IO, but the spend same time in IO, multi process how to save time?

It's not possible to increase disk read speed by adding more threads. With 2 threads reading you will get at best 1/2 the speed per thread (in practice even less), with 3 threads - 1/3 the speed, etc.
With disk I/O it is the difference between sequential and random access speed that is really important. For example, sequential read speed can be 10 MB/s, and random read just 10 KB/s. This is the case even with the latest SSD drives (although the ratio may be less pronounced).
For that reason you should prefer to read from disk sequentially from only one thread at a time. Reading the file in 2 threads in parallel will not only reduce the speed of each read by half, but will further reduce because of non-sequential (interleaved) disk access.
Note however, that 10 MB is really not much; modern OSes will prefetch the entire file into the cache, and any subsequent reads will appear instantaneous.

Speed up reading/hashing millions of files/images

I have directories containing 100K - 1 million images. I'm going to create a hash for each image so that I can, in the future, find an exact match based on these hashes. My current approach is:
def hash_test(images): # images is a list of image paths
hashes = []
for image in images:
with open(folder + image, 'rb', buffering=0) as f:
hashes.append(hashlib.sha256(f.read()).hexdigest())
# hashes.append(CityHash128(f.read()))
return hashes
31%|███ | 102193/334887 [00:04<42:15, 112.02it/s]
Of what I can tell from my experiments, the file.read() operation is my bottleneck, which means that I am I/O bound. This is also confirmed by checking iotop . I am reading from a HDD. I have read about memory-mapped reading, but couldn't get my head around whether it was applicable in this situation or not.
My question is: is there a way to optimize this reading operation?

You can try to parallelise your hash computation code like below. However, the performance depends upon how much parallel IO requests the disk can handle and also on how many cores does your CPU have. But, you can try.
from multiprocessing import Pool
# This function will return hashes as list
# Will wait for all parallel hash computation to complete
def parallel_hash(images):
with Pool(5) as pool:
return pool.map(hash_test, images)
def hash_test(image): # images is a list of image paths
with open(folder + image, 'rb', buffering=0) as f:
return hashlib.sha256(f.read()).hexdigest()
# hashes.append(CityHash128(f.read()))
parallel_hash(images)

It's also possible that the problem has to do with the number of files in a directory. Some file systems experience severely degraded performance when you get many thousands of files in a single directory. If you have 100K or more files in a single directory, it takes significant time for the file system just to find the file before opening and reading it.
That said, let's think about this a bit. If I'm reading your output correctly, your program completed approximately 102K out of 335K files in four hours and 42 minutes. In round numbers, that's about 6 files per second. So it'll take about 15.5 hours to do all 335K files.
If this is a one-time task, then just set it up to run overnight, and it'll be done when you get back to work in the morning. If you have to index a million files, start the process on Friday night and it'll be done when you get into the office on Monday.
If it's not a one-time task, then you have other problems . . .

Efficiently read data in python (only one line)

For a upcoming programming competition I solved a few of the tasks of former competitions.
Each task looks like this: We get a bunch of in-files (each containing 1 line of numbers and strings, f.e. "2 15 test 23 ..."), and we have to build a program and return some computed values.
These in-files can be quite large: for instance 10 MB.
My code is the following:
with open(filename) as f:
input_data = f.read().split()
This is quite slow. I quess mostly because of the split method. Is there a faster way?

What you have already looks like the best way for plain text IO on a one-line file.
10 MB of plain text is fairly large, if you need some more speedup you could consider pickling the data in a binary format instead of a plain text format. Or if it is very repetitive data, you could store it compressed.

If one of your input files contains independent tasks (that is, you can work on a couple of tokens of the line at a time, without knowing tokens further ahead), you can do reading and processing in lockstep, by simpy not reading the whole file at once.
def read_groups(f):
chunksize= 4096 #how many bytes to read from the file at once
buf= f.read(chunksize)
while buf:
if entire_group_inside(buf): #checks if you have enough data to process on buf
i= next_group_index(buf) #returns the index on the next group of tokens
group, buf= buf[:i], buf[i:]
yield group
else:
buf+= f.read(chunksize)
with open(filename) as f:
for data in read_groups(f):
#do something
This has some advantages:
You don't need to read the whole file into memory (which, for 10 MB on a desktop, probably doesn't matter much)
if you do a lot of processing on each group of tokens, it may lead to better performance, as you'll have alternating I/O and CPU bound tasks. Modern OSs use sequential file prefetching to optimize file linear access, so, in practice, if you lockstep I/O and CPU, your I/O will end up being executed in parallel by the OS. Even if your OS has no such functionality, if you have a modern disk, it'll probably cache sequential access to blocks.
If you don't have much processing, though, your task is fundamentally I/O bound, and there isn't much you can do to speed it up as it stands, as wim said - other than rethinking your input data format

Working with multiple Large Files in Python

I have around 60 files each contains around 900000 lines which each line is 17 tab separated float numbers. Per each line i need to do some calculation using all corresponding lines from all 60 files, but because of their huge sizes (each file size is 400 MB) and limited computation resources, it takes so long time. I would like to know is there any solution to do this fast?

It depends on how you process them. If you have enough memory you can read all the files first and change them to python data structures. Then you can do calculations.
If your files don't fit into memory probably the easiest way is to use some distributed computing mechanism (hadoop or other lighter alternatives).
Another smaller improvements could be to use fadvice linux function call to say how you will be using the file (sequential reading or random access), it tells the operating system how to optimize file access.
If the calculations fit into some common libraries like numpy numexpr which has a lot of optimizations you can use them (this can help if your computations use not-optimized algorithms to process them).

If "corresponding lines" means "first lines of all files, then second lines of all files etc", you can use `itertools.izip:
# cat f1.txt
1.1
1.2
1.3
# cat f2.txt
2.1
2.2
2.3
# python
>>> from itertools import izip
>>> files = map(open, ("f1.txt", "f2.txt"))
>>> lines_iterator = izip(*files)
>>> for lines in lines_iterator:
... print lines
...
('1.1\n', '2.1\n')
('1.2\n', '2.2\n')
('1.3\n', '2.3\n')
>>>

A few options:
1. Just use the memory
If you have 17x900000 = 15.3 M floats/file. Storing this as doubles (as numpy usually does) will take roughly 120 MB of memory per file. You can reduce this by storing the floats as float32, so that each file will take roughly 60 MB. If you have 60 files and 60 MB/file, you have 3.6 GB of data.
This amount is not unreasonable if you use 64-bit python. If you have less than, say, 6 GB of RAM in your machine, it will result in a lot of virtual memory swapping. Whether or not that is a problem depends on the way you access data.
2. Do it row-by-row
If you can do it row-by-row, just read each file one row at a time. It is quite easy to have 60 open files, that'll not cause any problems. This is probably the most efficient method, if you process the files sequentially. The memory usage is next to nothing, and the operating system will take the trouble of reading the files.
The operating system and the underlying file system try very hard to be efficient in sequential disk reads and writes.
3. Preprocess your files and use mmap
You may also preprocess your files so that they are not in CSV but in a binary format. That way each row will take exactly 17x8 = 136 or 17x4 = 68 bytes in the file. Then you can use numpy.mmap to map the files into arrays of [N, 17] shape. You can handle the arrays as usual arrays, and numpy plus the operating system will take care of optimal memory management.
The preprocessing is required because the record length (number of characters on a row) in a text file is not fixed.
This is probably the best solution, if your data access is not sequential. Then mmap is the fastest method, as it only reads the required blocks from the disk when they are needed. It also caches the data, so that it uses the optimal amount of memory.
Behind the scenes this is a close relative to solution #1 with the exception that nothing is loaded into memory until required. The same limitations about 32-bit python apply; it is not able to do this as it runs out of memory addresses.
The file conversion into binary is relatively fast and easy, almost a one-liner.

Is it possible to speed-up python IO?

Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question

Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).

The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.
First, be sure your 6GB file read is I/O bound, not CPU bound.
If It's I/O bound, consider the "Fan-Out" design pattern.
A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.

You can't get any faster than the maximum disk read speed.
In order to reach the maximum disk speed you can use the following two tips:
Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.

plain "no".
You've pretty much reached maximum disk speed.
I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.

If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.

as others have said - "no"
Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.
Another option:
If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.

2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.

PyPy provides optimised input/output faster up to 7 times.

This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.
Once you've reduced your file size, when you go to iterate through this file you just:
Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.
I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.
Example:
compress.py
import zlib
with open("big.csv", "rb") as f:
compressed = zlib.compress(f.read())
open("big_comp.csv", "wb").write(compressed)
iterate.py
import zlib
with open("big_comp.csv", "rb") as f:
big = zlib.decompress(f.read())
for line in big.split("\n"):
line = reversed(line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.