Speed up reading/hashing millions of files/images - python

I have directories containing 100K - 1 million images. I'm going to create a hash for each image so that I can, in the future, find an exact match based on these hashes. My current approach is:
def hash_test(images): # images is a list of image paths
hashes = []
for image in images:
with open(folder + image, 'rb', buffering=0) as f:
hashes.append(hashlib.sha256(f.read()).hexdigest())
# hashes.append(CityHash128(f.read()))
return hashes
31%|███ | 102193/334887 [00:04<42:15, 112.02it/s]
Of what I can tell from my experiments, the file.read() operation is my bottleneck, which means that I am I/O bound. This is also confirmed by checking iotop . I am reading from a HDD. I have read about memory-mapped reading, but couldn't get my head around whether it was applicable in this situation or not.
My question is: is there a way to optimize this reading operation?

You can try to parallelise your hash computation code like below. However, the performance depends upon how much parallel IO requests the disk can handle and also on how many cores does your CPU have. But, you can try.
from multiprocessing import Pool
# This function will return hashes as list
# Will wait for all parallel hash computation to complete
def parallel_hash(images):
with Pool(5) as pool:
return pool.map(hash_test, images)
def hash_test(image): # images is a list of image paths
with open(folder + image, 'rb', buffering=0) as f:
return hashlib.sha256(f.read()).hexdigest()
# hashes.append(CityHash128(f.read()))
parallel_hash(images)

It's also possible that the problem has to do with the number of files in a directory. Some file systems experience severely degraded performance when you get many thousands of files in a single directory. If you have 100K or more files in a single directory, it takes significant time for the file system just to find the file before opening and reading it.
That said, let's think about this a bit. If I'm reading your output correctly, your program completed approximately 102K out of 335K files in four hours and 42 minutes. In round numbers, that's about 6 files per second. So it'll take about 15.5 hours to do all 335K files.
If this is a one-time task, then just set it up to run overnight, and it'll be done when you get back to work in the morning. If you have to index a million files, start the process on Friday night and it'll be done when you get into the office on Monday.
If it's not a one-time task, then you have other problems . . .

Related

How to increase read from disk speed in Python

I use Python for Image analysis. The first step in my code is to load the images from disk to a big 20GB uint8 array. This step is taking a very long time, loading about 10MB/s, and the cpu is idling during the task.
This seems extremely slow. Am I making an obvious mistake? How can I improve performance? Is it a problem with the numpy array type?
# find all image files in working folder
FileNames = [] # FileNames is a list of image names
workingFolder = 'C:/folder'
for (dirpath, dirnames, filenames) in os.walk(workingFolder):
FileNames.extend(filenames)
FileNames.sort() # Sorted by image number
imNumber = len(FileNames) # Number of Images
# AllImages initialize
img = Image.open(workingFolder+'/'+FileNames[0])
AllImages = np.zeros((img.size[0],img.size[1], imNumber),dtype=np.uint8)
for ii in range(imNumber):
img = Image.open(workingFolder+'/'+FileNames[ii])
AllImages[:,:,ii] = img
Thanks a lot for your help.
Since the CPU is idling it sounds that it's the disk that is the bottle neck. 10 Mb/s is somewhat slow, but not that slow that it reminds me of stone age hard disks. If it were numpy I'd expect the CPU to be busy running numpy code rather than being idle.
Note that there maybe two ways the CPU will be waiting for the disk. First of course you will need to read the data from disk, but also since the data is 20GB the data may be big enough to require it to be swapped to disk. The normal solution to this type of situation is to memory map the file (which will avoid moving data from disk to swap).
Try to check if you can read the files faster by other means. For example on linux you could use dd if=/path/to/image of=/tmp/output bs=8k count=10k; rm -f /tmp/output to check the speed of read to ram. See this question for more information on checking disk performance.

Getting MD5 hash of the files is terribly slow

I'm using the following code to get a MD5 hash for several files with an approx. total size of 1GB:
md5 = hashlib.md5()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(128*md5.block_size), b''):
md5.update(chunk)
fileHash = md5.hexdigest()
For me, it's getting it pretty fast as it takes about 3 seconds to complete. But unfortunately for my users (having an old PC's), this method is very slow and from my observations it may take about 4 minutes for some user to get all of the file hashes. This is a very annoying process for them, but at the same I think this is the simplest & fastest way possible - am I right?
Would it be possible to speed-up the hash collecting process somehow?
I have a fairly weak laptop as well, and I just tried it - I can md5 one GB in four seconds as well. To go to several minutes, I suspect it's not the calculation but reading the file from hard disk. Try reading 1 MB blocks, i.e., f.read(2**20). That should need far fewer reads and increase the overall reading speed.

Efficiently read data in python (only one line)

For a upcoming programming competition I solved a few of the tasks of former competitions.
Each task looks like this: We get a bunch of in-files (each containing 1 line of numbers and strings, f.e. "2 15 test 23 ..."), and we have to build a program and return some computed values.
These in-files can be quite large: for instance 10 MB.
My code is the following:
with open(filename) as f:
input_data = f.read().split()
This is quite slow. I quess mostly because of the split method. Is there a faster way?
What you have already looks like the best way for plain text IO on a one-line file.
10 MB of plain text is fairly large, if you need some more speedup you could consider pickling the data in a binary format instead of a plain text format. Or if it is very repetitive data, you could store it compressed.
If one of your input files contains independent tasks (that is, you can work on a couple of tokens of the line at a time, without knowing tokens further ahead), you can do reading and processing in lockstep, by simpy not reading the whole file at once.
def read_groups(f):
chunksize= 4096 #how many bytes to read from the file at once
buf= f.read(chunksize)
while buf:
if entire_group_inside(buf): #checks if you have enough data to process on buf
i= next_group_index(buf) #returns the index on the next group of tokens
group, buf= buf[:i], buf[i:]
yield group
else:
buf+= f.read(chunksize)
with open(filename) as f:
for data in read_groups(f):
#do something
This has some advantages:
You don't need to read the whole file into memory (which, for 10 MB on a desktop, probably doesn't matter much)
if you do a lot of processing on each group of tokens, it may lead to better performance, as you'll have alternating I/O and CPU bound tasks. Modern OSs use sequential file prefetching to optimize file linear access, so, in practice, if you lockstep I/O and CPU, your I/O will end up being executed in parallel by the OS. Even if your OS has no such functionality, if you have a modern disk, it'll probably cache sequential access to blocks.
If you don't have much processing, though, your task is fundamentally I/O bound, and there isn't much you can do to speed it up as it stands, as wim said - other than rethinking your input data format

Is it possible to speed-up python IO?

Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question
Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).
The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.
First, be sure your 6GB file read is I/O bound, not CPU bound.
If It's I/O bound, consider the "Fan-Out" design pattern.
A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.
You can't get any faster than the maximum disk read speed.
In order to reach the maximum disk speed you can use the following two tips:
Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.
plain "no".
You've pretty much reached maximum disk speed.
I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.
If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.
as others have said - "no"
Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.
Another option:
If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.
2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.
PyPy provides optimised input/output faster up to 7 times.
This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.
Once you've reduced your file size, when you go to iterate through this file you just:
Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.
I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.
Example:
compress.py
import zlib
with open("big.csv", "rb") as f:
compressed = zlib.compress(f.read())
open("big_comp.csv", "wb").write(compressed)
iterate.py
import zlib
with open("big_comp.csv", "rb") as f:
big = zlib.decompress(f.read())
for line in big.split("\n"):
line = reversed(line)

Estimating zip size/creation time

I need to create ZIP archives on demand, using either Python zipfile module or unix command line utilities.
Resources to be zipped are often > 1GB and not necessarily compression-friendly.
How do I efficiently estimate its creation time / size?
Extract a bunch of small parts from the big file. Maybe 64 chunks of 64k each. Randomly selected.
Concatenate the data, compress it, measure the time and the compression ratio. Since you've randomly selected parts of the file chances are that you have compressed a representative subset of the data.
Now all you have to do is to estimate the time for the whole file based on the time of your test-data.
I suggest you measure the average time it takes to produce a zip of a certain size. Then you calculate the estimate from that measure. However I think the estimate will be very rough in any case if you don't know how well the data compresses. If the data you want to compress had a very similar "profile" each time you could probably make better predictions.
If its possible to get progress callbacks from the python module i would suggest finding out how many bytes are processed pr second ( By simply storing where in the file you where at start of the second, and where you are at the end ). When you have the data on how fast the computer your on you can off course save it, and use it as a basis for your next zip file. ( I normally collect about 5 samples before showing a time prognosses )
Using this method can give you Microsoft minutes so as you get more samples you would need to average it out. This would esp be the case if your making a zip file that contains a lot of files, as ZIP tends to slow down when compressing many small files compared to 1 large file.
If you're using the ZipFile.write() method to write your files into the archive, you could do the following:
Get a list of the files you want to zip and their relative sizes
Write one file to the archive and time how long it took
Calculate ETA based on the number of files written, their size, and how much is left.
This won't work if you're only zipping one really big file though. I've never used the zip module myself, so I'm not sure if it would work, but for small numbers of large files, maybe you could use the ZipFile.writestr() function and read in / zip up your files in chunks?

Categories

Resources