How to increase read from disk speed in Python

How to increase read from disk speed in Python - python

I use Python for Image analysis. The first step in my code is to load the images from disk to a big 20GB uint8 array. This step is taking a very long time, loading about 10MB/s, and the cpu is idling during the task.
This seems extremely slow. Am I making an obvious mistake? How can I improve performance? Is it a problem with the numpy array type?
# find all image files in working folder
FileNames = [] # FileNames is a list of image names
workingFolder = 'C:/folder'
for (dirpath, dirnames, filenames) in os.walk(workingFolder):
FileNames.extend(filenames)
FileNames.sort() # Sorted by image number
imNumber = len(FileNames) # Number of Images
# AllImages initialize
img = Image.open(workingFolder+'/'+FileNames[0])
AllImages = np.zeros((img.size[0],img.size[1], imNumber),dtype=np.uint8)
for ii in range(imNumber):
img = Image.open(workingFolder+'/'+FileNames[ii])
AllImages[:,:,ii] = img
Thanks a lot for your help.

Since the CPU is idling it sounds that it's the disk that is the bottle neck. 10 Mb/s is somewhat slow, but not that slow that it reminds me of stone age hard disks. If it were numpy I'd expect the CPU to be busy running numpy code rather than being idle.
Note that there maybe two ways the CPU will be waiting for the disk. First of course you will need to read the data from disk, but also since the data is 20GB the data may be big enough to require it to be swapped to disk. The normal solution to this type of situation is to memory map the file (which will avoid moving data from disk to swap).
Try to check if you can read the files faster by other means. For example on linux you could use dd if=/path/to/image of=/tmp/output bs=8k count=10k; rm -f /tmp/output to check the speed of read to ram. See this question for more information on checking disk performance.

Related

Speed up reading/hashing millions of files/images

I have directories containing 100K - 1 million images. I'm going to create a hash for each image so that I can, in the future, find an exact match based on these hashes. My current approach is:
def hash_test(images): # images is a list of image paths
hashes = []
for image in images:
with open(folder + image, 'rb', buffering=0) as f:
hashes.append(hashlib.sha256(f.read()).hexdigest())
# hashes.append(CityHash128(f.read()))
return hashes
31%|███ | 102193/334887 [00:04<42:15, 112.02it/s]
Of what I can tell from my experiments, the file.read() operation is my bottleneck, which means that I am I/O bound. This is also confirmed by checking iotop . I am reading from a HDD. I have read about memory-mapped reading, but couldn't get my head around whether it was applicable in this situation or not.
My question is: is there a way to optimize this reading operation?

You can try to parallelise your hash computation code like below. However, the performance depends upon how much parallel IO requests the disk can handle and also on how many cores does your CPU have. But, you can try.
from multiprocessing import Pool
# This function will return hashes as list
# Will wait for all parallel hash computation to complete
def parallel_hash(images):
with Pool(5) as pool:
return pool.map(hash_test, images)
def hash_test(image): # images is a list of image paths
with open(folder + image, 'rb', buffering=0) as f:
return hashlib.sha256(f.read()).hexdigest()
# hashes.append(CityHash128(f.read()))
parallel_hash(images)

It's also possible that the problem has to do with the number of files in a directory. Some file systems experience severely degraded performance when you get many thousands of files in a single directory. If you have 100K or more files in a single directory, it takes significant time for the file system just to find the file before opening and reading it.
That said, let's think about this a bit. If I'm reading your output correctly, your program completed approximately 102K out of 335K files in four hours and 42 minutes. In round numbers, that's about 6 files per second. So it'll take about 15.5 hours to do all 335K files.
If this is a one-time task, then just set it up to run overnight, and it'll be done when you get back to work in the morning. If you have to index a million files, start the process on Friday night and it'll be done when you get into the office on Monday.
If it's not a one-time task, then you have other problems . . .

NumPy memmap slow loading small chunk from large file on first read only

I am using NumPy memmap to load a small amount of data from various locations throughout a large binary file (memmap'd, reshaped, flipped around, and then around 2000x1000 points loaded from around a 2 GB binary file). There are five 2 GB files each with its own memory map object.
The memory maps are created all very quickly. And the slice of data from the first several files pulls out very quickly. But, then, it suddenly stops on the fourth and fifth file. Memory usage remains low, so, it does not appear to be reading the whole file into memory, but, I/O access from the process is high. It could easily take ten or fifteen minutes for this to clear, and then everything proceeds as expected. Subsequent access through all of the memory maps is extremely rapid, including loading data that was not previously touched. Memory usage remains low throughout. Closing Python and re-running, the problem does not reoccur until reboot (caching maybe?).
I'm on Windows 10 with Python 2.7. Any thoughts for troubleshooting?
EDIT: There was a request in the comments for file format type and example code. Unfortunately, I cannot provide exact details; however, I can say this much. The file format contains just int16 binary values for a 3D array which can be reshaped by the format [n1, n2, n3] where n* are the length for each dimension. However, the files are split at 2GB. So, they are loaded in like this:
memmaps = []
for filename in filelist:
memmaps.append(np.memmap(filename, dtype=np.int16, mode='r'))
memmaps[-1] = memmaps[-1].reshape([len(memmaps[-1])/n2/n3, n2, n3])
memmaps[-1] = np.transpose(memmaps[-1], [2,1,0])
This certainly isn't the cleanest code in the world, but it generally works, except for this seemingly random slow down. The user has a slider which allows them to plot a slice from this array as
image = np.zeros([n2, n1], dtype=np.int16)
#####
c = 0
for d in memmaps:
image[:,c:(c+d.shape[2])] = d[slice,:,:]
c = c + d.shape[2]
I'm leaving out a lot of detail, but I think this captures the most relevant information.
EDIT 2: Also, I am open to alternative approaches to handling this problem. My end goal is real time interactive plotting of an arbitrary and relatively small chunk of 2D data as an image from a large 3D dataset that may be split across multiple binary files. I'm presently using pyqtgraph with fairly reasonable results, except for this random problem.

pickle and python data structure

I have some data stored in a tree in memory and I regularly store the tree into disk using pickle.
Recently I noticed that the program using a large memory, then I checked saved pickle file, it is around 600M, then I wrote an other small test program loading the tree back into memory, and I found that it would take nearly 10 times memory(5G) than the size on disk, is that normal? And what's the best way to avoid that?

No it's not normal. I suspect your tree is bigger than you think. Write some code to walk it and add up all the space used (and count the nodes).
See memory size of Python data structure
Also what exactly are you asking? Are you surprised that a 600M data structure on disk is 5G in memory. That's not particularly surprising. Pickle compresses the data so you expect it to be smaller on disk. It's smaller by a factor of 10 (roughly) which is pretty good.
If you're surprised by the size of your own data that's another thing.

Reducing RAM overloading when handling big matrices in python

I am currently in a lab which uses iPython Notebook with python 2.7 for data processing. We work on pictures taken by a 285*384 pixels camera, with different parameters changing according to what we search to observe.Therefore, we need to deal with big matrices and as the data processing progress, the accumulation of matrices allocations makes the RAM / swap to be fullfilled and so we cannot go any further.
The typical initial data matrice is of size 100*285*384*16. Then we have to allocate numerous other matrices to calculate the temporal average corresponding to this matrice (of size 285*384*16, 100 being the temporal dimension), then we need to fit linearly the data so we have 2 100*285*384*16 matrices (2 estimated parameters needed for the linear fit), calculate the average and the standart deviation of those fits... and so on. So we allocate of lot of big matrices which leads to RAM / swap fullfilment. Also, we display some pictures associated with some of these matrices.
Of course we could deallocate matrices as we go further in the data processing but we need to be able to change the code and see the results of old calculations without having to rebuilt all the code (calculations are sometimes pretty long). All results depend on the previous ones indeed, so we need to keep the data in the memory.
I would know wether there is some way to extend the swap memory (on the "physical" memory of a disk for example) or to by-pass our RAM limitations in any way with a smarter way of coding. Otherwise I would use a server of my laboratory institute that has 32 Go of RAM but it would be a loss of time and ergonomy for us to be unable to do it with our own computers. The crash occurs both in Macintosh and Windows and due to the limitations of RAM for windows in python I will probably try it with linux, but the 4Go of RAM of our computers will still be overfilled at some point.
I would really appreciate any help on this problem, I didn't find any answers on the net at this point. Thanks you in advance for your help.

You can drastically reduce you RAM requirements by storing the images to disk in HDF5 format using compression with pytables. Depending on your specific data, you can gain significant performances compared to an all-in-RAM approach.
The trick is to use the blazing fast blosc compression included in pytables.
As an example, this code creates an file containing multiple numpy arrays using blosc compression:
import tables
import numpy as np
img1 = np.arange(200*300*100)
img2 = np.arange(200*300*100)*10
h5file = tables.open_file("image_store.h5", mode = "w", title = "Example images",
filters=tables.Filters(complevel=5, complib='blosc'))
h5file.create_carray('/', 'image1', obj=img1, title = 'The image number 1')
h5file.create_carray('/', 'image2', obj=img2, title = 'The image number 2')
h5file.flush() # This makes sure everything is flushed to disk
h5file.close() # Closes the file, previous flush is redundant here.
and the following code snippet loads the two arrays back in RAM:
h5file = tables.open_file("image_store.h5") # By default it is a read-only open
img1 = h5file.root.image1[:] # Load in RAM image1 by using "slicing"
img2 = h5file.root.image2.read() # Load in RAM image1
Finally, if a single array is too big to fit in RAM, you can save and read it chunk-by-chunk using the conventional slicing notation. You create an (chunked) pytables array on disk with a preset size and type and then fill in chunks in this way:
h5file.create_carray('/', 'image_big', title = 'Big image',
atom=tables.Atom.from_dtype(np.dtype('uint16')),
shape=(200, 300, 400))
h5file.root.image_big[:100] = 1
h5file.root.image_big[100:200] = 2
h5file.flush()
Note that this time you don't provide a numpy array to pytables (obj keyword) but you create an empty array, and therefore you need to specify shape and type (atom).
For more info you can check out the official pytables documentation:
PyTables Documentation

Reducing PIL image size in memory

i am grabbing PIL images with screengrab from the screen, saving them in a queue and writing them into a jpg image sequence.
I use a producer thread to capture and a worker to write the images down to disk.
However i noticed, that this queue gets really large really fast, even though the written output is not really that large, when compressen with jpg. That leads to the grabs being put into extended memory on the disk, making the write process even slower. Since my data comes in bursts, i can use up some time to write to the disk, but if the memory is written to disk, it gets just too slow.
Is there a way to compress the images before adding them into the queue?
cheers,

Here's an idea, merge the images as they come in.
After a set period of time or set amount merged, compress the image. Divide the image back into separate ones.
/profit

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.