I am working on a python program which read a lot of images in batches (let's say 500 images) and store it in a numpy array.
Now it's single thread, and IO is very fast, the part which take a lot of time is creating numpy array and doing something on it.
By using multiprocessing module, I am able to read and create the array in other process. But I am having problem let the main thread access those data.
I have tried:
1: Using multiprocessing.queues: Very slow, I believe it's the pickle and unpickle waste a lot of time. Pickling and unpickling a large numpy array take quite some time.
2: Using Manager.list(): Faster than queues, but when try to access it in main thread, it 's still very slow. Even just iterate over the list and do nothing takes 2 seconds per item.
I don't understand why it take so much time.
Any suggestions ? Thanks.
Looks like I have to answer my own question.
The problem I was facing could be solved by using shared memory with numpy.
More details could be found at
Use numpy array in shared memory for multiprocessing
The idea is basically create the shared memory in the main process, and assign the memory to a numpy array. Later in other process, you can either read from it or write to it.
This approach works pretty well for me, it speeds up my program by a factor of 10.
Because I am processing a large amount of data and pickling is not an option for me.
The most critical code is :
shared_arr = mp.Array(ctypes.c_double, N)
arr = tonumpyarray(shared_arr)
Related
I would like to know how does Python actually manages memory allocation for ndarrays.
I loaded a file that contains 32K floating value using numpy loadtxt, so the ndarray size should be 256KB data.
Actually, ndarray.nbytes gives the right size.
However, the memory occupation after loading data is increased by 2MB: I don't unserstand why this difference.
I'm not sure exactly how you measure memory occupation, but when looking at the memory footprint of your entire app there's a lot more that could be happening that causes these kind of memory occupation increases.
In this case, I suspect that the loadtxt function uses some buffering or otherwise copies the data which wasn't cleared yet by the GC.
But other things could be happening as well. Maybe the numpy back-end loads some extra stuff the first time it initialises a ndarray. Either way, you could only truly figure this stuff out by reading the numpy source could which is available freely on github. The implementation of loadtxt can be found here: https://github.com/numpy/numpy/blob/5b22ee427e17706e3b765cf6c65e924d89f3bfce/numpy/lib/npyio.py#L797
I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.
In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.
I get a memory error when processing very large(>50Gb) file (problem: RAM memory gets full).
My solution is: I would like to read only 500 kilo bytes of data once and process( and delete it from memory and go for next 500 kb). Is there any other better solution? or If this solution seems better , how to do it with numpy array?
It is just 1/4th the code(just for an idea)
import h5py
import numpy as np
import sys
import time
import os
hdf5_file_name = r"test.h5"
dataset_name = 'IMG_Data_2'
file = h5py.File(hdf5_file_name,'r+')
dataset = file[dataset_name]
data = dataset.value
dec_array = data.flatten()
........
I get memory error at this point itsef as it trys to put in all the data to memory.
Quick answer
Numpuy.memmap allows presenting a large file on disk as a numpy array. Don't know if it allows mapping files larger than RAM+swap though. Worth a shot.
[Presentation about out-of-memory work with Python] (http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html)
Longer answer
A key question is how much RAM you have (<10GB, >10GB) and what kind of processing you're doing (need to look at each element in the dataset once or need to look at the whole dataset at once).
If it's <10GB and need to look once, then your approach seems like the most decent one. It's a standard way to deal with datasets which are larger than main memory. What I'd do is increase the size of a chunk from 500kb to something closer to the amount of memory you have - perhaps half of physical RAM, but anyway, something in the GB range, but not large enough to cause swapping to disk and interfere with your algorithm. A nice optimisation would be to hold two chunks in memory at one time. One is being processes, while the other is being loaded in parallel from disk. This works because loading stuff from disk is relatively expensive, but it doesn't require much CPU work - the CPU is basically waiting for data to load. It's harder to do in Python, because of the GIL, but numpy and friends should not be affected by that, since they release the GIL during math operations. The threading package might be useful here.
If you have low RAM AND need to look at the whole dataset at once (perhaps when computing some quadratic-time ML algorithm, or even doing random accesses in the dataset), things get more complicated, and you probably won't be able to use the previous approach. Either upgrade your algorithm to a linear one, or you'll need to implement some logic to make the algorithms in numpy etc work with data on disk directly rather than have it in RAM.
If you have >10GB of RAM, you might let the operating system do the hard work for you and increase swap size enough to capture all the dataset. This way everything is loaded into virtual memory, but only a subset is loaded into physical memory, and the operating system handles the transitions between them, so everything looks like one giant block of RAM. How to increase it is OS specific though.
The memmap object can be used anywhere an ndarray is accepted. Given a memmap fp, isinstance(fp, numpy.ndarray) returns True.
Memory-mapped files cannot be larger than 2GB on 32-bit systems.
When a memmap causes a file to be created or extended beyond its current size in the filesystem, the contents of the new part are unspecified. On systems with POSIX filesystem semantics, the extended part will be filled with zero bytes.
I'm using multiprocessing.Queue to pass numpy arrays of float64 between python processes. This is working fine, but I'm worried it may not be as efficient as it could be.
According to the documentation of multiprocessing, objects placed on the Queue will be pickled. calling pickle on a numpy array results in a text representation of the data, so null bytes get replaced by the string "\\x00".
>>> pickle.dumps(numpy.zeros(10))
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I10\ntp6\ncnumpy\ndtype\np7\n(S'f8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."
I'm concerned that this means my arrays are being expensively converted into something 4x the original size and then converted back in the other process.
Is there a way to pass the data through the queue in a raw unaltered form?
I know about shared memory, but if that is the correct solution, I'm not sure how to build a queue on top of it.
Thanks!
The issue isn't with numpy, but the default settings for how pickle represents data (as strings so the output is human readable). You can change the default settings for pickle to produce binary data instead.
import numpy
import cPickle as pickle
N = 1000
a0 = pickle.dumps(numpy.zeros(N))
a1 = pickle.dumps(numpy.zeros(N), protocol=-1)
print "a0", len(a0) # 32155
print "a1", len(a1) # 8133
Also, note, that if you want to decrease processor work and time, you should probably use cPickle instead of pickle (but the space savings due to using the binary protocol work regardless of pickle version).
On shared memory:
On the question of shared memory, there are a few things to consider. Shared data typically adds a significant amount of complexity to code. Basically, for every line of code that uses that data, you will need to worry about whether some other line of code in another process is simultaneously using that data. How hard this will be will depend on what you're doing. The advantages are that you save time sending the data back and forth. The question that Eelco cites is for a 60GB array, and for this there's really no choice, it has to be shared. On the other hand, for most reasonably complex code, deciding to share data simply to save a few microseconds or bytes would probably be one of the worst premature optimizations one could make.
Share Large, Read-Only Numpy Array Between Multiprocessing Processes
That should cover it all. Pickling of uncompressible binary data is a pain regardless of the protocol used, so this solution is much to be preferred.
Despite the warnings and confused feelings I got from the ton of questions that have been asked on the subject, especially on StackOverflow, I paralellized a naive version of an embarassingly parallel problem (basically read-image-do-stuff-return for a list of many images), returned the resulting NumPy array for each computation and updated a global NumPy array via the callback parameter, and immediately got a x5 speedup on a 8-core machine.
Now, I probably didn't get x8 because of the lock required by each callback call, but what I got is encouraging.
I'm trying to find out if this can be improved upon, or if this is a good result. Questions :
I suppose the returned NumPy arrays got pickled?
Were the underlying NumPy buffers copied or just passed by reference?
How can I find out what the bottleneck is? Any particularly useful technique?
Can I improve on that or is such an improvement pretty common in such cases?
I've had great success sharing large NumPy arrays (by reference, of course) between multiple processes using sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem. Basically it suppresses pickling that normally happens when passing around NumPy arrays. All you have to do is, instead of:
import numpy as np
foo = np.empty(1000000)
do this:
import sharedmem
foo = sharedmem.empty(1000000)
and off you go passing foo from one process to another, like:
q = multiprocessing.Queue()
...
q.put(foo)
Note however, that this module has a known possibility of a memory leak upon ungraceful program exit, described to some extent here: http://grokbase.com/t/python/python-list/1144s75ps4/multiprocessing-shared-memory-vs-pickled-copies.
Hope this helps. I use the module to speed up live image processing on multi-core machines (my project is https://github.com/vmlaker/sherlock.)
Note: This answer is how I ended up solving the issue, but Velimir's answer is more suited if you're doing intense transfers between your processes. I don't, so I didn't need sharedmem.
How I did it
It turns out that the time spent pickling my NumPy arrays was negligible, and I was worrying too much. Essentially, what I'm doing is a MapReduce operation, so I'm doing this :
First, on Unix systems, any object you instantiate before spawning a process will be present (and copied) in the context of the process if needed. This is called copy-on-write (COW), and is handled automagically by the kernel, so it's pretty fast (and definitely fast enough for my purposes). The docs contained a lot of warnings about objects needing pickling, but here I didn't need that at all for my inputs.
Then, I ended up loading my images from the disk, from within each process. Each image is individually processed (mapped) by its own worker, so I neither lock nor send large batches of data, and I don't have any performance loss.
Each worker does its own reduction for the mapped images it handles, then sends the result to the main process with a Queue. The usual outputs I get from the reduction function are 32-bit float images with 4 or 5 channels, with sizes close to 5000 x 5000 pixels (~300 or 400MB of memory each).
Finally, I retrieve the intermediate reduction outputs from each process, then do a final reduction in the main process.
I'm not seeing any performance loss when transferring my images with a queue, even when they're eating up a few hundred megabytes. I ran that on a 6 core workstation (with HyperThreading, so the OS sees 12 logical cores), and using multiprocessing with 6 cores was 6 times faster than without using multiprocessing.
(Strangely, running it on the full 12 cores wasn't any faster than 6, but I suspect it has to do with the limitations of HyperThreading.)
Profiling
Another of my concerns was profiling and quantifying how much overhead multiprocessing was generating. Here are a few useful techniques I learned :
Compared to the built-in (at least in my shell) time command, the time executable (/usr/bin/time in Ubuntu) gives out much more information, including things such as average RSS, context switches, average %CPU,... I run it like this to get everything I can :
$ /usr/bin/time -v python test.py
Profiling (with %run -p or %prun in IPython) only profiles the main process. You can hook cProfile to every process you spawn and save the individual profiles to the disk, like in this answer.
I suggest adding a DEBUG_PROFILE flag of some kind that toggles this on/off, you never know when you might need it.
Last but not least, you can get some more or less useful information from a syscall profile (mostly to see if the OS isn't taking ages transferring heaps of data between the processes), by attaching to one of your running Python processes like :
$ sudo strace -c -p <python-process-id>