I am using scipy.signal.correlate to perform large 2D convolutions. I have a large number of arrays that I want to operate on, and so naturally thought the multiprocessing.Poolcould help. However using the following simple setup (on a 4-core cpu) provides no benefit.
import multiprocessing as mp
import numpy as np
from scipy import signal
arrays = [np.ones([500, 500])] * 100
kernel = np.ones([30, 30])
def conv(array, kernel):
return (array, kernel, signal.correlate(array, kernel, mode="valid", method="fft"))
pool = mp.Pool(processes=4)
results = [pool.apply(conv, args=(arr, kernel)) for arr in arrays]
Changing the process count to {1, 2, 3, 4} has approximately the same time (2.6 sec +- .2).
What could be going ?
I think the problem is in this line:
results = [pool.apply(conv, args=(arr, kernel)) for arr in arrays]
Pool.apply is a blocking operation, running conv on each element in the array and waiting before going to the next element, so even though it looks like you are multiprocessing nothing is actually being distributed. You need to use pool.map instead to get the behavior you are looking for:
from functools import partial
conv_partial = partial( conv, kernel=kernel )
results = pool.map( conv_partial, arrays )
What could be going on?
First I thought that the reason is that the underlying FFT implementation is already parallelized. While the Python interpreter is single threaded, the C code that may be called from Python may be able to fully utilize your CPU.
However, the underlying FFT implementation of scipy.correlate seems to be fftpack in NumPy (translated from Fortran that was written in 1985) which seems to be single threaded from what it looks like on the fftpack page.
Indeed, I ran your script and got considerably higher CPU usage. With four cores on my computer and four processes I got a speedup of ~2 (for 2000x2000 arrays and using the code changes from Raw Dawg).
The overhead from creating the processes and communicating with the processes eats up parts of the benefits of more efficient CPU usage.
You can still try to optimize the array sizes or to compute only a small part of the correlation (if you don't need it all) or if the kernel is the same all the time, compute the FFT of the kernel only once and reuse it (this would require implementing part of scipy.correlate yourself) or try single precision instead of double or do the correlation on the graphics card with CUDA.
Related
I try to perform a Monte Carlo computation in parallel using python. The problem is an extremely parallel: I need to compute a function N times and add the output together, each computation is independent and the addition is a simple addition between tables.
So far I have tried two approaches:
Using multiprocessing.map() and then python reduce. The problem is that I run out of memory because map is storing all the data even if I do not need to.
The code looks like this:
from multiprocessing import Pool
import tqdm
import numpy as np
n_cpu = 8
pool = Pool(n_cpu)
out1 = list(tqdm.tqdm(pool.imap_unordered(f, args, chunksize = 1000)))
out = reduce(np.add, out1)
This way I obtain a poor scaling with n_cpu and the code crashes for a memory error if the input size len(args) is too large.
I tried to solve using pyspark and the following code:
import pyspark, findspark
import numpy as np
findspark.init()
number_cores = 8
memory_gb = 8
conf = (
pyspark.SparkConf()
.setMaster('local[{}]'.format(number_cores))
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
sc = pyspark.SparkContext(conf=conf)
out = sc.parallelize(range(N_samples)).repartition(number_cores).map(function).reduce(lambda a, b: np.add(a, b))
The repartition is done explicitly for clarity and is equal to the number of cores because I thought this is the best way to do it since the function to compute is computationally heavy.
The problem is that I obtain similar performance to the multiprocessing method.
My question is:
Is there a method to make the code scale better with the number of cores? Is there a way to use multiprocessing imap_unordered() and reduce it before the computation is done?
Correct me if i'm wrong, but to generalize your problem:
You have 1 machine with multiple cores and you try to run some
algorithm and want to get best performance.
If the statement above is true - then Spark, most probably, is not what you need. Spark is primarily for distributed computing - i.e if you have multiple machines - then you can distribute your work and perform computations faster.
If you have only one machine - you can hardly get any better performance than any multithreaded approach, since Spark does a lot of stuff which is not required for single machine computations.
I'm trying to compute the matrix product Y=XX^T for a matrix X of size 10,000 * 800,000. The matrix X is stored on-disk in an h5py file. The resulting Y should be a 10,000*10,000 matrix stored in the same h5py file. Here is a reproducible sample code.
import dask.array as da
from blaze import into
into("h5py:///tmp/dummy::/X", da.ones((10**4, 8*10**5), chunks=(10**4,10**4)))
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**4,10**4)))
y = x.dot(x.T)
into("h5py:///tmp/dummy::/Y", y)
I expected this computation to go smoothly as each (10,000*10,000) chunk should be individually transposed, followed by a dot product and then summed up to the final result. However, running this computation fills both my RAM and swap memory until the process eventually gets killed.
Here is a sample of the computation graph plotted with dot_graph: Computation graph sample
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html
I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed. This would free the memory of these tensordot intermediary results, so that we would not face memory errors.
Playing around with a smaller toy example:
from dask.diagnostics import Profiler, CacheProfiler, ResourceProfiler
# Experiment on a (1,0000 * 5,000) matrix X split into 500 chunks of size (1,000 * 10)
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**3,10)))[:10**3,5000]
y = x.T.dot(x)
with Profiler() as prof, CacheProfiler() as cprof, ResourceProfiler() as rprof:
into("h5py:///tmp/dummy::/X", y)
rprof.visualize()
I get the following display:
Ressource profiler
Where the green bar represents the sum operation, while yellow and purple bars represent respectively get_array and tensordot operations. This seems to indicate that the sum operation waits for all intermediary tensordot operations to be performed before summing them. This would also explain my process running out of memory and getting killed.
So my questions are:
Is that the normal behavior of the sum operation?
Is there a way to force it to compute intermediary sums before all
the intermediary tensordot products are computed and kept in memory?
If not, is there a work around that does not involve spilling to disk?
Any help much much appreciated!
Generally speaking performing a dense matrix-matrix multiply in small space is hard. This is because every intermediate chunk will by used by several of the output chunks.
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed.
The graph that you have shown has many inputs to a sum function. Dask will wait until all of those inputs are complete before running the sum function. The task scheduler has no idea that sum is associative and can be run piece by piece. This lack of semantic information is the price you pay for using a general task scheduling system like Dask rather than a dedicated linear algebra library. If your goal is to perform dense linear algebra as efficiently as possible then you might want to look elsewhere; this is a well covered field.
So as written your memory requirements are at least 8e5 * 1e4 * dtype.itemsize, assuming that Dask proceeds in exactly the right order (which it should mostly do).
You might try the following:
Reduce the chunksize along the non-contracting dimension
Use a version of Dask later than 0.14.1 (0.14.2 should be released by May 5th, 2017), where we break down those large sum calls into many smaller ones explicitly in the graph.
Use the distributed scheduler, which handles writing data to disk more efficiently.
from dask.distributed import Client
client = Client(processes=False) # create a local cluster in this process
I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.
In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.
I would like some help understanding exactly what I have done/ why my code isn't running as I would expect.
I have started to use joblib to try and speed up my code by running a (large) loop in parallel.
I am using it like so:
from joblib import Parallel, delayed
def frame(indeces, image_pad, m):
XY_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1]:indeces[1]+m, indeces[2]])
XZ_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1], indeces[2]:indeces[2]+m])
YZ_Patches = np.float32(image_pad[indeces[0], indeces[1]:indeces[1]+m, indeces[2]:indeces[2]+m])
return XY_Patches, XZ_Patches, YZ_Patches
def Patch_triplanar_para(image_path, patch_size):
Image, Label, indeces = Sampling(image_path)
n = (patch_size -1)/2
m = patch_size
image_pad = np.pad(Image, pad_width=n, mode='constant', constant_values = 0)
A = Parallel(n_jobs= 1)(delayed(frame)(i, image_pad, m) for i in indeces)
A = np.array(A)
Label = np.float32(Label.reshape(len(Label), 1))
R, T, Y = np.hsplit(A, 3)
return R, T, Y, Label
I have been experimenting with "n_jobs", expecting that increasing this will speed up my function. However as I increase n_jobs, things slow down quite significantly. When running this code without "Parallel", things are slower, until I increase the number of jobs from 1.
Why is this the case? I understood that the more jobs I run, the faster the script? am i using this wrong?
Thanks!
Maybe your problem is caused because image_pad is a large array. In your code, you are using the default multiprocessing backend of joblib. This backend creates a pool of workers, each of which is a Python process. The input data to the function is then copied n_jobs times and broadcasted to each worker in the pool, which can lead to a serious overhead. Quoting from joblib's docs:
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
This can be problematic for large arguments as they will be reallocated n_jobs times by the workers.
As this problem can often occur in scientific computing with numpy based datastructures, joblib.Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy.memmap subclass of numpy.ndarray. This makes it possible to share a segment of data between all the worker processes.
Note: The following only applies with the default "multiprocessing" backend. If your code can release the GIL, then using backend="threading" is even more efficient.
So if this is your case, you should switch to the threading backend, if you are able to release the global interpreter lock when calling frame, or switch to the shared memory approach of joblib.
The docs say that joblib provides an automated memmap conversion that could be useful.
It's quite possible that the problem you are running up against is a fundamental one to the nature of the python compiler.
If you read "https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en", you can see from a professional who specialises in optimisation and parallelising python code that iterating through large loops is an inherently slow operation for a python thread to perform. Therefore, spawning more processes that loop through arrays is only going to slow things down.
However - there are things that can be done.
The Cython and Numba compilers are both designed to optimise code that is similar to C/C++ style (i.e. your case) - in particular Numba's new #vectorise decorators allow scalar functions to take in and apply operations on large arrays with large arrays in a parallel manner (target=Parallel).
I don't understand your code enough to give an example of an implementation, but try this! These compilers, used in the correct ways, have brought speed increases of 3000,000% to me for parallel processes in the past!
I have a strong background in numeric compuation using FORTRAN and parallelization with OpenMP, which I found easy enough to use it on many problems. I switched to PYTHON since it much more fun (at least for me) to develop with, but parallelization for nummeric tasks seem much more tedious than with OpenMP. I'm often interested in loading large (tens of GB) data sets to to the main Memory and manipulate it in parallel while containing only a single copy of the data in main memory (shared data). I started to use the PYTHON module MULTIPROCESSING for this and came up with this generic example:
#test cases
#python parallel_python_example.py 1000 1000
#python parallel_python_example.py 10000 50
import sys
import numpy as np
import time
import multiprocessing
import operator
n_dim = int(sys.argv[1])
n_vec = int(sys.argv[2])
#class which contains large dataset and computationally heavy routine
class compute:
def __init__(self,n_dim,n_vec):
self.large_matrix=np.random.rand(n_dim,n_dim)#define large random matrix
self.many_vectors=np.random.rand(n_vec,n_dim)#define many random vectors which are organized in a matrix
def dot(self,a,b):#dont use numpy to run on single core only!!
return sum(p*q for p,q in zip(a,b))
def __call__(self,ii):# use __call__ as computation such that it can be handled by multiprocessing (pickle)
vector = self.dot(self.large_matrix,self.many_vectors[ii,:])#compute product of one of the vectors and the matrix
return self.dot(vector,vector)# return "length" of the result vector
#initialize data
comp = compute(n_dim,n_vec)
#single core
tt=time.time()
result = [comp(ii) for ii in range(n_vec)]
time_single = time.time()-tt
print "Time:",time_single
#multi core
for prc in [1,2,4,10]:#the 20 case is there to check that the large_matrix is only once in the main memory
tt=time.time()
pool = multiprocessing.Pool(processes=prc)
result = pool.map(comp,range(n_vec))
pool.terminate()
time_multi = time.time()-tt
print "Time using %2i processes. Time: %10.5f, Speedup:%10.5f" % (prc,time_multi,time_single/time_multi)
I ran two test cases on my machine (64bit Linux using Fedora 18) with the following results:
andre#lot:python>python parallel_python_example.py 10000 50
Time: 10.3667809963
Time using 1 processes. Time: 15.75869, Speedup: 0.65785
Time using 2 processes. Time: 11.62338, Speedup: 0.89189
Time using 4 processes. Time: 15.13109, Speedup: 0.68513
Time using 10 processes. Time: 31.31193, Speedup: 0.33108
andre#lot:python>python parallel_python_example.py 1000 1000
Time: 4.9363951683
Time using 1 processes. Time: 5.14456, Speedup: 0.95954
Time using 2 processes. Time: 2.81755, Speedup: 1.75201
Time using 4 processes. Time: 1.64475, Speedup: 3.00131
Time using 10 processes. Time: 1.60147, Speedup: 3.08242
My question is, am I misusing the MULTIPROCESSING module here? Or is this the way it goes with PYTHON (i.e. don't parallelize within python but rely totally on numpy's optimizations)?
While there is no general answer to your question (in the title), I think it is valid to say that multiprocessing alone is not the key for great number-crunching performance in Python.
In principle however, Python (+ 3rd party modules) are awesome for number crunching. Find the right tools, you will be amazed. Most of the times, I am pretty sure, you will get better performance with writing (much!) less code than you have achieved before doing everything manually in Fortran. You just have to use the right tools and approaches. This is a broad topic. A few random things that might interest you:
You can compile numpy and scipy yourself using Intel MKL and OpenMP (or maybe a sys admin in your facility already did so). This way, many linear algebra operations will automatically use multiple threads and get the best out of your machine. This is simply awesome and probably underestimated so far. Get your hands on a properly compiled numpy and scipy!
multiprocessing should be understood as a useful tool for managing multiple more or less independent processes. Communication among these processes has to be explicitly programmed. Communication happens mainly through pipes. Processes talking a lot to each other spend most of their time talking and not number crunching. Hence, multiprocessing is best used in cases when the transmission time for input and output data is small compared to the computing time. There are also tricks, you can for instance make use of Linux' fork() behavior and share large amounts of memory (read-only!) among multiple multiprocessing processes without having to pass this data around through pipes. You might want to have a look at https://stackoverflow.com/a/17786444/145400.
Cython has already been mentioned, you can use it in special situations and replace performance-critical code parts in your Python program with compiled code.
I did not comment on the details of your code, because (a) it is not very readable (please get used to PEP8 when writing Python code :-)) and (b) I think especially regarding number crunching it depends on the problem what the right solution is. You have already observed in your benchmark what I have outlined above: in the context of multiprocessing, it is especially important to have an eye on the communication overhead.
Spoken generally, you should always try to find a way from within Python to control compiled code to do the heavy work for you. Numpy and SciPy provide great interfaces for that.
Number crunching with Python... You probably should learn about Cython. It is and intermediate language between Python and C. It is tightly interfaced with numpy and has support for paralellization using openMP as backend.
From the test results you supplied, it appears that you ran your tests on a two core machine. I have one of those and ran your test code getting similar results. What these results show is that there is little benefit to running more processes than you have cores for numerical applications that lend themselves to parallel computation.
On my two core machine, approximately 20% of the CPU is absorbed simply in keeping my environment going, so when I see a 1.8 improvement running two processes I am confident that all the available cycles are being used for my work. Basically, for parallel numerical work the more cores the better as this raises the percentage of the computer that is available to do your work.
The other posters are entirely correct in pointing you at Numpy, Scipy, Cython etc. Basically you first need to make your computation use as few cycles as possible and then use multiprocessing in some form to find more cycles to apply to your problem.