Will multiprocessing be a good solution for this operation? - python

while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.

It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate

Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.

Related

How to run generator code in parallel?

I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.
In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.

Python slow on for-loops and hundreds of attribute lookups. Use Numba?

i am working on a simple showcase SPH (smoothed particle hydrodynamics, not relevant here though) implementation in python. The code works, but the execution is kinda sluggish. I often have to compare individual particles with a certain amount of neighbours. In an earlier implementation i kept all particle positions and all distances-to-each-existing-particle in large numpy arrays -> to a certain point this was pretty fast. But visually not pleasing and n**2. Now i want it clean and simple with classes + kdTree to speed up the neighbour search.
this all happens in my global Simulation-Class. Additionally there's a class called "particle" that contains all individual informations. i create hundreds of instances before and loop through them.
def calculate_density(self):
#Using scipys advanced nearest neighbour seach magic
tree = scipy.spatial.KDTree(self.particle_positions)
#here we go... loop through all existing particles. set attributes..
for particle in self.my_particles:
#get the indexes for the nearest neighbours
particle.index_neighbours = tree.query_ball_point(particle.position,self.h,p=2)
#now loop through the list of neighbours and perform some additional math
particle.density = 0
for neighbour in particle.index_neighbours:
r = np.linalg.norm(particle.position - self.my_particles[neighbour].position)
particle.density += particle.mass * (315/(64*math.pi*self.h**9)) *(self.h**2-r**2)**3
i timed 0.2717630863189697s for only 216 particles.
Now i wonder: what to do to speed it up?
Most tools online like "Numba" show how they speed up math-heavy individual functions. I dont know which to choose. On a sidenode, i cannot even get Numba to work in this case. I get a looong error message. And i hoped it is as simple as slapping "#jit" in front of it.
I know its the loops with the attribute calls that crush my performance anyway - not the math or the neighbour search. Sadly iam a novice to programming and i liked the clean approach i got to work here :( any thoughts?
These kind of loop-intensive calculations are slow in Python. In these cases, the first thing you want to do is to see if you can vectorize these operations and get rid of the loops. Then actual calculations will be done in C or Fortran libraries and you will get a lot of speed up. If you can do it usually this is the way to go, since it is much easier to maintain your code.
Some operations, however, are just inherently loop-intensive. In these cases using Cython will help you a lot - you can usually expect 60X+ speed up when you cythonize your loop. I also had similar experiences with numba - when my function becomes complicated, it failed to make it faster, so usually I just use Cython.
Coding in Cython is not too bad - much easier than actually code in C because you can access numpy arrays easily via memoryviews. Another advantage is that it is pretty easy to parallelize the loop with openMP, which can gives you additional 4X+ speedups (of course, depending on the number of cores you have in your machine), so your code can be hundreds times faster.
One issue is that to get the optimal speed, you have to remove all the python calls inside your loop, which means you cannot call numpy/scipy functions. So you have to convert tree.query_ball_point and np.linalg.norm part to Cython for optimal speed.

Running parallel iterations

I am trying to run a sort of simulations where there are fixed parameters i need to iterate on and find out the combinations which has the least cost.I am using python multiprocessing for this purpose but the time consumed is too high.Is there something wrong with my implementation?Or is there better solution.Thanks in advance
import multiprocessing
class Iters(object):
#parameters for iterations
iters['cwm']={'min':100,'max':130,'step':5}
iters['fx']={'min':1.45,'max':1.45,'step':0.01}
iters['lvt']={'min':106,'max':110,'step':1}
iters['lvw']={'min':9.2,'max':10,'step':0.1}
iters['lvk']={'min':3.3,'max':4.3,'step':0.1}
iters['hvw']={'min':1,'max':2,'step':0.1}
iters['lvh']={'min':6,'max':7,'step':1}
def run_mp(self):
mps=[]
m=multiprocessing.Manager()
q=m.list()
cmain=self.iters['cwm']['min']
while(cmain<=self.iters['cwm']['max']):
t2=multiprocessing.Process(target=mp_main,args=(cmain,iters,q))
mps.append(t2)
t2.start()
cmain=cmain+self.iters['cwm']['step']
for mp in mps:
mp.join()
r1=sorted(q,key=lambda x:x['costing'])
returning=[r1[0],r1[1],r1[2],r1[3],r1[4],r1[5],r1[6],r1[7],r1[8],r1[9],r1[10],r1[11],r1[12],r1[13],r1[14],r1[15],r1[16],r1[17],r1[18],r1[19]]
self.counter=len(q)
return returning
def mp_main(cmain,iters,q):
fmain=iters['fx']['min']
while(fmain<=iters['fx']['max']):
lvtmain=iters['lvt']['min']
while (lvtmain<=iters['lvt']['max']):
lvwmain=iters['lvw']['min']
while (lvwmain<=iters['lvw']['max']):
lvkmain=iters['lvk']['min']
while (lvkmain<=iters['lvk']['max']):
hvwmain=iters['hvw']['min']
while (hvwmain<=iters['hvw']['max']):
lvhmain=iters['lvh']['min']
while (lvhmain<=iters['lvh']['max']):
test={'cmain':cmain,'fmain':fmain,'lvtmain':lvtmain,'lvwmain':lvwmain,'lvkmain':lvkmain,'hvwmain':hvwmain,'lvhmain':lvhmain}
y=calculations(test,q)
lvhmain=lvhmain+iters['lvh']['step']
hvwmain=hvwmain+iters['hvw']['step']
lvkmain=lvkmain+iters['lvk']['step']
lvwmain=lvwmain+iters['lvw']['step']
lvtmain=lvtmain+iters['lvt']['step']
fmain=fmain+iters['fx']['step']
def calculations(test,que):
#perform huge number of calculations here
output={}
output['data']=test
output['costing']='foo'
que.append(output)
x=Iters()
x.run_thread()
From a theoretical standpoint:
You're iterating every possible combination of 6 different variables. Unless your search space is very small, or you wanted just a very rough solution, there's no way you'll get any meaningful results within reasonable time.
i need to iterate on and find out the combinations which has the least cost
This very much sounds like an optimization problem.
There are many different efficient ways of dealing with these problems, depending on the properties of the function you're trying to optimize. If it has a straighforward "shape" (it's injective), you can use a greedy algorithm such as hill climbing, or gradient descent. If it's more complex, you can try shotgun hill climbing.
There are a lot more complex algorithms, but these are the basic, and will probably help you a lot in this situation.
From a more practical programming standpoint:
You are using very large steps - so large, in fact, that you'll only probe the function 19,200. If this is what you want, it seems very feasible. In fact, if I comment the y=calculations(test,q), this returns instantly on my computer.
As you indicate, there's a "huge number of calculations" there - so maybe that is your real problem, and not the code you're asking for help with.
As to multiprocessing, my honest advise is to not use it until you already have your code executing reasonably fast. Unless you're running a supercomputing cluster (you're not programming a supercomputing cluster in python, are you??), parallel processing will get you speedups of 2-4x. That's absolutely negligible, compared to the gains you get by the kind of algorithmic changes I mentioned.
As an aside, I don't think I've ever seen that many nested loops in my life (excluding code jokes). If don't want to switch to another algorithm, you might want to consider using itertools.product together with numpy.arange

Python 3: Parallel diagonalization of multiple matrices

I am trying to improve the performance of some code of mine, that first constructs a 4x4 matrix depending on two indices, diagonalizes this matrix and then stores the eigenvectors of each diagonalization of each matrix in an 4-dimensional array. At the moment I am just going through all the indices serially and then store the eigenvectors in its place in the 4-dimensional array. Now, I am wondering if it is possible to parallelize this a little bit by using threading or something similar such that each thread would diagonalize one matrix and then store it in its place. The problem I have is, what are my limitations in doing this? Would I run into problems when different threads want to write into the resulting 4-dim. array at the same time and do I have to use a lock in order to prevent this? I am sorry if this question is trivial, but by searching I was not able to find anything related and my knowledge about threading is very limited. A minimal example would be
from numpy.linalg import eigh as eigh2
from scipy import *
spectrum = zeros([L//2,L//2,4,4],complex)
for i in range(0,L//2):
for j in range(0,L//2):
k = [-(2 * i*2*pi/L),-(2 * j*2*pi/L)]
H = ones([4,4],complex)
energies, states = eigh2(H)
spectrum[i,j,:,:] = states
Note that I have exchanged the function that constructs the matrix in dependence of k for some constant matrix for sake of brevity.
I would really appreciate any help or pointers to resources how I could implement some parallelizations. Is threading a realistic way of improving the performance?
The short answer is that yes, you probably need locks—but if you can reorganize your problem, that may be a lot better than locking.
The long answer is a bit involved, especially since I don't know how much you already know.
In general, threading doesn't do much good in CPython for CPU-bound code, because of the Global Interpreter Lock, which prevents any threads from interpreting a line (actually, bytecode) of Python if another thread is in the middle of doing so. However, NumPy has code that specifically releases the GIL in certain places to allow threading to work better, so if you're CPU-bound within low-level NumPy algorithms, threading actually can work. The docs are not always clear about which functions do this and which don't, so you may have to test it yourself just to find out if parallelizing will help here. (A quick&dirty way to do this is to hack up a version of your code that just does the computations without storing them anywhere, run it across N threads, and see how many cores are busy while you do it.)
Now, in general, in CPython, locks aren't necessary around certain kinds of operations, including __setitem__ on simple types—but that's because of that same GIL, so it isn't going to help you here. If you have multiple operations all trying to write to the same array, they will need a lock around that array.
But there may be a better way around this. If you can find a way to divide the array into smaller arrays, only one of which is being modified at any given time, you don't need any locks. Or, if you can have the threads return smaller arrays that can be assembled by a single master thread into the final answer, instead of working in-place in the first place, that also works.
But before you go doing that… in some cases, NumPy (or, rather, one of the libraries it's using) is already auto-parallelizing things for you, or could be if you built it differently. Or it could be SIMD-vectorizing things in a way that actually gives more speedup than threading, which you could end up breaking. And so on.
So, make sure you have a properly-optimized NumPy with all the optional prereqs installed before you try anything. Then make sure it's only using one core as-is. Then build a test scaffolding so you can compare different implementations. And then you can try out each lock-based, non-sharing, and non-mutating algorithm you can come up with to see if the parallelism helps more than the extra stuff hurts.

storing values between iterations (cache-like mechanism) in pyCUDA

Good morning all,
I am kind of newbie with cuda/pyCuda, so probably this will have an easy solution employing some mechanism that I don't know....
I am employing pycuda to operate over pairs of values: I subtract the smallest from the biggest and then perform some time-consuming operations. As it must be repeated many times, it is well suited for GPUs.
However, most of the times the result of the substraction is the same. Then, performing the time-consuming operations make no sense. what I do in the non-GPU version of my code is something like:
myFunction(A,B):
index=A-B
try:
value = myDictionary[index]
except:
value = expensiveOperation(index)
myDictionary[index] = value
return value
As accessing the dictionary is much faster than expensiveOperation, and the value is found most of the times, I obtain a significant time gain.
When porting this to GPUs, I can call to myFunction(A,B) with a high degree of parallelism, which is great. However, I don't know how could I employ this dictionary mechanism -or a similar one- to avoid redundant operations.
any thoughts on this?
Thanks for your help
edit: I would like to know, is it possible to store the dictionary on the GPU, or should I copy it every time? If it's on the GPU, can it be accessed/edited by several cores at the same time? How should I implement it?
You could try this:
myFunction(A,B):
index=A-B
if index in myDictionary.keys():
value = myDictionary[index]
else:
value = expensiveOperation(index)
myDictionary[index] = value
return value
It seems your question is about implementing some sort of memoise facility inside GPU code. I don't think this is worth pursuing. In the GPU arithmetic operations are almost free, but memory access is very expensive (and random memory access even more so). Performing a dictionary/hash table look-up in GPU memory to retrieve an arithmetic result from a cache is almost guaranteed to be slower that the cost of just calculating the result. It sounds counter-intuitive, but that is the reality of GPU computing.
In an interpreted language like Python, which is relatively slow, using a fast native memoisation mechanism makes a lot of sense, and memoising the results of a complete kernel function call also could yield useful performance benefits for expensive kernels. But memoisation inside CUDA doesn't seem all that useful.

Categories

Resources