Multiprocessing Pool implementation - python

I'm doing some intense computation and I'd like to speed up the process using all the computational power available (8 cores on my PC).
I'm doing some calculations on an electrical field and propagation using some functions. I'm calculating an entire XY slice (assuming propagation goes along Z) and retrieving the X and Y band (but I need to calculate the entire slice because of the fft2 implementation). And adding those bands to an array to create complete XZ and YZ slices.
So I have the function propagate that I want to execute in parallel and I thought about using a callback function that takes care of adding my band to the XZ and YZ arrays.
Since I don't know how bad it could be to pass an instance reference to the function that executes in parallel I preferred giving it all the parameters as simple as possible so I used pool.apply_async and iterated for all the values of z that I needed to propagate it to.
So the my class look like this :
class Calculator:
def __init__(self, s:Simulation):
# self.q = Queue()
# self.manager = Manager()
self.sim = s
def start(self, U:Field, z_min, z_max, nbr_pnt):
l = np.linspace(z_min, z_max, nbr_pnt)
self.tot = nbr_pnt
self.state = 0
self.XZ_slice = np.zeros((self.sim.Res(), nbr_pnt))
self.YZ_slice = np.zeros((self.sim.Res(), nbr_pnt))
field = U.field()
xx, yy = self.sim.xx_yy()
k = self.sim.k()
wavelength = self.sim.Wavelength()
rho_sqr = xx*xx+yy*yy
with Pool() as pool:
for i, z in enumerate(l):
pool.apply_async(self.propa_A, (i, z, field, rho_sqr, k, wavelength), callback=self.add_slice)
def propa_A(self, i, z, F, rho_sqr, k, wavelength):
prop_term = np.exp(1j*k/(2*z)*(rho_sqr))
return (i,np.abs(1/(wavelength**2*z**2)*fft.fft2(F*prop_term))**2)
def add_slice(self, result):
i, F = result
shape = F.shape
self.XZ_slice[:, i] = F[:, shape//2]
self.YZ_slice[:, i] = F[shape//2, :]
self.state += 1
def error(self, e):
print(e)
def get_XZslice(self):
return self.XZ_slice
def get_YZslice(self):
return self.YZ_slice
And in my main function, I made some kind of progress bar in the terminal to check on the progression
plt.figure()
print("[ ]0%", end='\r')
c = Calculator(w)
z_min = 0.2
z_max = 1
nbr_pnt = 10000
c.start(U_0, z_min, z_max, nbr_pnt)
counter = 0
while(counter < nbr_pnt):
counter = c.state
progression = counter/nbr_pnt
s = "[" + "="*int(progression*20) +">"+" "*int((1-progression)*20) + "]" + str(int(progression*100)) + "%"
print(s, end="\r")
sleep(1)
print("")
plt.subplot(211)
plt.imshow(c.get_XZslice())
plt.subplot(212)
plt.imshow(c.get_YZslice())
plt.show()
Assume U_0 represents an N*N (Res * Res) complex array
When I execute this script there is nothing that happens. No progression, no load on the CPU. Maybe apply_async is not the right choice. Do you guys have any idea on how to implement this better? It's the first time that I get into multiprocessing so excuse my mistakes.
Thanks in advance.
EDIT
As suggested I added a callback_error but it's never called

Related

Simulate a coupled ordinary differential equation

I want to write a program which turns a 2nd order differential equation into two ordinary differential equations but I don't know how I can do that in Python.
I am getting lots of errors, please help in writing the code correctly.
from scipy.integrate import solve_ivp
import matplotlib.pyplot as plt
import numpy as np
N = 30 # Number of coupled oscillators.
alpha=0.25
A = 1.0
# Initial positions.
y[0] = 0 # Fix the left-hand side at zero.
y[N+1] = 0 # Fix the right-hand side at zero.
# The range(1,N+1) command only prints out [1,2,3, ... N].
for p in range(1, N+1): # p is particle number.
y[p] = A * np.sin(3 * p * np.pi /(N+1.0))
####################################################
# Initial velocities.
####################################################
v[0] = 0 # The left and right boundaries are
v[N+1] = 0 # clamped and don't move.
# This version sets them all the particle velocities to zero.
for p in range(1, N+1):
v[p] = 0
w0 = [v[p], y[p]]
def accel(t,w):
v[p], y[p] = w
global a
a[0] = 0.0
a[N+1] = 0.0
# This version loops explicitly over all the particles.
for p in range(1,N+1):
a[p] = [v[p], y(p+1)+y(p-1)-2*y(p)+ alpha * ((y[p+1] - y[p])**2 - (y[p] - y[p-1])**2)]
return a
duration = 50
t = np.linspace(0, duration, 800)
abserr = 1.0e-8
relerr = 1.0e-6
solution = solve_ivp(accel, [0, duration], w0, method='RK45', t_eval=t,
vectorized=False, dense_output=True, args=(), atol=abserr, rtol=relerr)
Most general-purpose solvers do not do structured state objects. They just work with a flat array as representation of the state space points. From the construction of the initial point you seem to favor the state space ordering
[ v[0], v[1], ... v[N+1], y[0], y[1], ..., y[N+1] ]
This allows to simply split both and to assemble the derivatives vector from the velocity and acceleration arrays.
Let's keep things simple and separate functionality in small functions
a = np.zeros(N+2)
def accel(y):
global a ## initialized to the correct length with zero, avoids repeated allocation
a[1:-1] = y[2:]+y[:-2]-2*y[1:-1] + alpha*((y[2:]-y[1:-1])**2-(y[1:-1]-y[:-2])**2)
return a
def derivs(t,w):
v,y = w[:N+2], w[N+2:]
return np.concatenate([accel(y), v])
or keeping the theme of avoiding allocations
dwdt = np.zeros(2*N+4)
def derivs(t,w):
global dwdt
v,y = w[:N+2], w[N+2:]
dwdt[:N+2] = accel(y)
dwdt[N+2:] = v
return dwdt
Now you only need to set
w0=np.concatenate([v,y])
to rapidly get to a more interesting class of errors.

How to implement parallel, delayed in such a way that the parallelized for loop stops when output goes below a threshold?

Suppose I have the following code:
from scipy import *
import multiprocessing as mp
num_cores = mp.cpu_count()
from joblib import Parallel, delayed
import matplotlib.pyplot as plt
def func(x,y):
return y/x
def main(y, xmin,xmax, dx):
x = arange(xmin,xmax,dx)
output = Parallel(n_jobs=num_cores)(delayed(func)(i, y) for i in x)
return x, asarray(output)
def demo():
x,z = main(2.,1.,30.,.1)
plt.plot(x,z, label='All values')
plt.plot(x[z>.1],z[z>.1], label='desired range') ## This is better to do in main()
plt.show()
demo()
I want to calculate output only until output > a given number (it can be assumed that elements of output decreases monotonically with increase of x) and then stop (NOT calculating for all values of x and then sorting, that's inefficient for my purpose). Is there any way to do that using Parallel, delayed or any other multiprocessing?
There was no output > a given number specified so I just made one up. after testing I had to reverse the
condition for proper operation output < a given number.
I would use a pool, launch the processes with a callback function to check the stop condition, then terminate the pool
when ready. but that would cause a race condition which would allow results to be omitted from running processes that
were not allowed to finish. I think this method has minimal modification to your code and is very easy to read. The
order of list is NOT guaranteed.
Pros: very little overhead
Cons: could have missing results.
Method 1)
from scipy import *
import multiprocessing
import matplotlib.pyplot as plt
def stop_condition_callback(ret):
output.append(ret)
if ret < stop_condition:
worker_pool.terminate()
def func(x, y, ):
return y / x
def main(y, xmin, xmax, dx):
x = arange(xmin, xmax, dx)
print("Number of calculations: %d" % (len(x)))
# add calculations to the pool
for i in x:
worker_pool.apply_async(func, (i, y,), callback=stop_condition_callback)
# wait for the pool to finish/terminate
worker_pool.close()
worker_pool.join()
print("Number of results: %d" % (len(output)))
return x, asarray(output)
def demo():
x, z_list = main(2., 1., 30., .1)
plt.plot(z_list, label='desired range')
plt.show()
output = []
stop_condition = 0.1
worker_pool = multiprocessing.Pool()
demo()
This method has more overhead but will allow processes which have started to finish.
Method 2)
from scipy import *
import multiprocessing
import matplotlib.pyplot as plt
def stop_condition_callback(ret):
if ret is not None:
if ret < stop_condition:
worker_stop.value = 1
else:
output.append(ret)
def func(x, y, ):
if worker_stop.value != 0:
return None
return y / x
def main(y, xmin, xmax, dx):
x = arange(xmin, xmax, dx)
print("Number of calculations: %d" % (len(x)))
# add calculations to the pool
for i in x:
worker_pool.apply_async(func, (i, y,), callback=stop_condition_callback)
# wait for the pool to finish/terminate
worker_pool.close()
worker_pool.join()
print("Number of results: %d" % (len(output)))
return x, asarray(output)
def demo():
x, z_list = main(2., 1., 30., .1)
plt.plot(z_list, label='desired range')
plt.show()
output = []
worker_stop = multiprocessing.Value('i', 0)
stop_condition = 0.1
worker_pool = multiprocessing.Pool()
demo()
Method 3) Pros: No results will be left out
Cons: This steps way outside what you would normally do.
take Method 1 and add
def stopPoolButLetRunningTaskFinish(pool):
# Pool() shutdown new task from being started, by emptying the query all worker processes draw from
while pool._task_handler.is_alive() and pool._inqueue._reader.poll():
pool._inqueue._reader.recv()
# Send sentinels to all worker processes
for a in range(len(pool._pool)):
pool._inqueue.put(None)
Then change stop_condition_callback
def stop_condition_callback(ret):
if ret[1] < stop_condition:
#worker_pool.terminate()
stopPoolButLetRunningTaskFinish(worker_pool)
else:
output.append(ret)
I would use Dask to execute in parallel, and specifically the futures interface for realtime feedback of the results as they are completed. When done, you could either cancel the remaining futures in flight, lease the unneeded ones to finish asynchronously or close down the cluster.
from dask.distributed import Client, as_completed
client = Client() # defaults to ncores workers, one thread each
y, xmin, xmax, dx = 2.,1.,30.,.1
def func(x, y):
return x, y/x
x = arange(xmin,xmax,dx)
outx = []
output = []
futs = [client.submit(func, val, y) for val in x]
for future in as_completed(futs):
outs = future.result()
outx.append(outs[0])
output.append(outs[1])
if outs[1] < 0.1:
break
Notes:
- I assume you meant "less than", because otherwise the first value already passes (y / xmin > 0.1)
- the outputs are not guaranteed to be in the order you input them if you want to fetch results as they become ready, but with such a fast calculation, perhaps they always are (this is why I had the func return the input value too)
- if you stop computing, the output will be shorter than the full set of inputs, so I'm not quite sure what you want to print.

How to pass values to a publisher in ROS?

I am currently trying to write a Python ROS program which can be executed as a ROS node (using rosrun) that implements the defs declared in a separate Python file arm.py (available at: https://github.com/nortega1/dvrk-ros/blob/44c8604b6c120e91f5357e7fd3649a8f7936c504/dvrk_python/src/dvrk/arm.py). The program initially examines the current cartesian position of the arm. Subsequently, when provided with a series of points that the arm must pass through, the program calculates a polynomial equation and given a range of x values the program evaluates the equation to find the corresponding y values.
Within the arm.py file there is a publisher set_position_cartesian_pub that sets the Cartesian position of the arm as follows:
self.__set_position_cartesian_pub = rospy.Publisher(self.__full_ros_namespace + '/set_position_cartesian', Pose, latch = True, queue_size = 1)
However, I am unsure how to pass the x and y values (I'll calculate the z values at a later date) to the publisher in the python program that I am creating. This is what I have written thus far:
#!/usr/bin/env python
import rospy
from tf import transformations
from tf_conversions import posemath
from std_msgs.msg import String, Bool, Float32, Empty, Float64MultiArray
from geometry_msgs.msg import Pose, PoseStamped, Vector3, Quaternion, Wrench, WrenchStamped, TwistStamped
def callback(data):
rospy.loginfo(data.pose)
def currentPositionListener():
rospy.init_node('currentPositionListener', anonymous=True)
rospy.Subscriber('/dvrk/PSM1/position_cartesian_current', PoseStamped, callback)
rospy.spin()
def lagrange(f, x):
total = 0
n = len(f)
for i in range(n):
xi, yi = f[i]
def g(i, n):
g_tot = 1
for j in range(n):
if i == j:
continue
xj, yj = f[j]
g_tot *= (x - xj) / float(xi - xj)
return g_tot
total += yi * g(i, n)
return total
def newPositionListener():
rospy.Subscriber('/dvrk/PSM1/set_position_cartesian', PoseStamped, trajectoryMover)
rospy.spin()
def trajectoryMover(data):
points =[(0,0),(45,30),(23,10), (48,0)]
xlist = [i for i in range(100)]
ylist = [lagrange(points, xlist[i]) for i in range(100)]
for x, y in zip(xlist, ylist):
data.pose.x = x
data.pose.y = y
data.pose.z = 0
if __name__ == '__main__':
currentPositionListener()
newPositionListener()
Any help would be greatly appreciated!
Problem is in .trajectoryMover():
If you want to publish a Pose message you should create a Pose message and fill in its positions like below code snippet:
from geometry_msgs.msg import Pose
def trajectoryMover(data):
pose = Pose()
self.__set_position_cartesian_pub = rospy.Publisher(
self.__full_ros_namespace + '/set_position_cartesian',
Pose, latch = True, queue_size = 1
)
points =[(0,0),(45,30),(23,10), (48,0)]
xlist = [i for i in range(100)]
ylist = [lagrange(points, xlist[i]) for i in range(100)]
for x, y in zip(xlist, ylist):
pose.x = x
pose.y = y
pose.z = 0
self.__set_position_cartesian_pub.publish(pose)
[NOTE]:
A .spin() is enough for your code.
I couldn't realize what is data you need for.

Graphing Scipy optimize.minimize convergence results each iteration?

I would like to perform some tests on my optimisation routine using scipy.optimize.minimize, in particular graphing the convergence (or the rather objective function) each iteration, over multiple tests.
Suppose I have the following linearly constrained quadratic optimisation problem:
minimise: x_i Q_ij x_j + a|x_i|
subject to: sum(x_i) = 1
I can code this as:
def _fun(x, Q, a):
c = np.einsum('i,ij,j->', x, Q, x)
p = np.sum(a * np.abs(x))
return c + p
def _constr(x):
return np.sum(x) - 1
And I will implement the optimisation in scipy as:
x_0 = # some initial vector
x_soln = scipy.optimise.minimize(_fun, x_0, args=(Q, a), method='SLSQP',
constraints={'type': 'eq', 'fun': _constr})
I see that there is a callback argument but which only accepts a single argument of the parameter values at each iteration. How can I utilise this in a more esoteric case where I might have other arguments that need to be supplied to my callback function?
The way I solved this was use a generic callback cache object referenced each time from my callback function. Let's say you want to do 20 tests and plot the objective function after each iteration in the same chart. You will need an outer loop to run 20 tests, but we'll create that later.
First lets create a class that will store all iteration objective function values for us, and a couple extra bits and pieces:
class OpObj(object):
def __init__(self, Q, a):
self.Q, self.a = Q, a
rv = np.random.rand()
self.x_0 = np.array([rv, (1-rv)/2, (1-rv)/2])
self.f = np.full(shape=(500,), fill_value=np.NaN)
self.count = 0
def _fun(self, x):
return _fun(x, self.Q, self.a)
Also lets add a callback function that manipulates that class obj. Don't worry that it has more than one argument for now since we'll fix this later. Just make sure the first parameter is the solution variables.
def cb(xk, obj=None):
obj.f[obj.count] = obj._fun(xk)
obj.count += 1
All this does is use the object's functions and values to update itself, counting the number of iterations each time. This function will be called after each iteration.
Putting this all together all we need is two more things: 1) some matplotlib-ing to do the plot, and fixing the callback to have only one argument. We can do that with a decorator which is exactly what functools partial does. It returns a function with less arguments than the original. So the final code looks like this:
import matplotlib.pyplot as plt
import scipy.optimize as op
import numpy as np
from functools import partial
Q = np.array([[1.0, 0.75, 0.45], [0.75, 1.0, 0.60], [0.45, 0.60, 1.0]])
a = 1.0
def _fun(x, Q, a):
c = np.einsum('i,ij,j->', x, Q, x)
p = np.sum(a * np.abs(x))
return c + p
def _constr(x):
return np.sum(x) - 1
class OpObj(object):
def __init__(self, Q, a):
self.Q, self.a = Q, a
rv = np.random.rand()
self.x_0 = np.array([rv, (1-rv)/2, (1-rv)/2])
self.f = np.full(shape=(500,), fill_value=np.NaN)
self.count = 0
def _fun(self, x):
return _fun(x, self.Q, self.a)
def cb(xk, obj=None):
obj.f[obj.count] = obj._fun(xk)
obj.count += 1
fig, ax = plt.subplots(1,1)
x = np.linspace(1,500,500)
for test in range(20):
op_obj = OpObj(Q, a)
x_soln = op.minimize(_fun, op_obj.x_0, args=(Q, a), method='SLSQP',
constraints={'type': 'eq', 'fun': _constr},
callback=partial(cb, obj=op_obj))
ax.plot(x, op_obj.f)
ax.set_ylim((1.71,1.76))
plt.show()

How to get current working grid index in cuda python

Hi I'm trying to understand each step of cuda kernel. It will by nice to get all grid indexes that are occupy by data. My code is to add 2 vectors and is written in python numba.
n = 10
x = np.arange(n).astype(np.float32)
y = x + 1
setup number of threads and blocks in grid
threads_per_block = 8
blocks_per_grid = 2
Kernel
def kernel_manual_add(x, y, out):
threads_number = cuda.blockDim.x
block_number = cuda.gridDim.x
thread_index = cuda.threadIdx.x
block_index = cuda.blockIdx.x
grid_index = thread_index + block_index * threads_number
threads_range = threads_number * block_number
for i in range(grid_index, x.shape[0], threads_range):
out[i] = x[i] + y[i]
Initialize kernel:
kernel_manual_add[blocks_per_grid, threads_per_block](x, y, out)
When i try to print out grid_index i get all input indexes 2*8.
How to get grid indexes (10 of them) that are used to compute data?
The canonical way to write your kernel would be something like this
#cuda.jit
def kernel_manual_add(x, y, out):
i = cuda.grid(1)
if i < x.shape[0]:
out[i] = x[i] + y[i]
You must run at least as many threads as there are elements in the input arrays. There is no magic here, you need to calculate grid and block dimensions manually before calling the kernel. See here and here for suggestions.

Categories

Resources