Memory consumption when using multiprocessing in python

Memory consumption when using multiprocessing in python - python

I am using python's multiprocessing module to launch some Monte-Carlo simulation in order to speed up the computation. The code I have looks like this:
def main():
(various parameters are being set up...)
start = 0
end = 10
count = int(1e4)
time = np.linspace(start, end, num=count)
num_procs = 12
mc_iterations_per_proc = int(1e5)
mc_iterations = num_procs * mc_iterations_per_proc
mean_estimate, mean_estimate_variance = np.zeros(count), np.zeros(count)
pool = multiprocessing.Pool(num_procs)
for index, (estimate, estimate_variance) in enumerate(pool.imap_unordered(mc_linear_estimate,
((disorder_mean, intensity, wiener_std, time) for index in xrange(mc_iterations)), chunksize=mc_iterations_per_proc)):
delta = estimate - mean_estimate
mean_estimate = mean_estimate + delta / float(index + 1)
mean_estimate_variance = mean_estimate_variance + delta * (estimate - mean_estimate)
mean_estimate_variance = mean_estimate_variance / float(index)
Ok, now mc_linear_estimate is a function taking *args and creating additional variables inside it. It looks like this:
def mc_linear_estimate(*args):
disorder_mean, intensity, wiener_std, time = args[0]
theta_process = source_process(time, intensity, disorder_mean)
xi_process = observed_process(time, theta_process, wiener_std)
gamma = error_estimate(time, intensity, wiener_std, disorder_mean)
estimate = signal_estimate(time, intensity, wiener_std, disorder_mean, gamma, xi_process)
estimate_variance = (estimate - theta_process) ** 2
return estimate, estimate_variance
As you could see, the number of iterations is pretty large (1.2M), and the size of all the arrays is 10K doubles, therefore I use Welford's algorithm to compute mean and variance, due to the fact that it does not require the store every element of the considered sequences in memory. However, this does not help.
The problem: I run out of memory. When I launch the application, 12 processes emerge (as seen using top program on my Linux machine). They instantly start consuming a lot of memory, but as the Linux machine I'm using has 49G of RAM, things are OK for some time. Then, as each of the processes takes up to around 4G of RAM, one of them fails and shows as <defunct> in top. Then another process falls off, and this happens until only one process is left, which ultimately fails with "Out of memory" exception.
The questions:
What am I possibly doing wrong?
How could I improve the code so that it wouldn't consume all the memory?

Related

Can I use PyOpenCL in integration with Scipy to perform Differential Evolution in parallel with GPU?

I got my code for simulating a multivariate regression model to work using the Differential Evolution, and even got the multiprocessing option to help out in reducing runtime. However, with 7 independent variables with 10 values each and matrix operations on 21 100+ element matrices takes a bit of time to work on 24 cores.
I don't have much experience with multiprocessing with PyOpenCL, so I wanted to ask if it's worth entering into and trying to integrate the two to work on the GPU. I've attached the code snippet of 3 variables and 3 values for reference:
import scipy.optimize as op
import numpy as np
def func(vars, *args):
res = []
x = []
for i in args[1:]:
if len(res) + 1 > len(args)//2:
x.append(i)
continue
res.append(np.array(i).T)
f1 = 0
for i in range(len(x[0])):
for j in range(len(x[1])):
diff = (vars[0]*x[0][i] + vars[1])*(vars[2]*x[1][j]*x[1][j] + vars[3]*x[1][j] + vars[4])*(vars[5]*50*50 + vars[6]*50 + vars[7])
f1 = f1 + abs(res[0][i][j] - diff) # ID-Pitch
f2 = 0
for i in range(len(x[0])):
for j in range(len(x[2])):
diff = (vars[0]*x[0][i] + vars[1])*(vars[5]*x[2][j]*x[2][j] + vars[6]*x[2][j] + vars[7])*(vars[2]*10*10 + vars[3]*10 + vars[4])
f2 = f2 + abs(res[1][i][j] - diff) # ID-Depth
f3 = 0
for i in range(len(x[1])):
for j in range(len(x[2])):
diff = (vars[2]*x[1][i]*x[1][i] + vars[3]*x[1][i] + vars[4])*(vars[5]*x[2][j]*x[2][j] + vars[6]*x[2][j] + vars[7])*(vars[0]*3.860424005 + vars[1])
f3 = f3 + abs(res[2][i][j] - diff) # Pitch-Depth
return f1 + f2 + f3
def main():
res1 = [[134.3213274,104.8030828,75.28483813],[151.3351445,118.07797,84.82079556],[135.8343927,105.9836392,76.1328857]]
res2 = [[131.0645086,109.1574174,91.1952225],[54.74920444,30.31300092,17.36537062],[51.8931954,26.45139822,17.28693162]]
res3 = [[131.0645086,141.2210331,133.3192429],[54.74920444,61.75898314,56.52756593],[51.8931954,52.8191817,52.66531712]]
x1 = np.array([3.860424005,7.72084801,11.58127201])
x2 = np.array([10,20,30])
x3 = np.array([50,300,500])
interval = (-20,20)
bds = [interval,interval,interval,interval,interval,interval,interval,interval]
res = op.differential_evolution(func, bounds=bds, workers=-1, maxiter=100000, tol=0.01, popsize=15, args=([1,2,2], res1, res2, res3, x1, x2, x3))
print(res)
if __name__ == '__main__':
main()

firstly, yes it's possible, and func can be a function that will send the data to the GPU then wait for the computationts to finish then transfer the data back to the RAM and return it to scipy.
changing computations from CPU to GPU side is not always beneficial, because of the time required to transfer data back and forth from the GPU, so with a moderate laptop GPU for example, you won't get any speedup at all, and your code might be even slower. reducing data transfer between the GPU and RAM can make GPU 2-4 times faster than an average CPU, but your code requires data transfer so that won't be possible.
for powerful GPUs with high bandwidth (things like RTX2070 or RTX3070 or APUs) you can expect faster computations, so computations on GPU will be a few times faster than CPU, even with the data transfer, but it depends on the code implementation of both the CPU and GPU code.
lastly, your code can be sped up without the use of GPU, which is likely the first thing you should do before going for GPU computations, mainly by using code compilers like cython and numba, that can speed up your code by almost 100 times with little effort without major modifications, but you should convert your code to use only fixed size preallocated numpy arrays and not lists, as the code will be much faster and you can even disable the GIL and have your code multithreaded, and there are good multithreaded looping implementations in them.

Speed up the differental evolution algorithm with thousands of parameters

I am trying to make a lumped rainfall-runoff balance model with a lot parameters (from 37 to 1099) in python. As input it will receive daily rainfall and temperature data and then provides output as a daily flows.
I am stuck on the optimisation method for the model's calibration. I choosed differential evolution algorithm because it is easy to use and can be applied to this kind of problem. The algorithm I wrote works well and it seems to minimise the objective function (which is Nash-Sutcliff model Eficiency - NSE). The problem starts with higher number of parameters which significantly slows the whole algorithm.
The DE algorithm I wrote:
import numpy as np
import flow # a python file from where I get observed daily flows as a np.array
def differential_evolution(func, bounds, popsize=10, mutate=0.8, CR=0.85, maxiter=50):
#--- INITIALIZE THE FIRST POPULATION WITHIN THE BOUNDS-------------------+
bounds = [(0, 250)] * 1 + [(0, 5)] * 366 + [(0, 2)] * 366 + [(0, 100)] * 366
dim = len(bounds)
pop_norm = np.random.rand(popsize, dim)
min_bound, max_bound = np.asarray(bounds).T
difference = np.fabs(min_bound - max_bound)
population = min_bound + pop_norm * difference
# Computed value of objective function for intial population
fitness = np.asarray([func(x, flow.l_flow) for x in population])
best_idx = np.argmin(fitness)
best = population[best_idx]
#--- MUTATION -----------------------------------------------------------+
# This is the part which take to much time to complete
for i in range(maxiter):
print('Generation: ', i)
for j in range(popsize):
# Random selection of three individuals to make a noice vector
idxs = list(range(0, popsize))
idxs.remove(j)
x_1, x_2, x_3 = pop_norm[np.random.choice(idxs, 3, replace=True)]
noice_vector = np.clip(x_1 + mutate * (x_2 - x_3), 0, 1)
#--- RECOMBINATION ------------------------------------------------------+
cross_points = np.random.rand(dim) < CR
if not np.any(cross_points):
cross_points[np.random.randint(0, dim)] = True
trial_vector_norm = np.where(cross_points, noice_vector, pop_norm[j])
trial_vector = min_bound + trial_vector_norm * difference
crit = func(trial_vector, flow.l_flow)
# Check for better fitness of objective function
if crit < fitness[j]:
fitness[j] = crit
pop_norm[j] = trial_vector_norm
if crit < fitness[best_idx]:
best_idx = j
best = trial_vector
return best, fitness[best_idx]
The rainfall-runoff model itself is a function which works basically on lists and via for loop it iteraters over each row to compute daily flows by simple equation.
The objective function NSE is vectorised by numpy arrays:
import model # a python file where rainfall-runoff model function is defined
def nse_min(parameters, observations):
# Modeled flows from model function
Q_modeled = np.array(model.model(parameters))
# Computation of the NSE fraction
numerator = np.subtract(observations, Q_modeled) ** 2
denominator = np.subtract(observations, np.sum(observations)/len(observations)) ** 2
return np.sum(numerator) / np.sum(denominator)
Is there any chance to speed it up? I found out about the numba library which "compiles python code to a machine code" and then let you compute on CPU more efficiently or on GPU using CUDA cores. But I do not study anything related to IT and have no idea how CPU/GPU works, therefore I do not know how to use numba properly. Can anyone help me with it? Or can anyone suggest different optimisation method?
What I use:
Python 3.7.0 64-bit,
Windows 10 Home x64,
Intel Core(TM) i7-7700HQ CPU # 2.80 Ghz,
NVIDIA GeForce GTX 1050 Ti 4GB GDDR5,
16 GB RAM DDR4.
I am a python beginner who study a water management and use python sometimes just for some sripts which make my life easier in data processing. Thank you for your help in advance.

You can use the python library multiprocessing. It just makes more processes to run your function.
you can use it like this.
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()

Running an action x times during t seconds uniformly

I want to repeat a task several times during an interval of time and I want to do that in an uniform way so if I want to do the task 4 times in 1 second it should be executed at t = 0, 0.25, 0.5 and 0.75.
So now I am doing:
import math
import socket
s = socket.socket(...) #not important
time_step = 1./num_times_executed
for _ in num_times_executed:
now = time.time()
s.sendto(...) #action i do
time.sleep(max(0,time_step-(time.time()-now)))
However there is a lot of overhead, the bigger the loop is the more drift I get. For example with num_times_executed = 800, it takes 1.1 seconds so ~ 10% wrong...
Is there a way to do that with a good precision ?

time_step = 1./num_times_executed
start = time.time()
for i in num_times_executed:
s.sendto(...) #action i do
next_send_time = start + (i+1) * time_step
time.sleep(max(0,next_send_time - time.time()))
Now you're not going to get any drift because the time steps are firmly set values off of a start time. Previously the little calculations happening before setting now = time.time() would cause a tiny drift, but now so long as your time_step is long enough to execute the s.sendto(...) command, you shouldn't have any drift.

Why is there no speed-up when using pythons multiprocessing for embarassingly parallel problem within a for-loop, with shared numpy data?

I want to speed up an embarassingly parallel problem related to Bayesian Inference. The aim is to infer coefficents u for a set of images x, given a matrix A, such that X = A*U.
X has dimensions mxn, A mxp and U pxn. For each column of X, one has to infer the optimal corresponding column of the coefficients U. In the end, this information is used to update A. I use m = 3000, p = 1500 and n = 100.
So, as it is a linear model, the inference of the coefficient-matrix u consists of n independent calculations. Thus, I tried to work with the multiprocessing module of Python, but there is no speed up.
Here is what I did:
The main structure, without parallelization, is:
import numpy as np
from convex import Crwlasso_cd
S = np.empty((m, batch_size))
for t in xrange(start_iter, niter):
## Begin Warm Start ##
# Take 5 gradient steps w/ this batch using last coef. to warm start inf.
for ws in range(5):
# Initialize the coefficients
if ws:
theta = U
else:
theta = np.dot(A.T, X)
# Infer the Coefficients for the given data batch X of size mxn (n=batch_size)
# Crwlasso_cd is the function that does the inference per data sample
# It's basically a C-inline code
for k in range(batch_size):
U[:,k] = Crwlasso_cd(X[:, k].copy(), A, theta=theta[:,k].copy())
# Given the inferred coefficients, update and renormalize
# the basis functions A
dA1 = np.dot(X - np.dot(A, U), U.T) # Gaussian data likelihood
A += (eta / batch_size) * dA1
A = np.dot(A, np.diag(1/np.sqrt((A**2).sum(axis=0))))
Implementation of multiprocessing:
I tried to implement multiprocessing. I have an 8-core machine that I can use.
There are 3 for-loops. The only one that seems to be "parallelizable" is the third one, where the coefficients are inferred:
Generate a Queue and stack the iteration-numbers from 0 to batch_size-1 into the Queue
Generate 8 processes, and let them work through the Queue
Share the data U using multiprocessing.Array
So, I replaced this third loop with the following:
from multiprocessing import Process, Queue
import multiprocessing as mp
from Queue import Empty
num_cpu = mp.cpu_count()
work_queue = Queue()
# Generate the empty ndarray U and a multiprocessing.Array-Wrapper U_mp around U
# The class Wrap_mp is attached. Basically, U_mp.asarray() gives the corresponding
# ndarray
U = np.empty((p, batch_size))
U_mp = Wrap_mp(U)
...
# Within the for-loops:
for p in xrange(batch_size):
work_queue.put(p)
processes = [Process(target=infer_coefficients_mp, args=(work_queue,U_mp,A,X)) for p in range(num_cpu)]
for p in processes:
p.start()
print p.pid
for p in processes:
p.join()
Here is the class Wrap_mp:
class Wrap_mp(object):
""" Wrapper around multiprocessing.Array to share an array across
processes. Store the array as a multiprocessing.Array, but compute with it
as a numpy.ndarray
"""
def __init__(self, arr):
""" Initialize a shared array from a numpy array.
The data is copied.
"""
self.data = ndarray_to_shmem(arr)
self.dtype = arr.dtype
self.shape = arr.shape
def __array__(self):
""" Implement the array protocole.
"""
arr = shmem_as_ndarray(self.data, dtype=self.dtype)
arr.shape = self.shape
return arr
def asarray(self):
return self.__array__()
And here is the function infer_coefficients_mp:
def infer_feature_coefficients_mp(work_queue,U_mp,A,X):
while True:
try:
index = work_queue.get(block=False)
x = X[:,index]
U = U_mp.asarray()
theta = np.dot(phit,x)
# Infer the coefficients of the column index
U[:,index] = Crwlasso_cd(x.copy(), A, theta=theta.copy())
except Empty:
break
The problem now are the following:
The multiprocessing version is not faster than the single thread version for the given dimensions of the data.
The process ID's increase with every iteration. Does this mean that there is constantly a new process generated? Doesn't this generate a huge overhead? How can I avoid that? Is there a possibility of creating within the whole for-loop 8 different processes and just update them with the data?
Does the way I share the coefficients U amongst the processes slow the calculation down? Is there another, better way of doing this?
Would a Pool of processes be better?
I am really thankful for any sort of help! I have started working with Python a month ago, and am pretty lost now.
Engin

Every time you create a Process you are creating a new process. If you're doing that within your for loop, then yes, you are starting new processes every time through the loop. It sounds like what you want to do is initialize your Queue and Processes outside of the loop, then fill the Queue inside the loop.
I've used multiprocessing.Pool before, and it's useful, but it doesn't offer much over what you've already implemented with a Queue.

Eventually, this all boils down to one question: Is it possible to start processes outside of the main for-loop, and for every iteration, feed the updated variables in them, have them processing the data, and collecting the newly calculated data from all of the processes, without having to start new processes every iteration?

Rewriting a for loop in pure NumPy to decrease execution time

I recently asked about trying to optimise a Python loop for a scientific application, and received an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100 for me!
However, calculation of the B value is actually nested within a few other loops, because it is evaluated at a regular grid of positions. Is there a similarly smart NumPy rewrite to shave time off this procedure?
I suspect the performance gain for this part would be less marked, and the disadvantages would presumably be that it would not be possible to report back to the user on the progress of the calculation, that the results could not be written to the output file until the end of the calculation, and possibly that doing this in one enormous step would have memory implications? Is it possible to circumvent any of these?
import numpy as np
import time
def reshape_vector(v):
b = np.empty((3,1))
for i in range(3):
b[i][0] = v[i]
return b
def unit_vectors(r):
return r / np.sqrt((r*r).sum(0))
def calculate_dipole(mu, r_i, mom_i):
relative = mu - r_i
r_unit = unit_vectors(relative)
A = 1e-7
num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
den = np.sqrt(np.sum(relative*relative, 0))**3
B = np.sum(num/den, 1)
return B
N = 20000 # number of dipoles
r_i = np.random.random((3,N)) # positions of dipoles
mom_i = np.random.random((3,N)) # moments of dipoles
a = np.random.random((3,3)) # three basis vectors for this crystal
n = [10,10,10] # points at which to evaluate sum
gamma_mu = 135.5 # a constant
t_start = time.clock()
for i in range(n[0]):
r_frac_x = np.float(i)/np.float(n[0])
r_test_x = r_frac_x * a[0]
for j in range(n[1]):
r_frac_y = np.float(j)/np.float(n[1])
r_test_y = r_frac_y * a[1]
for k in range(n[2]):
r_frac_z = np.float(k)/np.float(n[2])
r_test = r_test_x +r_test_y + r_frac_z * a[2]
r_test_fast = reshape_vector(r_test)
B = calculate_dipole(r_test_fast, r_i, mom_i)
omega = gamma_mu*np.sqrt(np.dot(B,B))
# write r_test, B and omega to a file
frac_done = np.float(i+1)/(n[0]+1)
t_elapsed = (time.clock()-t_start)
t_remain = (1-frac_done)*t_elapsed/frac_done
print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'

One obvious thing you can do is replace the line
r_test_fast = reshape_vector(r_test)
with
r_test_fast = r_test.reshape((3,1))
Probably won't make any big difference in performance, but in any case it makes sense to use the numpy builtins instead of reinventing the wheel.
Generally speaking, as you probably have noticed by now, the trick with optimizing numpy is to express the algorithm with the help of numpy whole-array operations or at least with slices instead of iterating over each element in python code. What tends to prevent this kind of "vectorization" is so-called loop-carried dependencies, i.e. loops where each iteration is dependent on the result of a previous iteration. Looking briefly at your code, you have no such thing, and it should be possible to vectorize your code just fine.
EDIT: One solution
I haven't verified this is correct, but should give you an idea of how to approach it.
First, take the cartesian() function, which we'll use. Then
def calculate_dipole_vect(mus, r_i, mom_i):
# Treat each mu sequentially
Bs = []
omega = []
for mu in mus:
rel = mu - r_i
r_norm = np.sqrt((rel * rel).sum(1))
r_unit = rel / r_norm[:, np.newaxis]
A = 1e-7
num = A*(3*np.sum(mom_i * r_unit, 0)*r_unit - mom_i)
den = r_norm ** 3
B = np.sum(num / den[:, np.newaxis], 0)
Bs.append(B)
omega.append(gamma_mu * np.sqrt(np.dot(B, B)))
return Bs, omega
# Transpose to get more "natural" ordering with row-major numpy
r_i = r_i.T
mom_i = mom_i.T
t_start = time.clock()
r_frac = cartesian((np.arange(n[0]) / float(n[0]),
np.arange(n[1]) / float(n[1]),
np.arange(n[2]) / float(n[2])))
r_test = np.dot(r_frac, a)
B, omega = calculate_dipole_vect(r_test, r_i, mom_i)
print 'Total time for vectorized: %f s' % (time.clock() - t_start)
Well, in my testing, this is in fact slightly slower than the loop-based approach I started from. The thing is, in the original version in the question, it was already vectorized with whole-array operations over arrays of shape (20000, 3), so any further vectorization doesn't really bring much further benefit. In fact, it may worsen the performance, as above, maybe due to big temporary arrays.

If you profile your code, you'll see that 99% of the running time is in calculate_dipole so reducing the time for this looping really won't give a noticeable reduction in execution time. You still need to focus on calculate_dipole if you want to make this faster. I tried my Cython code for calculate_dipole on this and got a reduction by about a factor of 2 in the overall time. There might be other ways to improve the Cython code too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Memory consumption when using multiprocessing in python - python

Related

Can I use PyOpenCL in integration with Scipy to perform Differential Evolution in parallel with GPU?

Speed up the differental evolution algorithm with thousands of parameters

Running an action x times during t seconds uniformly

Why is there no speed-up when using pythons multiprocessing for embarassingly parallel problem within a for-loop, with shared numpy data?

Rewriting a for loop in pure NumPy to decrease execution time

Categories

Resources