Python Multiprocessing: run time increases with number of processes

Python Multiprocessing: run time increases with number of processes - python

I am trying to run a multiprocessing task on all 4 processors of a core. When I run just one of the four tasks that I actually want to run, the code takes about 3 seconds per 1000 iterations. However, if I set it up to run all 4 processes the speed quadruples to 13 seconds per 1000 iterations. I'm going to attach part of my code below. I'm not sure why this is happening. I've tried to do my own investigation and it doesn't seem to be a memory or cpu issue. If I monitor as it runs with one task, one of the processors is active at 100% and only .8% of the memory is in use. And when I run 4 tasks all 4 processors are active at 100% each with .8% of memory used.
Not sure why this is happening. I've used the Multiprocessing task many times before and have never noticed a run time increase. Anyways, here is my messy code:
import numpy as np
import multiprocessing
from astropy.io import fits
import os
import time
def SmoothAllSpectra(ROOT, range_lo, range_hi):
file_names = os.listdir(ROOT)
for i in range(len(file_names)):
file_names[i] = ROOT + file_names[i]
pool = multiprocessing.Pool(4)
for filename in pool.map(work, file_names[range_lo:range_hi]):
print filename+' complete'
return ROOT
def work(filename):
data = ImportSpectrum(filename) ##data is a 2d numpy array
smooth_data = SmoothSpectrum(data, 6900, h0=1)
return filename
def SmoothSpectrum(data, max_wl, h0=1):
wl_data = data[0]
count_data = data[1]
hi_idx = np.argmin(np.abs(wl_data - max_wl))
smoothed = np.empty((2, len(wl_data[:hi_idx])))
smoothed[0] = wl_data[:hi_idx]
temp = np.exp(-.5 * np.power(smoothed[0,int(len(smoothed[0])/2.)] - smoothed[0], 2) / h0**2)
idx = np.where(temp>1e-10)[0]
window = np.ceil(len(idx)/2.)
numer = np.zeros(len(smoothed[0]))
denom = np.zeros(len(smoothed[0]))
for i in range(len(numer)):
K1 = np.zeros(len(numer))
if (i-window >= 0) and (i+window <= len(smoothed[0])):
K1[i-window:i+window] = np.exp(-.5 * np.power(smoothed[0,i] - smoothed[0,i-window:i+window], 2) / h0**2)
elif i-window < 0:
K1[:i+window] = np.exp(-.5 * np.power(smoothed[0,i] - smoothed[0,:i+window], 2) / h0**2)
else:
K1[i-window:] = np.exp(-.5 * np.power(smoothed[0,i] - smoothed[0,i-window:], 2) / h0**2)
numer += count_data[i] * K1
denom += K1
smoothed[1] = numer / denom
return smoothed
There's nothing fancy going on in the code, just smoothing some data. I'm sure there are loads of ways to optimize this code, but I'm interested in why I'm seeing a computation time increase going from 1 task to 4 tasks.
Cheers!

Related

emcee issue MPI while sharing an executable file

So I want to run a python script that uses emcee librarie with MPI. I had run into some issues, running it on the server, so first, to verify that the MPI works will within emcee library and the nodes fo the server, I created this MWE:
import sys
import time
import emcee
import numpy as np
from schwimmbad import MPIPool
def log_prob(theta):
t = time.time() + np.random.uniform(0.005, 0.008)
while True:
if time.time() >= t:
break
return -0.5*np.sum(theta**2)
with MPIPool() as pool:
if not pool.is_master():
pool.wait()
sys.exit(0)
np.random.seed(42)
initial = np.random.randn(32, 5)
nwalkers, ndim = initial.shape
nsteps = 100
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, pool=pool)
start = time.time()
sampler.run_mcmc(initial, nsteps)
end = time.time()
print(end - start)
When running with
srun --mpi=pmi2 -n 40 python3 MWE.py
and
srun --mpi=pmi2 -n 2 python3 MWE.py
I got respectively, 2.2 and 20 seconds of execution, which hits that it works correctly.
Now my code is sort of too long to write but here is the structure:
import os
os.environ["OMP_NUM_THREADS"] = "1" #I have also tried without it
import emcee as emcee
import numpy as np
# a funciton that uses a c++ executable to via os.sytem
def AG_plotter(angls,n,efre,efrb,gg,ee,ec):
os.system('LD_LIBRARY_PATH=/home/..cpp/lib; export LD_LIBRARY_PATH; ./a.out DATA_FILE_TO_PRINT'+str(nu)+str(angls)+gg+ee+n+efre+efrb+ec+'.dat')
data=np.loadtxt('DATA_FILE_TO_PRINT'+str(nu[i])+str(angls[j])+gg+ee+n+efre+efrb+ec+'.dat')
return data
def func(theta,gg,ee):
ang,n,efre,efrb,ec = theta
data_to_check = AG_plotter(ang,n,efre,efrb,gg,ee,ec)
.
#things with data_to_check#
.
return -res #assume prior(theta) = 0 everywhere
# I pass some paths. SO I want to run MPIPool for each path
for i in range(len(pathss)):
#everything here is according to the previous examples
with MPIPool() as pool:
if not pool.is_master():
pool.wait()
sys.exit(0)
gg = 'g'+pathss[i]'.txt'
ee = 'e'+pathss[i]'.txt'
ndim, nwalkers = len(initial), 100
sampler = emcee.EnsembleSampler(nwalkers, ndim, func, args = [gg,ee])
print("Running first burn-in...")
p0 = initial + 1e-5 * np.random.randn(nwalkers, ndim)
p0, lp, _ = sampler.run_mcmc(p0, 15)
This code should print a file with DATA_TO_PRINT...dat after each call. But what I saw is that, no matter the number of cores, it always prints one file and takes ages to move on.
I expected, each 'cycle' of function to print n number of files, where n is the number of cores I used.
Anything obvious I'm doing wrong?

numpy: Bottleneck of multiprocessing when memory is not shared and no file IO

The following code (python) measures the speedup when increasing number of processing. The task in the multiprocessing is just multiplying a random matrix, the size of which is also varied and corresponding elapsed time is measured.
Note that, each process does not share any object and they are completely independent. So, I expected that performance curve when changing number of process will be almost same for all matrix size. However, when plotting the results (see below), I found that the expectation is false. Specifically, when matrix size becomes large (80, 160), the performance hardly be better though number of process increased. Note: The figures legend indicates the matrix sizes.
Could you explain, why performance does not become better when matrix size is large?
For your information, here is the spec of my CPU:
https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x
Product Family: AMD Ryzen™ Processors
Product Line: AMD Ryzen™ 9 Desktop Processors
# of CPU Cores: 12
# of Threads: 24
Max. Boost Clock: Up to 4.6GHz
Base Clock: 3.8GHz
L1 Cache: 768KB
L2 Cache: 6MB
L3 Cache: 64MB
main script
import numpy as np
import pickle
from dataclasses import dataclass
import time
import multiprocessing
import os
import subprocess
import numpy as np
def split_number(n_total, n_split):
return [n_total // n_split + (1 if x < n_total % n_split else 0) for x in range(n_split)]
def task(args):
n_iter, idx, matrix_size = args
#cores = "{},{}".format(2 * idx, 2 * idx+1)
#os.system("taskset -p -c {} {}".format(cores, os.getpid()))
for _ in range(n_iter):
A = np.random.randn(matrix_size, matrix_size)
for _ in range(100):
A = A.dot(A)
def measure_time(n_process: int, matrix_size: int) -> float:
n_total = 100
assigne_list = split_number(n_total, n_process)
pool = multiprocessing.Pool(n_process)
ts = time.time()
pool.map(task, zip(assigne_list, range(n_process), [matrix_size] * n_process))
elapsed = time.time() - ts
return elapsed
if __name__ == "__main__":
n_experiment_sample = 5
n_logical = os.cpu_count()
n_physical = int(0.5 * n_logical)
result = {}
for mat_size in [5, 10, 20, 40, 80, 160]:
subresult = {}
result[mat_size] = subresult
for n_process in range(1, n_physical + 1):
elapsed = np.mean([measure_time(n_process, mat_size) for _ in range(n_experiment_sample)])
subresult[n_process] = elapsed
print("{}, {}, {}".format(mat_size, n_process, elapsed))
with open("result.pkl", "wb") as f:
pickle.dump(result, f)
plot script
import numpy as np
import matplotlib.pyplot as plt
import pickle
with open("result.pkl", "rb") as f:
result = pickle.load(f)
fig, ax = plt.subplots()
for matrix_size in result.keys():
subresult = result[matrix_size]
n_process_list = list(subresult.keys())
elapsed_time_list = np.array(list(subresult.values()))
speedups = elapsed_time_list[0] / elapsed_time_list
ax.plot(n_process_list, speedups, label=matrix_size)
ax.set_xlabel("number of process")
ax.set_ylabel("speed up compared to single process")
ax.legend(loc="upper left", borderaxespad=0, fontsize=10, framealpha=1.0)
plt.show()

slow quadratic constraint creation in Pyomo

trying to construct a large scale quadratic constraint in Pyomo as follows:
import pyomo as pyo
from pyomo.environ import *
scale = 5000
pyo.n = Set(initialize=range(scale))
pyo.x = Var(pyo.n, bounds=(-1.0,1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q_values = dict(zip(list(itertools.product(range(0,scale), range(0,scale))), Q.flatten()))
pyo.Q = Param(pyo.n, pyo.n, initialize=Q_values)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*pyo.Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )
turns out the last line is unbearably slow given the problem scale. tried several things mentioned in PyPSA, Performance of creating Pyomo constraints and pyomo seems very slow to write models. but no luck.
any suggestion (once the model was constructed, Ipopt solving was also slow. but that's independent from Pyomo i guess)?
ps: construct such quadratic constraint directly as follows didnt help either (also unbearably slow)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )

You can get a small speed-up by using quicksum in place of sum. To measure the performance of the last line, I modified your code a little bit, as shown:
import itertools
from pyomo.environ import *
import time
import numpy as np
scale = 5000
m = ConcreteModel()
m.n = Set(initialize=range(scale))
m.x = Var(m.n, bounds=(-1.0, 1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q = np.ones([scale, scale])
Q_values = dict(
zip(list(itertools.product(range(scale), range(scale))), Q.flatten()))
m.Q = Param(m.n, m.n, initialize=Q_values)
t = time.time()
m.xQx = Constraint(expr=sum(m.x[i]*m.Q[i, j]*m.x[j]
for i in m.n for j in m.n) <= 1.0)
print("Time to make QuadCon = {}".format(time.time() - t))
The time I measured with sum was around 174.4 s. With quicksum I got 163.3 seconds.
Not satisfied with such a modest gain, I tried to re-formulate as a SOCP. If you can factorize Q like so: Q= (F^T F), then you could easily express your constraint as a quadratic cone, as shown below:
import itertools
import time
import pyomo.kernel as pmo
from pyomo.environ import *
import numpy as np
scale = 5000
m = pmo.block()
m.n = np.arange(scale)
m.x = pmo.variable_list()
for j in m.n:
m.x.append(pmo.variable(lb=-1.0, ub=1.0))
# Q = (F^T)F factors (eg.: Cholesky factor)
_F = np.ones([scale, scale])
t = time.time()
F = pmo.parameter_list()
for f in _F:
_row = pmo.parameter_list(pmo.parameter(_e) for _e in f)
F.append(_row)
print("Time taken to make parameter F = {}".format(time.time() - t))
t1 = time.time()
x_expr = pmo.expression_tuple(pmo.expression(
expr=sum_product(f, m.x, index=m.n)) for f in F)
print("Time for evaluating Fx = {}".format(time.time() - t1))
t2 = time.time()
m.xQx = pmo.conic.quadratic.as_domain(1, x_expr)
print("Time for quad constr = {}".format(time.time() - t2))
Running on the same machine, I observed a time of around 112 seconds in the preparation of the expression that gets passed to the cone. Actually preparing the cone takes very little time (0.031 s).
Naturally, the only solver that can handle Conic constraints in pyomo is MOSEK. A recent update to the Pyomo-MOSEK interface has also shown promising speed-ups.
You can try MOSEK for free by getting yourself a MOSEK trial license. If you want to read more about Conic reformulations, a quick and thorough guide can be found in the MOSEK modelling cookbook. Lastly, if you are affiliated with an academic institution, then we can offer you a personal/institutional academic license. Hope you find this useful.

Using multiprocessing module to runs parallel processes where one is fed (dependent) by the other for Viterbi Algorithm

I have recently played around with Python's multiprocessing module to speed up the forward-backward algorithm for Hidden Markov Models as forward filtering and backward filtering can run independently. Seeing the run-time halve was awe-inspiring stuff.
I now attempt to include some multiprocessing in my iterative Viterbi algorithm.In this algorithm, the two processes I am trying to run are not independent. The val_max part can run independently but arg_max[t] depends on val_max[t-1]. So I played with the idea that one can run val_max as a separate process and then arg_max also as a separate process which can be fed by val_max.
I admit to be a bit out of my depth here and do not know much about multiprocessing other than watching some basic video's on it as well as browsing blogs. I provide my attempt below, but it does not work.
import numpy as np
from time import time,sleep
import multiprocessing as mp
class Viterbi:
def __init__(self,A,B,pi):
self.M = A.shape[0] # number of hidden states
self.A = A # Transition Matrix
self.B = B # Observation Matrix
self.pi = pi # Initial distribution
self.T = None # time horizon
self.val_max = None
self.arg_max = None
self.obs = None
self.sleep_time = 1e-6
self.output = mp.Queue()
def get_path(self,x):
# returns the most likely state sequence given observed sequence x
# using the Viterbi algorithm
self.T = len(x)
self.val_max = np.zeros((self.T, self.M))
self.arg_max = np.zeros((self.T, self.M))
self.val_max[0] = self.pi*self.B[:,x[0]]
for t in range(1, self.T):
# Indepedent Process
self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,obs[t]]) , axis = 0 )
# Dependent Process
self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)
# BACKTRACK
states = np.zeros(self.T, dtype=np.int32)
states[self.T-1] = np.argmax(self.val_max[self.T-1])
for t in range(self.T-2, -1, -1):
states[t] = self.arg_max[t+1, states[t+1]]
return states
def get_val(self):
'''Independent Process'''
for t in range(1,self.T):
self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,self.obs[t]]) , axis = 0 )
self.output.put(self.val_max)
def get_arg(self):
'''Dependent Process'''
for t in range(1,self.T):
while 1:
# Process info if available
if self.val_max[t-1].any() != 0:
self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)
break
# Else sleep and wait for info to arrive
sleep(self.sleep_time)
self.output.put(self.arg_max)
def get_path_parallel(self,x):
self.obs = x
self.T = len(obs)
self.val_max = np.zeros((self.T, self.M))
self.arg_max = np.zeros((self.T, self.M))
val_process = mp.Process(target=self.get_val)
arg_process = mp.Process(target=self.get_arg)
# get first initial value for val_max which can feed arg_process
self.val_max[0] = self.pi*self.B[:,obs[0]]
arg_process.start()
val_process.start()
arg_process.join()
val_process.join()
Note: get_path_parallel does not have backtracking yet.
It would seem that val_process and arg_process never really run. Really not sure why this happens. You can run the code on the Wikipedia example for the viterbi algorithm.
obs = np.array([0,1,2]) # normal then cold and finally dizzy
pi = np.array([0.6,0.4])
A = np.array([[0.7,0.3],
[0.4,0.6]])
B = np.array([[0.5,0.4,0.1],
[0.1,0.3,0.6]])
viterbi = Viterbi(A,B,pi)
path = viterbi.get_path(obs)
I also tried using Ray. However, I had no clue what I was really doing there. Can you please help recommend me what to do in order to get the parallel version to run. I must be doing something very wrong but I do not know what.
Your help would be much appreciated.

I have managed to get my code working thanks to #SıddıkAçıl. The producer-consumer pattern is what does the trick. I also realised that the processes can run successfully but if one does not store the final results in a "result queue" of sorts then it vanishes. By this I mean, that I filled in values in my numpy arrays val_max and arg_max by allowing the process to start() but when I called them, they were still np.zero arrays. I verified that they did fill up to the correct arrays by printing them just as the process is about to terminate (at last self.T in iteration). So instead of printing them, I just added them to a multiprocessing Queue object on the final iteration to capture then entire filled up array.
I provide my updated working code below. NOTE: it is working but takes twice as long to complete as the serial version. My thoughts on why this might be so is as follows:
I can get it to run as two processes but don't actually know how to do it properly. Experienced programmers might know how to fix it with the chunksize parameter.
The two processes I am separating are numpy matrix operations. These processes execute so fast already that the overhead of concurrency (multiprocessing) is not worth the theoretical improvement. Had the two processes been the two original for loops (as used in Wikipedia and most implementations) then multiprocessing might have given gains (perhaps I should investigate this). Furthermore, because we have a producer-consumer pattern and not two independent processes (producer-producer pattern) we can only expect the producer-consumer pattern to run as long as the longest of the two processes (in this case the producer takes twice as long as the consumer). We can not expect run time to halve as in the producer-producer scenario (this happened with my parallel forward-backward HMM filtering algorithm).
My computer has 4 cores and numpy already does built-in CPU multiprocessing optimization on its operations. By me attempting to use cores to make the code faster, I am depriving numpy of cores that it could use in a more effective manner. To figure this out, I am going to time the numpy operations and see if they are slower in my concurrent version as compared to that of my serial version.
I will update if I learn anything new. If you perhaps know the real reason for why my concurrent code is so much slower, please do let me know. Here is the code:
import numpy as np
from time import time
import multiprocessing as mp
class Viterbi:
def __init__(self,A,B,pi):
self.M = A.shape[0] # number of hidden states
self.A = A # Transition Matrix
self.B = B # Observation Matrix
self.pi = pi # Initial distribution
self.T = None # time horizon
self.val_max = None
self.arg_max = None
self.obs = None
self.intermediate = mp.Queue()
self.result = mp.Queue()
def get_path(self,x):
'''Sequential/Serial Viterbi Algorithm with backtracking'''
self.T = len(x)
self.val_max = np.zeros((self.T, self.M))
self.arg_max = np.zeros((self.T, self.M))
self.val_max[0] = self.pi*self.B[:,x[0]]
for t in range(1, self.T):
# Indepedent Process
self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,obs[t]]) , axis = 0 )
# Dependent Process
self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)
# BACKTRACK
states = np.zeros(self.T, dtype=np.int32)
states[self.T-1] = np.argmax(self.val_max[self.T-1])
for t in range(self.T-2, -1, -1):
states[t] = self.arg_max[t+1, states[t+1]]
return states
def get_val(self,intial_val_max):
'''Independent Poducer Process'''
val_max = intial_val_max
for t in range(1,self.T):
val_max = np.max( self.A*np.outer(val_max,self.B[:,self.obs[t]]) , axis = 0 )
#print('Transfer: ',self.val_max[t])
self.intermediate.put(val_max)
if t == self.T-1:
self.result.put(val_max) # we only need the last val_max value for backtracking
def get_arg(self):
'''Dependent Consumer Process.'''
t = 1
while t < self.T:
val_max =self.intermediate.get()
#print('Receive: ',val_max)
self.arg_max[t] = np.argmax( val_max*self.A.T, axis = 1)
if t == self.T-1:
self.result.put(self.arg_max)
#print('Processed: ',self.arg_max[t])
t += 1
def get_path_parallel(self,x):
'''Multiprocessing producer-consumer implementation of Viterbi algorithm.'''
self.obs = x
self.T = len(obs)
self.arg_max = np.zeros((self.T, self.M)) # we don't tabulate val_max anymore
initial_val_max = self.pi*self.B[:,obs[0]]
producer_process = mp.Process(target=self.get_val,args=(initial_val_max,),daemon=True)
consumer_process = mp.Process(target=self.get_arg,daemon=True)
self.intermediate.put(initial_val_max) # initial production put into pipeline for consumption
consumer_process.start() # we can already consume initial_val_max
producer_process.start()
#val_process.join()
#arg_process.join()
#self.output.join()
return self.backtrack(self.result.get(),self.result.get()) # backtrack takes last row of val_max and entire arg_max
def backtrack(self,val_max_last_row,arg_max):
'''Backtracking the Dynamic Programming solution (actually a Trellis diagram)
produced by Multiprocessing Viterbi algorithm.'''
states = np.zeros(self.T, dtype=np.int32)
states[self.T-1] = np.argmax(val_max_last_row)
for t in range(self.T-2, -1, -1):
states[t] = arg_max[t+1, states[t+1]]
return states
if __name__ == '__main__':
obs = np.array([0,1,2]) # normal then cold and finally dizzy
T = 100000
obs = np.random.binomial(2,0.3,T)
pi = np.array([0.6,0.4])
A = np.array([[0.7,0.3],
[0.4,0.6]])
B = np.array([[0.5,0.4,0.1],
[0.1,0.3,0.6]])
t1 = time()
viterbi = Viterbi(A,B,pi)
path = viterbi.get_path(obs)
t2 = time()
print('Iterative Viterbi')
print('Path: ',path)
print('Run-time: ',round(t2-t1,6))
t1 = time()
viterbi = Viterbi(A,B,pi)
path = viterbi.get_path_parallel(obs)
t2 = time()
print('\nParallel Viterbi')
print('Path: ',path)
print('Run-time: ',round(t2-t1,6))

Welcome to SO. Consider taking a look at producer-consumer pattern that is heavily used in multiprocessing.
Beware that multiprocessing in Python reinstantiates your code for every process you create on Windows. So your Viterbi objects and therefore their Queue fields are not the same.
Observe this behaviour through:
import os
def get_arg(self):
'''Dependent Process'''
print("Dependent ", self)
print("Dependent ", self.output)
print("Dependent ", os.getpid())
def get_val(self):
'''Independent Process'''
print("Independent ", self)
print("Independent ", self.output)
print("Independent ", os.getpid())
if __name__ == "__main__":
print("Hello from main process", os.getpid())
obs = np.array([0,1,2]) # normal then cold and finally dizzy
pi = np.array([0.6,0.4])
A = np.array([[0.7,0.3],
[0.4,0.6]])
B = np.array([[0.5,0.4,0.1],
[0.1,0.3,0.6]])
viterbi = Viterbi(A,B,pi)
print("Main viterbi object", viterbi)
print("Main viterbi object queue", viterbi.output)
path = viterbi.get_path_parallel(obs)
There are three different Viterbi objects as there are three different processes. So, what you need in terms of parallelism is not processes. You should explore the threading library that Python offers.

Parameter allocation for MPI collectives

This is my first python MPI program, and I would really appreciate some help optimizing the code. Specifically, I have two questions regarding scattering and gathering, if anyone can help. This program is much slower than a traditional approach without MPI.
I am trying to scatter one array, do some work on the nodes which fills another set of arrays, and gather those. My questions are primarily in the setup and gather sections of code.
Is it necessary to allocate memory for an array on all nodes? (A, my_A, xset, yset, my_xset, my_yset)? Some of these can get large.
Is an array the best structure to gather for the data I am using? When I scatter A, it is relatively small. However, xset and yset can get very large (over a million elements at least).
Here is the code:
#!usr/bin/env python
#Libraries
import numpy as py
import matplotlib.pyplot as plt
from mpi4py import MPI
comm = MPI.COMM_WORLD
print "%d nodes running."% (comm.size)
#Variables
cmin = 0.0
cmax = 4.0
cstep = 0.005
run = 300
disc = 100
#Setup
if comm.rank == 0:
A = py.arange(cmin, cmax + cstep, cstep)
else:
A = py.arange((cmax - cmin) / cstep, dtype=py.float64)
my_A = py.empty(len(A) / comm.size, dtype=py.float64)
xset = py.empty(len(A) * (run - disc) * comm.size, dtype=py.float64)
yset = py.empty(len(A) * (run - disc) * comm.size, dtype=py.float64)
my_xset = py.empty(0, dtype=py.float64)
my_yset = py.empty(0, dtype=py.float64)
#Scatter
comm.Scatter( [A, MPI.DOUBLE], [my_A, MPI.DOUBLE] )
#Work
for i in my_A:
x = 0.5
for j in range(0, run):
x = i * x * (1 - x)
if j >= disc:
my_xset = py.append(my_xset, i)
my_yset = py.append(my_yset, x)
#Gather
comm.Allgather( [my_xset, MPI.DOUBLE], [xset, MPI.DOUBLE])
comm.Allgather( [my_yset, MPI.DOUBLE], [yset, MPI.DOUBLE])
#Export Results
if comm.rank == 0:
f = open('points.3d', 'w+')
for k in range(0, len(xset)-1):
f.write('(' + str(round(xset[k],2)) + ',' + str(round(yset[k],2)) + ',0)\n')
f.close()

You do not need to allocate A on the non-root processes. If you were not using Allgather, but a simple Gather, you could also omit xset and yset. Basically you only need to allocate data that is actually used by the collectives - the other parameters are only significant on root.
Yes, a numpy array is an appropriate data structure for such large arrays. For small data where it is not performance-critical, it can be more convenient and pythonic to use the all-lowercase methods and communicate with Python objects (lists etc).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Multiprocessing: run time increases with number of processes - python

Related

emcee issue MPI while sharing an executable file

numpy: Bottleneck of multiprocessing when memory is not shared and no file IO

slow quadratic constraint creation in Pyomo

Using multiprocessing module to runs parallel processes where one is fed (dependent) by the other for Viterbi Algorithm

Parameter allocation for MPI collectives

Categories

Resources