This is my first python MPI program, and I would really appreciate some help optimizing the code. Specifically, I have two questions regarding scattering and gathering, if anyone can help. This program is much slower than a traditional approach without MPI.
I am trying to scatter one array, do some work on the nodes which fills another set of arrays, and gather those. My questions are primarily in the setup and gather sections of code.
Is it necessary to allocate memory for an array on all nodes? (A, my_A, xset, yset, my_xset, my_yset)? Some of these can get large.
Is an array the best structure to gather for the data I am using? When I scatter A, it is relatively small. However, xset and yset can get very large (over a million elements at least).
Here is the code:
#!usr/bin/env python
import numpy as py
import matplotlib.pyplot as plt
from mpi4py import MPI
print "%d nodes running."% (comm.size)
cmin = 0.0
cmax = 4.0
cstep = 0.005
run = 300
disc = 100
if comm.rank == 0:
A = py.arange(cmin, cmax + cstep, cstep)
A = py.arange((cmax - cmin) / cstep, dtype=py.float64)
my_A = py.empty(len(A) / comm.size, dtype=py.float64)
xset = py.empty(len(A) * (run - disc) * comm.size, dtype=py.float64)
yset = py.empty(len(A) * (run - disc) * comm.size, dtype=py.float64)
my_xset = py.empty(0, dtype=py.float64)
my_yset = py.empty(0, dtype=py.float64)
comm.Scatter( [A, MPI.DOUBLE], [my_A, MPI.DOUBLE] )
for i in my_A:
x = 0.5
for j in range(0, run):
x = i * x * (1 - x)
if j >= disc:
my_xset = py.append(my_xset, i)
my_yset = py.append(my_yset, x)
comm.Allgather( [my_xset, MPI.DOUBLE], [xset, MPI.DOUBLE])
comm.Allgather( [my_yset, MPI.DOUBLE], [yset, MPI.DOUBLE])
#Export Results
if comm.rank == 0:
f = open('points.3d', 'w+')
for k in range(0, len(xset)-1):
f.write('(' + str(round(xset[k],2)) + ',' + str(round(yset[k],2)) + ',0)\n')
You do not need to allocate A on the non-root processes. If you were not using Allgather, but a simple Gather, you could also omit xset and yset. Basically you only need to allocate data that is actually used by the collectives - the other parameters are only significant on root.
Yes, a numpy array is an appropriate data structure for such large arrays. For small data where it is not performance-critical, it can be more convenient and pythonic to use the all-lowercase methods and communicate with Python objects (lists etc).
I'm trying to learn more about the use of shared memory to improve performance in some cuda kernels in Numba, for this I was looking at the Matrix multiplication Example in the Numba documentation and tried to implement to see the gain.
This is my test implementation, I'm aware that the example in the documentation has some issues that I followed from Here, so I copied the fixed example code.
from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)
s = timer()
matmul[bpg,tpb](a_in, b_in, c_out1);
gpu_time = timer() - s
c_host1 = c_out1.copy_to_host()
s = timer()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
gpu_time = timer() - s
c_host2 = c_out2.copy_to_host()
The time of execution of the above kernels are essentially the same, actually the matmul was making faster for some larger input matrices. I would like to know what I'm missing in order to see the gain as the documentation suggests.
I made a performance mistake in the code I put in that other answer. I've now fixed it. In a nutshell this line:
tmp = 0.
caused numba to create a 64-bit floating point variable tmp. That triggered other arithmetic in the kernel to be promoted from 32-bit floating point to 64-bit floating point. That is inconsistent with the rest of the arithmetic and also inconsistent with the intent of the demonstration in the other answer. This error affects both kernels.
When I change it in both kernels to
tmp = float32(0.)
both kernels get noticeably faster, and on my GTX960 GPU, your test case shows that the shared code runs about 2x faster than the non-shared code (but see below).
The non-shared kernel also has a performance issue related to memory access patterns. Similar to the indices swap in that other answer, for this particular scenario only, we can rectify this problem simply by reversing the assigned indices:
j, i = cuda.grid(2)
in the non-shared kernel. This allows that kernel to perform approximately as well as it can, and with that change the shared kernel runs about 2x faster than the non-shared kernel. Without that additional change to the non-shared kernel, the performance of the non-shared kernel is much worse.
trying to construct a large scale quadratic constraint in Pyomo as follows:
import pyomo as pyo
from pyomo.environ import *
scale = 5000
pyo.n = Set(initialize=range(scale))
pyo.x = Var(pyo.n, bounds=(-1.0,1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q_values = dict(zip(list(itertools.product(range(0,scale), range(0,scale))), Q.flatten()))
pyo.Q = Param(pyo.n, pyo.n, initialize=Q_values)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*pyo.Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )
turns out the last line is unbearably slow given the problem scale. tried several things mentioned in PyPSA, Performance of creating Pyomo constraints and pyomo seems very slow to write models. but no luck.
any suggestion (once the model was constructed, Ipopt solving was also slow. but that's independent from Pyomo i guess)?
ps: construct such quadratic constraint directly as follows didnt help either (also unbearably slow)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )
You can get a small speed-up by using quicksum in place of sum. To measure the performance of the last line, I modified your code a little bit, as shown:
import itertools
from pyomo.environ import *
import time
import numpy as np
scale = 5000
m = ConcreteModel()
m.n = Set(initialize=range(scale))
m.x = Var(m.n, bounds=(-1.0, 1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q = np.ones([scale, scale])
Q_values = dict(
zip(list(itertools.product(range(scale), range(scale))), Q.flatten()))
m.Q = Param(m.n, m.n, initialize=Q_values)
t = time.time()
m.xQx = Constraint(expr=sum(m.x[i]*m.Q[i, j]*m.x[j]
for i in m.n for j in m.n) <= 1.0)
print("Time to make QuadCon = {}".format(time.time() - t))
The time I measured with sum was around 174.4 s. With quicksum I got 163.3 seconds.
Not satisfied with such a modest gain, I tried to re-formulate as a SOCP. If you can factorize Q like so: Q= (F^T F), then you could easily express your constraint as a quadratic cone, as shown below:
import itertools
import time
import pyomo.kernel as pmo
from pyomo.environ import *
import numpy as np
scale = 5000
m = pmo.block()
m.n = np.arange(scale)
m.x = pmo.variable_list()
for j in m.n:
m.x.append(pmo.variable(lb=-1.0, ub=1.0))
# Q = (F^T)F factors (eg.: Cholesky factor)
_F = np.ones([scale, scale])
t = time.time()
F = pmo.parameter_list()
for f in _F:
_row = pmo.parameter_list(pmo.parameter(_e) for _e in f)
print("Time taken to make parameter F = {}".format(time.time() - t))
t1 = time.time()
x_expr = pmo.expression_tuple(pmo.expression(
expr=sum_product(f, m.x, index=m.n)) for f in F)
print("Time for evaluating Fx = {}".format(time.time() - t1))
t2 = time.time()
m.xQx = pmo.conic.quadratic.as_domain(1, x_expr)
print("Time for quad constr = {}".format(time.time() - t2))
Running on the same machine, I observed a time of around 112 seconds in the preparation of the expression that gets passed to the cone. Actually preparing the cone takes very little time (0.031 s).
Naturally, the only solver that can handle Conic constraints in pyomo is MOSEK. A recent update to the Pyomo-MOSEK interface has also shown promising speed-ups.
You can try MOSEK for free by getting yourself a MOSEK trial license. If you want to read more about Conic reformulations, a quick and thorough guide can be found in the MOSEK modelling cookbook. Lastly, if you are affiliated with an academic institution, then we can offer you a personal/institutional academic license. Hope you find this useful.
I have a numpy script where I do the following operation with big matrices (can go over 10000x10000 with float values):
F = (I - Q)^-1 * R
I first used pytorch tensors on CPU (i7-8750H) and it runs 2 times faster:
tensorQ = torch.from_numpy(Q)
tensorR = torch.from_numpy(R)
sub= torch.eye(a * d, dtype=float) - tensorQ
inv= torch.inverse(sub)
tensorF =, tensorR)
F = tensorF.numpy()
Now I'm trying to execute it on GPU (1050Ti Max-Q) to see if I can get another speedup but the code run slower than numpy version (I've already installed CUDA and cuDNN). Maybe pytorch it's not even the best library to do this kind of things, but I'm learning it now and I thought it could help me:
dev = torch.device('cuda')
tensorQ = torch.from_numpy(Q).to(dev)
tensorR = torch.from_numpy(R).to(dev)
sub= torch.eye(a * d, dtype=float).to(dev) - tensorQ
inv= torch.inverse(sub).to(dev)
tensorF =, tensorR).cpu()
F = tensorF.numpy()
Am I missing something?
I also tried using CuPy but it's still slow:
Q = cp.array(matrixQ)
R = cp.array(matrixR)
sub = cp.identity(attacker * defender) - matrixQ
inv = cp.linalg.inv(sub)
F = cp.matmul(inv, matrixR)
F = cp.asnumpy(matrixF)
Probably the overhead of memory allocation is just too big compared to the computation of few operations.
I am trying to run a multiprocessing task on all 4 processors of a core. When I run just one of the four tasks that I actually want to run, the code takes about 3 seconds per 1000 iterations. However, if I set it up to run all 4 processes the speed quadruples to 13 seconds per 1000 iterations. I'm going to attach part of my code below. I'm not sure why this is happening. I've tried to do my own investigation and it doesn't seem to be a memory or cpu issue. If I monitor as it runs with one task, one of the processors is active at 100% and only .8% of the memory is in use. And when I run 4 tasks all 4 processors are active at 100% each with .8% of memory used.
Not sure why this is happening. I've used the Multiprocessing task many times before and have never noticed a run time increase. Anyways, here is my messy code:
import numpy as np
import multiprocessing
from import fits
import os
import time
def SmoothAllSpectra(ROOT, range_lo, range_hi):
file_names = os.listdir(ROOT)
for i in range(len(file_names)):
file_names[i] = ROOT + file_names[i]
pool = multiprocessing.Pool(4)
for filename in, file_names[range_lo:range_hi]):
print filename+' complete'
return ROOT
def work(filename):
data = ImportSpectrum(filename) ##data is a 2d numpy array
smooth_data = SmoothSpectrum(data, 6900, h0=1)
return filename
def SmoothSpectrum(data, max_wl, h0=1):
wl_data = data[0]
count_data = data[1]
hi_idx = np.argmin(np.abs(wl_data - max_wl))
smoothed = np.empty((2, len(wl_data[:hi_idx])))
smoothed[0] = wl_data[:hi_idx]
temp = np.exp(-.5 * np.power(smoothed[0,int(len(smoothed[0])/2.)] - smoothed[0], 2) / h0**2)
idx = np.where(temp>1e-10)[0]
window = np.ceil(len(idx)/2.)
numer = np.zeros(len(smoothed[0]))
denom = np.zeros(len(smoothed[0]))
for i in range(len(numer)):
K1 = np.zeros(len(numer))
if (i-window >= 0) and (i+window <= len(smoothed[0])):
K1[i-window:i+window] = np.exp(-.5 * np.power(smoothed[0,i] - smoothed[0,i-window:i+window], 2) / h0**2)
elif i-window < 0:
K1[:i+window] = np.exp(-.5 * np.power(smoothed[0,i] - smoothed[0,:i+window], 2) / h0**2)
K1[i-window:] = np.exp(-.5 * np.power(smoothed[0,i] - smoothed[0,i-window:], 2) / h0**2)
numer += count_data[i] * K1
denom += K1
smoothed[1] = numer / denom
return smoothed
There's nothing fancy going on in the code, just smoothing some data. I'm sure there are loads of ways to optimize this code, but I'm interested in why I'm seeing a computation time increase going from 1 task to 4 tasks.
I have a piece of code that uses Numbapro to write a simple kernel to square the contents of two arrays of size 41724,add them together and store it into another array. All the arrays have the same size and are float32. The code is below:
import numpy as np
from numba import *
from numbapro import cuda
def square_add(a,b,c):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
#Since the length of a is 41724 and the total
#threads is 41*1024 = 41984, this check is necessary
if (i>len(a)):
c[i] = a[i]*a[i] + b[i]*b[i]
a = np.array(range(0,41724),dtype = np.float32)
b = np.array(range(41724,83448),dtype=np.float32)
c = np.zeros(shape=(1,41724),dtype=np.float32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c,copy=False)
#Launch the kernel; Gridsize = (1,41),Blocksize=(1,1024)
c = d_c.copy_to_host()
print c
print len(c[0])
The values I am getting when I print the result of the operation (the array c) is completely different compared to that when I do the exact same thing in a python terminal.
I do not know what I am doing wrong here.
There a two problems here.
The first is that you are specifying a block and grid dimension for your CUDA kernel launch which is incompatible with the indexing scheme you have chosen to use in the kernel.
launches a two dimensional grid where all the threads have the same block and thread dimensions in x, and vary in only in y. This implies that
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
will yield i=0 for every thread. If you change the kernel launch to this:
you will find that in indexing will work correctly.
The second is that c has been declared as a two dimensional array, but the kernel function signature has been declared as a one dimensional array. Under some circumstances, the numbapro runtime should detect this and raise an error.
I was able to get your example to work correctly like this:
import numpy as np
from numba import *
from numbapro import cuda
def square_add(a,b,c):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
if (i<len(a)):
c[0,i] = a[i]*a[i] + b[i]*b[i]
a = np.array(range(0,41724),dtype=np.float32)
b = np.array(range(41724,83448),dtype=np.float32)
c = np.zeros(shape=(1,41724),dtype=np.float32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c, copy=False)
c = d_c.copy_to_host()
[Note I am using Python 3, so this uses new style print statements]
$ ipython
numbapro:1: ImportWarning: The numbapro package is deprecated in favour of the accelerate package. Please update your code to use equivalent functions from accelerate.
[[ 1.74089216e+09 1.74097562e+09 1.74105907e+09 ..., 8.70371021e+09
8.70396006e+09 8.70421094e+09]]
(1, 41724)