Basic matrix operation in PyTorch/CuPy using GPU

Basic matrix operation in PyTorch/CuPy using GPU - python

I have a numpy script where I do the following operation with big matrices (can go over 10000x10000 with float values):
F = (I - Q)^-1 * R
I first used pytorch tensors on CPU (i7-8750H) and it runs 2 times faster:
tensorQ = torch.from_numpy(Q)
tensorR = torch.from_numpy(R)
sub= torch.eye(a * d, dtype=float) - tensorQ
inv= torch.inverse(sub)
tensorF = torch.mm(inv, tensorR)
F = tensorF.numpy()
Now I'm trying to execute it on GPU (1050Ti Max-Q) to see if I can get another speedup but the code run slower than numpy version (I've already installed CUDA and cuDNN). Maybe pytorch it's not even the best library to do this kind of things, but I'm learning it now and I thought it could help me:
dev = torch.device('cuda')
tensorQ = torch.from_numpy(Q).to(dev)
tensorR = torch.from_numpy(R).to(dev)
sub= torch.eye(a * d, dtype=float).to(dev) - tensorQ
inv= torch.inverse(sub).to(dev)
tensorF = torch.mm(inv, tensorR).cpu()
F = tensorF.numpy()
Am I missing something?
Edit:
I also tried using CuPy but it's still slow:
Q = cp.array(matrixQ)
R = cp.array(matrixR)
sub = cp.identity(attacker * defender) - matrixQ
inv = cp.linalg.inv(sub)
F = cp.matmul(inv, matrixR)
F = cp.asnumpy(matrixF)
Probably the overhead of memory allocation is just too big compared to the computation of few operations.

Related

Understanding shared memory use for improvement in Numba

I'm trying to learn more about the use of shared memory to improve performance in some cuda kernels in Numba, for this I was looking at the Matrix multiplication Example in the Numba documentation and tried to implement to see the gain.
This is my test implementation, I'm aware that the example in the documentation has some issues that I followed from Here, so I copied the fixed example code.
from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *
#cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)
s = timer()
cuda.synchronize()
matmul[bpg,tpb](a_in, b_in, c_out1);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host1 = c_out1.copy_to_host()
print(c_host1)
s = timer()
cuda.synchronize()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host2 = c_out2.copy_to_host()
print(c_host2)
The time of execution of the above kernels are essentially the same, actually the matmul was making faster for some larger input matrices. I would like to know what I'm missing in order to see the gain as the documentation suggests.
Thanks,
Bruno.

I made a performance mistake in the code I put in that other answer. I've now fixed it. In a nutshell this line:
tmp = 0.
caused numba to create a 64-bit floating point variable tmp. That triggered other arithmetic in the kernel to be promoted from 32-bit floating point to 64-bit floating point. That is inconsistent with the rest of the arithmetic and also inconsistent with the intent of the demonstration in the other answer. This error affects both kernels.
When I change it in both kernels to
tmp = float32(0.)
both kernels get noticeably faster, and on my GTX960 GPU, your test case shows that the shared code runs about 2x faster than the non-shared code (but see below).
The non-shared kernel also has a performance issue related to memory access patterns. Similar to the indices swap in that other answer, for this particular scenario only, we can rectify this problem simply by reversing the assigned indices:
j, i = cuda.grid(2)
in the non-shared kernel. This allows that kernel to perform approximately as well as it can, and with that change the shared kernel runs about 2x faster than the non-shared kernel. Without that additional change to the non-shared kernel, the performance of the non-shared kernel is much worse.

slow quadratic constraint creation in Pyomo

trying to construct a large scale quadratic constraint in Pyomo as follows:
import pyomo as pyo
from pyomo.environ import *
scale = 5000
pyo.n = Set(initialize=range(scale))
pyo.x = Var(pyo.n, bounds=(-1.0,1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q_values = dict(zip(list(itertools.product(range(0,scale), range(0,scale))), Q.flatten()))
pyo.Q = Param(pyo.n, pyo.n, initialize=Q_values)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*pyo.Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )
turns out the last line is unbearably slow given the problem scale. tried several things mentioned in PyPSA, Performance of creating Pyomo constraints and pyomo seems very slow to write models. but no luck.
any suggestion (once the model was constructed, Ipopt solving was also slow. but that's independent from Pyomo i guess)?
ps: construct such quadratic constraint directly as follows didnt help either (also unbearably slow)
pyo.xQx = Constraint( expr=sum( pyo.x[i]*Q[i,j]*pyo.x[j] for i in pyo.n for j in pyo.n ) <= 1.0 )

You can get a small speed-up by using quicksum in place of sum. To measure the performance of the last line, I modified your code a little bit, as shown:
import itertools
from pyomo.environ import *
import time
import numpy as np
scale = 5000
m = ConcreteModel()
m.n = Set(initialize=range(scale))
m.x = Var(m.n, bounds=(-1.0, 1.0))
# Q is a n-by-n matrix in numpy array format, where n equals <scale>
Q = np.ones([scale, scale])
Q_values = dict(
zip(list(itertools.product(range(scale), range(scale))), Q.flatten()))
m.Q = Param(m.n, m.n, initialize=Q_values)
t = time.time()
m.xQx = Constraint(expr=sum(m.x[i]*m.Q[i, j]*m.x[j]
for i in m.n for j in m.n) <= 1.0)
print("Time to make QuadCon = {}".format(time.time() - t))
The time I measured with sum was around 174.4 s. With quicksum I got 163.3 seconds.
Not satisfied with such a modest gain, I tried to re-formulate as a SOCP. If you can factorize Q like so: Q= (F^T F), then you could easily express your constraint as a quadratic cone, as shown below:
import itertools
import time
import pyomo.kernel as pmo
from pyomo.environ import *
import numpy as np
scale = 5000
m = pmo.block()
m.n = np.arange(scale)
m.x = pmo.variable_list()
for j in m.n:
m.x.append(pmo.variable(lb=-1.0, ub=1.0))
# Q = (F^T)F factors (eg.: Cholesky factor)
_F = np.ones([scale, scale])
t = time.time()
F = pmo.parameter_list()
for f in _F:
_row = pmo.parameter_list(pmo.parameter(_e) for _e in f)
F.append(_row)
print("Time taken to make parameter F = {}".format(time.time() - t))
t1 = time.time()
x_expr = pmo.expression_tuple(pmo.expression(
expr=sum_product(f, m.x, index=m.n)) for f in F)
print("Time for evaluating Fx = {}".format(time.time() - t1))
t2 = time.time()
m.xQx = pmo.conic.quadratic.as_domain(1, x_expr)
print("Time for quad constr = {}".format(time.time() - t2))
Running on the same machine, I observed a time of around 112 seconds in the preparation of the expression that gets passed to the cone. Actually preparing the cone takes very little time (0.031 s).
Naturally, the only solver that can handle Conic constraints in pyomo is MOSEK. A recent update to the Pyomo-MOSEK interface has also shown promising speed-ups.
You can try MOSEK for free by getting yourself a MOSEK trial license. If you want to read more about Conic reformulations, a quick and thorough guide can be found in the MOSEK modelling cookbook. Lastly, if you are affiliated with an academic institution, then we can offer you a personal/institutional academic license. Hope you find this useful.

How can I accelerate a sparse matrix by dense vector product, currently implemented via scipy.sparse.csc_matrix.dot, using CUDA?

My ultimate goal is to accelerate the computation of a matrix-vector product in Python, potentially by using a CUDA-enabled GPU. The matrix A is about 15k x 15k and sparse (density ~ 0.05), and the vector x is 15k elements and dense, and I am computing Ax. I have to perform this computation many times, so making it as fast as possible would be ideal.
My current non-GPU “optimization” is to represent A as a scipy.sparse.csc_matrix object, and then simply computing A.dot(x), but I was hoping to speed this up on a VM with a couple NVIDIA GPUs attached, and using only Python if possible (i.e. not writing out the detailed kernel functions by hand). I’ve succeeded in accelerating dense matrix-vector products using the cudamat library, but not for the sparse case. There are a handful of suggestions for the sparse case online, such as using pycuda, or scikit-cuda, or anaconda’s accelerate package, but there’s not a ton of information so it’s hard to know where to begin.
I don’t need greatly detailed instructions, but if anyone has solved this before and could provide a “big picture” roadmap for the simplest way of doing this, or has an idea of the sort of speed up a sparse GPU-based matrix-vector product would have over scipy’s sparse algorithms, that would be very helpful.

Another alternative is to use the CuPy package. It has the same interface as numpy/ scipy (wich is nice) and (for me at least), it turned out to be much easier to install than pycuda.
The new code would look something like this:
import cupy as cp
from cupyx.scipy.sparse import csr_matrix as csr_gpu
A = some_sparse_matrix #(scipy.sparse.csr_matrix)
x = some_dense_vector #(numpy.ndarray)
A_gpu = csr_gpu(A) #moving A to the gpu
x_gpu = cp.array(x) #moving x to the gpu
for i in range(niter):
x_gpu = A_gpu.dot(x_gpu)
x = cp.asnumpy(x_gpu) #back to numpy object for fast indexing
UPDATE: cupy now supports AMD GPUs as well through their ROCm stack, so that's an added bonus

As pointed out in comments, NVIDIA ship the cuSPARSE library which includes functions for sparse matrix products with dense vectors.
Numba now has Python bindings for the cuSparse library via the pyculib package.

Thanks for the suggestions.
I managed to get pyculib’s csrmm (matrix multiplication for compressed sparse row formatted matrices) operation to work using the following (using 2 NVIDIA K80 GPUs on Google Cloud Platform), but unfortunately wasn’t able to achieve a speedup.
I assume this is because most of the time in the csrmm function is spent transferring data to/from the GPU, as opposed to actually doing the computations. Unfortunately, I couldn’t figure out any straightforward pyculib way to get the arrays onto the GPU on the first place and keep them there over iterations. The code I used is:
import numpy as np
from scipy.sparse import csr_matrix
from pyculib.sparse import Sparse
from time import time
def spmv_cuda(a_sparse, b, sp, count):
"""Compute a_sparse x b."""
# args to csrmm call
trans_a = 'N' # non-transpose, use 'T' for transpose or 'C' for conjugate transpose
m = a_sparse.shape[0] # num rows in a
n = b.shape[1] # num cols in b, c
k = a_sparse.shape[1] # num cols in a
nnz = len(a_sparse.data) # num nonzero in a
alpha = 1 # no scaling
descr_a = sp.matdescr( # matrix descriptor
indexbase=0, # 0-based indexing
matrixtype='G', # 'general': no symmetry or triangular structure
)
csr_val_a = a_sparse.data # csr data
csr_row_ptr_a = a_sparse.indptr # csr row pointers
csr_col_ind_a = a_sparse.indices # csr col idxs
ldb = b.shape[0]
beta = 0
c = np.empty((m, n), dtype=a_sparse.dtype)
ldc = b.shape[0]
# call function
tic = time()
for ii in range(count):
sp.csrmm(
transA=trans_a,
m=m,
n=n,
k=k,
nnz=nnz,
alpha=alpha,
descrA=descr_a,
csrValA=csr_val_a,
csrRowPtrA=csr_row_ptr_a,
csrColIndA=csr_col_ind_a,
B=b,
ldb=ldb,
beta=beta,
C=c,
ldc=ldc)
toc = time()
return c, toc - tic
# run benchmark
COUNT = 20
N = 5000
P = 0.1
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(np.float32)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(np.float32)
sp = Sparse()
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}'.format(c[:5, 0]))
# pyculib sparse
c, t = spmv_cuda(a_sparse, b, sp, COUNT)
print('pyculib sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}'.format(c[:5, 0]))
which yields the output:
Constructing objects...
scipy sparse matrix multiplication took 0.05158638954162598 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
Testing pyculib sparse matrix multiplication...
pyculib sparse matrix multiplication took 0.12598299980163574 seconds
c = [ 122.29483032 127.83659363 128.75003052 130.6912384 124.98326111]
As you can see, pyculib is more than twice as slow, even though the matrix multiplication is on the GPU. Again, probably because of overhead involved in transferring data to/from GPU at each iteration.
An alternative solution I found, however, was to use Andreas Kloeckner’s pycuda library, which yielded a 50x speed up!
import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.sparse.packeted import PacketedSpMV
from pycuda.tools import DeviceMemoryPool
from scipy.sparse import csr_matrix
from time import time
def spmv_cuda(a_sparse, b, count):
dtype = a_sparse.dtype
m = a_sparse.shape[0]
print('moving objects to GPU...')
spmv = PacketedSpMV(a_sparse, is_symmetric=False, dtype=dtype)
dev_pool = DeviceMemoryPool()
d_b = gpuarray.to_gpu(b, dev_pool.allocate)
d_c = gpuarray.zeros(m, dtype=dtype, allocator=d_b.allocator)
print('executing spmv operation...\n')
tic = time()
for ii in range(count):
d_c.fill(0)
d_c = spmv(d_b, d_c)
toc = time()
return d_c.get(), toc - tic
# run benchmark
COUNT = 100
N = 5000
P = 0.1
DTYPE = np.float32
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(DTYPE)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(DTYPE)
# numpy dense
tic = time()
for ii in range(COUNT):
c = np.dot(a_dense, b)
toc = time()
print('numpy dense matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# pycuda sparse
c, t = spmv_cuda(a_sparse, b, COUNT)
print('pycuda sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}\n'.format(c[:5]))
which yields this output:
numpy dense matrix multiplication took 0.2290663719177246 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
scipy sparse matrix multiplication took 0.24468040466308594 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
moving objects to GPU...
executing spmv operation...
pycuda sparse matrix multiplication took 0.004545450210571289 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
Note 1: pycuda requires the following dependencies:
pymetis: install using: pip install pymetis
nvcc: install using: sudo apt install nvidia-cuda-toolkit
Note 2: for some reason pip install pycuda fails to install the file pkt_build_cython.pyx, so you’ll need to download/copy it yourself from https://github.com/inducer/pycuda/blob/master/pycuda/sparse/pkt_build_cython.pyx.

Another solution is to use tensorflow's matrix multiplication functions. Once GPU-enabled tensorflow is up and running, these work out-of-the-box.
After installing CUDA and tensorflow-gpu (a couple of involved but straightforward tutorials are here and here), you can use tensorflow's SparseTensor class and sparse_tensor_dense_matmul function as follows:
import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from time import time
Make sure GPU is detected:
gpus = [x.name for x in device_lib.list_local_devices() if x.device_type == 'GPU']
print('GPU DEVICES:\n {}'.format(gpus))
Output:
GPU DEVICES:
['/device:GPU:0']
Benchmarks:
from scipy.sparse import csr_matrix
ITERS = 30
N = 20000
P = 0.1 # matrix density
Using scipy:
np.random.seed(0)
a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N)
tic = time()
for ii in range(ITERS):
c = a_sparse.dot(b)
toc = time()
elapsed = toc - tic
print('Scipy spmv product took {} seconds per iteration.'.format(elapsed/ITERS))
Output:
Scipy spmv product took 0.06693172454833984 seconds per iteration.
Using GPU-enabled tensorflow:
with tf.device('/device:GPU:0'):
np.random.seed(0)
a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
indices = np.transpose(a_dense.nonzero())
values = a_dense[indices[:, 0], indices[:, 1]]
dense_shape = a_dense.shape
a_sparse = tf.SparseTensor(indices, values, dense_shape)
b = tf.constant(np.random.rand(N, 1))
tic = time()
for ii in range(ITERS):
c = tf.sparse_tensor_dense_matmul(a_sparse, b)
toc = time()
elapsed = toc - tic
print('GPU spmv product took {} seconds per iteration.'.format(elapsed/ITERS))
Output:
GPU spmv product took 0.0011811971664428711 seconds per iteration.
Quite a nice speed-up, it turns out.

Can my numba code be faster than numpy

I am new to Numba and am trying to speed up some calculations that have proved too unwieldy for numpy. The example I've given below compares a function containing a subset of my calculations using a vectorized/numpy and numba versions of the function the latter of which was also tested as pure python by commenting out the #autojit decorator.
I find that the numba and numpy versions give similar speed ups relative to the pure python, both of which are about a factor of 10 speed improvement.
The numpy version was actually slightly faster than my numba function but because of the 4D nature of this calculation I quickly run out of memory when the arrays in the numpy function are sized much larger than this toy example.
This speed up is nice but I have often seen speed ups of >100x on the web when moving from pure python to numba.
I would like to know if there is a general expected speed increase when moving to numba in nopython mode. I would also like to know if there are any components of my numba-ized function that would be limiting further speed increases.
import numpy as np
from timeit import default_timer as timer
from numba import autojit
import math
def vecRadCalcs(slope, skyz, solz, skya, sola):
nloc = len(slope)
ntime = len(solz)
[lenz, lena] = skyz.shape
asolz = np.tile(np.reshape(solz,[ntime,1,1,1]),[1,nloc,lenz,lena])
asola = np.tile(np.reshape(sola,[ntime,1,1,1]),[1,nloc,lenz,lena])
askyz = np.tile(np.reshape(skyz,[1,1,lenz,lena]),[ntime,nloc,1,1])
askya = np.tile(np.reshape(skya,[1,1,lenz,lena]),[ntime,nloc,1,1])
phi1 = np.cos(asolz)*np.cos(askyz)
phi2 = np.sin(asolz)*np.sin(askyz)*np.cos(askya- asola)
phi12 = phi1 + phi2
phi12[phi12> 1.0] = 1.0
phi = np.arccos(phi12)
return(phi)
#autojit
def RadCalcs(slope, skyz, solz, skya, sola, phi):
nloc = len(slope)
ntime = len(solz)
pop = 0.0
[lenz, lena] = skyz.shape
for iiT in range(ntime):
asolz = solz[iiT]
asola = sola[iiT]
for iL in range(nloc):
for iz in range(lenz):
for ia in range(lena):
askyz = skyz[iz,ia]
askya = skya[iz,ia]
phi1 = math.cos(asolz)*math.cos(askyz)
phi2 = math.sin(asolz)*math.sin(askyz)*math.cos(askya- asola)
phi12 = phi1 + phi2
if phi12 > 1.0:
phi12 = 1.0
phi[iz,ia] = math.acos(phi12)
pop = pop + 1
return(pop)
zenith_cells = 90
azim_cells = 360
nloc = 10 # nominallly ~ 700
ntim = 10 # nominallly ~ 200000
slope = np.random.rand(nloc) * 10.0
solz = np.random.rand(ntim) *np.pi/2.0
sola = np.random.rand(ntim) * 1.0*np.pi
base = np.ones([zenith_cells,azim_cells])
skya = np.deg2rad(np.cumsum(base,axis=1))
skyz = np.deg2rad(np.cumsum(base,axis=0)*90/zenith_cells)
phi = np.zeros(skyz.shape)
start = timer()
outcalc = RadCalcs(slope, skyz, solz, skya, sola, phi)
stop = timer()
outcalc2 = vecRadCalcs(slope, skyz, solz, skya, sola)
stopvec = timer()
print(outcalc)
print(stop-start)
print(stopvec-stop)

On my machine running numba 0.31.0, the Numba version is 2x faster than the vectorized solution. When timing numba functions, you need to run the function more than one time because the first time you're seeing the time of jitting the code + the run time. Subsequent runs will not include the overhead of jitting the functions time since Numba caches the jitted code in memory.
Also, please note that your functions are not calculating the same thing -- you want to be careful that you're comparing the same things using something like np.allclose on the results.

Parameter allocation for MPI collectives

This is my first python MPI program, and I would really appreciate some help optimizing the code. Specifically, I have two questions regarding scattering and gathering, if anyone can help. This program is much slower than a traditional approach without MPI.
I am trying to scatter one array, do some work on the nodes which fills another set of arrays, and gather those. My questions are primarily in the setup and gather sections of code.
Is it necessary to allocate memory for an array on all nodes? (A, my_A, xset, yset, my_xset, my_yset)? Some of these can get large.
Is an array the best structure to gather for the data I am using? When I scatter A, it is relatively small. However, xset and yset can get very large (over a million elements at least).
Here is the code:
#!usr/bin/env python
#Libraries
import numpy as py
import matplotlib.pyplot as plt
from mpi4py import MPI
comm = MPI.COMM_WORLD
print "%d nodes running."% (comm.size)
#Variables
cmin = 0.0
cmax = 4.0
cstep = 0.005
run = 300
disc = 100
#Setup
if comm.rank == 0:
A = py.arange(cmin, cmax + cstep, cstep)
else:
A = py.arange((cmax - cmin) / cstep, dtype=py.float64)
my_A = py.empty(len(A) / comm.size, dtype=py.float64)
xset = py.empty(len(A) * (run - disc) * comm.size, dtype=py.float64)
yset = py.empty(len(A) * (run - disc) * comm.size, dtype=py.float64)
my_xset = py.empty(0, dtype=py.float64)
my_yset = py.empty(0, dtype=py.float64)
#Scatter
comm.Scatter( [A, MPI.DOUBLE], [my_A, MPI.DOUBLE] )
#Work
for i in my_A:
x = 0.5
for j in range(0, run):
x = i * x * (1 - x)
if j >= disc:
my_xset = py.append(my_xset, i)
my_yset = py.append(my_yset, x)
#Gather
comm.Allgather( [my_xset, MPI.DOUBLE], [xset, MPI.DOUBLE])
comm.Allgather( [my_yset, MPI.DOUBLE], [yset, MPI.DOUBLE])
#Export Results
if comm.rank == 0:
f = open('points.3d', 'w+')
for k in range(0, len(xset)-1):
f.write('(' + str(round(xset[k],2)) + ',' + str(round(yset[k],2)) + ',0)\n')
f.close()

You do not need to allocate A on the non-root processes. If you were not using Allgather, but a simple Gather, you could also omit xset and yset. Basically you only need to allocate data that is actually used by the collectives - the other parameters are only significant on root.
Yes, a numpy array is an appropriate data structure for such large arrays. For small data where it is not performance-critical, it can be more convenient and pythonic to use the all-lowercase methods and communicate with Python objects (lists etc).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Basic matrix operation in PyTorch/CuPy using GPU - python

Related

Understanding shared memory use for improvement in Numba

slow quadratic constraint creation in Pyomo

How can I accelerate a sparse matrix by dense vector product, currently implemented via scipy.sparse.csc_matrix.dot, using CUDA?

Can my numba code be faster than numpy

Parameter allocation for MPI collectives

Categories

Resources