Parallelizing a simple matrix assignment in Python with MPI - python

I have a rather simple parallelization question that I can't seem to work out. I had parallelized a simple matrix assignment using joblib in Python which worked nicely on my workstation, but now I need to run the code on a HPC and the as-is code is not playing nicely with MPI. A skeleton of the code is below (I have stripped out a lot of not-relevant computation). Basically I have a large matrix that I want to fill in and at each point the value is a sum over many energies and eigenvalues, so this is the 'slow step' of the calculation. When I run this on my workstation I just parallelize that fill in using Parallel and delayed from joblib but of course when I run this on the cluster using mpirun --bind-to none -n 16 python KZ_spectral_function.py | tee spectral.out for example the code runs basically in serial (although with some odd behavior).
So, what I think I need to do is to convert that joblib line over to mpi4py, include an if rank == 0: statement encompassing everything in the main function, and just modify the the contents of gen_spec_func() and divvy up the calls to spec_func() to the different cores. This is the part where I am stuck, as all the examples I have read that were simple enough for me to understand use some variation of COMMS.scatter() and then append the results to a list, as far as I can tell in a random order, and I don't know quite enough to adapt them to something where I want the results to go in a specific place in the matrix. Any help or advice would be greatly appreciated, as neither parallelization nor python are strengths of mine ...
Code Snippet (simplified):
import numpy as np
import numpy.linalg as lina
import time
from functools import partial
from joblib import Parallel, delayed
# Helper Functions
def get_eigenvals(k_cart,cellMap,Hwannier,G):
## [....] some linear algebra, not important
return Ek
def gen_spec_func(eigenvals,Nkpts,Energies,Sigma):
## This is really the only part that I care to parallelize
## This is the joblib version
num_cores=16
tempfunc = partial(spec_func,Energies,Sigma,eigenvals)
spectral = np.reshape(Parallel(n_jobs=num_cores)(delayed(tempfunc)(i,j) for j in range(0,Nkpts) for i in range(0,len(Energies))),(Nkpts,len(Energies))).T
return np.matrix(spectral)
def spec_func(Energies,Sigma,eigenvals,i,j) :
return sum([(1.0/( (Energies[i]-val)**2 + (Sigma)**2 )) for val in eigenvals[j,:]])
#--- Start of main script
Tstart = time.time()
# [...] Declare Constants & Parameters
# [...] Read data from disk
# [...] Some calculations on that data we want done in serial
Energies = [Emin + (Emax-Emin)*i/(Nenergies-1) for i in range(0,Nenergies)]
kzs = [kzi+(kzf-kzi)*l/(nkzs-1) for l in range(0,nkzs)]
DomainAvg = np.matrix([[0.0 for j in range(0,Nkpts)] for i in range(0,Nenergies)])
for kz in kzs:
## An outer loop over Kz
print("Starting loop for kz = ",kz)
# Generate the base k-grid (symmetric) in 1/A for convenience
# [...] Generate the appropriate kpoint grid
for angle in range(0,Nangles):
## Inner loop over rotation angles
#--- For each angle generate the kpoint grid for that domain
# [...] Calculate some eigenvalues, small matrices not a big deal, serial fine
#--- Now we Generate the spectral function for that grid
### Ok, this is the slow part that we want to parallelize
DomainAvg += gen_spec_func(eigenvals,Nkpts,Energies,Sigma)
if(angle%20 == 0):
Tend = time.time()
print("Completed iteration ", angle, "of ", Nangles, " at T = ", Tend-Tstart)
# Output the results (one file for each Kz)
DomainAvg = DomainAvg/Nangles
outfile = "Spectral"+str(kz)+".txt"
np.savetxt(outfile, DomainAvg)
# And we are done
Tend = time.time()
print("Total execution time was :", Tend-Tstart)
EDIT: A very very hack solution I came up with was to encode the matrix indices in the matrix itself as floats, then use scatter() and gather() to distribute the matrix, replace the value with the calculation output, and reassemble the matrix. This is of course not a good idea since it requires int<->float conversion but it was the only way I could come up with that didn't require rebuilding the entire matrix from the gathered data index by index (instead just using hstack() and reshape() to put it together). I feel like there must be some tool I am missing that assists in distributed calculation for arrays/matrices where the index matters so I would still be interested if someone has a tip/pointer in this regard.
Minimum Working Example:
import numpy.linalg as lina
import time
import math
from functools import partial
from mpi4py import MPI
#-- Standard Comms
COMM = MPI.COMM_WORLD
size = COMM.Get_size()
rank = COMM.Get_rank()
Nkpts = 3
Energies = [1.00031415926*i for i in range(0,11)]
#--- Now we Generate the spectral function for that grid
# This will be done in parallel using scatter/gather in MPI
if rank == 0:
# List that we will scatter to the different nodes
# Encode the matrix index from whcn each element came as a float
datalist = [float(j+i*Nkpts) for j in range(0,Nkpts) for i in range(0,len(Energies))]
data = np.array_split(datalist, COMM.Get_size())
else:
data = None
# Distribute to the different nodes
data = COMM.scatter(data, root=0)
print("I am processor ",rank," and my data is",data)
for index in range(0,len(data)) :
# Decode the indicies
j = data[index]%Nkpts
i = math.floor(data[index]/Nkpts)
data[index] = 100.100*j+Energies[i]
COMM.Barrier()
dataMPI = COMM.gather(data,root=0)
if(rank==0) :
spectral = np.reshape(np.hstack(dataMPI),(Nkpts,len(Energies))).T
spectral_func = np.matrix(spectral)
print(spectral_func)```

Related

Can Dask automatically create a tree to parallelize a computation and reduce the copies between workers?

I've structured this in two sections, BACKGROUND and QUESTION. The Question is all the way at the bottom.
BACKGROUND:
Suppose I want to (using Dask distributed) do an embarrassingly parallel computation like summing 16 gigantic dataframes. I know that this is going to be blazing fast using CUDA but let's please stay with Dask for this example.
A basic way to accomplish this (using delayed) is:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(matrices):
return reduce(lambda a, b: a + b, matrices)
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# Here's the Big Sum
matrices = calc_sum(matrices)
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And here's the dask graph:
This certainly will work, BUT as the size of the matrices (see gen_matrix above) gets too large, the Dask distributed workers start to have three problems:
They time out sending data to the main worker performing the sum
The main worker runs out of memory gathering all of the matrices
The overall sum is not running in parallel (only matrix ganeration is)
Note that none of these issues are Dask's fault, it's working as advertised. I've just set up the computation poorly.
One solution is to break this into a tree computation, which is shown here, along with the dask visualization of that graph:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(a, b):
return a + b
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# This tells us the depth of the calculation portion
# of the tree we are constructing in the next step
depth = int(math.log(num_matrices, 2))
# This is the code I don't want to have to manually write
for _ in range(depth):
matrices = [
calc_sum(matrices[i], matrices[i+1])
for i in range(0, len(matrices), 2)
]
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And the graph:
QUESTION:
I would like to be able to get this tree generation done by either a library or perhaps Dask itself. How can I accomplish this?
And for those who are wondering, why not just use the code above? Because there are edge cases that I don't want to have to code for, and also because it's just more code to write :)
I have also seen this: Parallelize tree creation with dask
Is there something in functools or itertools that knows how to do this (and can be used with dask.delayed)?
Dask bag has a reduction/aggregation method that will generate tree-like DAG: fold.
The workflow would be to 'bag' the delayed objects and then fold them.

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.
Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

I receive a MemError when I interpolate data with generator expressions and iterators from a regular grid on a mesh >2000 times

this is my first question here at stackoverflow, because I started scripting with Python3.
Application
I made a Python3 script that writes the load definition of a moveable heat source for a finite element simulation in LS-Dyna. As source I have a discretized 3D heat generation rate density (W/cm^3) field, the coordinates defining the finite element mesh and the position of the heat field center over time.
As output I get a time dependent heating power sorted after the element number for each finite element. This works already for reasonable dimensions (200000 finite Elements, 3000 locations of the heat field, 400000 data points in the heat field).
Problem
For larger finite element meshes (4 000 000 Elements), I run out of memory (60GB RAM, python3 64Bit). To illustrate the problem further I prepared a minimal example which runs on its own. It generates some artificial test data, a finite element mesh how I use it (in reality it is not a regular grid) and an iterator for new locations for the heat application.
import numpy as np
import math
from scipy.interpolate import RegularGridInterpolator
def main():
dataCoordinateAxes,dataArray = makeTestData()
meshInformationArray = makeSampleMesh()
coordinates = makeSampleCoordinates()
interpolateOnMesh(dataCoordinateAxes,dataArray,meshInformationArray,coordinates)
def makeTestData():
x = np.linspace(-0.02,0.02,300)
y = np.linspace(-0.02,0.02,300)
z = np.linspace(-0.005,0.005,4)
data = f(*np.meshgrid(x,y,z,indexing='ij',sparse=True))
return (x,y,z),data
def f(x,y,z):
scaling = 1E18
sigmaXY = 0.01
muXY = 0
sigmaZ = 0.5
muZ = 0.005
return weight(x,1E-4,muXY,sigmaXY)*weight(y,1E-4,muXY,sigmaXY)*weight(z,0.1,muZ,sigmaZ)*scaling
def weight(x,dx,mu,sigma):
result = np.multiply(np.divide(np.exp(np.divide(np.square(np.subtract(x,mu)),(-2*sigma**2))),math.sqrt(2*math.pi*sigma**2.)),dx)
return result
def makeSampleMesh():
meshInformation = []
for x in np.linspace(-0.3,0.3,450):
for y in np.linspace(-0.3,0.3,450):
for z in np.linspace(-0.005,0.005,5):
meshInformation.append([x,y,z])
return np.array(meshInformation)
def makeSampleCoordinates():
x = np.linspace(-0.2,0.2,500)
y = np.sqrt(np.subtract(0.2**2,np.square(x)))
return (np.array([element[0],element[1],0])for element in zip(x,y))
The interpolation is then done in this function. I removed everything in the for loop to isolate the problem. In reality I export the load curve to a file in a specific format.
def interpolateOnMesh(dataCoordinateAxes,dataArray,meshInformationArray,coordinates):
interpolationFunction = RegularGridInterpolator(dataCoordinateAxes, dataArray, bounds_error=False, fill_value=None)
for finiteElementNumber, heatGenerationCurve in enumerate(iterateOverFiniteElements(meshInformationArray, coordinates, interpolationFunction)):
pass
return
def iterateOverFiniteElements(meshInformationArray, coordinates, interpolationFunction):
meshDataIterator = (np.nditer(interpolationFunction(np.subtract(meshInformationArray,coordinateSystem))) for coordinateSystem in coordinates)
for heatGenerationCurve in zip(*meshDataIterator):
yield heatGenerationCurve
if __name__ == '__main__':
main()
To identify the problem, I tracked the memory consumption over time.
Memory Consumption over Time
It seems the iteration over the result arrays consumes a considerable amount of memory.
Question
Is there a less memory consuming way to iterate over the datapoints without loosing too much performance? If not, I guess I will slice the mesh array in chunks and interpolate on those one by one.
So far the only solution I found was to cut the meshInformationArray.
Here the modified main() function:
def main():
dataCoordinateAxes,dataArray = makeTestData()
meshInformationArray = makeSampleMesh()
coordinates = makeSampleCoordinates()
sections = int(meshInformationArray.shape[0] / 100000)
if sections == 0: sections = 1
for array in iter(np.array_split(meshInformationArray, sections, axis=0)):
interpolateOnMesh(dataCoordinateAxes,dataArray,array,coordinates)

Numba slower than pure Python in frequency counting

Given a data matrix with discrete entries represented as a 2D numpy array, I'm trying to compute the observed frequencies of some features (the columns) only looking at some instances (the rows of the matrix).
I can do that quite easily with numpy using bincount applied to each slice after having done some fancy slicing. Doing that in pure Python, using an external data structure as a count accumulator, is a double loop in C-style.
import numpy
import numba
try:
from time import perf_counter
except:
from time import time
perf_counter = time
def estimate_counts_numpy(data,
instance_ids,
feature_ids):
"""
WRITEME
"""
#
# slicing the data array (probably memory consuming)
curr_data_slice = data[instance_ids, :][:, feature_ids]
estimated_counts = []
for feature_slice in curr_data_slice.T:
counts = numpy.bincount(feature_slice)
#
# checking just for the all 0 case:
# this is not stable for not binary datasets TODO: fix it
if counts.shape[0] < 2:
counts = numpy.append(counts, [0], 0)
estimated_counts.append(counts)
return estimated_counts
#numba.jit(numba.types.int32[:, :](numba.types.int8[:, :],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:, :]))
def estimate_counts_numba(data,
instance_ids,
feature_ids,
feature_vals,
estimated_counts):
"""
WRITEME
"""
#
# actual counting
for i, feature_id in enumerate(feature_ids):
for instance_id in instance_ids:
estimated_counts[i][data[instance_id, feature_id]] += 1
return estimated_counts
if __name__ == '__main__':
#
# creating a large synthetic matrix, testing for performance
rand_gen = numpy.random.RandomState(1337)
n_instances = 2000
n_features = 2000
large_matrix = rand_gen.binomial(1, 0.5, (n_instances, n_features))
#
# random indexes too
n_sample = 1000
rand_instance_ids = rand_gen.choice(n_instances, n_sample, replace=False)
rand_feature_ids = rand_gen.choice(n_features, n_sample, replace=False)
binary_feature_vals = [2 for i in range(n_features)]
#
# testing
numpy_start_t = perf_counter()
e_counts_numpy = estimate_counts_numpy(large_matrix,
rand_instance_ids,
rand_feature_ids)
numpy_end_t = perf_counter()
print('numpy done in {0} secs'.format(numpy_end_t - numpy_start_t))
binary_feature_vals = numpy.array(binary_feature_vals)
#
#
curr_feature_vals = binary_feature_vals[rand_feature_ids]
#
# creating a data structure to hold the slices
# (with numba I cannot use list comprehension?)
# e_counts_numba = [[0 for val in range(feature_val)]
# for feature_val in
# curr_feature_vals]
e_counts_numba = numpy.zeros((n_sample, 2), dtype='int32')
numba_start_t = perf_counter()
estimate_counts_numba(large_matrix,
rand_instance_ids,
rand_feature_ids,
binary_feature_vals,
e_counts_numba)
numba_end_t = perf_counter()
print('numba done in {0} secs'.format(numba_end_t - numba_start_t))
These are the times I get while running the above code:
numpy done in 0.2863295429997379 secs
numba done in 11.55551904299864 secs
My point here is that my implementation is even slower when I try to apply a jit with numba, so I highly suspect I am messing things up.
The reason your function is slow is because Numba has fallen back to object mode to compile the loop.
There are two problems:
Numba doesn't yet support chained indexing of multidimensional arrays, so you need to rewrite this:
estimated_counts[i][data[instance_id, feature_id]]
into this:
estimated_counts[i, data[instance_id, feature_id]]
Your explicit type signature is incorrect. All of your input arrays are actually int64, rather than int8/int32. Rather than fix your signature, you can rely on Numba's automatic JIT to detect the argument types and compile the right version. All you have to do is change the decorator to just #numba.jit. Just make sure you call the function once before you benchmark if you don't want to include compilation time.
With these changes, I benchmark Numba to be about 15% faster than NumPy for this function.

Why is there no speed-up when using pythons multiprocessing for embarassingly parallel problem within a for-loop, with shared numpy data?

I want to speed up an embarassingly parallel problem related to Bayesian Inference. The aim is to infer coefficents u for a set of images x, given a matrix A, such that X = A*U.
X has dimensions mxn, A mxp and U pxn. For each column of X, one has to infer the optimal corresponding column of the coefficients U. In the end, this information is used to update A. I use m = 3000, p = 1500 and n = 100.
So, as it is a linear model, the inference of the coefficient-matrix u consists of n independent calculations. Thus, I tried to work with the multiprocessing module of Python, but there is no speed up.
Here is what I did:
The main structure, without parallelization, is:
import numpy as np
from convex import Crwlasso_cd
S = np.empty((m, batch_size))
for t in xrange(start_iter, niter):
## Begin Warm Start ##
# Take 5 gradient steps w/ this batch using last coef. to warm start inf.
for ws in range(5):
# Initialize the coefficients
if ws:
theta = U
else:
theta = np.dot(A.T, X)
# Infer the Coefficients for the given data batch X of size mxn (n=batch_size)
# Crwlasso_cd is the function that does the inference per data sample
# It's basically a C-inline code
for k in range(batch_size):
U[:,k] = Crwlasso_cd(X[:, k].copy(), A, theta=theta[:,k].copy())
# Given the inferred coefficients, update and renormalize
# the basis functions A
dA1 = np.dot(X - np.dot(A, U), U.T) # Gaussian data likelihood
A += (eta / batch_size) * dA1
A = np.dot(A, np.diag(1/np.sqrt((A**2).sum(axis=0))))
Implementation of multiprocessing:
I tried to implement multiprocessing. I have an 8-core machine that I can use.
There are 3 for-loops. The only one that seems to be "parallelizable" is the third one, where the coefficients are inferred:
Generate a Queue and stack the iteration-numbers from 0 to batch_size-1 into the Queue
Generate 8 processes, and let them work through the Queue
Share the data U using multiprocessing.Array
So, I replaced this third loop with the following:
from multiprocessing import Process, Queue
import multiprocessing as mp
from Queue import Empty
num_cpu = mp.cpu_count()
work_queue = Queue()
# Generate the empty ndarray U and a multiprocessing.Array-Wrapper U_mp around U
# The class Wrap_mp is attached. Basically, U_mp.asarray() gives the corresponding
# ndarray
U = np.empty((p, batch_size))
U_mp = Wrap_mp(U)
...
# Within the for-loops:
for p in xrange(batch_size):
work_queue.put(p)
processes = [Process(target=infer_coefficients_mp, args=(work_queue,U_mp,A,X)) for p in range(num_cpu)]
for p in processes:
p.start()
print p.pid
for p in processes:
p.join()
Here is the class Wrap_mp:
class Wrap_mp(object):
""" Wrapper around multiprocessing.Array to share an array across
processes. Store the array as a multiprocessing.Array, but compute with it
as a numpy.ndarray
"""
def __init__(self, arr):
""" Initialize a shared array from a numpy array.
The data is copied.
"""
self.data = ndarray_to_shmem(arr)
self.dtype = arr.dtype
self.shape = arr.shape
def __array__(self):
""" Implement the array protocole.
"""
arr = shmem_as_ndarray(self.data, dtype=self.dtype)
arr.shape = self.shape
return arr
def asarray(self):
return self.__array__()
And here is the function infer_coefficients_mp:
def infer_feature_coefficients_mp(work_queue,U_mp,A,X):
while True:
try:
index = work_queue.get(block=False)
x = X[:,index]
U = U_mp.asarray()
theta = np.dot(phit,x)
# Infer the coefficients of the column index
U[:,index] = Crwlasso_cd(x.copy(), A, theta=theta.copy())
except Empty:
break
The problem now are the following:
The multiprocessing version is not faster than the single thread version for the given dimensions of the data.
The process ID's increase with every iteration. Does this mean that there is constantly a new process generated? Doesn't this generate a huge overhead? How can I avoid that? Is there a possibility of creating within the whole for-loop 8 different processes and just update them with the data?
Does the way I share the coefficients U amongst the processes slow the calculation down? Is there another, better way of doing this?
Would a Pool of processes be better?
I am really thankful for any sort of help! I have started working with Python a month ago, and am pretty lost now.
Engin
Every time you create a Process you are creating a new process. If you're doing that within your for loop, then yes, you are starting new processes every time through the loop. It sounds like what you want to do is initialize your Queue and Processes outside of the loop, then fill the Queue inside the loop.
I've used multiprocessing.Pool before, and it's useful, but it doesn't offer much over what you've already implemented with a Queue.
Eventually, this all boils down to one question: Is it possible to start processes outside of the main for-loop, and for every iteration, feed the updated variables in them, have them processing the data, and collecting the newly calculated data from all of the processes, without having to start new processes every iteration?

Categories

Resources