I'm learning parallel computing through mpi4py. Since I deal with a large dataset, I need to preallocate the memory at the master process in order to not have memory issues. That's the reason why I use the Scatterv and Gatherv methods. The code proposed has only the scope to allocate the memory without doing any specific operation
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()
if rank == 0:
sendbuf = np.random.rand(4,3)
r, c = sendbuf.shape
ave, res = divmod(c, nprocs)
count = [ave + 1 if p < res else ave for p in range(nprocs)]
count = np.array(count)
print("count is ", count)
# displacement: the starting index of each sub-task
displ = [sum(count[:p]) for p in range(nprocs)]
displ = np.array(displ)
else:
sendbuf = None
# initialize count on worker processes
count = np.zeros(nprocs, dtype=np.int)
displ = None
# broadcast count
comm.Bcast(count, root=0)
# initialize recvbuf on all processes
recvbuf = np.zeros((4,count[rank]))
comm.Scatterv([sendbuf, count, displ, MPI.DOUBLE], recvbuf, root=0)
a, b = recvbuf.shape
sendbuf2 = np.random.rand(a,b)
recvbuf2 = np.zeros((4,sum(count)))
comm.Gatherv(sendbuf2, [recvbuf2, count, displ, MPI.DOUBLE], root=0)
In the master process, I firstly define a random 2D array (sendbuf) of dimensions (4,3). What I want to do is to scatter this matrix to the different process by dividing it in columns (so preserving the number of rows). Then I initialize the recvbuf variable in order to receive the chunks of information of sendbuf. Then I use the Scatterv method to pass the information. I noticed that only the data in the first row are passed correctly. This is not really important, since in the real application the recvbuf variable is used only to pre-allocate the memory. At this point I redefine the recvbuf variable, and then I try to send back the information to the master node, but the code gives error. I don't really understand what I'm doing wrong in the Gatherv part.
I tried to keep the example as simple as possible, so that the code doesn't not do anything specific. What I want to learn is how to correctly scatter and gather 2D numpy array.
What 'error' do you have exactly?
Here's a working version of gatherv with 2D arrays. Keep in mind that the sizes are scaled with the size of the 2D array, as NumPy has arrays row-major in memory
from mpi4py import MPI
import numpy as np
comm_world = MPI.COMM_WORLD
my_rank = comm_world.Get_rank()
num_proc = comm_world.Get_size()
# Parameters for this script
rowlength = 2
sizes = 2*np.ones((num_proc), dtype=np.int32)
sizes[-1] = 1
# Construct some data
data = [np.array((), dtype=np.double) for _ in range(num_proc)]
data[my_rank] = np.array(my_rank+np.random.rand(sizes[my_rank], rowlength), np.double)
# Compute sizes and offsets for Allgatherv
sizes_memory = rowlength*sizes
offsets = np.zeros(num_proc)
offsets[1:] = np.cumsum(sizes_memory)[:-1]
if my_rank == 0:
print(f"Total size {np.sum(sizes)}")
print(f"Sizes: {sizes}")
print(f"Sizes: {sizes_memory}")
print(f"Offsets: {offsets}")
# Prepare buffer for Allgatherv
data_out = np.empty((np.sum(sizes), rowlength), dtype=np.double)
comm_world.Allgatherv(
data[my_rank],
recvbuf=[data_out, sizes_memory.tolist(), offsets.tolist(), MPI.DOUBLE])
if (my_rank == 0):
print(f"Data_out has shape {data_out.shape}")
print(data_out[:, 0])
Linux OS, MPI: 'mpirun (Open MPI) 4.1.2', MPI4PY: 'mpi4py 3.1.4'
Related
I have a rather simple parallelization question that I can't seem to work out. I had parallelized a simple matrix assignment using joblib in Python which worked nicely on my workstation, but now I need to run the code on a HPC and the as-is code is not playing nicely with MPI. A skeleton of the code is below (I have stripped out a lot of not-relevant computation). Basically I have a large matrix that I want to fill in and at each point the value is a sum over many energies and eigenvalues, so this is the 'slow step' of the calculation. When I run this on my workstation I just parallelize that fill in using Parallel and delayed from joblib but of course when I run this on the cluster using mpirun --bind-to none -n 16 python KZ_spectral_function.py | tee spectral.out for example the code runs basically in serial (although with some odd behavior).
So, what I think I need to do is to convert that joblib line over to mpi4py, include an if rank == 0: statement encompassing everything in the main function, and just modify the the contents of gen_spec_func() and divvy up the calls to spec_func() to the different cores. This is the part where I am stuck, as all the examples I have read that were simple enough for me to understand use some variation of COMMS.scatter() and then append the results to a list, as far as I can tell in a random order, and I don't know quite enough to adapt them to something where I want the results to go in a specific place in the matrix. Any help or advice would be greatly appreciated, as neither parallelization nor python are strengths of mine ...
Code Snippet (simplified):
import numpy as np
import numpy.linalg as lina
import time
from functools import partial
from joblib import Parallel, delayed
# Helper Functions
def get_eigenvals(k_cart,cellMap,Hwannier,G):
## [....] some linear algebra, not important
return Ek
def gen_spec_func(eigenvals,Nkpts,Energies,Sigma):
## This is really the only part that I care to parallelize
## This is the joblib version
num_cores=16
tempfunc = partial(spec_func,Energies,Sigma,eigenvals)
spectral = np.reshape(Parallel(n_jobs=num_cores)(delayed(tempfunc)(i,j) for j in range(0,Nkpts) for i in range(0,len(Energies))),(Nkpts,len(Energies))).T
return np.matrix(spectral)
def spec_func(Energies,Sigma,eigenvals,i,j) :
return sum([(1.0/( (Energies[i]-val)**2 + (Sigma)**2 )) for val in eigenvals[j,:]])
#--- Start of main script
Tstart = time.time()
# [...] Declare Constants & Parameters
# [...] Read data from disk
# [...] Some calculations on that data we want done in serial
Energies = [Emin + (Emax-Emin)*i/(Nenergies-1) for i in range(0,Nenergies)]
kzs = [kzi+(kzf-kzi)*l/(nkzs-1) for l in range(0,nkzs)]
DomainAvg = np.matrix([[0.0 for j in range(0,Nkpts)] for i in range(0,Nenergies)])
for kz in kzs:
## An outer loop over Kz
print("Starting loop for kz = ",kz)
# Generate the base k-grid (symmetric) in 1/A for convenience
# [...] Generate the appropriate kpoint grid
for angle in range(0,Nangles):
## Inner loop over rotation angles
#--- For each angle generate the kpoint grid for that domain
# [...] Calculate some eigenvalues, small matrices not a big deal, serial fine
#--- Now we Generate the spectral function for that grid
### Ok, this is the slow part that we want to parallelize
DomainAvg += gen_spec_func(eigenvals,Nkpts,Energies,Sigma)
if(angle%20 == 0):
Tend = time.time()
print("Completed iteration ", angle, "of ", Nangles, " at T = ", Tend-Tstart)
# Output the results (one file for each Kz)
DomainAvg = DomainAvg/Nangles
outfile = "Spectral"+str(kz)+".txt"
np.savetxt(outfile, DomainAvg)
# And we are done
Tend = time.time()
print("Total execution time was :", Tend-Tstart)
EDIT: A very very hack solution I came up with was to encode the matrix indices in the matrix itself as floats, then use scatter() and gather() to distribute the matrix, replace the value with the calculation output, and reassemble the matrix. This is of course not a good idea since it requires int<->float conversion but it was the only way I could come up with that didn't require rebuilding the entire matrix from the gathered data index by index (instead just using hstack() and reshape() to put it together). I feel like there must be some tool I am missing that assists in distributed calculation for arrays/matrices where the index matters so I would still be interested if someone has a tip/pointer in this regard.
Minimum Working Example:
import numpy.linalg as lina
import time
import math
from functools import partial
from mpi4py import MPI
#-- Standard Comms
COMM = MPI.COMM_WORLD
size = COMM.Get_size()
rank = COMM.Get_rank()
Nkpts = 3
Energies = [1.00031415926*i for i in range(0,11)]
#--- Now we Generate the spectral function for that grid
# This will be done in parallel using scatter/gather in MPI
if rank == 0:
# List that we will scatter to the different nodes
# Encode the matrix index from whcn each element came as a float
datalist = [float(j+i*Nkpts) for j in range(0,Nkpts) for i in range(0,len(Energies))]
data = np.array_split(datalist, COMM.Get_size())
else:
data = None
# Distribute to the different nodes
data = COMM.scatter(data, root=0)
print("I am processor ",rank," and my data is",data)
for index in range(0,len(data)) :
# Decode the indicies
j = data[index]%Nkpts
i = math.floor(data[index]/Nkpts)
data[index] = 100.100*j+Energies[i]
COMM.Barrier()
dataMPI = COMM.gather(data,root=0)
if(rank==0) :
spectral = np.reshape(np.hstack(dataMPI),(Nkpts,len(Energies))).T
spectral_func = np.matrix(spectral)
print(spectral_func)```
I have one big numpy array A of shape (2_000_000, 2000) of dtype float64, which takes 32 GB.
(or alternatively the same data split into 10 arrays of shape (200_000, 2000), it may be easier for serialization?).
How can we serialize it to disk such that we can have fast random read access to any part of the data?
More precisely I need to be able to read ten thousands of windows of shape (16, 2 000) from A at random starting indexes i:
L = []
for i in range(10_000):
i = random.randint(0, 2_000_000 - 16):
window = A[i:i+16, :] # window of A of shape (16, 2000) starting at a random index i
L.append(window)
WINS = np.concatenate(L) # shape (10_000, 16, 2000) of float64, ie: ~ 2.4 GB
Let's say I only have 8 GB of RAM available for this task; it's totally impossible to load the whole 32 GB of A in RAM.
How can we read such windows in a serialized-on-disk numpy array? (.h5 format or any other)
Note: The fact the reading is done at randomized starting indexes is important.
This example shows how you can use an HDF5 file for the process you describe.
First, create a HDF5 file with a dataset of shape(2_000_000, 2000) and dtype=float64 values. I used variables for the dimensions so you can tinker with it.
import numpy as np
import h5py
import random
h5_a0, h5_a1 = 2_000_000, 2_000
with h5py.File('SO_68206763.h5','w') as h5f:
dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
incr = 1_000
a0 = h5_a0//incr
for i in range(incr):
arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
dset[i*a0:i*a0+a0, :] = arr
print(dset[-1,0:10]) # quick dataset check of values in last row
Next, open the file in read mode, read 10_000 random array slices of shape (16,2_000) and append to the list L. At the end, convert the list to the array WINS. Note, by default the array will have 2 axes -- you need to use .reshape() if you want 3 axes per your comment (reshape also shown).
with h5py.File('SO_68206763.h5','r') as h5f:
dset = h5f['test']
L = []
ds0, ds1 = dset.shape[0], dset.shape[1]
for i in range(10_000):
ir = random.randint(0, ds0 - 16)
window = dset[ir:ir+16, :] # window from dset of shape (16, 2000) starting at a random index i
L.append(window)
WINS = np.concatenate(L) # shape (160_000, 2_000) of float64,
print(WINS.shape, WINS.dtype)
WINS = np.concatenate(L).reshape(10_0000,16,ds1) # reshaped to (10_000, 16, 2_000) of float64
print(WINS.shape, WINS.dtype)
The procedure above is not memory efficient. You wind up with 2 copies of the randomly sliced data: in both list L and array WINS. If memory is limited, this could be a problem. To avoid the intermediate copy, read the random slide of data directly to an array. Doing this simplifies the code, and reduces the memory footprint. That method is shown below (WINS2 is a 2 axis array, and WINS3 is a 3 axis array).
with h5py.File('SO_68206763.h5','r') as h5f:
dset = h5f['test']
ds0, ds1 = dset.shape[0], dset.shape[1]
WINS2 = np.empty((10_000*16,ds1))
WINS3 = np.empty((10_000,16,ds1))
for i in range(10_000):
ir = random.randint(0, ds0 - 16)
WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
WINS3[i,:,:] = dset[ir:ir+16, :]
An alternative soluton to h5py datasets that I tried and that works is using memmap, as suggested in #RyanPepper's comment.
Write the data as binary
import numpy as np
with open('a.bin', 'wb') as A:
for f in range(1000):
x = np.random.randn(10*2000).astype('float32').reshape(10, 2000)
A.write(x.tobytes())
A.flush()
Open later as memmap
A = np.memmap('a.bin', dtype='float32', mode='r').reshape((-1, 2000))
print(A.shape) # (10000, 2000)
print(A[1234:1234+16, :]) # window
I have a numpy array of coordinates of size n_slice x 2048 x 3, where n_slice is in the tens of thousands. I want to apply the following operation on each 2048 x 3 slice separately
import numpy as np
from scipy.spatial.distance import pdist
# load coor from a binary xyz file, dcd format
n_slice, n_coor, _ = coor.shape
r = np.arange(n_coor)
dist = np.zeros([n_slice, n_coor, n_coor])
# this loop is what I want to parallelize, each slice is completely independent
for i in xrange(n_slice):
dist[i, r[:, None] < r] = pdist(coor[i])
I tried using Dask by making coor a dask.array,
import dask.array as da
dcoor = da.from_array(coor, chunks=(1, 2048, 3))
but simply replacing coor by dcoor will not expose the parallelism. I could see setting up parallel threads to run for each slice but how do I leverage Dask to handle the parallelism?
Here is the parallel implementation using concurrent.futures
import concurrent.futures
import multiprocessing
n_cpu = multiprocessing.cpu_count()
def get_dist(coor, dist, r):
dist[r[:, None] < r] = pdist(coor)
# load coor from a binary xyz file, dcd format
n_slice, n_coor, _ = coor.shape
r = np.arange(n_coor)
dist = np.zeros([n_slice, n_coor, n_coor])
with concurrent.futures.ThreadPoolExecutor(max_workers=n_cpu) as executor:
for i in xrange(n_slice):
executor.submit(get_dist, cool[i], dist[i], r)
It is possible this problem is not well suited to Dask since there are no inter-chunk computations.
map_blocks
The map_blocks method may be helpful:
dcoor.map_blocks(pdist)
Uneven arrays
It looks like you're doing a bit of fancy slicing to insert particular values into particular locations of an output array. This will probably be awkward to do with dask.arrays. Instead, I recommend making a function that produces a numpy array
def myfunc(chunk):
values = pdist(chunk[0, :, :])
output = np.zeroes((2048, 2048))
r = np.arange(2048)
output[r[:, None] < r] = values
return output
dcoor.map_blocks(myfunc)
delayed
Worst case scenario you can always use dask.delayed
from dask import delayed, compute
coor2 = delayed(coor)
slices = [coor2[i] for i in range(coor.shape[0])]
slices2 = [delayed(pdist)(slice) for slice in slices]
results = compute(*slices2)
I'm trying to do knn search on big data with limited memory.
I'm using HDF5 and python.
I tried bruteforce linear search(using pytables) and kd-tree search (using sklearn)
It's suprising but kd-tree method takes more time(maybe kd-tree will work better if we increase batch size? but I don't know optimal size also it limited by memory)
Now I'm looking for how to speed up calculations, I think HDF5 file can be tuned for individual PC, also norm calculation can be speeded maybe using nymexpr or some python tricks.
import numpy as np
import time
import tables
import cProfile
from sklearn.neighbors import NearestNeighbors
rows = 10000
cols = 1000
batches = 100
k= 10
#USING HDF5
vec= np.random.rand(1,cols)
data = np.random.rand(rows,cols)
fileName = 'C:\carray1.h5'
shape = (rows*batches, cols) # predefined size
atom = tables.UInt8Atom() #?
filters = tables.Filters(complevel=5, complib='zlib') #?
#create
# h5f = tables.open_file(fileName, 'w')
# ca = h5f.create_carray(h5f.root, 'carray', atom, shape, filters=filters)
# for i in range(batches):
# ca[i*rows:(i+1)*rows]= data[:]+i # +i to modify data
# h5f.close()
#can be parallel?
def test_bruteforce_knn():
h5f = tables.open_file(fileName)
t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
d[i*rows:(i+1)*rows] = ((h5f.root.carray[i*rows:(i+1)*rows]-vec)**2).sum(axis=1)
print (time.time()-t0)
ndx = d.argsort()
print ndx[:k]
h5f.close()
def test_tree_knn():
h5f = tables.open_file(fileName)
# it will not work
# t0= time.time()
# nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray)
# distances, indices = nbrs.kneighbors(vec)
# print (time.time()-t0)
#need to concatenate distances, indices somehow
t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray[i*rows:(i+1)*rows])
distances, indices = nbrs.kneighbors(vec) # put in dict?
#d[i*rows:(i+1)*rows] =
print (time.time()-t0)
#ndx = d.argsort()
#print ndx[:k]
h5f.close()
cProfile.run('test_bruteforce_knn()')
cProfile.run('test_tree_knn()')
If I understand correctly your data has 1000 dimensions? If this is the case then it's expected that kd-tree won't fare well as it suffers from the curse of dimensionality.
You might want to have a look at Approximate Nearest Neighbors search methods instead. For instance have a look at flann.
I want to speed up an embarassingly parallel problem related to Bayesian Inference. The aim is to infer coefficents u for a set of images x, given a matrix A, such that X = A*U.
X has dimensions mxn, A mxp and U pxn. For each column of X, one has to infer the optimal corresponding column of the coefficients U. In the end, this information is used to update A. I use m = 3000, p = 1500 and n = 100.
So, as it is a linear model, the inference of the coefficient-matrix u consists of n independent calculations. Thus, I tried to work with the multiprocessing module of Python, but there is no speed up.
Here is what I did:
The main structure, without parallelization, is:
import numpy as np
from convex import Crwlasso_cd
S = np.empty((m, batch_size))
for t in xrange(start_iter, niter):
## Begin Warm Start ##
# Take 5 gradient steps w/ this batch using last coef. to warm start inf.
for ws in range(5):
# Initialize the coefficients
if ws:
theta = U
else:
theta = np.dot(A.T, X)
# Infer the Coefficients for the given data batch X of size mxn (n=batch_size)
# Crwlasso_cd is the function that does the inference per data sample
# It's basically a C-inline code
for k in range(batch_size):
U[:,k] = Crwlasso_cd(X[:, k].copy(), A, theta=theta[:,k].copy())
# Given the inferred coefficients, update and renormalize
# the basis functions A
dA1 = np.dot(X - np.dot(A, U), U.T) # Gaussian data likelihood
A += (eta / batch_size) * dA1
A = np.dot(A, np.diag(1/np.sqrt((A**2).sum(axis=0))))
Implementation of multiprocessing:
I tried to implement multiprocessing. I have an 8-core machine that I can use.
There are 3 for-loops. The only one that seems to be "parallelizable" is the third one, where the coefficients are inferred:
Generate a Queue and stack the iteration-numbers from 0 to batch_size-1 into the Queue
Generate 8 processes, and let them work through the Queue
Share the data U using multiprocessing.Array
So, I replaced this third loop with the following:
from multiprocessing import Process, Queue
import multiprocessing as mp
from Queue import Empty
num_cpu = mp.cpu_count()
work_queue = Queue()
# Generate the empty ndarray U and a multiprocessing.Array-Wrapper U_mp around U
# The class Wrap_mp is attached. Basically, U_mp.asarray() gives the corresponding
# ndarray
U = np.empty((p, batch_size))
U_mp = Wrap_mp(U)
...
# Within the for-loops:
for p in xrange(batch_size):
work_queue.put(p)
processes = [Process(target=infer_coefficients_mp, args=(work_queue,U_mp,A,X)) for p in range(num_cpu)]
for p in processes:
p.start()
print p.pid
for p in processes:
p.join()
Here is the class Wrap_mp:
class Wrap_mp(object):
""" Wrapper around multiprocessing.Array to share an array across
processes. Store the array as a multiprocessing.Array, but compute with it
as a numpy.ndarray
"""
def __init__(self, arr):
""" Initialize a shared array from a numpy array.
The data is copied.
"""
self.data = ndarray_to_shmem(arr)
self.dtype = arr.dtype
self.shape = arr.shape
def __array__(self):
""" Implement the array protocole.
"""
arr = shmem_as_ndarray(self.data, dtype=self.dtype)
arr.shape = self.shape
return arr
def asarray(self):
return self.__array__()
And here is the function infer_coefficients_mp:
def infer_feature_coefficients_mp(work_queue,U_mp,A,X):
while True:
try:
index = work_queue.get(block=False)
x = X[:,index]
U = U_mp.asarray()
theta = np.dot(phit,x)
# Infer the coefficients of the column index
U[:,index] = Crwlasso_cd(x.copy(), A, theta=theta.copy())
except Empty:
break
The problem now are the following:
The multiprocessing version is not faster than the single thread version for the given dimensions of the data.
The process ID's increase with every iteration. Does this mean that there is constantly a new process generated? Doesn't this generate a huge overhead? How can I avoid that? Is there a possibility of creating within the whole for-loop 8 different processes and just update them with the data?
Does the way I share the coefficients U amongst the processes slow the calculation down? Is there another, better way of doing this?
Would a Pool of processes be better?
I am really thankful for any sort of help! I have started working with Python a month ago, and am pretty lost now.
Engin
Every time you create a Process you are creating a new process. If you're doing that within your for loop, then yes, you are starting new processes every time through the loop. It sounds like what you want to do is initialize your Queue and Processes outside of the loop, then fill the Queue inside the loop.
I've used multiprocessing.Pool before, and it's useful, but it doesn't offer much over what you've already implemented with a Queue.
Eventually, this all boils down to one question: Is it possible to start processes outside of the main for-loop, and for every iteration, feed the updated variables in them, have them processing the data, and collecting the newly calculated data from all of the processes, without having to start new processes every iteration?