Given a data matrix with discrete entries represented as a 2D numpy array, I'm trying to compute the observed frequencies of some features (the columns) only looking at some instances (the rows of the matrix).
I can do that quite easily with numpy using bincount applied to each slice after having done some fancy slicing. Doing that in pure Python, using an external data structure as a count accumulator, is a double loop in C-style.
import numpy
import numba
try:
from time import perf_counter
except:
from time import time
perf_counter = time
def estimate_counts_numpy(data,
instance_ids,
feature_ids):
"""
WRITEME
"""
#
# slicing the data array (probably memory consuming)
curr_data_slice = data[instance_ids, :][:, feature_ids]
estimated_counts = []
for feature_slice in curr_data_slice.T:
counts = numpy.bincount(feature_slice)
#
# checking just for the all 0 case:
# this is not stable for not binary datasets TODO: fix it
if counts.shape[0] < 2:
counts = numpy.append(counts, [0], 0)
estimated_counts.append(counts)
return estimated_counts
#numba.jit(numba.types.int32[:, :](numba.types.int8[:, :],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:, :]))
def estimate_counts_numba(data,
instance_ids,
feature_ids,
feature_vals,
estimated_counts):
"""
WRITEME
"""
#
# actual counting
for i, feature_id in enumerate(feature_ids):
for instance_id in instance_ids:
estimated_counts[i][data[instance_id, feature_id]] += 1
return estimated_counts
if __name__ == '__main__':
#
# creating a large synthetic matrix, testing for performance
rand_gen = numpy.random.RandomState(1337)
n_instances = 2000
n_features = 2000
large_matrix = rand_gen.binomial(1, 0.5, (n_instances, n_features))
#
# random indexes too
n_sample = 1000
rand_instance_ids = rand_gen.choice(n_instances, n_sample, replace=False)
rand_feature_ids = rand_gen.choice(n_features, n_sample, replace=False)
binary_feature_vals = [2 for i in range(n_features)]
#
# testing
numpy_start_t = perf_counter()
e_counts_numpy = estimate_counts_numpy(large_matrix,
rand_instance_ids,
rand_feature_ids)
numpy_end_t = perf_counter()
print('numpy done in {0} secs'.format(numpy_end_t - numpy_start_t))
binary_feature_vals = numpy.array(binary_feature_vals)
#
#
curr_feature_vals = binary_feature_vals[rand_feature_ids]
#
# creating a data structure to hold the slices
# (with numba I cannot use list comprehension?)
# e_counts_numba = [[0 for val in range(feature_val)]
# for feature_val in
# curr_feature_vals]
e_counts_numba = numpy.zeros((n_sample, 2), dtype='int32')
numba_start_t = perf_counter()
estimate_counts_numba(large_matrix,
rand_instance_ids,
rand_feature_ids,
binary_feature_vals,
e_counts_numba)
numba_end_t = perf_counter()
print('numba done in {0} secs'.format(numba_end_t - numba_start_t))
These are the times I get while running the above code:
numpy done in 0.2863295429997379 secs
numba done in 11.55551904299864 secs
My point here is that my implementation is even slower when I try to apply a jit with numba, so I highly suspect I am messing things up.
The reason your function is slow is because Numba has fallen back to object mode to compile the loop.
There are two problems:
Numba doesn't yet support chained indexing of multidimensional arrays, so you need to rewrite this:
estimated_counts[i][data[instance_id, feature_id]]
into this:
estimated_counts[i, data[instance_id, feature_id]]
Your explicit type signature is incorrect. All of your input arrays are actually int64, rather than int8/int32. Rather than fix your signature, you can rely on Numba's automatic JIT to detect the argument types and compile the right version. All you have to do is change the decorator to just #numba.jit. Just make sure you call the function once before you benchmark if you don't want to include compilation time.
With these changes, I benchmark Numba to be about 15% faster than NumPy for this function.
Related
I have a rather simple parallelization question that I can't seem to work out. I had parallelized a simple matrix assignment using joblib in Python which worked nicely on my workstation, but now I need to run the code on a HPC and the as-is code is not playing nicely with MPI. A skeleton of the code is below (I have stripped out a lot of not-relevant computation). Basically I have a large matrix that I want to fill in and at each point the value is a sum over many energies and eigenvalues, so this is the 'slow step' of the calculation. When I run this on my workstation I just parallelize that fill in using Parallel and delayed from joblib but of course when I run this on the cluster using mpirun --bind-to none -n 16 python KZ_spectral_function.py | tee spectral.out for example the code runs basically in serial (although with some odd behavior).
So, what I think I need to do is to convert that joblib line over to mpi4py, include an if rank == 0: statement encompassing everything in the main function, and just modify the the contents of gen_spec_func() and divvy up the calls to spec_func() to the different cores. This is the part where I am stuck, as all the examples I have read that were simple enough for me to understand use some variation of COMMS.scatter() and then append the results to a list, as far as I can tell in a random order, and I don't know quite enough to adapt them to something where I want the results to go in a specific place in the matrix. Any help or advice would be greatly appreciated, as neither parallelization nor python are strengths of mine ...
Code Snippet (simplified):
import numpy as np
import numpy.linalg as lina
import time
from functools import partial
from joblib import Parallel, delayed
# Helper Functions
def get_eigenvals(k_cart,cellMap,Hwannier,G):
## [....] some linear algebra, not important
return Ek
def gen_spec_func(eigenvals,Nkpts,Energies,Sigma):
## This is really the only part that I care to parallelize
## This is the joblib version
num_cores=16
tempfunc = partial(spec_func,Energies,Sigma,eigenvals)
spectral = np.reshape(Parallel(n_jobs=num_cores)(delayed(tempfunc)(i,j) for j in range(0,Nkpts) for i in range(0,len(Energies))),(Nkpts,len(Energies))).T
return np.matrix(spectral)
def spec_func(Energies,Sigma,eigenvals,i,j) :
return sum([(1.0/( (Energies[i]-val)**2 + (Sigma)**2 )) for val in eigenvals[j,:]])
#--- Start of main script
Tstart = time.time()
# [...] Declare Constants & Parameters
# [...] Read data from disk
# [...] Some calculations on that data we want done in serial
Energies = [Emin + (Emax-Emin)*i/(Nenergies-1) for i in range(0,Nenergies)]
kzs = [kzi+(kzf-kzi)*l/(nkzs-1) for l in range(0,nkzs)]
DomainAvg = np.matrix([[0.0 for j in range(0,Nkpts)] for i in range(0,Nenergies)])
for kz in kzs:
## An outer loop over Kz
print("Starting loop for kz = ",kz)
# Generate the base k-grid (symmetric) in 1/A for convenience
# [...] Generate the appropriate kpoint grid
for angle in range(0,Nangles):
## Inner loop over rotation angles
#--- For each angle generate the kpoint grid for that domain
# [...] Calculate some eigenvalues, small matrices not a big deal, serial fine
#--- Now we Generate the spectral function for that grid
### Ok, this is the slow part that we want to parallelize
DomainAvg += gen_spec_func(eigenvals,Nkpts,Energies,Sigma)
if(angle%20 == 0):
Tend = time.time()
print("Completed iteration ", angle, "of ", Nangles, " at T = ", Tend-Tstart)
# Output the results (one file for each Kz)
DomainAvg = DomainAvg/Nangles
outfile = "Spectral"+str(kz)+".txt"
np.savetxt(outfile, DomainAvg)
# And we are done
Tend = time.time()
print("Total execution time was :", Tend-Tstart)
EDIT: A very very hack solution I came up with was to encode the matrix indices in the matrix itself as floats, then use scatter() and gather() to distribute the matrix, replace the value with the calculation output, and reassemble the matrix. This is of course not a good idea since it requires int<->float conversion but it was the only way I could come up with that didn't require rebuilding the entire matrix from the gathered data index by index (instead just using hstack() and reshape() to put it together). I feel like there must be some tool I am missing that assists in distributed calculation for arrays/matrices where the index matters so I would still be interested if someone has a tip/pointer in this regard.
Minimum Working Example:
import numpy.linalg as lina
import time
import math
from functools import partial
from mpi4py import MPI
#-- Standard Comms
COMM = MPI.COMM_WORLD
size = COMM.Get_size()
rank = COMM.Get_rank()
Nkpts = 3
Energies = [1.00031415926*i for i in range(0,11)]
#--- Now we Generate the spectral function for that grid
# This will be done in parallel using scatter/gather in MPI
if rank == 0:
# List that we will scatter to the different nodes
# Encode the matrix index from whcn each element came as a float
datalist = [float(j+i*Nkpts) for j in range(0,Nkpts) for i in range(0,len(Energies))]
data = np.array_split(datalist, COMM.Get_size())
else:
data = None
# Distribute to the different nodes
data = COMM.scatter(data, root=0)
print("I am processor ",rank," and my data is",data)
for index in range(0,len(data)) :
# Decode the indicies
j = data[index]%Nkpts
i = math.floor(data[index]/Nkpts)
data[index] = 100.100*j+Energies[i]
COMM.Barrier()
dataMPI = COMM.gather(data,root=0)
if(rank==0) :
spectral = np.reshape(np.hstack(dataMPI),(Nkpts,len(Energies))).T
spectral_func = np.matrix(spectral)
print(spectral_func)```
I'm facing a problem with vectorizing a function so that it applies efficiently on a numpy array.
My program entries :
A pos_part 2D Array of Nb_particles lines, 3 columns (basicaly x,y,z coordinates, only z is relevant for the part that bothers me) Nb_particles can up to several hundreds of thousands.
An prop_part 1D array with Nb_particles values. This part I got covered, creation is made with some nice numpy functions ; I just put here a basic distribution that ressembles real values.
A z_distances 1D Array, a simple np.arange betwwen z=0 and z=z_max.
Then come the calculation that takes time, because where I can't find a way to do things properply with only numpy operation of arrays. What i want to do is :
For all distances z_i in z_distances, sum all values from prop_part if corresponding particle coordinate z_particle < z_i. This would return a 1D array the same length as z_distances.
My ideas so far :
Version 0, for loop, enumerate and np.where do retrieve the index of values that I need to sum. Obviously quite long.
Version 1, using a mask on a new array (combination of z coordinates and particle properties), and sum on the masked array. Seems better than v0
Version 2, another mask and a np.vectorize, but i understand it's not efficient as vectorize is basicaly a for loop. Still seems better than v0
Version 3, I'm trying to use mask on a function that can I directly apply to z_distances, but it's not working so far.
So, here I am. There is maybe something to do with a sort and a cumulative sum, but I don't know how to do this, so any help would be greatly appreciated. Please find below the code to make things clearer
Thanks in advance.
import numpy as np
import time
import matplotlib.pyplot as plt
# Creation of particles' positions
Nb_part = 150_000
pos_part = 10*np.random.rand(Nb_part,3)
pos_part[:,0] = pos_part[:,1] = 0
#usefull property creation
beta = 1/1.5
prop_part = (1/beta)*np.exp(-pos_part[:,2]/beta)
z_distances = np.arange(0,10,0.1)
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = np.where(pos_part[:,2]<val_dist)[0]
result[index_dist] = sum(prop_part[i] for i in positions)
print("v0 :",time.time()-t0)
#A graph to help understand
plt.figure()
plt.plot(z_distances,result, c="red")
plt.ylabel("Sum of particles' usefull property for particles with z-pos<d")
plt.xlabel("d")
#version 1 ??
t1=time.time()
combi = np.column_stack((pos_part[:,2],prop_part))
result2 = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
mask = (combi[:,0]<val_dist)
result2[index_dist]=sum(combi[:,1][mask])
print("v1 :",time.time()-t1)
plt.plot(z_distances,result2, c="blue")
#version 2
t2=time.time()
def themask(a):
mask = (combi[:,0]<a)
return sum(combi[:,1][mask])
thefunc = np.vectorize(themask)
result3 = thefunc(z_distances)
print("v2 :",time.time()-t2)
plt.plot(z_distances,result3, c="green")
### This does not work so far
# version 3
# =============================
# t3=time.time()
# def thesum(a):
# mask = combi[combi[:,0]<a]
# return sum(mask[:,1])
# result4 = thesum(z_distances)
# print("v3 :",time.time()-t3)
# =============================
You can get a lot more performance by writing your first version completely in numpy. Replace pythons sum with np.sum. Instead of the for i in positions list comprehension, simply pass the positions mask you are creating anyways.
Indeed, the np.where is not necessary and my best version looks like:
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = pos_part[:, 2] < val_dist
result[index_dist] = np.sum(prop_part[positions])
print("v0 :",time.time()-t0)
# out: v0 : 0.06322097778320312
You can get a bit faster if z_distances is very long by using numba.
Running calc for the first time usually creates some overhead which we can get rid of by running the function for some small set of `z_distances.
The below code achieves roughly a factor of two speedup over pure numpy on my laptop.
import numba as nb
#nb.njit(parallel=True)
def calc(result, z_distances):
n = z_distances.shape[0]
for ii in nb.prange(n):
pos = pos_part[:, 2] < z_distances[ii]
result[ii] = np.sum(prop_part[pos])
return result
result4 = np.zeros_like(result)
# _t = time.time()
# calc(result4, z_distances[:10])
# print(time.time()-_t)
t3 = time.time()
result4 = calc(result4, z_distances)
print("v3 :", time.time()-t3)
plt.plot(z_distances, result4)
I have a dataset carrying N arrays, each array may have a fixed or variable length. I am going to extract a window from the dataset and counts the occurrence of a given number. To make it simple, I assume each array contains 90 numbers. The model will pick a random position on each array and extract M consecutive numbers started from that position as a window strip, and thus total N strip will be obtained. I assume each strip has the same size so all N strips composing a view (window). I have a C++ code to perform the job and work very well, for a dataset of 5x90, repeating the window sampling 1000000 times, it takes roughly 0.002 seconds to finish every 10,000 iterations. I am trying to port the C++ to Python such that I could use the powerful data frame (Pandas) and other tools for related analysis. After some research, it seems that the Numpy array is a good representation of the data array but since the size of each array varied, I need to warp all arrays into a list. The code I have shown below (majority of irrelated code are removed)
import numpy as np
import random
import time
class dataSource:
_data = None
_NCOLS = 0
_NROWS = 0
# NCOLS, NROWS is the dimension of a window (view) of the data
# _data is a list of NCOLS Numpy array, each array carry 90 numbers
def __init__(self, NCOLS, NROWS):
# assuming each array has 90 numbers for this exmaple, it could be varied in actual case
src = [[1]*90]*NCOLS # just assume all numbers in the array to be 1 for this example
self._data = [np.array(cc) for cc in src]
self._NCOLS = NCOLS
self._NROWS = NROWS
def extractView(self, view):
pos = [random.randint(0, len(cc)-1-self._NROWS) for cc in self._data]
for col in range(len(self._data)):
view[:, col] = self._data[col][pos[col]:pos[col]+self._NROWS]
class dataView:
_NCOLS = 0
_NROWS = 0
_view = None
def __init__(self, NCOLS, NROWS):
self._NCOLS = NCOLS
self._NROWS = NROWS
self._view = np.zeros([self._NROWS, self._NCOLS], order='F')
def __setitem__(self, key, value):
self._view[key] = value
def count(self, elem):
return np.count_nonzero(self._view==elem)
if __name__ == '__main__':
ds = dataSource(5, 3)
dv = dataView(5, 3)
batchCount, totalTime, startTime = 1, 0, time.process_time()
for n in range(10000000):
if ((n+1)%10000==0):
dt = time.process_time() - startTime
totalTime += dt
print(totalTime / batchCount, ' ', dt)
batchCount += 1
startTime = time.process_time()
ds.extractView(dv)
cnt = dv.count(1) # this could then be used for other analysis, dv may be used by other functions
Running this code in Python 3.8.8 showing that the code takes 0.23 seconds to finish every 10,000 iterations. The Python code is about 100 times slower than C++. In an actual case, the data source could have up to 50 arrays and each may have 50 to 500 numbers, so the run times in Python could be even longer. To apply the C++ code in the real data, it takes about 5 hours to finish all iterations, based on the above estimation, it takes 500 hours to do the same job in Python. I understand that there may be some limit in Python that makes it hard to boost the speed up, but if anything that I could do to optimize my code so to reduce the time (even 10%) will help a lot.
I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.
Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.
I have a complex numpy array signal with dimensions [10,1000,50000]
I need to modify this array in slices. This is done in a for loop:
for k in range(signal.shape[2]):
signal[:,:,k] = myfunction(signal[:,:,k], constant1, constant2, constant5=constant5, constant6=constant6)
I optimized myfunction as much as possible. When I run the script it takes quite some time but only uses 1 of 24 CPU's.
The code can not be rewritten to perform myfunction on the entire array with numpy.
Therefore I want to speed up my code with parallel computing.
There seem to many different approach for parallel computing in python.
Which one seems to be the best for my problem? And how can I implement it?
Joblib provides easy execution for such 'embarrassingly-parallel' tasks:
import numpy as np
# Initialize array and define function
np_array = np.random.rand(100,100,100)
my_function = lambda x: x / np.sum(x)
# Option 1: Loop over array and apply function
serial_result = np_array.copy()
for i in range(np_array.shape[2]):
serial_result[:,:,i] = my_function(np_array[:,:,i])
Now using parallel execution with joblib:
# Option 2: Parallel execution
# ... Apply function in Parallel
from joblib import delayed, parallel
sub_arrays = Parallel(n_jobs=6)( # Use 6 cores
delayed(my_function)(np_array[:,:,i]) # Apply my_function
for i in range(np_array.shape[2])) # For each 3rd dimension
# ... Concatenate the list of returned arrays
parallel_results = np.stack(sub_arrays, axis=2)
# Compare results
np.equal(serial_result, parallel_results).all() # True