Numpy array vs python list with time measurement

Numpy array vs python list with time measurement - python

I have read about numpy array runs faster than ordinary code performed in python. I have a large data to send a function with return value. I test it with numpy index and boolean and with python, I should expect to run the task faster with numpy function but in my test is not the case. Why does it run slower?
from time import time
import numpy as np
def stress(STEEL_E_MODULE, strain_in_rebars, STEEL_STRENGTH):
STRESS = STEEL_E_MODULE * strain_in_rebars
if STRESS <= - STEEL_STRENGTH:
stress = - STEEL_STRENGTH
elif STRESS >= - STEEL_STRENGTH and STRESS < 0:
stress = STRESS
else:
stress = min(STRESS, STEEL_STRENGTH)
return stress
def stressnumpy(STEEL_E_MODULE, strain_in_rebars, STEEL_STRENGTH):
STRESS = np.array([ STEEL_E_MODULE * strain_in_rebars, STEEL_E_MODULE * strain_in_rebars])
Result = np.array([ - STEEL_STRENGTH, STRESS[1], min(STRESS[1], STEEL_STRENGTH)])
cond1 = STRESS[0] <= - STEEL_STRENGTH
cond2 = (STRESS[1] >= - STEEL_STRENGTH) & (STRESS[1] < 0)
arr = [cond1, cond2, True if cond1 == False and cond2== False else False]
return Result[arr][0]
tg1 = time()
for n in np.arange(-1000,1000,0.01):
y = stressnumpy( 2, n, 20)
tg2 = time()
print("time with numpyfunc:", tg2-tg1)
tg3 = time()
for n in np.arange(-1000,1000,0.01):
x = stress( 2, n, 20)
tg4 = time()
print("time with func:", tg4-tg3)
Time with both functions:
time with numpyfunc: 1.2981691360473633
time with func: 0.1738882064819336
I expect to get opposite running time?

It's misguiding to say that a code with arrays runs faster. In general, when people speak about code going faster with Numpy arrays, they point out the benefit of vectorizing the code which can then run faster with Numpy performant implementation of functions for array operations and manipulation. I wrote a code that compares the second code you provided (I slightly modified it to save values on a list in order to compare the results) and I compared it to a vectorized version. You can see that this leads to a significant increase in time and the results given by the two methods are equal:
from time import time
import numpy as np
# We use a list and apply the operation for all elemnt in it => return a list
def stresslist(STEEL_E_MODULE, strain_in_rebars_list, STEEL_STRENGTH):
stress_list = []
for strain_in_rebars in strain_in_rebars_list:
STRESS = STEEL_E_MODULE * strain_in_rebars
if STRESS <= - STEEL_STRENGTH:
stress = - STEEL_STRENGTH
elif STRESS >= - STEEL_STRENGTH and STRESS < 0:
stress = STRESS
else:
stress = min(STRESS, STEEL_STRENGTH)
stress_list.append(stress)
return stress_list
# We use a numpy array and apply the operation for all elemnt in it => return a array
def stressnumpy(STEEL_E_MODULE, strain_in_rebars_array, STEEL_STRENGTH):
STRESS = STEEL_E_MODULE * strain_in_rebars_array
stress_array = np.where(
STRESS <= -STEEL_STRENGTH, -STEEL_STRENGTH,
np.where(
np.logical_and(STRESS >= -STEEL_STRENGTH, STRESS < 0), STRESS,
np.minimum(STRESS, STEEL_STRENGTH)
)
)
return stress_array
t_start = time()
x = stresslist( 2, list(np.arange(-1000,1000,0.01)), 20)
print(f'time with list: {time()-t_start}s')
t_start = time()
y = stressnumpy( 2, np.arange(-1000,1000,0.01), 20)
print(f'time with numpy: {time()-t_start}s')
print('Results are equal!' if np.allclose(y,np.array(x)) else 'Results differ!')
Output:
% python3 script.py
time with list: 0.24164390563964844s
time with numpy: 0.003011941909790039s
Results are equal!
Please do not hesitate if you have questions about the code. You can also refer to the official documentation for numpy.where, numpy.logical_and and numpy.minimum.

Related

EWMA Covariance Matrix in Pandas - Optimization

I would like to calculate the EWMA Covariance Matrix from a DataFrame of stock price returns using Pandas and have followed the methodology in PyPortfolioOpt.
I like the flexibility of using Pandas objects and functions but when the set of assets grows the function is becomes very slow:
import pandas as pd
import numpy as np
def ewma_cov_pairwise_pd(x, y, alpha=0.06):
x = x.mask(y.isnull(), np.nan)
y = y.mask(x.isnull(), np.nan)
covariation = ((x - x.mean()) * (y - y.mean()).dropna()
return covariation.ewm(alpha=0.06).mean().iloc[-1]
def ewma_cov_pd(rets, alpha=0.06):
assets = rets.columns
n = len(assets)
cov = np.zeros((n, n))
for i in range(n):
for j in range(i, n):
cov[i, j] = cov[j, i] = ewma_cov_pairwise_pd(
rets.iloc[:, i], rets.iloc[:, j], alpha=alpha)
return pd.DataFrame(cov, columns=assets, index=assets)
I would like to improve the speed of the code ideally while still using Pandas but the bottleneck is within the DataFrame.ewm() function which uses 90% of the calculation time.
If using this function was a binding constraint, what is the most efficient way of improving the speed at which the code runs? I was considering taking a brute force approach and using concurrent.futures.ProcessPoolExecutor but perhaps there is a better solutions.
n = 100 # n is typically 2000
rets = pd.DataFrame(np.random.normal(0, 1., size=(n, n)))
cov_pd = ewma_cov_pd(rets)
The true time-series data can contain leading nulls and potentially missing values after that although the latter less likely.
Update I
A potential solution which leverages off the answer provided by Quang Hoang and produces the expected results in a far more reasonable time would be something similar to:
def ewma_cov_frame_qh(rets, alpha=0.06):
weights = (1-alpha) ** np.arange(len(df))[::-1]
normalized = (rets-rets.mean()).to_numpy()
out = (weights * normalized.T) # normalized / weights.sum()
return pd.DataFrame(out, index=rets.columns, columns=rets.columns)
def ewma_cov_qh(rets, alpha=0.06):
syms = rets.columns
covar = pd.DataFrame(index=rets.columns, columns=rets.columns)
delta = rets.isnull().sum(axis=1).shift(1) - rets.isnull().sum(axis=1)
dates = delta.loc[delta != 0].index.tolist()
for date in dates:
frame = rets.loc[rets.index >= date].dropna(axis=1, how='any')
cov = ewma_cov_frame_qh(frame).reindex(index=syms, columns=syms)
covar = covar.fillna(cov)
return covar
cov_qh = ewma_cov_qh(rets)
This violates the requirement that the underlying covariance is calculated using the native Pandas/Numpy functions and calculation time will depend on the number leading na's in the data set.
Update II
A potential improvement on the above which uses (a naive implementation of) multiprocessing and improves the calculation time by a further 42.5% on my machine is listed below:
from concurrent.futures import ProcessPoolExecutor, as_completed
from functools import partial
def ewma_cov_mp_worker(date, rets, alpha=0.06):
syms = rets.columns
frame = rets.loc[rets.index >= date].dropna(axis=1, how='any')
return ewma_cov_frame_qh(frame, alpha=alpha).reindex(index=syms, columns=syms)
def ewma_cov_mp(rets, alpha=0.06):
covar = pd.DataFrame(index=rets.columns, columns=rets.columns)
delta = rets.isnull().sum(axis=1).shift(1) - rets.isnull().sum(axis=1)
dates = delta.loc[delta != 0].index.tolist()
func = partial(ewma_cov_mp_worker, rets=rets, alpha=alpha)
covs = {}
with ProcessPoolExecutor(max_workers=6) as exec:
future_to_date = {exec.submit(func, date): date for date in dates}
covs = {future_to_date[future]: future.result() for future in as_completed(future_to_date)}
for date in dates:
covar.fillna(covs[date], inplace=True)
return covar
[I have not added as answer as not addressed the original question and I am optimistic there is a better solution.]

since you don't really care for ewm, i.e, you only take the last value. We can try matrix multiplication:
def ewma(df, alpha=0.94):
weights = (1-alpha) ** np.arange(len(df))[::-1]
# fillna with 0 here
normalized = (df-df.mean()).fillna(0).to_numpy()
out = ((weights * normalized.T) # normalized / weights.sum()
return out
# verify
out = ewma(df)
print(out[0,1] == ewma_cov_pairwise(df[0],df[1]) )
# True
And this took about 150 ms on my system with df.shape==(2000,2000) while your code refuses to run within minutes :-).

Efficiently calculating grid-based point density in 3d point cloud

I have a 3d point cloud matrix, and I am trying to calculate the largest point density within a smaller volume inside the matrix. I am currently using a 3D grid-histogram system where I loop through every point in the matrix and increase the value of the corresponding grid square. Then, I can simply find the max value of the grid matrix.
I have already written code that works, but it is horribly slow for what I am trying to do
import numpy as np
def densityPointCloud(points, gridCount, gridSize):
hist = np.zeros((gridCount, gridCount, gridCount), np.uint16)
rndPoints = np.rint(points/gridSize) + int(gridCount/2)
rndPoints = rndPoints.astype(int)
for point in rndPoints:
if np.amax(point) < gridCount and np.amin(point) >= 0:
hist[point[0]][point[1]][point[2]] += 1
return hist
cloud = (np.random.rand(100000, 3)*10)-5
histogram = densityPointCloud(cloud , 50, 0.2)
print(np.amax(histogram))
Are there any shortcuts I can take to do this more efficiently?

Here's a start:
import numpy as np
import time
from collections import Counter
# if you need the whole histogram object
def dpc2(points, gridCount, gridSize):
hist = np.zeros((gridCount, gridCount, gridCount), np.uint16)
rndPoints = np.rint(points/gridSize) + int(gridCount/2)
rndPoints = rndPoints.astype(int)
inbounds = np.logical_and(np.amax(rndPoints,axis = 1) < gridCount, np.amin(rndPoints,axis = 1) >= 0)
for point in rndPoints[inbounds,:]:
hist[point[0]][point[1]][point[2]] += 1
return hist
# just care about a max point
def dpc3(points, gridCount, gridSize):
rndPoints = np.rint(points/gridSize) + int(gridCount/2)
rndPoints = rndPoints.astype(int)
inbounds = np.logical_and(np.amax(rndPoints,axis = 1) < gridCount,
np.amin(rndPoints,axis = 1) >= 0)
# cheap hashing
phashes = gridCount*gridCount*rndPoints[inbounds,0] + gridCount*rndPoints[inbounds,1] + rndPoints[inbounds,2]
max_h, max_v = Counter(phashes).most_common(1)[0]
max_coord = [(max_h // (gridCount*gridCount)) % gridCount,(max_h // gridCount) % gridCount,max_h % gridCount]
return (max_coord, max_v)
# TESTING
cloud = (np.random.rand(200000, 3)*10)-5
t1 = time.perf_counter()
hist1 = densityPointCloud(cloud , 50, 0.2)
t2 = time.perf_counter()
hist2 = dpc2(cloud,50,0.2)
t3 = time.perf_counter()
hist3 = dpc3(cloud,50,0.2)
t4 = time.perf_counter()
print(f"task 1: {round(1000*(t2-t1))}ms\ntask 2: {round(1000*(t3-t2))}ms\ntask 3: {round(1000*(t4-t3))}ms")
print(f"max value is {hist3[1]}, achieved at {hist3[0]}")
np.all(np.equal(hist1,hist2)) # check that results are identical
# check for equal max - histogram may be multi-modal so the point won't
# necessarily match
np.unravel_index(np.argmax(hist2, axis=None), hist2.shape)
The idea is to do all the if/and comparisons once: let numpy do them (effectively in C) rather then doing them 'manually' inside a Python loop. This also lets us only iterate over the points that will lead to hist being incremented.
You can also consider using a sparse data structure for hist if you think your cloud will have lots of empty space - memory allocation can become a bottleneck for very large data.
Did not scientifically benchmark this but appears to run ~2-3x faster (v2) and 6-8x faster (v3)! If you'd like all the points which are tied for the max. density, it would be easy to extract those from the Counter object.

Converting `for` loop that can't be vectorized to sparse matrix

There are 2 boxes and a small gap that allows 1 particle per second from one box to enter the other box. Whether a particle will go from A to B, or B to A depends on the ratio Pa/Ptot (Pa: number of particles in box A, Ptot: total particles in both boxes).
To make it faster, I need to get rid of the for loops, however I can't find a way to either vectorize them or turn them into a sparse matrix that represents my for loop:
What about for loops you can't vectorize? The ones where the result at iteration n depends on what you calculated in iteration n-1, n-2, etc. You can define a sparse matrix that represents your for loop and then do a sparse matrix solve.
But I can't figure out how to define a sparse matrix out of this. The simulation boils down to calculating:
where
is the piece that gives me trouble when trying to express my problem as described here. (Note: the contents in the parenthesis are a bool operation)
Questions:
Can I vectorize the for loop?
If not, how can I define a sparse matrix?
(bonus question) Why is the execution time x27 faster in Python (0.027s) than Octave (0.75s)?
Note: I implemented the simulation in both Python and Octave and will soon do it on Matlab, therefor the tags are correct.
Octave code
1; % starting with `function` causes errors
function arr = Px_simulation (Pa_init, Ptot, t_arr)
t_size = size(t_arr);
arr = zeros(t_size); % fixed size array is better than arr = []
rand_arr = rand(t_size); % create all rand values at once
_Pa = Pa_init;
for _j=t_arr()
if (rand_arr(_j) * Ptot > _Pa)
_Pa += 1;
else
_Pa -= 1;
endif
arr(_j) = _Pa;
endfor
endfunction
t = 1:10^5;
for _i=1:3
Ptot = 100*10^_i;
tic()
Pa_simulation = Px_simulation(Ptot, Ptot, t);
toc()
subplot(2,2,_i);
plot(t, Pa_simulation, "-2;simulation;")
title(strcat("{P}_{a0}=", num2str(Ptot), ',P=', num2str(Ptot)))
endfor
Python
import numpy
import matplotlib.pyplot as plt
import timeit
import cpuinfo
from random import random
print('\nCPU: {}'.format(cpuinfo.get_cpu_info()['brand']))
PARTICLES_COUNT_LST = [1000, 10000, 100000]
DURATION = 10**5
t_vals = numpy.linspace(0, DURATION, DURATION)
def simulation(na_initial, ntotal, tvals):
shape = numpy.shape(tvals)
arr = numpy.zeros(shape)
na_current = na_initial
for i in range(len(tvals)):
if random() > (na_current/ntotal):
na_current += 1
else:
na_current -= 1
arr[i] = na_current
return arr
plot_lst = []
for i in PARTICLES_COUNT_LST:
start_t = timeit.default_timer()
n_a_simulation = simulation(na_initial=i, ntotal=i, tvals=t_vals)
execution_time = (timeit.default_timer() - start_t)
print('Execution time: {:.6}'.format(execution_time))
plot_lst.append(n_a_simulation)
for i in range(len(PARTICLES_COUNT_LST)):
plt.subplot('22{}'.format(i))
plt.plot(t_vals, plot_lst[i], 'r')
plt.grid(linestyle='dotted')
plt.xlabel("time [s]")
plt.ylabel("Particles in box A")
plt.show()

IIUC you can use cumsum() in both Octave and Numpy:
Octave:
>> p = rand(1, 5);
>> r = rand(1, 5);
>> p
p =
0.43804 0.37906 0.18445 0.88555 0.58913
>> r
r =
0.70735 0.41619 0.37457 0.72841 0.27605
>> cumsum (2*(p<(r+0.03)) - 1)
ans =
1 2 3 2 1
>> (2*(p<(r+0.03)) - 1)
ans =
1 1 1 -1 -1
Also note that the following function will return values ([-1, 1]):

Python: rewrite a looping numpy math function to run on GPU

Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head around it but to no result. I wonder maybe somebody can help me rewrite this function in whatever way you may seem fit as log as I gives the same result at the end. I tried to use #jit from numba but for some reason it is actually much slower than running the code as usual. With a huge sample size, the goal is to decrease the execution time considerably so naturally I believe the GPU is the fastest way to do it.
I'll explain a little what is actually happening. The real data, which looks almost identical as the sample data created in the code below is divided into sample sizes of approx 5.000.000 rows each sample or around 150MB per file. In total there are around 600.000.000 rows or 20GB of data. I must loop through this data, sample by sample and then row by row in each sample, take the last 2000 (or another) rows as of each line and run the doTheMath function which returns a result. That result is then saved back to the hardrive where I can do some other things with it with another program. As you can see below, I do not need all of the results of all the rows, only those bigger than a specific amount. If I run my function as it is right now in python I get about 62seconds per 1.000.000 rows. This is a very long time considering all the data and how fast it should be done with.
I must mention that I upload the real data file by file to the RAM with the help of data = joblib.load(file) so uploading the data is not the problem as it takes only about 0.29 seconds per file. Once uploaded I run the entire code below. What takes the longest time is the doTheMath function. I am willing to give all of my 500 reputation points I have on stackoverflow as a reward for somebody willing to help me rewrite this simple code to run on the GPU. My interest is specifically in the GPU, I really want to see how it is done on this problem at hand.
EDIT/UPDATE 1:
Here is a link to a small sample of the real data: data_csv.zip About 102000 rows of real data1 and 2000 rows for real data2a and data2b. Use minimumLimit = 400 on the real sample data
EDIT/UPDATE 2:
For those following this post here is a short summary of the answers below. Up until now we have 4 answers to the original solution. The one offered by #Divakar are just tweaks to the original code. Of the two tweaks only the first one is actually applicable to this problem, the second one is a good tweak but does not apply here. Out of the other three answers, two of them are CPU based solutions and one tensorflow-GPU try. The Tensorflow-GPU by Paul Panzer seems to be promising but when i actually run it on the GPU it is slower than the original, so the code still needs improvement.
The other two CPU based solutions are submitted by #PaulPanzer (a pure numpy solution) and #MSeifert (a numba solution). Both solutions give very good results and both process data extremely fast compared to the original code. Of the two the one submitted by Paul Panzer is faster. It processes about 1.000.000 rows in about 3 seconds. The only problem is with smaller batchSizes, this can be overcome by either switching to the numba solution offered by MSeifert, or even the original code after all the tweaks that have been discussed below.
I am very happy and thankful to #PaulPanzer and #MSeifert for the work they did on their answers. Still, since this is a question about a GPU based solution, i am waiting to see if anybody is willing to give it a try on a GPU version and see how much faster the data can be processed on the GPU when compared to the current CPU solutions. If there will be no other answers outperforming #PaulPanzer's pure numpy solution then i'll accept his answer as the right one and gets the bounty :)
EDIT/UPDATE 3:
#Divakar has posted a new answer with a solution for the GPU. After my testings on real data, the speed is not even comparable to the CPU counterpart solutions. The GPU processes about 5.000.000 in about 1,5 seconds. This is incredible :) I am very excited about the GPU solution and i thank #Divakar for posting it. As well as i thank #PaulPanzer and #MSeifert for their CPU solutions :) Now my research continues with an incredible speed due to the GPU :)
import pandas as pd
import numpy as np
import time
def doTheMath(tmpData1, data2a, data2b):
A = tmpData1[:, 0]
B = tmpData1[:,1]
C = tmpData1[:,2]
D = tmpData1[:,3]
Bmax = B.max()
Cmin = C.min()
dif = (Bmax - Cmin)
abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
#Declare variables
batchSize = 2000
sampleSize = 5000000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data
#Create Random Sample Data
data1 = np.matrix(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #upper limit
data2b = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #lower limit
#approx. half of data2a will be smaller than data2b, but that is only in the sample data because it is randomly generated, NOT the real data. The real data2a is always higher than data2b.
#Loop through the data
t0 = time.time()
for rowNr in range(data1.shape[0]):
tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
if(tmp_df.shape[0] == batchSize):
result = doTheMath(tmp_df, data2a, data2b)
if (result >= minimumLimit):
resultArray.append([rowNr , result])
print('Runtime:', time.time() - t0)
#Save data results
resultArray = np.array(resultArray)
print(resultArray[:,1].sum())
resultArray = pd.DataFrame({'index':resultArray[:,0], 'result':resultArray[:,1]})
resultArray.to_csv("Result Array.csv", sep=';')
The PC specs I am working on:
GTX970(4gb) video card;
i7-4790K CPU 4.00Ghz;
16GB RAM;
a SSD drive
running Windows 7;
As a side question, would a second video card in SLI help on this problem?

Introduction and solution code
Well, you asked for it! So, listed in this post is an implementation with PyCUDA that uses lightweight wrappers extending most of CUDA's capabilities within Python environment. We will its SourceModule functionality that lets us write and compile CUDA kernels staying in Python environment.
Getting to the problem at hand, among the computations involved, we have sliding maximum and minimum, few differences and divisions and comparisons. For the maximum and minimum parts, that involves block max finding (for each sliding window), we will use reduction-technique as discussed in some detail here. This would be done at block level. For the upper level iterations across sliding windows, we would use the grid level indexing into CUDA resources. For more info on this block and grid format, please refer to page-18. PyCUDA also supports builtins for computing reductions like max and min, but we lose control, specifically we intend to use specialized memory like shared and constant memory for leveraging GPU at its near to optimum level.
Listing out the PyCUDA-NumPy solution code -
1] PyCUDA part -
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule
mod = SourceModule("""
#define TBP 1024 // THREADS_PER_BLOCK
__device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
{
int tid = threadIdx.x;
int inv = TBP;
__shared__ float dS[TBP][2];
dS[tid][0] = d1[tid+offset];
dS[tid][1] = d2[tid+offset];
__syncthreads();
if(tid<L-TBP)
{
dS[tid][0] = fmaxf(dS[tid][0] , d1[tid+inv+offset]);
dS[tid][1] = fminf(dS[tid][1] , d2[tid+inv+offset]);
}
__syncthreads();
inv = inv/2;
while(inv!=0)
{
if(tid<inv)
{
dS[tid][0] = fmaxf(dS[tid][0] , dS[tid+inv][0]);
dS[tid][1] = fminf(dS[tid][1] , dS[tid+inv][1]);
}
__syncthreads();
inv = inv/2;
}
__syncthreads();
if(tid==0)
{
out[0] = dS[0][0];
out[1] = dS[0][1];
}
__syncthreads();
}
__global__ void main1(float* out, float *d0, float *d1, float *d2, float *d3, float *lowL, float *highL, int *BLOCKLEN)
{
int L = BLOCKLEN[0];
int tid = threadIdx.x;
int iterID = blockIdx.x;
float Bmax_Cmin[2];
int inv;
float Cmin, dif;
__shared__ float dS[TBP*2];
get_Bmax_Cmin(Bmax_Cmin, d1, d2, L, iterID);
Cmin = Bmax_Cmin[1];
dif = (Bmax_Cmin[0] - Cmin);
inv = TBP;
dS[tid] = (d0[tid+iterID] + d1[tid+iterID] + d2[tid+iterID] + d3[tid+iterID] - 4.0*Cmin) / (4.0*dif);
__syncthreads();
if(tid<L-TBP)
dS[tid+inv] = (d0[tid+inv+iterID] + d1[tid+inv+iterID] + d2[tid+inv+iterID] + d3[tid+inv+iterID] - 4.0*Cmin) / (4.0*dif);
dS[tid] = ((dS[tid] >= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
__syncthreads();
if(tid<L-TBP)
dS[tid] += ((dS[tid+inv] >= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
__syncthreads();
inv = inv/2;
while(inv!=0)
{
if(tid<inv)
dS[tid] += dS[tid+inv];
__syncthreads();
inv = inv/2;
}
if(tid==0)
out[iterID] = dS[0];
__syncthreads();
}
""")
Please note that THREADS_PER_BLOCK, TBP is to be set based on the batchSize. The rule of thumb here is to assign power of 2 value to TBP that is just lesser than batchSize. Thus, for batchSize = 2000, we needed TBP as 1024.
2] NumPy part -
def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
func1 = mod.get_function("main1")
outlen = len(A)-batchSize+1
# Set block and grid sizes
BSZ = (1024,1,1)
GSZ = (outlen,1)
dest = np.zeros(outlen).astype(np.float32)
N = np.int32(batchSize)
func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
drv.In(data2b), drv.In(data2a),\
drv.In(N), block=BSZ, grid=GSZ)
idx = np.flatnonzero(dest >= minimumLimit)
return idx, dest[idx]
Benchmarking
I have tested on GTX 960M. Please note that PyCUDA expects arrays to be of contiguous order. So, we need to slice the columns and make copies. I am expecting/assuming that the data could be read from the files such that the data is spread along rows instead of being as columns. Thus, keeping those out of the benchmarking function for now.
Original approach -
def org_app(data1, batchSize, minimumLimit):
resultArray = []
for rowNr in range(data1.shape[0]-batchSize+1):
tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
result = doTheMath(tmp_df, data2a, data2b)
if (result >= minimumLimit):
resultArray.append([rowNr , result])
return resultArray
Timings and verification -
In [2]: #Declare variables
...: batchSize = 2000
...: sampleSize = 50000
...: resultArray = []
...: minimumLimit = 490 #use 400 on the real sample data
...:
...: #Create Random Sample Data
...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
...:
...: # Make column copies
...: A = data1[:,0].copy()
...: B = data1[:,1].copy()
...: C = data1[:,2].copy()
...: D = data1[:,3].copy()
...:
...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
...: print(np.allclose(gpu_out1, cpu_out1))
...: print(np.allclose(gpu_out2, cpu_out2))
...:
True
False
So, there's some differences between CPU and GPU countings. Let's investigate them -
In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))
In [8]: idx
Out[8]: array([12776, 15208, 17620, 18326])
In [9]: gpu_out2[idx] - cpu_out2[idx]
Out[9]: array([-1., -1., 1., 1.])
There are four instances of non-matching counts. These are off at max by 1. Upon research, I came across some information on this. Basically, since we are using math intrinsics for max and min computations and those I think are causing the last binary bit in the floating pt representation to be diferent than the CPU counterpart. This is termed as ULP error and has been discused in detail here and here.
Finally, puting the issue aside, let's get to the most important bit, the performance -
In [10]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 2.18 s per loop
In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
10 loops, best of 3: 82.5 ms per loop
In [12]: 2180.0/82.5
Out[12]: 26.424242424242426
Let's try with bigger datasets. With sampleSize = 500000, we get -
In [14]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 23.2 s per loop
In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
1 loops, best of 3: 821 ms per loop
In [16]: 23200.0/821
Out[16]: 28.25822168087698
So, the speedup stays constant at around 27.
Limitations :
1) We are using float32 numbers, as GPUs work best with those. Double precision specially on non-server GPUs aren't popular when it comes to performance and since you are working with such a GPU, I tested with float32.
Further improvement :
1) We could use faster constant memory to feed in data2a and data2b, rather than use global memory.

Tweak #1
Its usually advised to vectorize things when working with NumPy arrays. But with very large arrays, I think you are out of options there. So, to boost performance, a minor tweak is possible to optimize on the last step of summing.
We could replace the step that makes an array of 1s and 0s and does summing :
np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
with np.count_nonzero that works efficiently to count True values in a boolean array, instead of converting to 1s and 0s -
np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
Runtime test -
In [45]: abcd = np.random.randint(11,99,(10000))
In [46]: data2a = np.random.randint(11,99,(10000))
In [47]: data2b = np.random.randint(11,99,(10000))
In [48]: %timeit np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
10000 loops, best of 3: 81.8 µs per loop
In [49]: %timeit np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
10000 loops, best of 3: 28.8 µs per loop
Tweak #2
Use a pre-computed reciprocal when dealing with cases that undergo implicit broadcasting. Some more info here. Thus, store reciprocal of dif and use that instead at the step :
((((A - Cmin) / dif) + ((B - Cmin) / dif) + ...
Sample test -
In [52]: A = np.random.rand(10000)
In [53]: dif = 0.5
In [54]: %timeit A/dif
10000 loops, best of 3: 25.8 µs per loop
In [55]: %timeit A*(1.0/dif)
100000 loops, best of 3: 7.94 µs per loop
You have four places using division by dif. So, hopefully this would bring out noticeable boost there too!

Before you start tweaking the target (GPU) or using anything else (i.e. parallel executions ), you might want to consider how to improve the already existing code. You used the numba-tag so I'll use it to improve the code: First we operate on arrays not on matrices:
data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit
Each time you call doTheMath you expect an integer back, however you use a lot of arrays and create a lot of intermediate arrays:
abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
This creates an intermediate array each step:
tmp1 = A-Cmin,
tmp2 = tmp1 / dif,
tmp3 = B - Cmin,
tmp4 = tmp3 / dif
... you get the gist.
However this is a reduce function (array -> integer) so having a lot of intermediate arrays is unnecessary weight, just calculate the value of the "fly".
import numba as nb
#nb.njit
def doTheMathNumba(tmpData, data2a, data2b):
Bmax = np.max(tmpData[:, 1])
Cmin = np.min(tmpData[:, 2])
diff = (Bmax - Cmin)
idiff = 1 / diff
sum_ = 0
for i in range(tmpData.shape[0]):
val = (tmpData[i, 0] + tmpData[i, 1] + tmpData[i, 2] + tmpData[i, 3]) / 4 * idiff - Cmin * idiff
if val <= data2a[i] and val >= data2b[i]:
sum_ += 1
return sum_
I did something else here to avoid multiple operations:
(((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4
= ((A - Cmin + B - Cmin + C - Cmin + D - Cmin) / dif) / 4
= (A + B + C + D - 4 * Cmin) / (4 * dif)
= (A + B + C + D) / (4 * dif) - (Cmin / dif)
This actually cuts down the execution time by almost a factor of 10 on my computer:
%timeit doTheMath(tmp_df, data2a, data2b) # 1000 loops, best of 3: 446 µs per loop
%timeit doTheMathNumba(tmp_df, data2a, data2b) # 10000 loops, best of 3: 59 µs per loop
There are certainly also other improvements, for example using a rolling min/max to calculate Bmax and Cmin, that would make at least part of the calculation run in O(sampleSize) instead of O(samplesize * batchsize). This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif) calculations because if Cmin and Bmax don't change for the next sample these values don't differ. It's a bit complicated to do because the comparisons differ. But definitely possible! See here:
import time
import numpy as np
import numba as nb
#nb.njit
def doTheMathNumba(abcd, data2a, data2b, Bmax, Cmin):
diff = (Bmax - Cmin)
idiff = 1 / diff
quarter_idiff = 0.25 * idiff
sum_ = 0
for i in range(abcd.shape[0]):
val = abcd[i] * quarter_idiff - Cmin * idiff
if val <= data2a[i] and val >= data2b[i]:
sum_ += 1
return sum_
#nb.njit
def doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, resultArray):
found = 0
for rowNr in range(data1.shape[0]):
if(abcd[rowNr:rowNr + batchSize].shape[0] == batchSize):
result = doTheMathNumba(abcd[rowNr:rowNr + batchSize],
data2a, data2b, Bmaxs[rowNr], Cmins[rowNr])
if (result >= minimumLimit):
resultArray[found, 0] = rowNr
resultArray[found, 1] = result
found += 1
return resultArray[:found]
#Declare variables
batchSize = 2000
sampleSize = 50000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data
data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit
from scipy import ndimage
t0 = time.time()
abcd = np.sum(data1, axis=1)
Bmaxs = ndimage.maximum_filter1d(data1[:, 1],
size=batchSize,
origin=-((batchSize-1)//2-1)) # correction for even shapes
Cmins = ndimage.minimum_filter1d(data1[:, 2],
size=batchSize,
origin=-((batchSize-1)//2-1))
result = np.zeros((sampleSize, 2), dtype=np.int64)
doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, result)
print('Runtime:', time.time() - t0)
This gives me a Runtime: 0.759593152999878 (after numba compiled the functions!), while your original took had Runtime: 24.68975639343262. Now we're 30 times faster!
With your sample size it still takes Runtime: 60.187848806381226 but that's not too bad, right?
And even if I haven't done this myself, numba says that it's possible to write "Numba for CUDA GPUs" and it doesn't seem to complicated.

Here is some code to demonstrate what is possible by just tweaking the algorithm. It's pure numpy but on the sample data you posted gives a roughly 35x speedup over the original version (~1,000,000 samples in ~2.5sec on my rather modest machine):
>>> result_dict = master('run')
[('load', 0.82578349113464355), ('precomp', 0.028138399124145508), ('max/min', 0.24333405494689941), ('ABCD', 0.015314102172851562), ('main', 1.3356468677520752)]
TOTAL 2.44821691513
Tweaks used:
A+B+C+D, see my other answer
running min/max, including avoiding to compute (A+B+C+D - 4Cmin)/(4dif) multiple times with the same Cmin/dif.
These are more or less routine. That leaves the comparison with data2a/b which is expensive O(NK) where N is the number of samples and K is the size of the window.
Here one can take advantage of the relatively well-behaved data. Using the running min/max one can create variants of data2a/b that can be used to test a range of window offsets at a time, if the test fails all these offsets can be ruled out immediately, otherwise the range is bisected.
import numpy as np
import time
# global variables; they will hold the precomputed pre-screening filters
preA, preB = {}, {}
CHUNK_SIZES = None
def sliding_argmax(data, K=2000):
"""compute the argmax of data over a sliding window of width K
returns:
indices -- indices into data
switches -- window offsets at which the maximum changes
(strictly speaking: where the index of the maximum changes)
excludes 0 but includes maximum offset (len(data)-K+1)
see last line of compute_pre_screening_filter for a recipe to convert
this representation to the vector of maxima
"""
N = len(data)
last = np.argmax(data[:K])
indices = [last]
while indices[-1] <= N - 1:
ge = np.where(data[last + 1 : last + K + 1] > data[last])[0]
if len(ge) == 0:
if last + K >= N:
break
last += 1 + np.argmax(data[last + 1 : last + K + 1])
indices.append(last)
else:
last += 1 + ge[0]
indices.append(last)
indices = np.array(indices)
switches = np.where(data[indices[1:]] > data[indices[:-1]],
indices[1:] + (1-K), indices[:-1] + 1)
return indices, np.r_[switches, [len(data)-K+1]]
def compute_pre_screening_filter(bound, n_offs):
"""compute pre-screening filter for point-wise upper bound
given a K-vector of upper bounds B and K+n_offs-1-vector data
compute K+n_offs-1-vector filter such that for each index j
if for any offset 0 <= o < n_offs and index 0 <= i < K such that
o + i = j, the inequality B_i >= data_j holds then filter_j >= data_j
therefore the number of data points below filter is an upper bound for
the maximum number of points below bound in any K-window in data
"""
pad_l, pad_r = np.min(bound[:n_offs-1]), np.min(bound[1-n_offs:])
padded = np.r_[pad_l+np.zeros(n_offs-1,), bound, pad_r+np.zeros(n_offs-1,)]
indices, switches = sliding_argmax(padded, n_offs)
return padded[indices].repeat(np.diff(np.r_[[0], switches]))
def compute_all_pre_screening_filters(upper, lower, min_chnk=5, dyads=6):
"""compute upper and lower pre-screening filters for data blocks of
sizes K+n_offs-1 where
n_offs = min_chnk, 2min_chnk, ..., 2^(dyads-1)min_chnk
the result is stored in global variables preA and preB
"""
global CHUNK_SIZES
CHUNK_SIZES = min_chnk * 2**np.arange(dyads)
preA[1] = upper
preB[1] = lower
for n in CHUNK_SIZES:
preA[n] = compute_pre_screening_filter(upper, n)
preB[n] = -compute_pre_screening_filter(-lower, n)
def test_bounds(block, counts, threshold=400):
"""test whether the windows fitting in the data block 'block' fall
within the bounds using pre-screening for efficient bulk rejection
array 'counts' will be overwritten with the counts of compliant samples
note that accurate counts will only be returned for above threshold
windows, because the analysis of bulk rejected windows is short-circuited
also note that bulk rejection only works for 'well behaved' data and
for example not on random numbers
"""
N = len(counts)
K = len(preA[1])
r = N % CHUNK_SIZES[0]
# chop up N into as large as possible chunks with matching pre computed
# filters
# start with small and work upwards
counts[:r] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
(block[l:l+K] >= preB[1]))
for l in range(r)]
def bisect(block, counts):
M = len(counts)
cnts = np.count_nonzero((block <= preA[M]) & (block >= preB[M]))
if cnts < threshold:
counts[:] = cnts
return
elif M == CHUNK_SIZES[0]:
counts[:] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
(block[l:l+K] >= preB[1]))
for l in range(M)]
else:
M //= 2
bisect(block[:-M], counts[:M])
bisect(block[M:], counts[M:])
N = N // CHUNK_SIZES[0]
for M in CHUNK_SIZES:
if N % 2:
bisect(block[r:r+M+K-1], counts[r:r+M])
r += M
elif N == 0:
return
N //= 2
else:
for j in range(2*N):
bisect(block[r:r+M+K-1], counts[r:r+M])
r += M
def analyse(data, use_pre_screening=True, min_chnk=5, dyads=6,
threshold=400):
samples, upper, lower = data
N, K = samples.shape[0], upper.shape[0]
times = [time.time()]
if use_pre_screening:
compute_all_pre_screening_filters(upper, lower, min_chnk, dyads)
times.append(time.time())
# compute switching points of max and min for running normalisation
upper_inds, upper_swp = sliding_argmax(samples[:, 1], K)
lower_inds, lower_swp = sliding_argmax(-samples[:, 2], K)
times.append(time.time())
# sum columns
ABCD = samples.sum(axis=-1)
times.append(time.time())
counts = np.empty((N-K+1,), dtype=int)
# main loop
# loop variables:
offs = 0
u_ind, u_scale, u_swp = 0, samples[upper_inds[0], 1], upper_swp[0]
l_ind, l_scale, l_swp = 0, samples[lower_inds[0], 2], lower_swp[0]
while True:
# check which is switching next, min(C) or max(B)
if u_swp > l_swp:
# greedily take the largest block possible such that dif and Cmin
# do not change
block = (ABCD[offs:l_swp+K-1] - 4*l_scale) \
* (0.25 / (u_scale-l_scale))
if use_pre_screening:
test_bounds(block, counts[offs:l_swp], threshold=threshold)
else:
counts[offs:l_swp] = [
np.count_nonzero((block[l:l+K] <= upper) &
(block[l:l+K] >= lower))
for l in range(l_swp - offs)]
# book keeping
l_ind += 1
offs = l_swp
l_swp = lower_swp[l_ind]
l_scale = samples[lower_inds[l_ind], 2]
else:
block = (ABCD[offs:u_swp+K-1] - 4*l_scale) \
* (0.25 / (u_scale-l_scale))
if use_pre_screening:
test_bounds(block, counts[offs:u_swp], threshold=threshold)
else:
counts[offs:u_swp] = [
np.count_nonzero((block[l:l+K] <= upper) &
(block[l:l+K] >= lower))
for l in range(u_swp - offs)]
u_ind += 1
if u_ind == len(upper_inds):
assert u_swp == N-K+1
break
offs = u_swp
u_swp = upper_swp[u_ind]
u_scale = samples[upper_inds[u_ind], 1]
times.append(time.time())
return {'counts': counts, 'valid': np.where(counts >= 400)[0],
'timings': np.diff(times)}
def master(mode='calibrate', data='fake', use_pre_screening=True, nrep=3,
min_chnk=None, dyads=None):
t = time.time()
if data in ('fake', 'load'):
data1 = np.loadtxt('data1.csv', delimiter=';', skiprows=1,
usecols=[1,2,3,4])
data2a = np.loadtxt('data2a.csv', delimiter=';', skiprows=1,
usecols=[1])
data2b = np.loadtxt('data2b.csv', delimiter=';', skiprows=1,
usecols=[1])
if data == 'fake':
data1 = np.tile(data1, (10, 1))
threshold = 400
elif data == 'random':
data1 = np.random.random((102000, 4))
data2b = np.random.random(2000)
data2a = np.random.random(2000)
threshold = 490
if use_pre_screening or mode == 'calibrate':
print('WARNING: pre-screening not efficient on artificial data')
else:
raise ValueError("data mode {} not recognised".format(data))
data = data1, data2a, data2b
t_load = time.time() - t
if mode == 'calibrate':
min_chnk = (2, 3, 4, 5, 6) if min_chnk is None else min_chnk
dyads = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) if dyads is None else dyads
timings = np.zeros((len(min_chnk), len(dyads)))
print('max bisect ' + ' '.join([
' n.a.' if dy == 0 else '{:7d}'.format(dy) for dy in dyads]),
end='')
for i, mc in enumerate(min_chnk):
print('\nmin chunk {}'.format(mc), end=' ')
for j, dy in enumerate(dyads):
for k in range(nrep):
if dy == 0: # no pre-screening
timings[i, j] += analyse(
data, False, mc, dy, threshold)['timings'][3]
else:
timings[i, j] += analyse(
data, True, mc, dy, threshold)['timings'][3]
timings[i, j] /= nrep
print('{:7.3f}'.format(timings[i, j]), end=' ', flush=True)
best_mc, best_dy = np.unravel_index(np.argmin(timings.ravel()),
timings.shape)
print('\nbest', min_chnk[best_mc], dyads[best_dy])
return timings, min_chnk[best_mc], dyads[best_dy]
if mode == 'run':
min_chnk = 2 if min_chnk is None else min_chnk
dyads = 5 if dyads is None else dyads
res = analyse(data, use_pre_screening, min_chnk, dyads, threshold)
times = np.r_[[t_load], res['timings']]
print(list(zip(('load', 'precomp', 'max/min', 'ABCD', 'main'), times)))
print('TOTAL', times.sum())
return res

This is technically off-topic (not GPU) but I'm sure you'll be interested.
There is one obvious and rather large saving:
Precompute A + B + C + D (not in the loop, on the whole data: data1.sum(axis=-1)), because abcd = ((A+B+C+D) - 4Cmin) / (4dif). This will save quite a few ops.
Surprised nobody spotted that one before ;-)
Edit:
There is another thing, though I suspect that's only in your example, not in your real data:
As it stands roughly half of data2a will be smaller than data2b. In these places your conditions on abcd cannot be both True, so you needn't even compute abcd there.
Edit:
One more tweak I used below but forgot to mention: If you compute the max (or min) over a moving window. When you move one point to the right, say, how likely is the max to change? There are only two things that can change it: the new point on the right is larger (happens roughly once in windowlength times, and even if it happens, you immediately know the new max) or the old max falls off the window
on the left (also happens roughly once in windowlength times). Only in this last case you have to search the entire window for the new max.
Edit:
Couldn't resist giving it a try in tensorflow. I don't have a GPU, so you yourself have to test it for speed. Put "gpu" for "cpu" on the marked line.
On cpu it is about half as fast as your original implementation (i.e. without Divakar's tweaks). Note that I've taken the liberty of changing the inputs from matrix to plain array. Currently tensorflow is a bit of a moving target, so make sure you have the right version. I used Python3.6 and tf 0.12.1 If you do a pip3 install tensorflow-gpu today it should might work.
import numpy as np
import time
import tensorflow as tf
# currently the max/min code is sequential
# thus
parallel_iterations = 1
# but you can put this in a separate loop, precompute and then try and run
# the remainder of doTheMathTF with a larger parallel_iterations
# tensorflow is quite capricious about its data types
ddf = tf.float64
ddi = tf.int32
def worker(data1, data2a, data2b):
###################################
# CHANGE cpu to gpu in next line! #
###################################
with tf.device('/cpu:0'):
g = tf.Graph ()
with g.as_default():
ABCD = tf.constant(data1.sum(axis=-1), dtype=ddf)
B = tf.constant(data1[:, 1], dtype=ddf)
C = tf.constant(data1[:, 2], dtype=ddf)
window = tf.constant(len(data2a))
N = tf.constant(data1.shape[0] - len(data2a) + 1, dtype=ddi)
data2a = tf.constant(data2a, dtype=ddf)
data2b = tf.constant(data2b, dtype=ddf)
def doTheMathTF(i, Bmax, Bmaxind, Cmin, Cminind, out):
# most of the time we can keep the old max/min
Bmaxind = tf.cond(Bmaxind<i,
lambda: i + tf.to_int32(
tf.argmax(B[i:i+window], axis=0)),
lambda: tf.cond(Bmax>B[i+window-1],
lambda: Bmaxind,
lambda: i+window-1))
Cminind = tf.cond(Cminind<i,
lambda: i + tf.to_int32(
tf.argmin(C[i:i+window], axis=0)),
lambda: tf.cond(Cmin<C[i+window-1],
lambda: Cminind,
lambda: i+window-1))
Bmax = B[Bmaxind]
Cmin = C[Cminind]
abcd = (ABCD[i:i+window] - 4 * Cmin) * (1 / (4 * (Bmax-Cmin)))
out = out.write(i, tf.to_int32(
tf.count_nonzero(tf.logical_and(abcd <= data2a,
abcd >= data2b))))
return i + 1, Bmax, Bmaxind, Cmin, Cminind, out
with tf.Session(graph=g) as sess:
i, Bmaxind, Bmax, Cminind, Cmin, out = tf.while_loop(
lambda i, _1, _2, _3, _4, _5: i<N, doTheMathTF,
(tf.Variable(0, dtype=ddi), tf.Variable(0.0, dtype=ddf),
tf.Variable(-1, dtype=ddi),
tf.Variable(0.0, dtype=ddf), tf.Variable(-1, dtype=ddi),
tf.TensorArray(ddi, size=N)),
shape_invariants=None,
parallel_iterations=parallel_iterations,
back_prop=False)
out = out.pack()
sess.run(tf.initialize_all_variables())
out, = sess.run((out,))
return out
#Declare variables
batchSize = 2000
sampleSize = 50000#00
resultArray = []
#Create Sample Data
data1 = np.random.uniform(1, 100, (sampleSize + batchSize, 4))
data2a = np.random.uniform(0, 1, (batchSize,))
data2b = np.random.uniform(0, 1, (batchSize,))
t0 = time.time()
out = worker(data1, data2a, data2b)
print('Runtime (tensorflow):', time.time() - t0)
good_indices, = np.where(out >= 490)
res_tf = np.c_[good_indices, out[good_indices]]
def doTheMath(tmpData1, data2a, data2b):
A = tmpData1[:, 0]
B = tmpData1[:,1]
C = tmpData1[:,2]
D = tmpData1[:,3]
Bmax = B.max()
Cmin = C.min()
dif = (Bmax - Cmin)
abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
#Loop through the data
t0 = time.time()
for rowNr in range(sampleSize+1):
tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
result = doTheMath(tmp_df, data2a, data2b)
if (result >= 490):
resultArray.append([rowNr , result])
print('Runtime (original):', time.time() - t0)
print(np.alltrue(np.array(resultArray)==res_tf))

Python/Numpy - Speeding up Monte Carlo method for radioactive decay

I am trying to optimize the generation of decay times for a radioactive isotope Monte Carlo.
That is given nsims atoms of an isotope with a halflife of t12, when does each isotope decay?
I tried to optimize this by generating random numbers for all un-decayed atoms at once using a single numpy.random.random call (I call this method parallel), but I hope that there is still more performance to be gained. I also show a method that does this calculation for each isotope individually (serial).
import numpy as np
import time
import matplotlib.pyplot as plt
import scipy.optimize
t12 = 3.1*60.
dt = 0.01
ln2 = np.log(2)
decay_exp = lambda t,A,tau: A * np.exp(-t/tau)
def serial(nsims):
sim_start_time = time.clock()
decay_time = np.zeros(nsims)
for i in range(nsims):
t = dt
while decay_time[i] == 0:
if np.random.random() > np.exp(-ln2*dt/t12):
decay_time[i] = t
t += dt
sim_end_time = time.clock()
return (sim_end_time - sim_start_time,decay_time)
def parallel(nsims):
sim_start_time = time.clock()
decay_time = np.zeros(nsims)
t = dt
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0]
idecay_check = np.random.random(len(inot_decayed)) > np.exp(-ln2*dt/t12)
decay_time[inot_decayed[np.where(idecay_check==True)[0]]] = t
t += dt
sim_end_time = time.clock()
return (sim_end_time - sim_start_time,decay_time)
I'm interested in any suggestions that performs better than the parallel function that is pure python, i.e. not cython.
This method already improves greatly upon the serial method of calculating this for large nsims.

There are still some speed gains to be had from your original "parallel" (vectorized is the correct word) execution.
Remark, this is micro-management, but it does still give a small performance increase.
import numpy as np
t12 = 3.1*60.
dt = 0.01
ln2 = np.log(2)
s = 98765
def parallel(nsims): # your code, unaltered, except removed inaccurate timing method
decay_time = np.zeros(nsims)
t = dt
np.random.seed(s) # also had to add a seed to get comparable results
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0]
idecay_check = np.random.random(len(inot_decayed)) > np.exp(-ln2*dt/t12)
decay_time[inot_decayed[np.where(idecay_check==True)[0]]] = t
t += dt
return decay_time
def parallel_micro(nsims): # micromanaged code
decay_time = np.zeros(nsims)
t = dt
half_time = np.exp(-ln2*dt/t12) # there was no need to calculate this again in every loop iteration
np.random.seed(s) # fixed seed to get comparable results
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0] # only here you need the call to np.where
# to my own surprise, len(some_array) is quicker than some_array.size (function lookup vs attribute lookup)
idecay_check = np.random.random(len(inot_decayed)) > half_time
decay_time[inot_decayed[idecay_check]] = t # no need for another np.where and certainly not for another boolean comparison
t += dt
return decay_time
You can run timing measurements with the timeit module. Profiling will tell you that the bottleneck here is the call to np.where.
Knowing that the bottleneck is np.where, you could get rid of it like this:
def parallel_micro2(nsims):
decay_time = np.zeros(nsims)
t = dt
half_time = np.exp(-ln2*dt/t12)
np.random.seed(s)
indices = np.where(decay_time==0)[0]
u = len(indices)
while u:
decayed = np.random.random(u) > half_time
decay_time[indices[decayed]] = t
indices = indices[np.logical_not(decayed)]
u = len(indices)
t += dt
return decay_time
And that does give a rather large speed increase:
In [2]: %timeit -n1 -r1 parallel_micro2(1e4)
1 loops, best of 1: 7.81 s per loop
In [3]: %timeit -n1 -r1 parallel_micro(1e4)
1 loops, best of 1: 29 s per loop
In [4]: %timeit -n1 -r1 parallel(1e4)
1 loops, best of 1: 33.5 s per loop
Don't forget to get rid of the call to np.random.seed when you're done optimizing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy array vs python list with time measurement - python

Related

EWMA Covariance Matrix in Pandas - Optimization

Efficiently calculating grid-based point density in 3d point cloud

Converting `for` loop that can't be vectorized to sparse matrix

Python: rewrite a looping numpy math function to run on GPU

Python/Numpy - Speeding up Monte Carlo method for radioactive decay

Categories

Resources