How to vectorise this for loop in Python? - python

My data has 4 columns A-D which have integers. I am adding a new column E, which takes its first value same as first value in column D. The next value in E should be the corresponding value in column D if previous value in column E is negative, otherwise it takes corresponding value in column C.
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
data=pd.read_excel('/Users/xxxx/Documents/PY Notebooks/Data/yyyy.xlsx')
data1=data.copy()
data1['E']=np.nan
data1.at[0,'E']=data1['D'][0]
l=len(data1)
for i in range(l-1):
if data1['E'][i]<0:
data1.at[i+1,'E']=data1['D'][i+1]
else:
data1.at[i+1,'E']=data1['C'][i+1]

TL;DR: go to the benchmark code and use Method 1.
Short Answer
No. Vectorization is not possible.
Long Answer
Theorem: For this particular task, the output of a given row cannot be determined using any finite length of backward rolling window smaller than the partial length up to this row.
Thus, there is no way for this output logic to be processed in a vectorized way. (See this answer for an idea of vectorization is performed in CPUs). The output can only be computed from the beginning of the dataframe.
Proof: Consider a target row of a dataframe df. Assume there is a backward rolling window with size n < partial length, so a previous value of df["E"] exists before the window. We denote this previous value by state.
Consider a special case: df["C"] == -1 and df["D"] == 1 within the window.
Case 1 (state < 0): The output within this rolling window will be [1, -1, 1, -1, .....], making the last element (-1)^(n-1)
Case 2 (state >= 0): The output will be [-1, 1, -1, 1, .....], making the last element (-1)^(n)
Therefore, it is possible for the output df["E"] of the target row to be dependent on a state variable outside the window. QED.
Useful Answer
Although vectorization is impossible, it does not mean that significant acceleration cannot be achieved. A simple yet very efficient approach is using a numba-compiled generator to perform the sequential generation. It only requires re-writing your logic into a generator function and add two additional lines:
import numba
#numba.njit
def my_generator_func():
....
Of course, you may have to install numba first. If this is not possible, then using a plain generator without numba optimization is also fine.
Benchmark
The benchmark is performed on a i5-8250U (4C8T) laptop with 16GB RAM running 64-bit debian 10. Python version is 3.7.9 and pandas is 1.1.3. n = 10^7 (10 million) records are generated for benchmarking purposes.
Result:
1. numba-njit: 2.48s
2. plain generator (no numba): 5.13s
3. original: 271.15s
> 100x efficiency gain can be achieved against the original code.
Code
from datetime import datetime
import pandas as pd
import numpy as np
n = 10000000 # a large number of rows
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
#print(df.head())
# ========== Method 1. generator + numba njit ==========
ti = datetime.now()
import numba
#numba.njit
def gen(plus: np.array, minus: np.array):
l = len(plus)
assert len(minus) == l
# first
state = minus[0]
yield state
# second to last
for i in range(l-1):
state = minus[i+1] if state < 0 else plus[i+1]
yield state
df["E"] = [i for i in gen(df["C"].values, df["D"].values)]
tf = datetime.now()
print(f"1. numba-njit: {(tf-ti).total_seconds():.2f}s") # 1. numba-njit: 0.47s
# ========== Method 2. Generator without numba ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()
def gen_plain(plus: np.array, minus: np.array):
l = len(plus)
assert len(minus) == l
# first
state = minus[0]
yield state
# second to last
for i in range(l-1):
state = minus[i+1] if state < 0 else plus[i+1]
yield state
df["E"] = [i for i in gen_plain(df["C"].values, df["D"].values)]
tf = datetime.now()
print(f"2. plain generator (no numba): {(tf-ti).total_seconds():.2f}s") #
# ========== Method 3. Direct iteration ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()
# code provided by the OP
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
l=len(df)
for i in range(l - 1):
if df['E'][i] < 0:
df.at[i+1,'E'] = df['D'][i+1]
else:
df.at[i+1,'E'] = df['C'][i+1]
tf = datetime.now()
print(f"3. original: {(tf-ti).total_seconds():.2f}s") # 2. 26.61s

I don't think you can vectorize this operation as you have dependent rows that need to get previous calculations being run. This being said, there still is quite some room for optimization in your functionality. Let's first check your original implementation with some random points.
import numpy as np
import pandas as pd
import time
size = 10000000
data = np.random.randint(-2, 10, size=size)
data = data.reshape([size//4, 4])
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
for i in range(len(df)-1):
if df['E'][i]<0:
df.at[i+1,'E'] = df['D'][i+1]
else:
df.at[i+1,'E'] = df['C'][i+1]
print(f"Operation on pd df took {time.time() - time_start} seconds.")
Output:
Operation on pd df took 84.00791883468628 seconds.
As operations on the DataFrame usually are quite slow, we can operate on the underlying numpy arrays instead.
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values
e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(df)-1):
if last_e < 0:
last_e = d_vals[i+1]
else:
last_e = c_vals[i+1]
e_vals.append(last_e)
df['E'] = e_vals
print(f"Operation on np array took {time.time() - time_start} seconds.")
Output:
Operation on np array took 2.2387869358062744 seconds.
Now we can argue that for loops are slow in Python and we can use a JIT compiler that can deal with numpy arrays, for instance numba.
import numba
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values
#numba.jit(nopython=True)
def numba_calc(c_vals, d_vals):
e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(c_vals)-1):
if last_e < 0:
last_e = d_vals[i+1]
else:
last_e = c_vals[i+1]
e_vals.append(last_e)
return e_vals
df["E"] = numba_calc(c_vals, d_vals)
print(f"Operation on np array with numba took {time.time() - time_start} seconds.")
Output:
Operation on np array with numba took 1.2623450756072998 seconds.
So especially for larger DataFrames using numba will pay out, while operating on the raw numpy arrays mostly gives a nice runtime improvement.

Related

Accelerate parallel processing in Python

I was hoping to use parallel processing to accelerate a for loop, but as seen in the example below, it is much slower that the loop. Is there anything wrong with my parallel processing approach? Are there better solutions?
The goal here is to update a column of a dataframe using a pre-defined functions that operates on multiple other columns of the dataframe.
import itertools
import pandas as pd
import multiprocessing as mp
import timeit
inputs = [range(50),range(90),range(30)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
start_p = timeit.timeit()
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df.loc[A,B,C]['Output']
num_workers = mp.cpu_count()
pool = mp.Pool(num_workers)
df['Output'] = pool.starmap(Addition,inputs_list) # specify the function and arguments to map
pool.close()
pool.join()
end_p = timeit.timeit()
print(end_p - start_p)
start_l = timeit.timeit()
for A in range(50):
for B in range(90):
for C in range(30):
df.loc[A,B,C]['Output']=A+B+C
end_l = timeit.timeit()
print(end_l - start_l)
Better approach is to first prepare dict, then make dataframe out of it. Adding rows to df one by one is slow.
And as DarkKnight mentioned in a comment, timeit does not make sense here. I use time.time()
start_l = time.time()
dict_to_df = {}
for A in range(50):
for B in range(90):
for C in range(30):
dict_to_df[A,B,C] = A+B+C
df2 = pd.DataFrame.from_dict(dict_to_df, orient='index', columns=['Output'])
end_l = time.time()
print(end_l - start_l)
0.26 sec on my Machine
Assuming dataframe index is well ordered, you can just do something like this, using numpy vectorization:
import numpy as np
start_l = time.time()
a = np.arange(50)
b = np.arange(90)
c = np.arange(30)
a_plus_b = np.add.outer(a, b).flatten()
a_plus_b_plus_c = np.add.outer(a_plus_b, c).flatten()
df['Output'] = a_plus_b_plus_c
end_l = time.time()
print(end_l - start_l)
0.00044

Optimize code for step function using only NumPy

I'm trying to optimize the function 'pw' in the following code using only NumPy functions (or perhaps list comprehensions).
from time import time
import numpy as np
def pw(x, udata):
"""
Creates the step function
| 1, if d0 <= x < d1
| 2, if d1 <= x < d2
pw(x,data) = ...
| N, if d(N-1) <= x < dN
| 0, otherwise
where di is the ith element in data.
INPUT: x -- interval which the step function is defined over
data -- an ordered set of data (without repetitions)
OUTPUT: pw_func -- an array of size x.shape[0]
"""
vals = np.arange(1,udata.shape[0]+1).reshape(udata.shape[0],1)
pw_func = np.sum(np.where(np.greater_equal(x,udata)*np.less(x,np.roll(udata,-1)),vals,0),axis=0)
return pw_func
N = 50000
x = np.linspace(0,10,N)
data = [1,3,4,5,5,7]
udata = np.unique(data)
ti = time()
pw(x,udata)
tf = time()
print(tf - ti)
import cProfile
cProfile.run('pw(x,udata)')
The cProfile.run is telling me that most of the overhead is coming from np.where (about 1 ms) but I'd like to create faster code if possible. It seems that performing the operations row-wise versus column-wise makes some difference, unless I'm mistaken, but I think I've accounted for it. I know that sometimes list comprehensions can be faster but I couldn't figure out a faster way than what I'm doing using it.
Searchsorted seems to yield better performance but that 1 ms still remains on my computer:
(modified)
def pw(xx, uu):
"""
Creates the step function
| 1, if d0 <= x < d1
| 2, if d1 <= x < d2
pw(x,data) = ...
| N, if d(N-1) <= x < dN
| 0, otherwise
where di is the ith element in data.
INPUT: x -- interval which the step function is defined over
data -- an ordered set of data (without repetitions)
OUTPUT: pw_func -- an array of size x.shape[0]
"""
inds = np.searchsorted(uu, xx, side='right')
vals = np.arange(1,uu.shape[0]+1)
pw_func = vals[inds[inds != uu.shape[0]]]
num_mins = np.sum(xx < np.min(uu))
num_maxs = np.sum(xx > np.max(uu))
pw_func = np.concatenate((np.zeros(num_mins), pw_func, np.zeros(xx.shape[0]-pw_func.shape[0]-num_mins)))
return pw_func
This answer using piecewise seems pretty close, but that's on a scalar x0 and x1. How would I do it on arrays? And would it be more efficient?
Understandably, x may be pretty big but I'm trying to put it through a stress test.
I am still learning though so some hints or tricks that can help me out would be great.
EDIT
There seems to be a mistake in the second function since the resulting array from the second function doesn't match the first one (which I'm confident that it works):
N1 = pw1(x,udata.reshape(udata.shape[0],1)).shape[0]
N2 = np.sum(pw1(x,udata.reshape(udata.shape[0],1)) == pw2(x,udata))
print(N1 - N2)
yields
15000
data points that are not the same. So it seems that I don't know how to use 'searchsorted'.
EDIT 2
Actually I fixed it:
pw_func = vals[inds[inds != uu.shape[0]]]
was changed to
pw_func = vals[inds[inds[(inds != uu.shape[0])*(inds != 0)]-1]]
so at least the resulting arrays match. But the question still remains on whether there's a more efficient way of going about doing this.
EDIT 3
Thanks Tin Lai for pointing out the mistake. This one should work
pw_func = vals[inds[(inds != uu.shape[0])*(inds != 0)]-1]
Maybe a more readable way of presenting it would be
non_endpts = (inds != uu.shape[0])*(inds != 0) # only consider the points in between the min/max data values
shift_inds = inds[non_endpts]-1 # searchsorted side='right' includes the left end point and not right end point so a shift is needed
pw_func = vals[shift_inds]
I think I got lost in all those brackets! I guess that's the importance of readability.
A very abstract yet interesting problem! Thanks for entertaining me, I had fun :)
p.s. I'm not sure about your pw2 I wasn't able to get it output the same as pw1.
For reference the original pws:
def pw1(x, udata):
vals = np.arange(1,udata.shape[0]+1).reshape(udata.shape[0],1)
pw_func = np.sum(np.where(np.greater_equal(x,udata)*np.less(x,np.roll(udata,-1)),vals,0),axis=0)
return pw_func
def pw2(xx, uu):
inds = np.searchsorted(uu, xx, side='right')
vals = np.arange(1,uu.shape[0]+1)
pw_func = vals[inds[inds[(inds != uu.shape[0])*(inds != 0)]-1]]
num_mins = np.sum(xx < np.min(uu))
num_maxs = np.sum(xx > np.max(uu))
pw_func = np.concatenate((np.zeros(num_mins), pw_func, np.zeros(xx.shape[0]-pw_func.shape[0]-num_mins)))
return pw_func
My first attempt was utilising a lot of boardcasting operation from numpy:
def pw3(x, udata):
# the None slice is to create new axis
step_bool = x >= udata[None,:].T
# we exploit the fact that bools are integer value of 1s
# skipping the last value in "data"
step_vals = np.sum(step_bool[:-1], axis=0)
# for the step_bool that we skipped from previous step (last index)
# we set it to zerp so that we can negate the step_vals once we reached
# the last value in "data"
step_vals[step_bool[-1]] = 0
return step_vals
After looking at the searchsorted from your pw2 I had a new approach that utilise it with much higher performance:
def pw4(x, udata):
inds = np.searchsorted(udata, x, side='right')
# fix-ups the last data if x is already out of range of data[-1]
if x[-1] > udata[-1]:
inds[inds == inds[-1]] = 0
return inds
Plots with:
plt.plot(pw1(x,udata.reshape(udata.shape[0],1)), label='pw1')
plt.plot(pw2(x,udata), label='pw2')
plt.plot(pw3(x,udata), label='pw3')
plt.plot(pw4(x,udata), label='pw4')
with data = [1,3,4,5,5,7]:
with data = [1,3,4,5,5,7,11]
pw1,pw3,pw4 are all identical
print(np.all(pw1(x,udata.reshape(udata.shape[0],1)) == pw3(x,udata)))
>>> True
print(np.all(pw1(x,udata.reshape(udata.shape[0],1)) == pw4(x,udata)))
>>> True
Performance: (timeit by default runs 3 times, average of number=N of times)
print(timeit.Timer('pw1(x,udata.reshape(udata.shape[0],1))', "from __main__ import pw1, x, udata").repeat(number=1000))
>>> [3.1938983199979702, 1.6096494779994828, 1.962694135003403]
print(timeit.Timer('pw2(x,udata)', "from __main__ import pw2, x, udata").repeat(number=1000))
>>> [0.6884554479984217, 0.6075002400029916, 0.7799002879983163]
print(timeit.Timer('pw3(x,udata)', "from __main__ import pw3, x, udata").repeat(number=1000))
>>> [0.7369808239964186, 0.7557657590004965, 0.8088172269999632]
print(timeit.Timer('pw4(x,udata)', "from __main__ import pw4, x, udata").repeat(number=1000))
>>> [0.20514375300263055, 0.20203858999957447, 0.19906871100101853]

Optimizing Matrix Traversal/General Code Optimization

I have two matrices. One is of size (CxK) and another is of size (SxK) (where S,C, and K all have the potential to be very large). I want to combine these an output matrix using the cosine similarity function (would be of size [CxS]). When I run my code, it takes a very long time to produce an output, and I was wondering if there is any way to optimize what I currently have. [Note, the two input matrices are often very sparse]
I was previously traversing each matrix using two for index,row loops, but I have since switched to the while loops, which improved my run time significantly.
A #this is one of my input matrices (pandas dataframe)
B #this is my second input matrix (pandas dataframe)
C = pd.DataFrame(columns = ['col_1' ,'col_2' ,'col_3'])
i=0
k=0
while i <= 5:
col_1 = A.iloc[i].get('label_A')
while k < 5:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
Right now I have the loops to run on only 5 items from each matrix, producing a 5x5 matrix, but I would obviously like this to work for very large inputs. This is the first time I have done anything like this so please let me know if any facet of code can be improved (data types used to hold matrices, how to traverse them, updating the output matrix, etc.).
Thank you in advance.
This can be done much more easyly and way faster by passing the whole arrays to cosine_similarity after you move the labels to the index:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import time
c = 50
s = 50
k = 100
A = pd.DataFrame( np.random.rand(c,k))
B = pd.DataFrame( np.random.rand(s,k))
A['label_A'] = [f'A{i}' for i in range(c)]
B['label_B'] = [f'B{i}' for i in range(s)]
C = pd.DataFrame()
# your program
start = time.time()
i=0
k=0
while i < c:
col_1 = A.iloc[i].get('label_A')
while k < s:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
print(f'elementwise: {time.time() - start:7.3f} s')
# my solution
start = time.time()
A = A.set_index('label_A')
B = B.set_index('label_B')
C1 = pd.DataFrame(cosine_similarity(A, B), index=A.index, columns=B.index).stack().rename('col_3')
C1.index.rename(['col_1','col_2'], inplace=True)
C1 = C1.reset_index()
print(f'whole array: {time.time() - start:7.3f} s')
# verification
assert(C[['col_1','col_2']].to_numpy()==C1[['col_1','col_2']].to_numpy()).all()\
and np.allclose(C.col_3.to_numpy(), C1.col_3.to_numpy())

Numba #jit(nopython=True) function offers no speed improvement on heavy Numpy function

I am currently running test_matrix_speed() to see how fast my search_and_book_availability function is. Using the PyCharm profiler I can see that each search_and_book_availability function call averages a speed of 0.001ms. Having the Numba #jit(nopython=True) decorator makes no difference to the performance of this function. Is this because there are no improvements to be had and Numpy is operating as fast as possible here? (I don't care about the speed of the generate_searches function)
Here's the code I'm running
import random
import numpy as np
from numba import jit
def generate_searches(number, sim_start, sim_end):
searches = []
for i in range(number):
start_slot = random.randint(sim_start, sim_end - 1)
end_slot = random.randint(start_slot + 1, sim_end)
searches.append((start_slot, end_slot))
return searches
#jit(nopython=True)
def search_and_book_availability(matrix, search_start, search_end):
search_slice = matrix[:, search_start:search_end]
output = np.where(np.sum(search_slice, axis=1) == 0)[0]
number_of_bookable_vecs = output.size
if number_of_bookable_vecs > 0:
if number_of_bookable_vecs == 1:
id_to_book = output[0]
else:
id_to_book = np.random.choice(output)
matrix[id_to_book, search_start:search_end] = 1
return True
else:
return False
def test_matrix_speed():
shape = (10, 1440)
matrix = np.zeros(shape)
sim_start = 0
sim_end = 1440
searches = generate_searches(1000000, sim_start, sim_end)
for i in searches:
search_start = i[0]
search_end = i[1]
availability = search_and_book_availability(matrix, search_start, search_end)
Using your function and the following code to profile the speed
import time
shape = (10, 1440)
matrix = np.zeros(shape)
sim_start = 0
sim_end = 1440
searches = generate_searches(1000000, sim_start, sim_end)
def reset():
matrix[:] = 0
def test_matrix_speed():
for i in searches:
search_start = i[0]
search_end = i[1]
availability = search_and_book_availability(matrix, search_start, search_end)
def timeit(func):
# warmup
reset()
func()
reset()
start = time.time()
func()
end = time.time()
return end - start
print(timeit(test_matrix_speed))
I find on the order of 11.5s for jited version and 7.5s without jit. I am no expert on numba, but what it is made for is optimizing numerical code written in non-vectorized way, in particular explicit for loops. In your code there is none, you only use vectorized operations. Therefore I expected jit to not outperform baseline solution, though I must admit that I am surprised to see it that much worse. If you're looking to optimize your solution, you can cut the execution time (at least on my PC) with the following code:
def search_and_book_availability_opt(matrix, search_start, search_end):
search_slice = matrix[:, search_start:search_end]
# we don't need to sum in order to check if all elements are 0.
# ndarray.any() can use short-circuiting and is therefore faster.
# Also, we don't need the selected values from np.where, only the
# indexes, so np.nonzero is faster
bookable, = np.nonzero(~search_slice.any(axis=1))
# short circuit
if bookable.size == 0:
return False
# we can perform random choice even if size is 1
id_to_book = np.random.choice(bookable)
matrix[id_to_book, search_start:search_end] = 1
return True
and by initializing matrix as np.zeros(shape, dtype=np.bool), instead of the default float64. I am able to get execution times of around 3.8s, a ~50% improvement over your unjited solution and ~70% improvement over the jited version. Hope that helps.

Improving runtime of python numpy code

I have a code which reassigns bins to a large numpy array. Basically, the elements of the large array has been sampled at different frequency and the final goal is to rebin the entire array at fixed bins freq_bins. The code is kind of slow for the array I have. Is there any good way to improve the runtime of this code? A factor of few would do for now. May be some numba magic would do.
import numpy as np
import time
division = 90
freq_division = 50
cd = 3000
boost_factor = np.random.rand(division, division, cd)
freq_bins = np.linspace(1, 60, freq_division)
es = np.random.randint(1,10, size = (cd, freq_division))
final_emit = np.zeros((division, division, freq_division))
time1 = time.time()
for i in xrange(division):
fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
sky_by_cap = np.einsum('ij, jk->ijk', boost_factor[i],es)
freq_index = np.digitize(fre_boost, freq_bins)
freq_index_reshaped = freq_index.reshape(division*cd, -1)
freq_index = None
sky_by_cap_reshaped = sky_by_cap.reshape(freq_index_reshaped.shape)
to_bin_emit = np.zeros(freq_index_reshaped.shape)
row_index = np.arange(freq_index_reshaped.shape[0]).reshape(-1, 1)
np.add.at(to_bin_emit, (row_index, freq_index_reshaped), sky_by_cap_reshaped)
to_bin_emit = to_bin_emit.reshape(fre_boost.shape)
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
final_emit[i] = np.sum(to_bin_emit, axis=1)
print(time.time()-time1)
Keep the code simple and than optimize
If you have an idea what algorithm you want to code write a simple reference implementation. From this you can go two ways using Python. You can try to vectorize the code or you can compile the code to get good performance.
Even if np.einsum or np.add.at were implementet in Numba, it would be very hard for any compiler to make efficient binary code from your example.
The only thing I have rewritten is a more efficient approach of digitize for scalar values.
Edit
In the Numba source code there is a more efficient implimentation of digitize for scalar values.
Code
#From Numba source
#Copyright (c) 2012, Anaconda, Inc.
#All rights reserved.
#nb.njit(fastmath=True)
def digitize(x, bins, right=False):
# bins are monotonically-increasing
n = len(bins)
lo = 0
hi = n
if right:
if np.isnan(x):
# Find the first nan (i.e. the last from the end of bins,
# since there shouldn't be many of them in practice)
for i in range(n, 0, -1):
if not np.isnan(bins[i - 1]):
return i
return 0
while hi > lo:
mid = (lo + hi) >> 1
if bins[mid] < x:
# mid is too low => narrow to upper bins
lo = mid + 1
else:
# mid is too high, or is a NaN => narrow to lower bins
hi = mid
else:
if np.isnan(x):
# NaNs end up in the last bin
return n
while hi > lo:
mid = (lo + hi) >> 1
if bins[mid] <= x:
# mid is too low => narrow to upper bins
lo = mid + 1
else:
# mid is too high, or is a NaN => narrow to lower bins
hi = mid
return lo
#nb.njit(fastmath=True)
def digitize(value, bins):
if value<bins[0]:
return 0
if value>=bins[bins.shape[0]-1]:
return bins.shape[0]
for l in range(1,bins.shape[0]):
if value>=bins[l-1] and value<bins[l]:
return l
#nb.njit(fastmath=True,parallel=True)
def inner_loop(boost_factor,freq_bins,es):
res=np.zeros((boost_factor.shape[0],freq_bins.shape[0]),dtype=np.float64)
for i in nb.prange(boost_factor.shape[0]):
for j in range(boost_factor.shape[1]):
for k in range(freq_bins.shape[0]):
ind=nb.int64(digitize(boost_factor[i,j]*freq_bins[k],freq_bins))
res[i,ind]+=boost_factor[i,j]*es[j,k]*freq_bins[ind]
return res
#nb.njit(fastmath=True)
def calc_nb(division,freq_division,cd,boost_factor,freq_bins,es):
final_emit = np.empty((division, division, freq_division),np.float64)
for i in range(division):
final_emit[i,:,:]=inner_loop(boost_factor[i],freq_bins,es)
return final_emit
Performance
(Quadcore i7)
original_code: 118.5s
calc_nb: 4.14s
#with digitize implementation from Numba source
calc_nb: 2.66s
This seems to be trivially parallelizable:
You've got an outer loop that you run 90 times.
Each time, you're not mutating any shared arrays except final_emit
… and that, only to store into a unique row.
It looks like most of the work inside the loop is numpy array-wide operations, which will release the GIL.
So (using the futures backport of concurrent.futures, since you seem to be on 2.7):
import numpy as np
import time
import futures
division = 90
freq_division = 50
cd = 3000
boost_factor = np.random.rand(division, division, cd)
freq_bins = np.linspace(1, 60, freq_division)
es = np.random.randint(1,10, size = (cd, freq_division))
final_emit = np.zeros((division, division, freq_division))
def dostuff(i):
fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
# ...
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
return np.sum(to_bin_emit, axis=1)
with futures.ThreadPoolExecutor(max_workers=8) as x:
for i, row in enumerate(x.map(dostuff, xrange(division))):
final_emit[i] = row
If this works, there are two tweaks to try, either of which might be more efficient. We don't really care which order the results come back in, but map queues them up in order. This can waste a bit of space and time. I don't think it will make much difference (presumably, the vast majority of your time is presumably spent doing the calculations, not writing out the results), but without profiling your code, it's hard to be sure. So, there are two easy ways around this problem.
Using as_completed lets us use the results in whatever order they finish, rather than in the order we queued them. Something like this:
def dostuff(i):
fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
# ...
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
return i, np.sum(to_bin_emit, axis=1)
with futures.ThreadPoolExecutor(max_workers=8) as x:
fs = [x.submit(dostuff, i) for i in xrange(division))
for i, row in futures.as_completed(fs):
final_emit[i] = row
Alternatively, we can make the function insert the rows directly, instead of returning them. This means we're now mutating a shared object from multiple threads. So I think we need a lock here, although I'm not positive (numpy's rules are a bit complicated, and I haven't read you code that thoroughly…). But that probably won't hurt performance significantly, and it's easy. So:
import numpy as np
import threading
# etc.
final_emit = np.zeros((division, division, freq_division))
final_emit_lock = threading.Lock()
def dostuff(i):
fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
# ...
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
with final_emit_lock:
final_emit[i] = np.sum(to_bin_emit, axis=1)
with futures.ThreadPoolExecutor(max_workers=8) as x:
x.map(dostuff, xrange(division))
That max_workers=8 in all of my examples should be tuned for your machine. Too many threads is bad, because they start fighting each other instead of parallelizing; too few threads is even worse, because some of your cores just sit there idle.
If you want this to run on a variety of machines, rather than tuning it for each one, the best guess (for 2.7) is usually:
import multiprocessing
# ...
with futures.ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as x:
But if you want to squeeze the max performance out of a specific machine, you should test different values. In particular, for a typical quad-core laptop with hyperthreading, the ideal value can be anywhere from 4 to 8, depending on the exact work you're doing, and it's easier to just try all the values than to try to predict.
I think you get a small boost in the performance by replacing einsum with actual multiplication.
import numpy as np
import time
division = 90
freq_division = 50
cd = 3000
boost_factor = np.random.rand(division, division, cd)
freq_bins = np.linspace(1, 60, freq_division)
es = np.random.randint(1,10, size = (cd, freq_division))
final_emit = np.zeros((division, division, freq_division))
time1 = time.time()
for i in xrange(division):
fre_boost = boost_factor[i][:, :, None]*freq_bins[None, None, :]
sky_by_cap = boost_factor[i][:, :, None]*es[None, :, :]
freq_index = np.digitize(fre_boost, freq_bins)
freq_index_reshaped = freq_index.reshape(division*cd, -1)
freq_index = None
sky_by_cap_reshaped = sky_by_cap.reshape(freq_index_reshaped.shape)
to_bin_emit = np.zeros(freq_index_reshaped.shape)
row_index = np.arange(freq_index_reshaped.shape[0]).reshape(-1, 1)
np.add.at(to_bin_emit, (row_index, freq_index_reshaped), sky_by_cap_reshaped)
to_bin_emit = to_bin_emit.reshape(fre_boost.shape)
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
final_emit[i] = np.sum(to_bin_emit, axis=1)
print(time.time()-time1)
Your code is rather slow at np.add.at, which I believe can be much faster with np.bincount, although I couldn't get it quite work for multidimensional arrays you have. May be someone here can add to that.

Categories

Resources