I have a code which reassigns bins to a large numpy array. Basically, the elements of the large array has been sampled at different frequency and the final goal is to rebin the entire array at fixed bins freq_bins. The code is kind of slow for the array I have. Is there any good way to improve the runtime of this code? A factor of few would do for now. May be some numba magic would do.
import numpy as np
import time
division = 90
freq_division = 50
cd = 3000
boost_factor = np.random.rand(division, division, cd)
freq_bins = np.linspace(1, 60, freq_division)
es = np.random.randint(1,10, size = (cd, freq_division))
final_emit = np.zeros((division, division, freq_division))
time1 = time.time()
for i in xrange(division):
fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
sky_by_cap = np.einsum('ij, jk->ijk', boost_factor[i],es)
freq_index = np.digitize(fre_boost, freq_bins)
freq_index_reshaped = freq_index.reshape(division*cd, -1)
freq_index = None
sky_by_cap_reshaped = sky_by_cap.reshape(freq_index_reshaped.shape)
to_bin_emit = np.zeros(freq_index_reshaped.shape)
row_index = np.arange(freq_index_reshaped.shape[0]).reshape(-1, 1)
np.add.at(to_bin_emit, (row_index, freq_index_reshaped), sky_by_cap_reshaped)
to_bin_emit = to_bin_emit.reshape(fre_boost.shape)
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
final_emit[i] = np.sum(to_bin_emit, axis=1)
print(time.time()-time1)
Keep the code simple and than optimize
If you have an idea what algorithm you want to code write a simple reference implementation. From this you can go two ways using Python. You can try to vectorize the code or you can compile the code to get good performance.
Even if np.einsum or np.add.at were implementet in Numba, it would be very hard for any compiler to make efficient binary code from your example.
The only thing I have rewritten is a more efficient approach of digitize for scalar values.
Edit
In the Numba source code there is a more efficient implimentation of digitize for scalar values.
Code
#From Numba source
#Copyright (c) 2012, Anaconda, Inc.
#All rights reserved.
#nb.njit(fastmath=True)
def digitize(x, bins, right=False):
# bins are monotonically-increasing
n = len(bins)
lo = 0
hi = n
if right:
if np.isnan(x):
# Find the first nan (i.e. the last from the end of bins,
# since there shouldn't be many of them in practice)
for i in range(n, 0, -1):
if not np.isnan(bins[i - 1]):
return i
return 0
while hi > lo:
mid = (lo + hi) >> 1
if bins[mid] < x:
# mid is too low => narrow to upper bins
lo = mid + 1
else:
# mid is too high, or is a NaN => narrow to lower bins
hi = mid
else:
if np.isnan(x):
# NaNs end up in the last bin
return n
while hi > lo:
mid = (lo + hi) >> 1
if bins[mid] <= x:
# mid is too low => narrow to upper bins
lo = mid + 1
else:
# mid is too high, or is a NaN => narrow to lower bins
hi = mid
return lo
#nb.njit(fastmath=True)
def digitize(value, bins):
if value<bins[0]:
return 0
if value>=bins[bins.shape[0]-1]:
return bins.shape[0]
for l in range(1,bins.shape[0]):
if value>=bins[l-1] and value<bins[l]:
return l
#nb.njit(fastmath=True,parallel=True)
def inner_loop(boost_factor,freq_bins,es):
res=np.zeros((boost_factor.shape[0],freq_bins.shape[0]),dtype=np.float64)
for i in nb.prange(boost_factor.shape[0]):
for j in range(boost_factor.shape[1]):
for k in range(freq_bins.shape[0]):
ind=nb.int64(digitize(boost_factor[i,j]*freq_bins[k],freq_bins))
res[i,ind]+=boost_factor[i,j]*es[j,k]*freq_bins[ind]
return res
#nb.njit(fastmath=True)
def calc_nb(division,freq_division,cd,boost_factor,freq_bins,es):
final_emit = np.empty((division, division, freq_division),np.float64)
for i in range(division):
final_emit[i,:,:]=inner_loop(boost_factor[i],freq_bins,es)
return final_emit
Performance
(Quadcore i7)
original_code: 118.5s
calc_nb: 4.14s
#with digitize implementation from Numba source
calc_nb: 2.66s
This seems to be trivially parallelizable:
You've got an outer loop that you run 90 times.
Each time, you're not mutating any shared arrays except final_emit
… and that, only to store into a unique row.
It looks like most of the work inside the loop is numpy array-wide operations, which will release the GIL.
So (using the futures backport of concurrent.futures, since you seem to be on 2.7):
import numpy as np
import time
import futures
division = 90
freq_division = 50
cd = 3000
boost_factor = np.random.rand(division, division, cd)
freq_bins = np.linspace(1, 60, freq_division)
es = np.random.randint(1,10, size = (cd, freq_division))
final_emit = np.zeros((division, division, freq_division))
def dostuff(i):
fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
# ...
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
return np.sum(to_bin_emit, axis=1)
with futures.ThreadPoolExecutor(max_workers=8) as x:
for i, row in enumerate(x.map(dostuff, xrange(division))):
final_emit[i] = row
If this works, there are two tweaks to try, either of which might be more efficient. We don't really care which order the results come back in, but map queues them up in order. This can waste a bit of space and time. I don't think it will make much difference (presumably, the vast majority of your time is presumably spent doing the calculations, not writing out the results), but without profiling your code, it's hard to be sure. So, there are two easy ways around this problem.
Using as_completed lets us use the results in whatever order they finish, rather than in the order we queued them. Something like this:
def dostuff(i):
fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
# ...
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
return i, np.sum(to_bin_emit, axis=1)
with futures.ThreadPoolExecutor(max_workers=8) as x:
fs = [x.submit(dostuff, i) for i in xrange(division))
for i, row in futures.as_completed(fs):
final_emit[i] = row
Alternatively, we can make the function insert the rows directly, instead of returning them. This means we're now mutating a shared object from multiple threads. So I think we need a lock here, although I'm not positive (numpy's rules are a bit complicated, and I haven't read you code that thoroughly…). But that probably won't hurt performance significantly, and it's easy. So:
import numpy as np
import threading
# etc.
final_emit = np.zeros((division, division, freq_division))
final_emit_lock = threading.Lock()
def dostuff(i):
fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
# ...
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
with final_emit_lock:
final_emit[i] = np.sum(to_bin_emit, axis=1)
with futures.ThreadPoolExecutor(max_workers=8) as x:
x.map(dostuff, xrange(division))
That max_workers=8 in all of my examples should be tuned for your machine. Too many threads is bad, because they start fighting each other instead of parallelizing; too few threads is even worse, because some of your cores just sit there idle.
If you want this to run on a variety of machines, rather than tuning it for each one, the best guess (for 2.7) is usually:
import multiprocessing
# ...
with futures.ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as x:
But if you want to squeeze the max performance out of a specific machine, you should test different values. In particular, for a typical quad-core laptop with hyperthreading, the ideal value can be anywhere from 4 to 8, depending on the exact work you're doing, and it's easier to just try all the values than to try to predict.
I think you get a small boost in the performance by replacing einsum with actual multiplication.
import numpy as np
import time
division = 90
freq_division = 50
cd = 3000
boost_factor = np.random.rand(division, division, cd)
freq_bins = np.linspace(1, 60, freq_division)
es = np.random.randint(1,10, size = (cd, freq_division))
final_emit = np.zeros((division, division, freq_division))
time1 = time.time()
for i in xrange(division):
fre_boost = boost_factor[i][:, :, None]*freq_bins[None, None, :]
sky_by_cap = boost_factor[i][:, :, None]*es[None, :, :]
freq_index = np.digitize(fre_boost, freq_bins)
freq_index_reshaped = freq_index.reshape(division*cd, -1)
freq_index = None
sky_by_cap_reshaped = sky_by_cap.reshape(freq_index_reshaped.shape)
to_bin_emit = np.zeros(freq_index_reshaped.shape)
row_index = np.arange(freq_index_reshaped.shape[0]).reshape(-1, 1)
np.add.at(to_bin_emit, (row_index, freq_index_reshaped), sky_by_cap_reshaped)
to_bin_emit = to_bin_emit.reshape(fre_boost.shape)
to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
final_emit[i] = np.sum(to_bin_emit, axis=1)
print(time.time()-time1)
Your code is rather slow at np.add.at, which I believe can be much faster with np.bincount, although I couldn't get it quite work for multidimensional arrays you have. May be someone here can add to that.
Related
My data has 4 columns A-D which have integers. I am adding a new column E, which takes its first value same as first value in column D. The next value in E should be the corresponding value in column D if previous value in column E is negative, otherwise it takes corresponding value in column C.
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
data=pd.read_excel('/Users/xxxx/Documents/PY Notebooks/Data/yyyy.xlsx')
data1=data.copy()
data1['E']=np.nan
data1.at[0,'E']=data1['D'][0]
l=len(data1)
for i in range(l-1):
if data1['E'][i]<0:
data1.at[i+1,'E']=data1['D'][i+1]
else:
data1.at[i+1,'E']=data1['C'][i+1]
TL;DR: go to the benchmark code and use Method 1.
Short Answer
No. Vectorization is not possible.
Long Answer
Theorem: For this particular task, the output of a given row cannot be determined using any finite length of backward rolling window smaller than the partial length up to this row.
Thus, there is no way for this output logic to be processed in a vectorized way. (See this answer for an idea of vectorization is performed in CPUs). The output can only be computed from the beginning of the dataframe.
Proof: Consider a target row of a dataframe df. Assume there is a backward rolling window with size n < partial length, so a previous value of df["E"] exists before the window. We denote this previous value by state.
Consider a special case: df["C"] == -1 and df["D"] == 1 within the window.
Case 1 (state < 0): The output within this rolling window will be [1, -1, 1, -1, .....], making the last element (-1)^(n-1)
Case 2 (state >= 0): The output will be [-1, 1, -1, 1, .....], making the last element (-1)^(n)
Therefore, it is possible for the output df["E"] of the target row to be dependent on a state variable outside the window. QED.
Useful Answer
Although vectorization is impossible, it does not mean that significant acceleration cannot be achieved. A simple yet very efficient approach is using a numba-compiled generator to perform the sequential generation. It only requires re-writing your logic into a generator function and add two additional lines:
import numba
#numba.njit
def my_generator_func():
....
Of course, you may have to install numba first. If this is not possible, then using a plain generator without numba optimization is also fine.
Benchmark
The benchmark is performed on a i5-8250U (4C8T) laptop with 16GB RAM running 64-bit debian 10. Python version is 3.7.9 and pandas is 1.1.3. n = 10^7 (10 million) records are generated for benchmarking purposes.
Result:
1. numba-njit: 2.48s
2. plain generator (no numba): 5.13s
3. original: 271.15s
> 100x efficiency gain can be achieved against the original code.
Code
from datetime import datetime
import pandas as pd
import numpy as np
n = 10000000 # a large number of rows
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
#print(df.head())
# ========== Method 1. generator + numba njit ==========
ti = datetime.now()
import numba
#numba.njit
def gen(plus: np.array, minus: np.array):
l = len(plus)
assert len(minus) == l
# first
state = minus[0]
yield state
# second to last
for i in range(l-1):
state = minus[i+1] if state < 0 else plus[i+1]
yield state
df["E"] = [i for i in gen(df["C"].values, df["D"].values)]
tf = datetime.now()
print(f"1. numba-njit: {(tf-ti).total_seconds():.2f}s") # 1. numba-njit: 0.47s
# ========== Method 2. Generator without numba ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()
def gen_plain(plus: np.array, minus: np.array):
l = len(plus)
assert len(minus) == l
# first
state = minus[0]
yield state
# second to last
for i in range(l-1):
state = minus[i+1] if state < 0 else plus[i+1]
yield state
df["E"] = [i for i in gen_plain(df["C"].values, df["D"].values)]
tf = datetime.now()
print(f"2. plain generator (no numba): {(tf-ti).total_seconds():.2f}s") #
# ========== Method 3. Direct iteration ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()
# code provided by the OP
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
l=len(df)
for i in range(l - 1):
if df['E'][i] < 0:
df.at[i+1,'E'] = df['D'][i+1]
else:
df.at[i+1,'E'] = df['C'][i+1]
tf = datetime.now()
print(f"3. original: {(tf-ti).total_seconds():.2f}s") # 2. 26.61s
I don't think you can vectorize this operation as you have dependent rows that need to get previous calculations being run. This being said, there still is quite some room for optimization in your functionality. Let's first check your original implementation with some random points.
import numpy as np
import pandas as pd
import time
size = 10000000
data = np.random.randint(-2, 10, size=size)
data = data.reshape([size//4, 4])
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
for i in range(len(df)-1):
if df['E'][i]<0:
df.at[i+1,'E'] = df['D'][i+1]
else:
df.at[i+1,'E'] = df['C'][i+1]
print(f"Operation on pd df took {time.time() - time_start} seconds.")
Output:
Operation on pd df took 84.00791883468628 seconds.
As operations on the DataFrame usually are quite slow, we can operate on the underlying numpy arrays instead.
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values
e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(df)-1):
if last_e < 0:
last_e = d_vals[i+1]
else:
last_e = c_vals[i+1]
e_vals.append(last_e)
df['E'] = e_vals
print(f"Operation on np array took {time.time() - time_start} seconds.")
Output:
Operation on np array took 2.2387869358062744 seconds.
Now we can argue that for loops are slow in Python and we can use a JIT compiler that can deal with numpy arrays, for instance numba.
import numba
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values
#numba.jit(nopython=True)
def numba_calc(c_vals, d_vals):
e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(c_vals)-1):
if last_e < 0:
last_e = d_vals[i+1]
else:
last_e = c_vals[i+1]
e_vals.append(last_e)
return e_vals
df["E"] = numba_calc(c_vals, d_vals)
print(f"Operation on np array with numba took {time.time() - time_start} seconds.")
Output:
Operation on np array with numba took 1.2623450756072998 seconds.
So especially for larger DataFrames using numba will pay out, while operating on the raw numpy arrays mostly gives a nice runtime improvement.
We have a 2d list, we can convert it into anything if necessary. Each row contains some positive integers(deltas of the original increasing numbers). Total 2 billion numbers, with more than half equals to 1. When using Elias-Gamma coding, we can encode the 2d list row by row (we'll be accessing arbitrary rows with row index later) using around 3 bits per number based on calculation from the distribution. However, our program has been running for 12 hours and it still hasn't finished the encoding.
Here's what we are doing:
from bitstring import BitArray
def _compress_2d_list(input: List[List[int]]) -> List[BitArray]:
res = []
for row in input:
res.append(sum(_elias_gamma_compress_number(num) for num in row))
return res
def _elias_gamma_compress_number(x: int) -> BitArray:
n = _log_floor(x)
return BitArray(bin="0" * n) + BitArray(uint=x, length=_log_floor(x) + 1)
def log_floor(num: int) -> int:
return floor(log(num, 2))
Called by:
input_2d_list: List[List[int]] # containing 1.5M lists, total 2B numbers
compressed_list = _compress_2d_list(input_2d_list)
How can I optimize my code to make it run faster? I mean, MUCH FASTER...... I am ok with using any reliable popular library or data structure.
Also, how do we decompress faster with BitStream? Currently I read prefix 0's one by one, then read the binary of the compressed number in a while loop. It's not very fast either...
If you are ok with numpy "bitfields" you can get the compression done in a matter of minutes. Decoding is slower by a factor of three but still a matter of minutes.
Sample run:
# create example (1'000'000 numbers)
a = make_example()
a
# array([2, 1, 1, ..., 3, 4, 3])
b,n = encode(a) # takes ~100 ms on my machine
c = decode(b,n) # ~300 ms
# check round trip
(a==c).all()
# True
Code:
import numpy as np
def make_example():
a = np.random.choice(2000000,replace=False,size=1000001)
a.sort()
return np.diff(a)
def encode(a):
a = a.view(f'u{a.itemsize}')
l = np.log2(a).astype('u1')
L = ((l<<1)+1).cumsum()
out = np.zeros(L[-1],'u1')
for i in range(l.max()+1):
out[L-i-1] += (a>>i)&1
return np.packbits(out),out.size
def decode(b,n):
b = np.unpackbits(b,count=n).view(bool)
s = b.nonzero()[0]
s = (s<<1).repeat(np.diff(s,prepend=-1))
s -= np.arange(-1,len(s)-1)
s = s.tolist() # list has faster __getitem__
ns = len(s)
def gen():
idx = 0
yield idx
while idx < ns:
idx = s[idx]
yield idx
offs = np.fromiter(gen(),int)
sz = np.diff(offs)>>1
mx = sz.max()+1
out = np.zeros(offs.size-1,int)
for i in range(mx):
out[b[offs[1:]-i-1] & (sz>=i)] += 1<<i
return out
Some simple optimizations result in a factor of three improvement:
def _compress_2d_list(input):
res = []
for row in input:
res.append(BitArray('').join(BitArray(uint=x, length=2*x.bit_length()-1) for x in row))
return res
However, I think you'll need something better than that. On my machine, this would finish in about 12 hours on 1.5 million lists with 1400 deltas each.
In C it takes about a minute to encode. About 15 seconds to decode.
I do have a piece of code that compute partitions of a set of (potentialy duplicated) integers. But i am interested in the set of possible partition and there multiplicity.
You can for exemple launch the follwoing code :
import numpy as np
from collections import Counter
import pandas as pd
def _B(i):
# for a given multiindex i, we defined _B(i) as the set of integers containg i_j times the number j:
if len(i) != 1:
B = []
for j in range(len(i)):
B.extend(i[j]*[j])
else:
B = i*[0]
return B
def _partition(collection):
# from here: https://stackoverflow.com/a/62532969/8425270
if len(collection) == 1:
yield (collection,)
return
first = collection[0]
for smaller in _partition(collection[1:]):
# insert `first` in each of the subpartition's subsets
for n, subset in enumerate(smaller):
yield smaller[:n] + ((first,) + subset,) + smaller[n + 1 :]
# put `first` in its own subset
yield ((first,),) + smaller
def to_list(tpl):
# the final hierarchy is
return list(list(i) if isinstance(i, tuple) else i for i in tpl)
def _Pi(inst_B):
# inst_B must be a tuple
if type(inst_B) != tuple :
inst_B = tuple(inst_B)
pp = [tuple(sorted(p)) for p in _partition(inst_B)]
c = Counter(pp)
Pi = c.keys()
N = list()
for pi in Pi:
N.append(c[pi])
Pi = [to_list(pi) for pi in Pi]
return Pi, N
if __name__ == "__main__":
import cProfile
pr = cProfile.Profile()
pr.enable()
sh = (3, 3, 3)
rez = list()
rez_sorted= list()
rez_ref = list()
for idx in np.ndindex(sh):
if sum(idx) > 0:
print(idx)
Pi, N = _Pi(_B(idx))
print(pd.DataFrame({'Pi': Pi, 'N': N * np.array([np.math.factorial(len(pi) - 1) for pi in Pi])}))
pr.disable()
# after your program ends
pr.print_stats(sort="tottime")
This code computes, for several examples of tuples of integer numbers (generated by np.ndindex) the partitions and counts i need. Everything happens in the _partition and the _Pi functions, this is were you should look at.
If you look closely at how these two functions are working, you'll see that they comput eevery potential partition and THEN count up how many times they appeared. For small problems, this is fine, but if the size of the prolbme increase, this starts to take a looooot of time. Try setting sh = (5,5,5), you'll see what i mean;
So the problem is the following :
Is there a way to compute directly the partitions and there number of occurences instead ?
Edit: I cross-posted on mathoverflow there, and they propose a solution in this article, in corrolary 2.10 (page 10 of the pdf). The problem could be solved by implmenting the sets p(v,r) in this corrolary.
I was hoping, as in the univariate case, that those sets would have a nice recursive expression but i ould not find one yet.
More Edit : This problem is equivalent to finding all (multiset)-partitions of a multiset. If the solution for finding (set)-partitions of a set is given by Bell partial polynomials, here we need multivariate version of these polynomials.
I'm trying to optimize the function 'pw' in the following code using only NumPy functions (or perhaps list comprehensions).
from time import time
import numpy as np
def pw(x, udata):
"""
Creates the step function
| 1, if d0 <= x < d1
| 2, if d1 <= x < d2
pw(x,data) = ...
| N, if d(N-1) <= x < dN
| 0, otherwise
where di is the ith element in data.
INPUT: x -- interval which the step function is defined over
data -- an ordered set of data (without repetitions)
OUTPUT: pw_func -- an array of size x.shape[0]
"""
vals = np.arange(1,udata.shape[0]+1).reshape(udata.shape[0],1)
pw_func = np.sum(np.where(np.greater_equal(x,udata)*np.less(x,np.roll(udata,-1)),vals,0),axis=0)
return pw_func
N = 50000
x = np.linspace(0,10,N)
data = [1,3,4,5,5,7]
udata = np.unique(data)
ti = time()
pw(x,udata)
tf = time()
print(tf - ti)
import cProfile
cProfile.run('pw(x,udata)')
The cProfile.run is telling me that most of the overhead is coming from np.where (about 1 ms) but I'd like to create faster code if possible. It seems that performing the operations row-wise versus column-wise makes some difference, unless I'm mistaken, but I think I've accounted for it. I know that sometimes list comprehensions can be faster but I couldn't figure out a faster way than what I'm doing using it.
Searchsorted seems to yield better performance but that 1 ms still remains on my computer:
(modified)
def pw(xx, uu):
"""
Creates the step function
| 1, if d0 <= x < d1
| 2, if d1 <= x < d2
pw(x,data) = ...
| N, if d(N-1) <= x < dN
| 0, otherwise
where di is the ith element in data.
INPUT: x -- interval which the step function is defined over
data -- an ordered set of data (without repetitions)
OUTPUT: pw_func -- an array of size x.shape[0]
"""
inds = np.searchsorted(uu, xx, side='right')
vals = np.arange(1,uu.shape[0]+1)
pw_func = vals[inds[inds != uu.shape[0]]]
num_mins = np.sum(xx < np.min(uu))
num_maxs = np.sum(xx > np.max(uu))
pw_func = np.concatenate((np.zeros(num_mins), pw_func, np.zeros(xx.shape[0]-pw_func.shape[0]-num_mins)))
return pw_func
This answer using piecewise seems pretty close, but that's on a scalar x0 and x1. How would I do it on arrays? And would it be more efficient?
Understandably, x may be pretty big but I'm trying to put it through a stress test.
I am still learning though so some hints or tricks that can help me out would be great.
EDIT
There seems to be a mistake in the second function since the resulting array from the second function doesn't match the first one (which I'm confident that it works):
N1 = pw1(x,udata.reshape(udata.shape[0],1)).shape[0]
N2 = np.sum(pw1(x,udata.reshape(udata.shape[0],1)) == pw2(x,udata))
print(N1 - N2)
yields
15000
data points that are not the same. So it seems that I don't know how to use 'searchsorted'.
EDIT 2
Actually I fixed it:
pw_func = vals[inds[inds != uu.shape[0]]]
was changed to
pw_func = vals[inds[inds[(inds != uu.shape[0])*(inds != 0)]-1]]
so at least the resulting arrays match. But the question still remains on whether there's a more efficient way of going about doing this.
EDIT 3
Thanks Tin Lai for pointing out the mistake. This one should work
pw_func = vals[inds[(inds != uu.shape[0])*(inds != 0)]-1]
Maybe a more readable way of presenting it would be
non_endpts = (inds != uu.shape[0])*(inds != 0) # only consider the points in between the min/max data values
shift_inds = inds[non_endpts]-1 # searchsorted side='right' includes the left end point and not right end point so a shift is needed
pw_func = vals[shift_inds]
I think I got lost in all those brackets! I guess that's the importance of readability.
A very abstract yet interesting problem! Thanks for entertaining me, I had fun :)
p.s. I'm not sure about your pw2 I wasn't able to get it output the same as pw1.
For reference the original pws:
def pw1(x, udata):
vals = np.arange(1,udata.shape[0]+1).reshape(udata.shape[0],1)
pw_func = np.sum(np.where(np.greater_equal(x,udata)*np.less(x,np.roll(udata,-1)),vals,0),axis=0)
return pw_func
def pw2(xx, uu):
inds = np.searchsorted(uu, xx, side='right')
vals = np.arange(1,uu.shape[0]+1)
pw_func = vals[inds[inds[(inds != uu.shape[0])*(inds != 0)]-1]]
num_mins = np.sum(xx < np.min(uu))
num_maxs = np.sum(xx > np.max(uu))
pw_func = np.concatenate((np.zeros(num_mins), pw_func, np.zeros(xx.shape[0]-pw_func.shape[0]-num_mins)))
return pw_func
My first attempt was utilising a lot of boardcasting operation from numpy:
def pw3(x, udata):
# the None slice is to create new axis
step_bool = x >= udata[None,:].T
# we exploit the fact that bools are integer value of 1s
# skipping the last value in "data"
step_vals = np.sum(step_bool[:-1], axis=0)
# for the step_bool that we skipped from previous step (last index)
# we set it to zerp so that we can negate the step_vals once we reached
# the last value in "data"
step_vals[step_bool[-1]] = 0
return step_vals
After looking at the searchsorted from your pw2 I had a new approach that utilise it with much higher performance:
def pw4(x, udata):
inds = np.searchsorted(udata, x, side='right')
# fix-ups the last data if x is already out of range of data[-1]
if x[-1] > udata[-1]:
inds[inds == inds[-1]] = 0
return inds
Plots with:
plt.plot(pw1(x,udata.reshape(udata.shape[0],1)), label='pw1')
plt.plot(pw2(x,udata), label='pw2')
plt.plot(pw3(x,udata), label='pw3')
plt.plot(pw4(x,udata), label='pw4')
with data = [1,3,4,5,5,7]:
with data = [1,3,4,5,5,7,11]
pw1,pw3,pw4 are all identical
print(np.all(pw1(x,udata.reshape(udata.shape[0],1)) == pw3(x,udata)))
>>> True
print(np.all(pw1(x,udata.reshape(udata.shape[0],1)) == pw4(x,udata)))
>>> True
Performance: (timeit by default runs 3 times, average of number=N of times)
print(timeit.Timer('pw1(x,udata.reshape(udata.shape[0],1))', "from __main__ import pw1, x, udata").repeat(number=1000))
>>> [3.1938983199979702, 1.6096494779994828, 1.962694135003403]
print(timeit.Timer('pw2(x,udata)', "from __main__ import pw2, x, udata").repeat(number=1000))
>>> [0.6884554479984217, 0.6075002400029916, 0.7799002879983163]
print(timeit.Timer('pw3(x,udata)', "from __main__ import pw3, x, udata").repeat(number=1000))
>>> [0.7369808239964186, 0.7557657590004965, 0.8088172269999632]
print(timeit.Timer('pw4(x,udata)', "from __main__ import pw4, x, udata").repeat(number=1000))
>>> [0.20514375300263055, 0.20203858999957447, 0.19906871100101853]
I am currently running test_matrix_speed() to see how fast my search_and_book_availability function is. Using the PyCharm profiler I can see that each search_and_book_availability function call averages a speed of 0.001ms. Having the Numba #jit(nopython=True) decorator makes no difference to the performance of this function. Is this because there are no improvements to be had and Numpy is operating as fast as possible here? (I don't care about the speed of the generate_searches function)
Here's the code I'm running
import random
import numpy as np
from numba import jit
def generate_searches(number, sim_start, sim_end):
searches = []
for i in range(number):
start_slot = random.randint(sim_start, sim_end - 1)
end_slot = random.randint(start_slot + 1, sim_end)
searches.append((start_slot, end_slot))
return searches
#jit(nopython=True)
def search_and_book_availability(matrix, search_start, search_end):
search_slice = matrix[:, search_start:search_end]
output = np.where(np.sum(search_slice, axis=1) == 0)[0]
number_of_bookable_vecs = output.size
if number_of_bookable_vecs > 0:
if number_of_bookable_vecs == 1:
id_to_book = output[0]
else:
id_to_book = np.random.choice(output)
matrix[id_to_book, search_start:search_end] = 1
return True
else:
return False
def test_matrix_speed():
shape = (10, 1440)
matrix = np.zeros(shape)
sim_start = 0
sim_end = 1440
searches = generate_searches(1000000, sim_start, sim_end)
for i in searches:
search_start = i[0]
search_end = i[1]
availability = search_and_book_availability(matrix, search_start, search_end)
Using your function and the following code to profile the speed
import time
shape = (10, 1440)
matrix = np.zeros(shape)
sim_start = 0
sim_end = 1440
searches = generate_searches(1000000, sim_start, sim_end)
def reset():
matrix[:] = 0
def test_matrix_speed():
for i in searches:
search_start = i[0]
search_end = i[1]
availability = search_and_book_availability(matrix, search_start, search_end)
def timeit(func):
# warmup
reset()
func()
reset()
start = time.time()
func()
end = time.time()
return end - start
print(timeit(test_matrix_speed))
I find on the order of 11.5s for jited version and 7.5s without jit. I am no expert on numba, but what it is made for is optimizing numerical code written in non-vectorized way, in particular explicit for loops. In your code there is none, you only use vectorized operations. Therefore I expected jit to not outperform baseline solution, though I must admit that I am surprised to see it that much worse. If you're looking to optimize your solution, you can cut the execution time (at least on my PC) with the following code:
def search_and_book_availability_opt(matrix, search_start, search_end):
search_slice = matrix[:, search_start:search_end]
# we don't need to sum in order to check if all elements are 0.
# ndarray.any() can use short-circuiting and is therefore faster.
# Also, we don't need the selected values from np.where, only the
# indexes, so np.nonzero is faster
bookable, = np.nonzero(~search_slice.any(axis=1))
# short circuit
if bookable.size == 0:
return False
# we can perform random choice even if size is 1
id_to_book = np.random.choice(bookable)
matrix[id_to_book, search_start:search_end] = 1
return True
and by initializing matrix as np.zeros(shape, dtype=np.bool), instead of the default float64. I am able to get execution times of around 3.8s, a ~50% improvement over your unjited solution and ~70% improvement over the jited version. Hope that helps.