For a large set of randomly distributed points in a 2D lattice, I want to efficiently extract a subarray, which contains only the elements that, approximated as indices, are assigned to non-zero values in a separate 2D binary matrix. Currently, my script is the following:
lat_len = 100 # lattice length
input = np.random.random(size=(1000,2)) * lat_len
binary_matrix = np.random.choice(2, lat_len * lat_len).reshape(lat_len, -1)
def landed(input):
output = []
input_as_indices = np.floor(input)
for i in range(len(input)):
if binary_matrix[input_as_indices[i,0], input_as_indices[i,1]] == 1:
output.append(input[i])
output = np.asarray(output)
return output
However, I suspect there must be a better way of doing this. The above script can take quite long to run for 10000 iterations.
You are correct. The calculation above, can be be done more efficiently without a for loop in python using advanced numpy indexing,
def landed2(input):
idx = np.floor(input).astype(np.int)
mask = binary_matrix[idx[:,0], idx[:,1]] == 1
return input[mask]
res1 = landed(input)
res2 = landed2(input)
np.testing.assert_allclose(res1, res2)
this results in a ~150x speed-up.
It seems you can squeeze in a noticeable performance boost if you work with linearly indexed arrays. Here's a vectorized implementation to solve our case, similar to #rth's answer, but using linear indexing -
# Get floor-ed indices
idx = np.floor(input).astype(np.int)
# Calculate linear indices
lin_idx = idx[:,0]*lat_len + idx[:,1]
# Index raveled/flattened version of binary_matrix with lin_idx
# to extract and form the desired output
out = input[binary_matrix.ravel()[lin_idx] ==1]
Thus, in short we have:
out = input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
Runtime tests -
This section compares the proposed approach in this solution against the other solution that uses row-column indexing.
Case #1(Original datasizes):
In [62]: lat_len = 100 # lattice length
...: input = np.random.random(size=(1000,2)) * lat_len
...: binary_matrix = np.random.choice(2, lat_len * lat_len).
reshape(lat_len, -1)
...:
In [63]: idx = np.floor(input).astype(np.int)
In [64]: %timeit input[binary_matrix[idx[:,0], idx[:,1]] == 1]
10000 loops, best of 3: 121 µs per loop
In [65]: %timeit input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
10000 loops, best of 3: 103 µs per loop
Case #2(Larger datasizes):
In [75]: lat_len = 1000 # lattice length
...: input = np.random.random(size=(100000,2)) * lat_len
...: binary_matrix = np.random.choice(2, lat_len * lat_len).
reshape(lat_len, -1)
...:
In [76]: idx = np.floor(input).astype(np.int)
In [77]: %timeit input[binary_matrix[idx[:,0], idx[:,1]] == 1]
100 loops, best of 3: 18.5 ms per loop
In [78]: %timeit input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
100 loops, best of 3: 13.1 ms per loop
Thus, the performance boost with this linear indexing seems to be about 20% - 30%.
Related
I have a big csr_matrix(1M*1K) and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. Actually my problem is exactly same as this Sum over rows in scipy.sparse.csr_matrix. The only thing is I find the accepted solution to be slow for my purpose. Let me state what I have
map_fn = np.random.randint(0, 10000, 1000000)
map_fn here tells me how my input rows(1M) are mapped into my output rows(10K). For example ith input row gets added up into map_fn[i] output row. I tried the two approaches mentioned in the above question,
namely forming a sparse matrix and using sparse sum. Although the sparse matrix approach looks way better than sparse sum approach but I find it slow for my purpose. Here is the code comparing two approaches:
import scipy.sparse
import numpy as np
import time
print "Setting up input"
s=10000
n=1000000
d=1000
density=1.0/500
X=scipy.sparse.rand(n,d,density=density,format="csr")
map_fn=np.random.randint(0, s, n)
# Approach 1
start_time=time.time()
col = scipy.arange(n)
val = np.ones(n)
S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
print "Approach 1 Creation time : ",time.time()-start_time
SX = S.dot(X)
print "Approach 1 Total time : ",time.time()-start_time
#Approach 2
start_time=time.time()
SX = np.zeros((s,X.shape[1]))
for i in range(SX.shape[0]):
SX[i,:] = X[np.where(map_fn==i)[0],:].sum(axis=0)
print "Approach 2 Total time : ",time.time()-start_time
which gives following numbers:
Approach 1 Creation time : 0.187678098679
Approach 1 Total time : 0.286989927292
Approach 2 Total time : 10.208632946
So my question is this is there a better way of doing this? I find forming sparse matrix to be an overkill as it takes more than half of the time. Are there any better alternatives? Any suggestions are greatly appreciated. Thanks
Starting approach
Adapting sparse solution from this post -
def sparse_matrix_mult_sparseX_mod1(X, rows):
nrows = rows.max()+1
ncols = X.shape[1]
nelem = nrows * ncols
a,b = X.nonzero()
ids = rows[a] + b*nrows
sums = np.bincount(ids, X[a,b].A1, minlength=nelem)
out = sums.reshape(ncols,-1).T
return out
Benchmarking
Original approach #1 -
def app1(X, map_fn):
col = scipy.arange(n)
val = np.ones(n)
S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
SX = S.dot(X)
return SX
Timings and verification -
In [209]: # Inputs setup
...: s=10000
...: n=1000000
...: d=1000
...: density=1.0/500
...:
...: X=scipy.sparse.rand(n,d,density=density,format="csr")
...: map_fn=np.random.randint(0, s, n)
...:
In [210]: out1 = app1(X, map_fn)
...: out2 = sparse_matrix_mult_sparseX_mod1(X, map_fn)
...: print np.allclose(out1.toarray(), out2)
...:
True
In [211]: %timeit app1(X, map_fn)
1 loop, best of 3: 517 ms per loop
In [212]: %timeit sparse_matrix_mult_sparseX_mod1(X, map_fn)
10 loops, best of 3: 147 ms per loop
To be fair, we should time the final dense array version from app1 -
In [214]: %timeit app1(X, map_fn).toarray()
1 loop, best of 3: 584 ms per loop
Porting to Numba
We could translate the binned counting step to numba, which might be beneficial for denser input matrices. One of the ways to do so would be -
from numba import njit
#njit
def bincount_mod2(out, rows, r, C, V):
N = len(V)
for i in range(N):
out[rows[r[i]], C[i]] += V[i]
return out
def sparse_matrix_mult_sparseX_mod2(X, rows):
nrows = rows.max()+1
ncols = X.shape[1]
r,C = X.nonzero()
V = X[r,C].A1
out = np.zeros((nrows, ncols))
return bincount_mod2(out, rows, r, C, V)
Timings -
In [373]: # Inputs setup
...: s=10000
...: n=1000000
...: d=1000
...: density=1.0/100 # Denser now!
...:
...: X=scipy.sparse.rand(n,d,density=density,format="csr")
...: map_fn=np.random.randint(0, s, n)
...:
In [374]: %timeit app1(X, map_fn)
1 loop, best of 3: 787 ms per loop
In [375]: %timeit sparse_matrix_mult_sparseX_mod1(X, map_fn)
1 loop, best of 3: 906 ms per loop
In [376]: %timeit sparse_matrix_mult_sparseX_mod2(X, map_fn)
1 loop, best of 3: 705 ms per loop
With the dense output from app1 -
In [379]: %timeit app1(X, map_fn).toarray()
1 loop, best of 3: 910 ms per loop
What is the most fast way to calculate function like
# here x is just a number
def f(x):
if x >= 0:
return np.log(x+1)
else:
return -np.log(-x+1)
One possible way is:
# here x is an array
def loga(x)
cond = [x >= 0, x < 0]
choice = [np.log(x+1), -np.log(-x+1)
return np.select(cond, choice)
But seems numpy goes through array element by element.
Is there any way to use something conceptually similar to np.exp(x) to achieve better performance?
def f(x):
return (x/abs(x)) * np.log(1+abs(x))
In cases like these, masking helps -
def mask_vectorized_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
out[mask] = np.log(x[mask]+1)
out[mask_rev] = -np.log(-x[mask_rev]+1)
return out
Introducing numexpr module helps us further.
import numexpr as ne
def mask_vectorized_numexpr_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
x_masked = x[mask]
x_rev_masked = x[mask_rev]
out[mask] = ne.evaluate('log(x_masked+1)')
out[mask_rev] = ne.evaluate('-log(-x_rev_masked+1)')
return out
Inspired by #user2685079's post and then using the logarithmetic property : log(A**B) = B*log(A), we can push in the sign into the log computations and this allows us to do more work with numexpr's evaluate expression, like so -
s = (-2*(x<0))+1 # np.sign(x)
out = ne.evaluate('log( (abs(x)+1)**s)')
Computing sign using comparison gives us s in another way -
s = (-2*(x<0))+1
Finally, we can push this into the numexpr evaluate expression -
def mask_vectorized_numexpr_app2(x):
return ne.evaluate('log( (abs(x)+1)**((-2*(x<0))+1))')
Runtime test
Loopy approach for comparison -
def loopy_app(x):
out = np.empty_like(x)
for i in range(len(out)):
out[i] = f(x[i])
return out
Timings and verification -
In [141]: x = np.random.randn(100000)
...: print np.allclose(loopy_app(x), mask_vectorized_app(x))
...: print np.allclose(loopy_app(x), mask_vectorized_numexpr_app(x))
...: print np.allclose(loopy_app(x), mask_vectorized_numexpr_app2(x))
...:
True
True
True
In [142]: %timeit loopy_app(x)
...: %timeit mask_vectorized_numexpr_app(x)
...: %timeit mask_vectorized_numexpr_app2(x)
...:
10 loops, best of 3: 108 ms per loop
100 loops, best of 3: 3.6 ms per loop
1000 loops, best of 3: 942 µs per loop
Using #user2685079's solution using np.sign to replace the first part and then with and without numexpr evaluation -
In [143]: %timeit np.sign(x) * np.log(1+abs(x))
100 loops, best of 3: 3.26 ms per loop
In [144]: %timeit np.sign(x) * ne.evaluate('log(1+abs(x))')
1000 loops, best of 3: 1.66 ms per loop
Using numba
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
The Numba project is supported by Continuum Analytics and The Gordon and Betty Moore Foundation (Grant GBMF5423).
from numba import njit
import numpy as np
#njit
def pir(x):
a = np.empty_like(x)
for i in range(a.size):
x_ = x[i]
_x = abs(x_)
a[i] = np.sign(x_) * np.log(1 + _x)
return a
Accuracy
np.isclose(pir(x), f(x)).all()
True
Timing
x = np.random.randn(100000)
# My proposal
%timeit pir(x)
1000 loops, best of 3: 881 µs per loop
# OP test
%timeit f(x)
1000 loops, best of 3: 1.26 ms per loop
# Divakar-1
%timeit mask_vectorized_numexpr_app(x)
100 loops, best of 3: 2.97 ms per loop
# Divakar-2
%timeit mask_vectorized_numexpr_app2(x)
1000 loops, best of 3: 621 µs per loop
Function definitions
from numba import njit
import numpy as np
#njit
def pir(x):
a = np.empty_like(x)
for i in range(a.size):
x_ = x[i]
_x = abs(x_)
a[i] = np.sign(x_) * np.log(1 + _x)
return a
import numexpr as ne
def mask_vectorized_numexpr_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
x_masked = x[mask]
x_rev_masked = x[mask_rev]
out[mask] = ne.evaluate('log(x_masked+1)')
out[mask_rev] = ne.evaluate('-log(-x_rev_masked+1)')
return out
def mask_vectorized_numexpr_app2(x):
return ne.evaluate('log( (abs(x)+1)**((-2*(x<0))+1))')
def f(x):
return (x/abs(x)) * np.log(1+abs(x))
You can slightly improve the speed of your second solution by using np.where instead of np.select:
def loga(x):
cond = [x >= 0, x < 0]
choice = [np.log(x+1), -np.log(-x+1)]
return np.select(cond, choice)
def logb(x):
return np.where(x>=0, np.log(x+1), -np.log(-x+1))
In [16]: %timeit loga(arange(-1000,1000))
10000 loops, best of 3: 169 µs per loop
In [17]: %timeit logb(arange(-1000,1000))
10000 loops, best of 3: 98.3 µs per loop
In [18]: np.all(loga(arange(-1000,1000)) == logb(arange(-1000,1000)))
Out[18]: True
Lets say I have a 4-D numpy array (ex: np.rand((x,y,z,t))) of data with dimensions corresponding to X,Y,Z, and time.
For each X and Y point, and at each time step, I want to find the largest index in Z for which the data is larger than some threshold n.
So my end result should be an X-by-Y-by-t array. Instances where there are no values in the Z-column greater than the threshold should be represented by a 0.
I can loop through element-by-element and construct a new array as I go, however I am operating on a very large array and it takes too long.
Unfortunately, following the example of Python builtins, numpy doesn't make it easy to get the last index, although the first is trivial. Still, something like
def slow(arr, axis, threshold):
return (arr > threshold).cumsum(axis=axis).argmax(axis=axis)
def fast(arr, axis, threshold):
compare = (arr > threshold)
reordered = compare.swapaxes(axis, -1)
flipped = reordered[..., ::-1]
first_above = flipped.argmax(axis=-1)
last_above = flipped.shape[-1] - first_above - 1
are_any_above = compare.any(axis=axis)
# patch the no-matching-element found values
patched = np.where(are_any_above, last_above, 0)
return patched
gives me
In [14]: arr = np.random.random((100,100,30,50))
In [15]: %timeit a = slow(arr, axis=2, threshold=0.75)
1 loop, best of 3: 248 ms per loop
In [16]: %timeit b = fast(arr, axis=2, threshold=0.75)
10 loops, best of 3: 50.9 ms per loop
In [17]: (slow(arr, axis=2, threshold=0.75) == fast(arr, axis=2, threshold=0.75)).all()
Out[17]: True
(There's probably a slicker way to do the flipping but it's the end of day here and my brain is shutting down. :-)
Here's a faster approach -
def faster(a,n,invalid_specifier):
mask = a>n
idx = a.shape[2] - (mask[:,:,::-1]).argmax(2) - 1
idx[~mask[:,:,-1] & (idx == a.shape[2]-1)] = invalid_specifier
return idx
Runtime test -
# Using #DSM's benchmarking setup
In [553]: a = np.random.random((100,100,30,50))
...: n = 0.75
...:
In [554]: out1 = faster(a,n,invalid_specifier=0)
...: out2 = fast(a, axis=2, threshold=n) # #DSM's soln
...:
In [555]: np.allclose(out1,out2)
Out[555]: True
In [556]: %timeit fast(a, axis=2, threshold=n) # #DSM's soln
10 loops, best of 3: 64.6 ms per loop
In [557]: %timeit faster(a,n,invalid_specifier=0)
10 loops, best of 3: 43.7 ms per loop
Suppose that you have an array and want to create another array, which's values are equal to standard deviation of first array's 10 elements successively. With the help of for loop, it can be written easily like below code. What I want to do is avoid using for loop for faster execution time. Any suggestions?
Code
a = np.arange(20)
b = np.empty(11)
for i in range(11):
b[i] = np.std(a[i:i+10])
You could create a 2D array of sliding windows with np.lib.stride_tricks.as_strided that would be views into the given 1D array and as such won't be occupying any more memory. Then, simply use np.std along the second axis (axis=1) for the final result in a vectorized way, like so -
W = 10 # Window size
nrows = a.size - W + 1
n = a.strides[0]
a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,W),strides=(n,n))
out = np.std(a2D, axis=1)
Runtime test
Function definitions -
def original_app(a, W):
b = np.empty(a.size-W+1)
for i in range(b.size):
b[i] = np.std(a[i:i+W])
return b
def vectorized_app(a, W):
nrows = a.size - W + 1
n = a.strides[0]
a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,W),strides=(n,n))
return np.std(a2D,1)
Timings and verification -
In [460]: # Inputs
...: a = np.arange(10000)
...: W = 10
...:
In [461]: np.allclose(original_app(a, W), vectorized_app(a, W))
Out[461]: True
In [462]: %timeit original_app(a, W)
1 loops, best of 3: 522 ms per loop
In [463]: %timeit vectorized_app(a, W)
1000 loops, best of 3: 1.33 ms per loop
So, around 400x speedup there!
For completeness, here's the equivalent pandas version -
import pandas as pd
def pdroll(a, W): # a is 1D ndarray and W is window-size
return pd.Series(a).rolling(W).std(ddof=0).values[W-1:]
Not so fancy, but the code with no loops would be something like this:
a = np.arange(20)
b = [a[i:i+10].std() for i in range(len(a)-10)]
Note this is not a question about multiple regression, it is a question about doing simple (single-variable) regression multiple times in Python/NumPy (2.7).
I have two m x n arrays x and y. The rows correspond to each other, and each pair is the set of (x,y) points for a measurement. That is, plt.plot(x.T, y.T, '.') would plot each of m datasets/measurements.
I'm wondering what the best way to perform the m linear regressions is. Currently I loop over the rows and use scipy.stats.linregress(). (Assume I don't want solutions based on doing linear algebra with the matrices but instead want to work with this function, or an equivalent black-box function.) I could try np.vectorize, but the docs indicate it also loops.
With some experimenting, I've also found a way to use list comprehensions with map() and get correct results. I've put both solutions below. In IPython, `%%timeit`` returns, using a small dataset (commented out):
(loop) 1000 loops, best of 3: 642 µs per loop
(map) 1000 loops, best of 3: 634 µs per loop
To try magnifying this, I made a much bigger random dataset (dimension trials x trials):
(loop, trials = 1000) 1 loops, best of 3: 299 ms per loop
(loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop
(map, trials = 1000) 1 loops, best of 3: 256 ms per loop
(map, trials = 10000) 1 loops, best of 3: 2.37 s per loop
That's a decent speedup on a really big set, but I was expecting a bit more. Is there a better way?
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
np.random.seed(42)
#y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8)))
#x = np.tile(np.arange(4), (3,1))
trials = 1000
y = np.random.rand(trials,trials)
x = np.tile(np.arange(trials), (trials,1))
num_rows = shape(y)[0]
slope = np.zeros(num_rows)
inter = np.zeros(num_rows)
for k, xrow in enumerate(x):
yrow = y[k,:]
slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow)
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope + intercept)
# Can the loop be removed?
tempx = [x[k,:] for k in range(num_rows)]
tempy = [y[k,:] for k in range(num_rows)]
results = np.array(map(stats.linregress, tempx, tempy))
slope_vec = results[:,0]
inter_vec = results[:,1]
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope_vec + inter_vec)
print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec)
print "Inters equal by both methods?: ", np.allclose(inter, inter_vec)
Single variable linear regression is simple enough to vectorize it manually:
def multiple_linregress(x, y):
x_mean = np.mean(x, axis=1, keepdims=True)
x_norm = x - x_mean
y_mean = np.mean(y, axis=1, keepdims=True)
y_norm = y - y_mean
slope = (np.einsum('ij,ij->i', x_norm, y_norm) /
np.einsum('ij,ij->i', x_norm, x_norm))
intercept = y_mean[:, 0] - slope * x_mean[:, 0]
return np.column_stack((slope, intercept))
With some made up data:
m = 1000
n = 1000
x = np.random.rand(m, n)
y = np.random.rand(m, n)
it outperforms your looping options by a fair margin:
%timeit multiple_linregress(x, y)
100 loops, best of 3: 14.1 ms per loop