Optimizing nested for loops in Python for numpy arrays - python

I am trying to optimize a nested for loop in python. Here is the code (Note: inputting the data needs not to be optimized):
Y=numpy.zeros(44100)
for i in range(len(Y)):
Y[i]=numpy.sin(i/len(Y))
### /Data^^
Z=numpy.zeros(len(Y))
for i in range(len(Y))
for j in range(len(Y))
Z[i]+=Y[j]*numpy.sinc(i-j)
How to best optimize code written for numpy arrays when nested for loops are involved?
EDIT: For clarity.

I guess this only makes sense to do if you multiply the argument to sinc with some factor f.. But then you can use numpy.convolve:
def orig(Y, f):
Z=numpy.zeros(len(Y))
for i in range(len(Y)):
for j in range(len(Y)):
Z[i]+=Y[j]*numpy.sinc((i-j)*f)
return Z
def new(Y, f):
sinc = np.sinc(np.arange(1-len(Y), len(Y)) * f)
return np.convolve(Y, sinc, 'valid')
In [111]: Y=numpy.zeros(441)
...: for i in range(len(Y)):
...: Y[i]=numpy.sin(i/len(Y))
In [112]: %time Z = orig(Y, 0.9)
Wall time: 2.81 s
In [113]: %timeit Z = new1(Y, 0.9)
The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 109 µs per loop
For the really good speed have a look at scipy.signal.fftconvolve

Related

Fastest way to sum over rows of sparse matrix

I have a big csr_matrix(1M*1K) and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. Actually my problem is exactly same as this Sum over rows in scipy.sparse.csr_matrix. The only thing is I find the accepted solution to be slow for my purpose. Let me state what I have
map_fn = np.random.randint(0, 10000, 1000000)
map_fn here tells me how my input rows(1M) are mapped into my output rows(10K). For example ith input row gets added up into map_fn[i] output row. I tried the two approaches mentioned in the above question,
namely forming a sparse matrix and using sparse sum. Although the sparse matrix approach looks way better than sparse sum approach but I find it slow for my purpose. Here is the code comparing two approaches:
import scipy.sparse
import numpy as np
import time
print "Setting up input"
s=10000
n=1000000
d=1000
density=1.0/500
X=scipy.sparse.rand(n,d,density=density,format="csr")
map_fn=np.random.randint(0, s, n)
# Approach 1
start_time=time.time()
col = scipy.arange(n)
val = np.ones(n)
S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
print "Approach 1 Creation time : ",time.time()-start_time
SX = S.dot(X)
print "Approach 1 Total time : ",time.time()-start_time
#Approach 2
start_time=time.time()
SX = np.zeros((s,X.shape[1]))
for i in range(SX.shape[0]):
SX[i,:] = X[np.where(map_fn==i)[0],:].sum(axis=0)
print "Approach 2 Total time : ",time.time()-start_time
which gives following numbers:
Approach 1 Creation time : 0.187678098679
Approach 1 Total time : 0.286989927292
Approach 2 Total time : 10.208632946
So my question is this is there a better way of doing this? I find forming sparse matrix to be an overkill as it takes more than half of the time. Are there any better alternatives? Any suggestions are greatly appreciated. Thanks
Starting approach
Adapting sparse solution from this post -
def sparse_matrix_mult_sparseX_mod1(X, rows):
nrows = rows.max()+1
ncols = X.shape[1]
nelem = nrows * ncols
a,b = X.nonzero()
ids = rows[a] + b*nrows
sums = np.bincount(ids, X[a,b].A1, minlength=nelem)
out = sums.reshape(ncols,-1).T
return out
Benchmarking
Original approach #1 -
def app1(X, map_fn):
col = scipy.arange(n)
val = np.ones(n)
S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
SX = S.dot(X)
return SX
Timings and verification -
In [209]: # Inputs setup
...: s=10000
...: n=1000000
...: d=1000
...: density=1.0/500
...:
...: X=scipy.sparse.rand(n,d,density=density,format="csr")
...: map_fn=np.random.randint(0, s, n)
...:
In [210]: out1 = app1(X, map_fn)
...: out2 = sparse_matrix_mult_sparseX_mod1(X, map_fn)
...: print np.allclose(out1.toarray(), out2)
...:
True
In [211]: %timeit app1(X, map_fn)
1 loop, best of 3: 517 ms per loop
In [212]: %timeit sparse_matrix_mult_sparseX_mod1(X, map_fn)
10 loops, best of 3: 147 ms per loop
To be fair, we should time the final dense array version from app1 -
In [214]: %timeit app1(X, map_fn).toarray()
1 loop, best of 3: 584 ms per loop
Porting to Numba
We could translate the binned counting step to numba, which might be beneficial for denser input matrices. One of the ways to do so would be -
from numba import njit
#njit
def bincount_mod2(out, rows, r, C, V):
N = len(V)
for i in range(N):
out[rows[r[i]], C[i]] += V[i]
return out
def sparse_matrix_mult_sparseX_mod2(X, rows):
nrows = rows.max()+1
ncols = X.shape[1]
r,C = X.nonzero()
V = X[r,C].A1
out = np.zeros((nrows, ncols))
return bincount_mod2(out, rows, r, C, V)
Timings -
In [373]: # Inputs setup
...: s=10000
...: n=1000000
...: d=1000
...: density=1.0/100 # Denser now!
...:
...: X=scipy.sparse.rand(n,d,density=density,format="csr")
...: map_fn=np.random.randint(0, s, n)
...:
In [374]: %timeit app1(X, map_fn)
1 loop, best of 3: 787 ms per loop
In [375]: %timeit sparse_matrix_mult_sparseX_mod1(X, map_fn)
1 loop, best of 3: 906 ms per loop
In [376]: %timeit sparse_matrix_mult_sparseX_mod2(X, map_fn)
1 loop, best of 3: 705 ms per loop
With the dense output from app1 -
In [379]: %timeit app1(X, map_fn).toarray()
1 loop, best of 3: 910 ms per loop

Fast way to calculate conditional function

What is the most fast way to calculate function like
# here x is just a number
def f(x):
if x >= 0:
return np.log(x+1)
else:
return -np.log(-x+1)
One possible way is:
# here x is an array
def loga(x)
cond = [x >= 0, x < 0]
choice = [np.log(x+1), -np.log(-x+1)
return np.select(cond, choice)
But seems numpy goes through array element by element.
Is there any way to use something conceptually similar to np.exp(x) to achieve better performance?
def f(x):
return (x/abs(x)) * np.log(1+abs(x))
In cases like these, masking helps -
def mask_vectorized_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
out[mask] = np.log(x[mask]+1)
out[mask_rev] = -np.log(-x[mask_rev]+1)
return out
Introducing numexpr module helps us further.
import numexpr as ne
def mask_vectorized_numexpr_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
x_masked = x[mask]
x_rev_masked = x[mask_rev]
out[mask] = ne.evaluate('log(x_masked+1)')
out[mask_rev] = ne.evaluate('-log(-x_rev_masked+1)')
return out
Inspired by #user2685079's post and then using the logarithmetic property : log(A**B) = B*log(A), we can push in the sign into the log computations and this allows us to do more work with numexpr's evaluate expression, like so -
s = (-2*(x<0))+1 # np.sign(x)
out = ne.evaluate('log( (abs(x)+1)**s)')
Computing sign using comparison gives us s in another way -
s = (-2*(x<0))+1
Finally, we can push this into the numexpr evaluate expression -
def mask_vectorized_numexpr_app2(x):
return ne.evaluate('log( (abs(x)+1)**((-2*(x<0))+1))')
Runtime test
Loopy approach for comparison -
def loopy_app(x):
out = np.empty_like(x)
for i in range(len(out)):
out[i] = f(x[i])
return out
Timings and verification -
In [141]: x = np.random.randn(100000)
...: print np.allclose(loopy_app(x), mask_vectorized_app(x))
...: print np.allclose(loopy_app(x), mask_vectorized_numexpr_app(x))
...: print np.allclose(loopy_app(x), mask_vectorized_numexpr_app2(x))
...:
True
True
True
In [142]: %timeit loopy_app(x)
...: %timeit mask_vectorized_numexpr_app(x)
...: %timeit mask_vectorized_numexpr_app2(x)
...:
10 loops, best of 3: 108 ms per loop
100 loops, best of 3: 3.6 ms per loop
1000 loops, best of 3: 942 µs per loop
Using #user2685079's solution using np.sign to replace the first part and then with and without numexpr evaluation -
In [143]: %timeit np.sign(x) * np.log(1+abs(x))
100 loops, best of 3: 3.26 ms per loop
In [144]: %timeit np.sign(x) * ne.evaluate('log(1+abs(x))')
1000 loops, best of 3: 1.66 ms per loop
Using numba
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
The Numba project is supported by Continuum Analytics and The Gordon and Betty Moore Foundation (Grant GBMF5423).
from numba import njit
import numpy as np
#njit
def pir(x):
a = np.empty_like(x)
for i in range(a.size):
x_ = x[i]
_x = abs(x_)
a[i] = np.sign(x_) * np.log(1 + _x)
return a
Accuracy
np.isclose(pir(x), f(x)).all()
True
Timing
x = np.random.randn(100000)
# My proposal
%timeit pir(x)
1000 loops, best of 3: 881 µs per loop
# OP test
%timeit f(x)
1000 loops, best of 3: 1.26 ms per loop
# Divakar-1
%timeit mask_vectorized_numexpr_app(x)
100 loops, best of 3: 2.97 ms per loop
# Divakar-2
%timeit mask_vectorized_numexpr_app2(x)
1000 loops, best of 3: 621 µs per loop
Function definitions
from numba import njit
import numpy as np
#njit
def pir(x):
a = np.empty_like(x)
for i in range(a.size):
x_ = x[i]
_x = abs(x_)
a[i] = np.sign(x_) * np.log(1 + _x)
return a
import numexpr as ne
def mask_vectorized_numexpr_app(x):
out = np.empty_like(x)
mask = x>=0
mask_rev = ~mask
x_masked = x[mask]
x_rev_masked = x[mask_rev]
out[mask] = ne.evaluate('log(x_masked+1)')
out[mask_rev] = ne.evaluate('-log(-x_rev_masked+1)')
return out
def mask_vectorized_numexpr_app2(x):
return ne.evaluate('log( (abs(x)+1)**((-2*(x<0))+1))')
def f(x):
return (x/abs(x)) * np.log(1+abs(x))
You can slightly improve the speed of your second solution by using np.where instead of np.select:
def loga(x):
cond = [x >= 0, x < 0]
choice = [np.log(x+1), -np.log(-x+1)]
return np.select(cond, choice)
def logb(x):
return np.where(x>=0, np.log(x+1), -np.log(-x+1))
In [16]: %timeit loga(arange(-1000,1000))
10000 loops, best of 3: 169 µs per loop
In [17]: %timeit logb(arange(-1000,1000))
10000 loops, best of 3: 98.3 µs per loop
In [18]: np.all(loga(arange(-1000,1000)) == logb(arange(-1000,1000)))
Out[18]: True

Fast array manipulation based on element inclusion in binary matrix

For a large set of randomly distributed points in a 2D lattice, I want to efficiently extract a subarray, which contains only the elements that, approximated as indices, are assigned to non-zero values in a separate 2D binary matrix. Currently, my script is the following:
lat_len = 100 # lattice length
input = np.random.random(size=(1000,2)) * lat_len
binary_matrix = np.random.choice(2, lat_len * lat_len).reshape(lat_len, -1)
def landed(input):
output = []
input_as_indices = np.floor(input)
for i in range(len(input)):
if binary_matrix[input_as_indices[i,0], input_as_indices[i,1]] == 1:
output.append(input[i])
output = np.asarray(output)
return output
However, I suspect there must be a better way of doing this. The above script can take quite long to run for 10000 iterations.
You are correct. The calculation above, can be be done more efficiently without a for loop in python using advanced numpy indexing,
def landed2(input):
idx = np.floor(input).astype(np.int)
mask = binary_matrix[idx[:,0], idx[:,1]] == 1
return input[mask]
res1 = landed(input)
res2 = landed2(input)
np.testing.assert_allclose(res1, res2)
this results in a ~150x speed-up.
It seems you can squeeze in a noticeable performance boost if you work with linearly indexed arrays. Here's a vectorized implementation to solve our case, similar to #rth's answer, but using linear indexing -
# Get floor-ed indices
idx = np.floor(input).astype(np.int)
# Calculate linear indices
lin_idx = idx[:,0]*lat_len + idx[:,1]
# Index raveled/flattened version of binary_matrix with lin_idx
# to extract and form the desired output
out = input[binary_matrix.ravel()[lin_idx] ==1]
Thus, in short we have:
out = input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
Runtime tests -
This section compares the proposed approach in this solution against the other solution that uses row-column indexing.
Case #1(Original datasizes):
In [62]: lat_len = 100 # lattice length
...: input = np.random.random(size=(1000,2)) * lat_len
...: binary_matrix = np.random.choice(2, lat_len * lat_len).
reshape(lat_len, -1)
...:
In [63]: idx = np.floor(input).astype(np.int)
In [64]: %timeit input[binary_matrix[idx[:,0], idx[:,1]] == 1]
10000 loops, best of 3: 121 µs per loop
In [65]: %timeit input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
10000 loops, best of 3: 103 µs per loop
Case #2(Larger datasizes):
In [75]: lat_len = 1000 # lattice length
...: input = np.random.random(size=(100000,2)) * lat_len
...: binary_matrix = np.random.choice(2, lat_len * lat_len).
reshape(lat_len, -1)
...:
In [76]: idx = np.floor(input).astype(np.int)
In [77]: %timeit input[binary_matrix[idx[:,0], idx[:,1]] == 1]
100 loops, best of 3: 18.5 ms per loop
In [78]: %timeit input[binary_matrix.ravel()[idx[:,0]*lat_len + idx[:,1]] ==1]
100 loops, best of 3: 13.1 ms per loop
Thus, the performance boost with this linear indexing seems to be about 20% - 30%.

Speedup sympy-lamdified and vectorized function

I am using sympy to generate some functions for numerical calculations. Therefore I lambdify an expression an vectorize it to use it with numpy arrays. Here is an example:
import numpy as np
import sympy as sp
def numpy_function():
x, y, z = np.mgrid[0:1:40*1j, 0:1:40*1j, 0:1:40*1j]
T = (1 - np.cos(2*np.pi*x))*(1 - np.cos(2*np.pi*y))*np.sin(np.pi*z)*0.1
return T
def sympy_function():
x, y, z = sp.Symbol("x"), sp.Symbol("y"), sp.Symbol("z")
T = (1 - sp.cos(2*sp.pi*x))*(1 - sp.cos(2*sp.pi*y))*sp.sin(sp.pi*z)*0.1
lambda_function = np.vectorize(sp.lambdify((x, y, z), T, "numpy"))
x, y, z = np.mgrid[0:1:40*1j, 0:1:40*1j, 0:1:40*1j]
T = lambda_function(x,y,z)
return T
The problem between the sympy version and a pure numpy version is the speed i.e.
In [3]: timeit test.numpy_function()
100 loops, best of 3: 11.9 ms per loop
vs.
In [4]: timeit test.sympy_function()
1 loops, best of 3: 634 ms per loop
So is there any way to get closer to the speed of the numpy version ?
I think np.vectorize is pretty slow but somehow some part of my code does not work without it. Thank you for any suggestions.
EDIT:
So I found the reason why the vectorize function is necessary, i.e:
In [35]: y = np.arange(10)
In [36]: f = sp.lambdify(x,sin(x),"numpy")
In [37]: f(y)
Out[37]:
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ,
-0.95892427, -0.2794155 , 0.6569866 , 0.98935825, 0.41211849])
this seems to work fine however:
In [38]: y = np.arange(10)
In [39]: f = sp.lambdify(x,1,"numpy")
In [40]: f(y)
Out[40]: 1
So for simple expression like 1 this function doesn't return an array.
Is there a way to fix this and isn't this some kind of bug or at least inconsistent design?
lambdify returns a single value for constants because no numpy functions are involved. This is because of the way lambdify works (see https://stackoverflow.com/a/25514007/161801).
But this is typically not a problem because a constant will automatically broadcast to the correct shape in any operation that you use it in with an array. On the other hand, if you explicitly worked with an array of the same constant, it would be much less efficient because you would compute the same operations multiple times.
Using np.vectorize() in this case is like looping over the first dimension of x, y and z, and that's why it becomes slower. You don't need np.vectorize() IF you tell lambdify()to use NumPy's functions, which is exactly what you are doing. Then, using:
def sympy_function():
x, y, z = sp.Symbol("x"), sp.Symbol("y"), sp.Symbol("z")
T = (1 - sp.cos(2*sp.pi*x))*(1 - sp.cos(2*sp.pi*y))*sp.sin(sp.pi*z)*0.1
lambda_function = sp.lambdify((x, y, z), T, "numpy")
x, y, z = np.mgrid[0:1:40*1j, 0:1:40*1j, 0:1:40*1j]
T = lambda_function(x,y,z)
return T
makes the performance comparable:
In [26]: np.allclose(numpy_function(), sympy_function())
Out[26]: True
In [27]: timeit numpy_function()
100 loops, best of 3: 4.08 ms per loop
In [28]: timeit sympy_function()
100 loops, best of 3: 5.52 ms per loop

Python Multiple Simple Linear Regression

Note this is not a question about multiple regression, it is a question about doing simple (single-variable) regression multiple times in Python/NumPy (2.7).
I have two m x n arrays x and y. The rows correspond to each other, and each pair is the set of (x,y) points for a measurement. That is, plt.plot(x.T, y.T, '.') would plot each of m datasets/measurements.
I'm wondering what the best way to perform the m linear regressions is. Currently I loop over the rows and use scipy.stats.linregress(). (Assume I don't want solutions based on doing linear algebra with the matrices but instead want to work with this function, or an equivalent black-box function.) I could try np.vectorize, but the docs indicate it also loops.
With some experimenting, I've also found a way to use list comprehensions with map() and get correct results. I've put both solutions below. In IPython, `%%timeit`` returns, using a small dataset (commented out):
(loop) 1000 loops, best of 3: 642 µs per loop
(map) 1000 loops, best of 3: 634 µs per loop
To try magnifying this, I made a much bigger random dataset (dimension trials x trials):
(loop, trials = 1000) 1 loops, best of 3: 299 ms per loop
(loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop
(map, trials = 1000) 1 loops, best of 3: 256 ms per loop
(map, trials = 10000) 1 loops, best of 3: 2.37 s per loop
That's a decent speedup on a really big set, but I was expecting a bit more. Is there a better way?
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
np.random.seed(42)
#y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8)))
#x = np.tile(np.arange(4), (3,1))
trials = 1000
y = np.random.rand(trials,trials)
x = np.tile(np.arange(trials), (trials,1))
num_rows = shape(y)[0]
slope = np.zeros(num_rows)
inter = np.zeros(num_rows)
for k, xrow in enumerate(x):
yrow = y[k,:]
slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow)
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope + intercept)
# Can the loop be removed?
tempx = [x[k,:] for k in range(num_rows)]
tempy = [y[k,:] for k in range(num_rows)]
results = np.array(map(stats.linregress, tempx, tempy))
slope_vec = results[:,0]
inter_vec = results[:,1]
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope_vec + inter_vec)
print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec)
print "Inters equal by both methods?: ", np.allclose(inter, inter_vec)
Single variable linear regression is simple enough to vectorize it manually:
def multiple_linregress(x, y):
x_mean = np.mean(x, axis=1, keepdims=True)
x_norm = x - x_mean
y_mean = np.mean(y, axis=1, keepdims=True)
y_norm = y - y_mean
slope = (np.einsum('ij,ij->i', x_norm, y_norm) /
np.einsum('ij,ij->i', x_norm, x_norm))
intercept = y_mean[:, 0] - slope * x_mean[:, 0]
return np.column_stack((slope, intercept))
With some made up data:
m = 1000
n = 1000
x = np.random.rand(m, n)
y = np.random.rand(m, n)
it outperforms your looping options by a fair margin:
%timeit multiple_linregress(x, y)
100 loops, best of 3: 14.1 ms per loop

Categories

Resources