possibly speed up matrix multiplications in loop - python

I asked a question here with the details: https://math.stackexchange.com/questions/4381785/possibly-speed-up-matrix-multiplications
In short, I am trying to create a P x N matrix, X, with typical element: \sum_{j,k;j,k \neq i} w_{jp} A_{jk} Y_{kp}, where w is P x N, A is N x N and Y is P x N. See the link above for a markup version of that formula.
I'm providing a mwe here to see how I can correct the code (the calculations seem correct, just incomplete see below) and more importantly speed this up however possible:
w = np.array([[2,1],[3,7]])
A = np.array([[2,1],[9,-1]])
Y = np.array([[6,2],[11,8]])
N=w.shape[1]
P=w.shape[0]
X = np.zeros((P, N))
for p in range(P) :
for i in range(N-1):
for j in range(N-1):
X[p,i] = np.delete(w,i,1)[i,p]*np.delete(np.delete(A,i,0),i,1)[i,j]*np.delete(Y.T,i,0)[j,p]
The output looks like:
array([[ -2. , 0. ],
[-56. , 0.]])
If we set (i,p) = to the (1,1) element of X_{ip}, the value can be understood using the formula provided above:
sum_{j,k;j,k \neq i} w_{j1} A_{jk} Y_{k1} = w_12 A_22 Y_12 = 1 * -1 * 2 = -2 as it is in the output.
the (1,2) element of X_{ip} should be:
sum_{j,k;j,k \neq i} w_{j2} A_{jk} Y_{k2} = w_22 A_22 Y_22 = 7 * -1 * 8 = -56 as it is in the output.
But I am not getting the correct answer for the final column of X because my range is to (N-1) not N because I received an IndexError out of bounds when it is N. More importantly, here N=P=2, but I have large N and P and the code, as is, takes a very long time to run. Any suggestions would be greatly appreciated.

Since the delete functions depend only on i, I factored them out, and reordered the loops. Also corrected the w1 index order.
In [274]: w = np.array([[2,1],[3,7]])
...: A = np.array([[2,1],[9,-1]])
...: Y = np.array([[6,2],[11,8]])
...: N=w.shape[1]
...: P=w.shape[0]
...: X = np.zeros((P, N))
...: for i in range(N-1):
...: print('i',i)
...: w1 = np.delete(w,i,1)
...: a1 = np.delete(np.delete(A,i,0),i,1)
...: y1 = np.delete(Y.T,i,0)
...: print(w1.shape, a1.shape, y1.shape)
...: print(w1#a1#y1)
...: print(np.einsum('ij,jk,li->i',w1,a1,y1))
...: for p in range(P):
...: for j in range(N-1):
...: X[p,i] = w1[p,i]*a1*y1[j,p]
...:
i 0
(2, 1) (1, 1) (1, 2)
[[ -2 -8]
[-14 -56]]
[ -2 -56]
In [275]: X
Out[275]:
array([[ -2., 0.],
[-56., 0.]])
Your [-2,-56] are the diagonal of w1#a1#y1, or the einsum. The 0's are from the original np.zeros because i is only on range(1).
This should be faster because the delete is not repeated unnecessarily. np.delete is still relatively expensive, but I haven't tried to figure out exactly what you are doing.
Didn't your question initially have (2,3) and (3,3) arrays? That, or something a bit larger, may be more general and informative.
edit
I think this is closer to the math expressions:
def foo(w,A,Y):
P, N = w.shape
X = np.zeros((P, N))
for i in range(N):
#w1 = np.delete(w,i,1)
a1 = np.delete(A,i,1)
y1 = np.delete(Y.T,i,0)
print(w1.shape, a1.shape, y1.shape)
for p in range(P) :
for j in range(N-1):
X[p,i] += w[p,i]*a1[i,j]*y1[j,p]
return X
we can get rid of the j loop with:
def foo1(w,A,Y):
P, N = w.shape
X = np.zeros((P, N))
for i in range(N):
a1 = np.delete(A,i,1)
y1 = np.delete(Y.T,i,0)
print(w1.shape, a1.shape, y1.shape)
for p in range(P) :
X[p,i] = w[p,i]*np.dot(a1[i,:],y1[:,p])
return X
with
...: w3 = np.array([[2,1,0],[3,7,0.5]])
...: A3 = np.array([[2,1,0],[9,0,8],[1,2,5]])
...: Y3 = np.array([[6,2,-1],[11,8,-7]])
...: w2 = np.array([[2,1],[3,7]])
...: A2 = np.array([[2,1],[9,-1]])
...: Y2 = np.array([[6,2],[11,8]])
both produce
In [372]: foo(w2,A2,Y2)
(2, 1) (2, 1) (1, 2)
...
Out[372]:
array([[ 4., 54.],
[ 24., 693.]])
In [373]: foo(w3,A3,Y3)
(2, 1) (3, 2) (2, 2)
...
Out[373]:
array([[ 4. , 46. , 0. ],
[ 24. , 301. , 13.5]])
and after more fiddling:
def foo4(w,A,Y):
P, N = w.shape
X = np.zeros((P, N))
for i in range(N):
a1 = np.delete(A,i,1)
y1 = np.delete(Y.T,i,0)
X[:,i] = np.einsum('j,jp->p',a1[i,:],y1)
# X[:,i] = a1[i,:]#y1
return X*w
I suspect it is possible to do w*(A#Y.T) and then subtract an array that involves A[:,i] and Y[:,i], but haven't figured out that array.

Related

rewriting loops functions in numpy without using for or while

I'm trying to reproduce the following functions using the numpy library, I want to produce an equivalent definition without using the keywords for or while. Im guessing you need to use broadcasting, newaxis, and reshape from numpy. but im new to numpy and doing loops without using "for" or "while" has been a mind-bender for me, especially while trying to work with nested loops.
def _bcast(x):
x1, x2 = x
y = np.empty(x1.shape)
for i in range(x1.shape[0]):
for j in range(x1.shape[1]):
for k in range(x1.shape[2]):
y[i,j,k] = (x1[i,j,k]+4)*(4*x2[j,k] - 4)
return y
def _bcast_ax(x):
x1, x2 = x
y = np.empty((x1.shape[0], x2.shape[0], x2.shape[1]))
for i in range(x1.shape[0]):
for j in range(x2.shape[0]):
for k in range(x2.shape[1]):
y[i,j,k] = (4+x1[i,k])*(4*x2[j,k]-4)
return y
def bcast(x):
return (x1+4) * (4*x2 -4)
def bcast_ax(x):
return (x**2)*(x[1]*2)*(x[2]**4)
I tried doing the following for these two functions, but they are not working.
just to clarify, i need this test to pass by both _bcast and bcast producing the same result. same for _bcast_ax and bcast_ax
def test_bcast(self):
def _bcast(x):
x1, x2 = x
y = np.empty(x1.shape)
for i in range(x1.shape[0]):
for j in range(x1.shape[1]):
for k in range(x1.shape[2]):
y[i,j,k] = (x1[i,j,k]+4)*(4*x2[j,k] - 4)
return y
X = [(np.random.randn(3,4,5), np.random.randn(4,5)) for _ in range(3)]
self._test_fun(ac.bcast, _bcast, X)
Focusing on the
y[i,j,k] = (x1[i,j,k]+4)*(4*x2[j,k] - 4)
That means y and x1 have same shape. x2 has the same last 2 dimensions. We can reshape x2 to have a new leading dimension x2[None,...]
y = (x1+4)*(4*x2[None,...] - 4)
but by the rules of broadcasting new leading dimensions are automatic
y = (x1+4)*(4*x2-4)
should work.
The key is to understand broadcasting.
testing
In [169]: x1, x2 = np.arange(24).reshape(2,3,4), np.arange(12).reshape(3,4)
In [170]: y = np.empty(x1.shape)
...: for i in range(x1.shape[0]):
...: for j in range(x1.shape[1]):
...: for k in range(x1.shape[2]):
...: y[i,j,k] = (x1[i,j,k]+4)*(4*x2[j,k] - 4)
...:
In [171]: y
Out[171]:
array([[[ -16., 0., 24., 56.],
[ 96., 144., 200., 264.],
[ 336., 416., 504., 600.]],
[[ -64., 0., 72., 152.],
[ 240., 336., 440., 552.],
[ 672., 800., 936., 1080.]]])
In [172]: (x1+4)*(4*x2-4)
Out[172]:
array([[[ -16, 0, 24, 56],
[ 96, 144, 200, 264],
[ 336, 416, 504, 600]],
[[ -64, 0, 72, 152],
[ 240, 336, 440, 552],
[ 672, 800, 936, 1080]]])

Protection against "index 0 is out of bounds for axis 0 with size 0" error in Python

I have a code in which I get a specific distribution of points on the graph of the function tan()
limited from the bottom and top by straight lines:
import matplotlib.pyplot as plt
import numpy as np
import sys
import itertools
import multiprocessing
import tqdm
ic = range(1,10)
jc = range(1,10)
paramlist = list(itertools.product(ic,jc))
def func(params):
ic = params[0]
jc = params[1]
fig = plt.figure(1, figsize=(10,6))
x_all = np.linspace(0, 10*np.pi, 10000, endpoint=False)
x_above = x_all[ (-0.01)*ic*x_all < np.tan(x_all) ]
x = x_above[ np.tan(x_above) < 0.01*jc*x_above ]
y = np.tan(x)
y2 = 0.01*jc*x
y3 = (-0.01)*ic*x
y_up = np.diff(y) > 0
y_diff = np.where( y_up, np.diff(y), 0 )
x_diff = np.where( y_up, np.diff(x), 0 )
diffs = np.sqrt( x_diff**2 + y_diff**2 )
length = diffs.sum()
numbers = [2,4,6,8,10,12,14,16,18,20]
p2 = []
for d in range(len(numbers)):
cumlenth = np.cumsum(diffs)
s = np.abs(np.diff(np.sign(cumlenth-numbers[d]))).astype(bool)
c = np.argwhere(s)[0][0]
p = x[c], y[c]
p2.append(p)
p3 = sorted(p2, key=lambda x: x[0])
x_max = p3[len(p3)-1][0]
p4 = sorted(p2, key=lambda x: x[1])
y_min = p4[0][1]
y_max = p4[len(p3)-1][1]
for b in range(len(p2)):
plt.scatter( p2[b][0], p2[b][1], color="crimson", s=8)
plt.plot(x, np.tan(x))
plt.plot(x, y2)
plt.plot(x, y3)
ax = plt.gca()
ax.set_xlim([0, x_max+0.5])
ax.set_ylim([y_min-0.5, y_max+0.5])
plt.savefig('C:\\Users\\tkp\\Desktop\\wykresy_4\\i='+str(ic)+'_j='+str(jc)+'.png', bbox_inches='tight')
plt.show()
if __name__ == '__main__':
p = multiprocessing.Pool(4)
for params in tqdm.tqdm(p.imap_unordered(func, paramlist), total=len(paramlist)):
#pass
sys.stdout.write('\r'+ str(params))
sys.stdout.flush()
p.close()
p.join()
Where, for example, I receive plot:
The problem is that if I set the range in x_all = np.linspace(0, 10*np.pi, 10000, endpoint=False) too small, I get the error index 0 is out of bounds for axis 0 with size 0. How can I protect yourself against this? Or maybe in this case I can set a variable range in the "linspace" function?
Where does this error occur? That's a fundamental piece of information - for us, but especially for you!
#edison says it's in the argwhere expression. I'll try to recreate that step, starting with a guess as to what diffs looks like:
In [8]: x = np.ones(5)*.1
In [9]: x
Out[9]: array([0.1, 0.1, 0.1, 0.1, 0.1])
In [10]: s = np.cumsum(x)
In [11]: s
Out[11]: array([0.1, 0.2, 0.3, 0.4, 0.5])
In [12]: s-1
Out[12]: array([-0.9, -0.8, -0.7, -0.6, -0.5])
In [13]: np.sign(s-1)
Out[13]: array([-1., -1., -1., -1., -1.])
In [14]: np.diff(np.sign(s-1))
Out[14]: array([0., 0., 0., 0.])
In [15]: np.abs(np.diff(np.sign(s-1)))
Out[15]: array([0., 0., 0., 0.])
In [16]: np.abs(np.diff(np.sign(s-1))).astype(bool)
Out[16]: array([False, False, False, False])
Regardless of the details to this point, it's a good guess that s is an array with just False. where finds the True elements in that array; there are none.
In [17]: np.where(_)
Out[17]: (array([], dtype=int64),)
argwhere is the transpose of this - one column for each dimension, and one row for each found item.
In [18]: np.argwhere(_)
Out[18]: array([], shape=(0, 2), dtype=int64)
In [19]: _[0]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-aa79beb95eae> in <module>
----> 1 _[0]
IndexError: index 0 is out of bounds for axis 0 with size 0
So your first line of defense is to check the shape of the returned array:
c = np.argwhere(s)
if c.shape[0]>0:
c = c[0,0]
p = x[c], y[c]
else:
# what do you want to do if non of `s` are true?
You can work backwards from there, taking care to ensure that the diffs or numbers are correct, and always find a valid c. But regardless, when using where or argwhere, be careful about assuming it has found a given number of items.

Numpy outer addition of subarrays

Is there a way, in numpy, to perform what amounts to an outer addition of subarrays?
That is to say, I have 2 arrays of the form 2x2xNxM, which may each be considered a stack of 2x2 matrices N high and M wide. I would like to add each of these matrices to each matrix from the other array, to form a 2x2xNxMxNxM array in which the last four indices correspond to the indices in my initial two arrays so that I can index output[:,:,x1,y1,x2,y2] == a1[:,:,x1,y1] + a2[:,:,x2,y2].
If these were arrays of scalars, it would be trivial, all I'd have to do is:
A, B = a.ravel(), b.ravel()
four_D = (a[...:np.newaxis] + b).reshape(*a1.shape, *a2.shape)
for (x1, y1, x2, y2), added in np.ndenumerate(four_D):
assert added == a1[x1,y1] + a2[x2,y2]
However, this doesn't work for the case where a and b comprise of matrices. I could, of course, use nested for loops, but my dataset is going to be fairly large, and I'm expecting to run this over multiple datasets.
Is there an efficient way to do this?
Extend arrays to have more dimensions and then leverage broadcasting -
output = a1[...,None,None] + a2[...,None,None,:,:]
Sample run -
In [38]: # Setup input arrays
...: N = 3
...: M = 4
...: a1 = np.random.rand(2,2,N,M)
...: a2 = np.random.rand(2,2,N,M)
...:
...: output = np.zeros((2,2,N,M,N,M))
...: for x1 in range(N):
...: for x2 in range(N):
...: for y1 in range(M):
...: for y2 in range(M):
...: output[:,:,x1,y1,x2,y2] = a1[:,:,x1,y1] + a2[:,:,x2,y2]
...:
...: output1 = a1[...,None,None] + a2[...,None,None,:,:]
...:
...: print np.allclose(output, output1)
True
Same as for scalars inserting additional axes works for higher dimensional arrays too (this is called broadcasting):
import numpy as np
a1 = np.random.randn(2, 2, 3, 4)
a2 = np.random.randn(2, 2, 3, 4)
added = a1[..., np.newaxis, np.newaxis] + a2[..., np.newaxis, np.newaxis, :, :]
print(added.shape) # (2, 2, 3, 4, 3, 4)

Numpy array thresholding acceleration

I want to construct a np.array from another np.array using a conditional. For each value, if the condition is met, one operation has to be applied, otherwise another. The calculation I have written is ugly due to conversion to and back a list. Can it be improved in terms of speed, by not converting to a list?
THR = 1.0
THR_REZ = 1.0 / THR**2
def thresholded_function(x):
if x < THR:
return THR_REZ
else:
return 1.0 / x**2
rad2 = .....some_np_array.....
rez = np.array([threshold(r2) for r2 in rad2])
Use np.where -
np.where(x < THR, THR_REZ, 1.0/x**2) # x is input array
Sample run -
In [267]: x = np.array([3,7,2,1,8])
In [268]: THR, THR_REZ = 5, 0
In [269]: np.where(x < THR, THR_REZ, 1.0/x**2)
Out[269]: array([ 0. , 0.02040816, 0. , 0. , 0.015625 ])
In [270]: def thresholded_function(x, THR, THR_REZ):
...: if x < THR:
...: return THR_REZ
...: else:
...: return 1.0 / x**2
In [272]: [thresholded_function(i,THR, THR_REZ) for i in x]
Out[272]: [0, 0.02040816326530612, 0, 0, 0.015625]

Radial profile of 2D matrix with float indexes

I have a 2D data array and I'm trying to get a profile of values about its center in an efficient manner. So the output should be two one-dimensional arrays: one with the values of distances from the center, the other with the mean of all the values in the original 2D that are at that distance from the center.
Each index has a non-integer distance from the center, which prevents me from using some already known solutions for the problem. Allow me to explain.
Consider these matrices
data = np.random.randn(5,5)
L = 2
x = np.arange(-L,L+1,1)*2.5
y = np.arange(-L,L+1,1)*2.5
xx, yy = np.meshgrid(x, y)
r = np.sqrt(xx**2. + yy**2.)
So the matrices are
In [30]: r
Out[30]:
array([[ 7.07106781, 5.59016994, 5. , 5.59016994, 7.07106781],
[ 5.59016994, 3.53553391, 2.5 , 3.53553391, 5.59016994],
[ 5. , 2.5 , 0. , 2.5 , 5. ],
[ 5.59016994, 3.53553391, 2.5 , 3.53553391, 5.59016994],
[ 7.07106781, 5.59016994, 5. , 5.59016994, 7.07106781]])
In [31]: data
Out[31]:
array([[ 1.27603322, 1.33635284, 1.93093228, 0.76229675, -0.00956535],
[ 0.69556071, -1.70829753, 1.19615919, -1.32868665, 0.29679494],
[ 0.13097791, -1.33302719, 1.48226442, -0.76672223, -1.01836614],
[ 0.51334771, -0.83863115, -0.41541794, 0.34743342, 0.1199237 ],
[-1.02042539, 0.90739383, -2.4858624 , -0.07417987, 0.90748933]])
For this case the expected output should be array([ 0. , 2.5 , 3.53553391, 5. , 5.59016994, 7.07106781]) for the index of distances, and a second array of same length with the mean of all the values that are at those corresponding distances: array([ 0.98791323, -0.32496927, 0.37221219, -0.6209728 , 0.27986926, 0.04060628]).
From this answer there is a very nice function to compute the profile about any arbitrary point. However, the problem with his approach is that it approximates the distance r by the index distance. So his r for my case would be this:
array([[2, 2, 2, 2, 2],
[2, 1, 1, 1, 2],
[2, 1, 0, 1, 2],
[2, 1, 1, 1, 2],
[2, 2, 2, 2, 2]])
which is a pretty big difference for me, since I'm working with small matrices. This approximation, however, allows him to use np.bincount, which is pretty handy (but won't work for me).
I've been trying to expand this for float distance, like my version r, but so far no luck. bincount doesn't work with floats and histogram needs equally-spaced bins, which is not the case. Any suggestion?
Approach #1
def radial_profile_app1(data, r):
mid = data.shape[0]//2
ids = np.rint((r**2)/r[mid-1,mid]**2).astype(int).ravel()
count = np.bincount(ids)
R = data.shape[0]//2 # Radial profile radius
R0 = R+1
dists = np.unique(r[:R0,:R0][np.tril(np.ones((R0,R0),dtype=bool))])
mean_data = (np.bincount(ids, data.ravel())/count)[count!=0]
return dists, mean_data
For the given sample data -
In [475]: radial_profile_app1(data, r)
Out[475]:
(array([ 0. , 2.5 , 3.53553391, 5. , 5.59016994,
7.07106781]),
array([ 1.48226442 , -0.3297520425, -0.8820454775, -0.3605795875,
0.5696863263, 0.2883829525]))
Approach #2
def radial_profile_app2(data, r):
R = data.shape[0]//2 # Radial profile radius
range_arr = np.arange(-R,R+1)
ids = (range_arr[:,None]**2 + range_arr**2).ravel()
count = np.bincount(ids)
R0 = R+1
dists = np.unique(r[:R0,:R0][np.tril(np.ones((R0,R0),dtype=bool))])
mean_data = (np.bincount(ids, data.ravel())/count)[count!=0]
return dists, mean_data
Runtime test -
In [562]: # Setup inputs
...: N = 2001
...: data = np.random.randn(N,N)
...: L = (N-1)//2
...: x = np.arange(-L,L+1,1)*2.5
...: y = np.arange(-L,L+1,1)*2.5
...: xx, yy = np.meshgrid(x, y)
...: r = np.sqrt(xx**2. + yy**2.)
...:
In [563]: out01, out02 = radial_profile_app1(data, r)
...: out11, out12 = radial_profile_app2(data, r)
...:
...: print np.allclose(out01, out11)
...: print np.allclose(out02, out12)
...:
True
True
In [566]: %timeit radial_profile_app1(data, r)
...: %timeit radial_profile_app2(data, r)
...:
10 loops, best of 3: 114 ms per loop
10 loops, best of 3: 91.2 ms per loop
Got what I was expecting with this function:
def radial_prof(data, r):
uniq = np.unique(r)
prof = np.array([ np.mean(data[ r==un ]) for un in uniq ])
return uniq, prof
But I'm still not happy with the fact that I had to use list comprehension (or a python loop), since it might be slow for very large matrices.
Here is an indirect sorting approach that should scale well if batch size and / or number of bins are large. The sorting is O(n log n) all the histogramming is O(n). I've also added a little unscientific speed test. For the speed test I use flat indexing but I left the 2d index code in because its more flexible when dealing with images of different sizes etc.
import numpy as np
# this need only be run once per batch
def r_to_ind(r, dist_bins="auto"):
f = np.argsort(r.ravel())
if dist_bins == "auto":
rs = r.ravel()[f]
bins = np.where(np.r_[True, rs[1:]!=rs[:-1]])[0]
dist_bins = rs[bins]
else:
bins = np.searchsorted(r.ravel()[f], dist_bins)
denom = np.diff(np.r_[bins, r.size])
return f, np.unravel_index(f, r.shape), bins, denom, dist_bins
# this is with adjustable offset
def profile_xy(image, yx, ij, bins, nynx, denom):
(y, x), (i, j), (ny, nx) = yx, ij, nynx
return np.add.reduceat(image[i + y - ny//2, j + x - nx//2], bins) / denom
# this is fixed
def profile_xy_no_offset(image, ij, bins, denom):
return np.add.reduceat(image[ij], bins) / denom
# this is fixed and flat
def profile_xy_no_offset_flat(image, k, bins, denom):
return np.add.reduceat(image.ravel()[k], bins) / denom
data = np.array([[ 1.27603322, 1.33635284, 1.93093228, 0.76229675, -0.00956535],
[ 0.69556071, -1.70829753, 1.19615919, -1.32868665, 0.29679494],
[ 0.13097791, -1.33302719, 1.48226442, -0.76672223, -1.01836614],
[ 0.51334771, -0.83863115, -0.41541794, 0.34743342, 0.1199237 ],
[-1.02042539, 0.90739383, -2.4858624 , -0.07417987, 0.90748933]])
r = np.array([[ 7.07106781, 5.59016994, 5. , 5.59016994, 7.07106781],
[ 5.59016994, 3.53553391, 2.5 , 3.53553391, 5.59016994],
[ 5. , 2.5 , 0. , 2.5 , 5. ],
[ 5.59016994, 3.53553391, 2.5 , 3.53553391, 5.59016994],
[ 7.07106781, 5.59016994, 5. , 5.59016994, 7.07106781]])
f, (i, j), bins, denom, dist_bins = r_to_ind(r)
result = profile_xy(data, (2, 2), (i, j), bins, (5, 5), denom)
print(dist_bins)
# [ 0. 2.5 3.53553391 5. 5.59016994 7.07106781]
print(result)
# [ 1.48226442 -0.32975204 -0.88204548 -0.36057959 0.56968633 0.28838295]
#########################
from timeit import timeit
n = 2001
batch = 100
fake = 10
a = np.random.random((fake, n, n))
l = np.linspace(-1, 1, n)**2
r = sum(np.ix_(l, l))
def run_all():
f, ij, bins, denom, dist_bins = r_to_ind(r)
for b in range(batch):
profile_xy_no_offset_flat(a[b%fake], f, bins, denom)
print(timeit(run_all, number=10))
# 47.4157 (for 10 batches of 100 images of size 2001x2001)
# and my computer is slower than Divakar's ;-)
I've made some more benchmarks comparing mine to #Divakar's approach 3 stripping out everything precomputable into a run-once-per-batch function. The general finding: they are similar mine has a higher upfront cost but is then faster. But they only cross over at around 100 pictures per batch.

Categories

Resources