Fastest way to compute upper-triangular matrix of geometric series (Python) - python

and thanks in advance for the help.
Using Python (mostly numpy), I am trying to compute an upper-triangular matrix where each row "j" is the first j-terms of a geometric series, all rows using the same parameter.
For example, if my parameter is B (where abs(B)=<1, i.e. B in [-1,1]), then row 1 would be [1 B B^2 B^3 ... B^(N-1)], row 2 would be [0 1 B B^2...B^(N-2)] ... row N would be [0 0 0 ... 1].
This computation is key to a Bayesian Metropolis-Gibbs sampler, and so needs to be done thousands of times for new values of "B".
I have currently tried this two ways:
Method 1 - Mostly Vectorized:
B_Matrix = np.triu(np.dot(np.reshape(B**(-1*np.array(range(N))),(N,1)),np.reshape(B**(np.array(range(N))),(1,N))))
Essentially, this is the upper triangle part of a product of an Nx1 and 1xN set of matrices:
upper triangle ([1 B^(-1) B^(-2) ... B^(-(N-1))]' * [1 B B^2 B^3 ... B^(N-1)])
This works great for small N (algebraically it is correct), but for large N it errs out. And it produces errors out for B=0 (which should be allowed). I believe this is stemming from taking B^(-N) ~ inf for small B and large N.
Method 2:
B_Matrix = np.zeros((N,N))
B_Row_1 = B**(np.array(range(N)))
for n in range(N):
B_Matrix[n,n:] = B_Row_1[0:N-n]
So that just fills in the matrix row by row, but uses a loop which slows things down.
I was wondering if anyone had run into this before, or had any better ideas on how to compute this matrix in a faster way.
I've never posted on stackoverflow before, but didn't see this question anywhere, and thought I'd ask.
Let me know if there's a better place to ask this, and if I should provide anymore detail.

You could use scipy.linalg.toeplitz:
In [12]: n = 5
In [13]: b = 0.5
In [14]: toeplitz(b**np.arange(n), np.zeros(n)).T
Out[14]:
array([[ 1. , 0.5 , 0.25 , 0.125 , 0.0625],
[ 0. , 1. , 0.5 , 0.25 , 0.125 ],
[ 0. , 0. , 1. , 0.5 , 0.25 ],
[ 0. , 0. , 0. , 1. , 0.5 ],
[ 0. , 0. , 0. , 0. , 1. ]])
If your use of the array is strictly "read only", you can play tricks with numpy strides to quickly create an array that uses only 2*n-1 elements (instead of n^2):
In [55]: from numpy.lib.stride_tricks import as_strided
In [56]: def make_array(b, n):
....: vals = np.zeros(2*n - 1)
....: vals[n-1:] = b**np.arange(n)
....: a = as_strided(vals[n-1:], shape=(n, n), strides=(-vals.strides[0], vals.strides[0]))
....: return a
....:
In [57]: make_array(0.5, 4)
Out[57]:
array([[ 1. , 0.5 , 0.25 , 0.125],
[ 0. , 1. , 0.5 , 0.25 ],
[ 0. , 0. , 1. , 0.5 ],
[ 0. , 0. , 0. , 1. ]])
If you will modify the array in-place, make a copy of the result returned by make_array(b, n). That is, arr = make_array(b, n).copy().
The function make_array2 incorporates the suggestion #Jaime made in the comments:
In [30]: def make_array2(b, n):
....: vals = np.zeros(2*n-1)
....: vals[n-1] = 1
....: vals[n:] = b
....: np.cumproduct(vals[n:], out=vals[n:])
....: a = as_strided(vals[n-1:], shape=(n, n), strides=(-vals.strides[0], vals.strides[0]))
....: return a
....:
In [31]: make_array2(0.5, 4)
Out[31]:
array([[ 1. , 0.5 , 0.25 , 0.125],
[ 0. , 1. , 0.5 , 0.25 ],
[ 0. , 0. , 1. , 0.5 ],
[ 0. , 0. , 0. , 1. ]])
make_array2 is more than twice as fast as make_array:
In [35]: %timeit make_array(0.99, 600)
10000 loops, best of 3: 23.4 µs per loop
In [36]: %timeit make_array2(0.99, 600)
100000 loops, best of 3: 10.7 µs per loop

Related

apply diminishing returns across 2 axis with numpy

How can I use numpy to apply a level of diminishing returns across 2 axes. I'm working with temperature model data for a fixed (x,y) location. So the axes I'm working with is t_axis time and the z_axis vertical atmosphere.
The values below dont really apply to what would make sense for the normal atmosphere, but lets pretend.
a1=np.arange(16).reshape(4,4)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
assume the information above is current forecast model data for my location, and it is predicting a temp of 12°C at the surface right now. But when I walk outside its actually 10°C, so I want to adjust the model data and make that temperature 10°C.
z_axis=3
t_axis=0
a1[z_axis,t_axis]=10
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[10 13 14 15]]
but really what I want to do apply a level of correction based on 2 variables t_mod (diminished returns over time) & z_mod (diminished returns through the vertical atmosphere).
correction = -2
t_mod=.05#50%
z_mod=0.25#25%
# how can i generate this array from modifiers
a2=np.array([
[0,0,0,0],#6k feet above ground level (agl)
[0,0,0,0],#4k feet agl
[.25,.13,0,0],#2k feet agl
[1,.5,.25,0]#surface
# ^ ^ ^ ^__ +3 hour
# | | L__ +2 hour
# | L__ +1 hour
# L__ zero hour
])
a1+(a2*correction )
[[ 0. 1. 2. 3. ]
[ 4. 5. 6. 7. ]
[ 7.5 8.74 8.8 11. ]
[10. 12. 13.5 15. ]]
Is this the approach I should be using? If so how can I generate a2 from the z and t axis modifiers.
How about this, we use linear stepping in t and z directions and multiply the t and z axes for points inside the matrix:
def shock_2d(t_mod, z_mod, n=4):
ts = np.maximum(1 - np.arange(n)*t_mod,0)
zs = np.maximum(1 - np.arange(n)*z_mod,0)
shock = zs.reshape(-1,1) # ts.reshape(1,-1)
return np.flipud(shock)
eg
shock_2d(t_mod = 0.5, z_mod = 0.25)
Out:
array([[0.25 , 0.125, 0. , 0. ],
[0.5 , 0.25 , 0. , 0. ],
[0.75 , 0.375, 0. , 0. ],
[1. , 0.5 , 0. , 0. ]])
and
shock_2d(t_mod = 0.05, z_mod = 0.25)
Out:
array([[0.25 , 0.2375, 0.225 , 0.2125],
[0.5 , 0.475 , 0.45 , 0.425 ],
[0.75 , 0.7125, 0.675 , 0.6375],
[1. , 0.95 , 0.9 , 0.85 ]])
the last argument, n, is the size of the matrix

Scipy Sparse Matrix Loop Taking Forever - Need to make more efficient

What is the most efficient way time & memory wise of writing this loop with sparse matrix (currently using csc_matrix)
for j in range(0, reducedsize):
xs = sum(X[:, j])
X[:, j] = X[:, j] / xs.data[0]
example:
reduced size (int) - 2500
X (csc_matrix) - 908x2500
The loop does iterate but it takes a very long time compared to just using numpy.
In [388]: from scipy import sparse
Make a sample matrix:
In [390]: M = sparse.random(10,8,.2, 'csc')
Matrix sum:
In [393]: M.sum(axis=0)
Out[393]:
matrix([[1.95018736, 0.90924629, 1.93427113, 2.38816133, 1.08713479,
0. , 2.45435481, 0. ]])
those 0's produce warning when dividing - and nan in the results:
In [394]: M/_
/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py:599: RuntimeWarning: invalid value encountered in true_divide
return np.true_divide(self.todense(), other)
Out[394]:
matrix([[0. , 0. , 0. , 0. , 0.27079623,
nan, 0.13752665, nan],
[0. , 0. , 0. , 0. , 0. ,
nan, 0.32825122, nan],
[0. , 0. , 0. , 0. , 0. ,
nan, 0. , nan],
...
nan, 0. , nan]])
the 0s also give a problem with your approach:
In [395]: for i in range(8):
...: xs = sum(M[:,i])
...: M[:,i] = M[:,i]/xs.data[0]
...:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-395-0195298ead19> in <module>
1 for i in range(8):
2 xs = sum(M[:,i])
----> 3 M[:,i] = M[:,i]/xs.data[0]
4
IndexError: index 0 is out of bounds for axis 0 with size 0
But if we compare the columns without 0 sum the values match:
In [401]: Out[394][:,:5]
Out[401]:
matrix([[0. , 0. , 0. , 0. , 0.27079623],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.49648886, 0.25626608, 0. , 0.19162678, 0.72920377],
[0. , 0. , 0.30200765, 0. , 0. ],
[0.50351114, 0. , 0.30445113, 0.41129367, 0. ],
[0. , 0.74373392, 0. , 0. , 0. ],
[0. , 0. , 0.39354122, 0. , 0. ],
[0. , 0. , 0. , 0.39707955, 0. ]])
In [402]: M.A[:,:5]
Out[402]:
array([[0. , 0. , 0. , 0. , 0.27079623],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.49648886, 0.25626608, 0. , 0.19162678, 0.72920377],
[0. , 0. , 0.30200765, 0. , 0. ],
[0.50351114, 0. , 0.30445113, 0.41129367, 0. ],
[0. , 0.74373392, 0. , 0. , 0. ],
[0. , 0. , 0.39354122, 0. , 0. ],
[0. , 0. , 0. , 0.39707955, 0. ]])
Back in [394] I should have first converted the matrix sum to sparse, so the result will also be sparse. Sparse doesn't have elementwise divide, so I had to take the dense matrix inverse first. The 0s are still a nuisance.
In [409]: M.multiply(sparse.csr_matrix(1/Out[393]))
...
Out[409]:
<10x8 sparse matrix of type '<class 'numpy.float64'>'
with 16 stored elements in Compressed Sparse Column format>
If you want to do it without any memory overhead (in-place)
Always think on how the data is actually stored.
A small example on a csc matrix.
shape=(5,5)
X=sparse.random(shape[0], shape[1], density=0.5, format='csc')
print(X.todense())
[[0.12146814 0. 0. 0.04075121 0.28749552]
[0. 0.92208639 0. 0.44279661 0. ]
[0.63509196 0.42334964 0. 0. 0.99160443]
[0. 0. 0.25941113 0.44669367 0.00389409]
[0. 0. 0. 0. 0.83226886]]
i=0 #first column
print(X.data[X.indptr[i]:X.indptr[i+1]])
[0.12146814 0.63509196]
A Numpy solution
So the only thing we want to do here is to modify the nonzero entries column by column in place. This can be easily done using a partly vectorized numpy solution. data is just the array which contains all non zero values, indptr stores the information where each column begins and ends.
def Numpy_csc_norm(data,indptr):
for i in range(indptr.shape[0]-1):
xs = np.sum(data[indptr[i]:indptr[i+1]])
#Modify the view in place
data[indptr[i]:indptr[i+1]]/=xs
Regarding performance this in-place solution is already not too bad. If you want to improve the performance further you could use Cython/Numba/ or some other compiled code which can be wrapped up in Python more or less easily.
A Numba solution
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def Numba_csc_norm(data,indptr):
for i in nb.prange(indptr.shape[0]-1):
acc=0
for j in range(indptr[i],indptr[i+1]):
acc+=data[j]
for j in range(indptr[i],indptr[i+1]):
data[j]/=acc
Performance
#Create a not to small example matrix
shape=(50_000,10_000)
X=sparse.random(shape[0], shape[1], density=0.001, format='csc')
#Not in-place from hpaulj
def hpaulj(X):
acc=X.sum(axis=0)
return X.multiply(sparse.csr_matrix(1./acc))
%timeit X2=hpaulj(X)
#6.54 ms ± 67.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#All 2 variants are in-place,
#but this shouldn't have a influence on the timings
%timeit Numpy_csc_norm(X.data,X.indptr)
#79.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#parallel=False -> faster on tiny matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#626 µs ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#parallel=True -> faster on larger matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#185 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Numpy dot product for group of rows

I am trying to calculate a dot product between two matrices, for each couple of rows.
I have matrix D with (u x 2) dimensions and matrix R with (u*2 x c) dimensions.
Below an example:
D = np.array([[0.02747092, 0.11233295],
[0.02747092, 0.07295284],
[0.01245856, 0.19935923],
[0.01245856, 0.13520913],
[0.11233295, 0.07295284]])
R = np.array([[-3. , 0. , 1. , -1. ],
[-1.25 , 0.75 , 1.75 , -1.25 ],
[-2.33333333, -0.33333333, 1.66666667, -1.33333333],
[-1.25 , 0.75 , 1.75 , -1.25 ],
[ 0. , -2. , 2. , -4. ],
[-1.25 , 0.75 , 1.75 , -1.25 ],
[ 0.66666667, -3.33333333, 2.66666667, -4.33333333],
[-1.25 , 0.75 , 1.75 , -1.25 ],
[-2.33333333, -0.33333333, 1.66666667, -1.33333333],
[-3. , 0. , 1. , -1. ]])
The result should be matrix M with dimensions (u x c) as follows (example of first row):
M = np.array([[-0.2185, 0.0825, 0.2195, -0.1645],
[...]])
Which is result of dot product between the first row of D and first two rows of matrix R as such:
D_ = np.array([[0.027, 0.11]])
R_ = np.array([[-3., 0., 1., -1.],
[-1.25, 0.75, 1.75, -1.25]])
D_.dot(R_)
I tried various ways of np.tensordot after reshaping the D matrix into tensor, but without any luck. I am looking for vectorized solution and to avoid loops (which is my current solution, quite slow).
Reshape R to 3D and use np.einsum -
np.einsum('ijk,ij->ik',R.reshape(len(D),2,-1),D)

Linear least-squares solution for 3d inputs

Problem
Say I have two arrays with the following shapes:
y.shape is (z, b). Picture this as a collection of z (b,) y vectors.
x.shape is (z, b, c). Picture this as a collection of z (b, c) multivariate x matrices.
My intent is to find the z independent vectors of least-squares coefficient solutions. I.e. the first solution is from regressing y[0] on x[0], where those inputs have shape (b, ) and (b, c) respectively. (b observations, c features.) The result would be shape (z, c).
Some example data
np.random.seed(123)
x = np.random.randn(190, 20, 3)
y = np.random.randn(190, 20) # Assumes no intercept term
# First vector of coefficients
np.linalg.lstsq(x[0], y[0])[0]
# array([-0.12823781, -0.3055392 , 0.11602805])
# Last vector of coefficients
np.linalg.lstsq(x[-1], y[-1])[0]
# array([-0.02777503, -0.20425779, 0.22874169])
NumPy's least-squares solver lstsq can't operate on these. (With my intended result being shape (190, 3), or 190 vectors of 3 coefficients each. Each (3,) vector is one coefficient set from regressions with n=20.)
Is there a workaround to get to the coefficient matrices wrapped into one result array? I'm thinking possibly of the matrix formulation:
For a 1d y and 2d x this would just be:
def coefs(y, x):
return np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))
but I'm having trouble getting this to accept a 2d y and 3d x as above.
Lastly, I'm curious as to why lstsq has trouble here. Is there a simple answer as to why the inputs must be at most 2d?
Here is some demo to demonstrate:
the problems mentioned in my comments
a mostly empirical analysis of looped-lstsq vs. one-step-embedded-lstsq
(with some surprising result at the end which is to be taken with a grain of salt):
Code
import numpy as np
import scipy.sparse as sp
from sklearn.datasets import make_regression
from time import perf_counter as pc
np.set_printoptions(edgeitems=3,infstr='inf',
linewidth=160, nanstr='nan', precision=1,
suppress=False, threshold=1000, formatter=None)
""" Create task """
Z, B, C = 4, 3, 2
Zs = []
Bs = []
for i in range(Z):
X, y, = make_regression(n_samples=B, n_features=C, random_state=i)
Zs.append(X)
Bs.append(y)
Zs = np.array(Zs)
Bs = np.array(Bs)
""" Independent looping """
print('LOOPED CALLS')
start = pc()
result = np.empty((Z, C))
for z in range(Z):
result[z] = np.linalg.lstsq(Zs[z], Bs[z])[0]
end = pc()
print('lhs-shape: ', Zs.shape)
print('lhs-dense-fill-ratio: ', np.count_nonzero(Zs) / np.product(Zs.shape))
print('used time: ', end-start)
print(result)
""" Embedding in one """
print('EMBEDDING INTO ONE CALL')
Zs_ = sp.block_diag([Zs[i] for i in range(Z)]).todense() # convenient to use scipy.sparse
# oops: there is a dense-one too:
# -> scipy.linalg.block_diag
Bs_ = Bs.flatten()
start = pc() # one could argue if transform above should be timed too!
result_ = np.linalg.lstsq(Zs_, Bs_)[0]
end = pc()
print('lhs-shape: ', Zs_.shape)
print('lhs-dense-fill-ratio: ', np.count_nonzero(Zs_) / np.product(Zs_.shape))
print('used time: ', end-start)
print(result_)
Output
LOOPED CALLS
lhs-shape: (4, 3, 2)
lhs-dense-fill-ratio: 1.0
used time: 0.0005415275241778155
[[ 89.2 43.8]
[ 68.5 41.9]
[ 61.9 20.5]
[ 5.1 44.1]]
EMBEDDING INTO ONE CALL
lhs-shape: (12, 8)
lhs-dense-fill-ratio: 0.25
used time: 0.00015907748341232328
[ 89.2 43.8 68.5 41.9 61.9 20.5 5.1 44.1]
lstsq problem-dimensions for each case
While the original data looks like:
[[[ 2.2 1. ]
[-1. 1.9]
[ 0.4 1.8]]
[[-1.1 -0.5]
[-2.3 0.9]
[-0.6 1.6]]
[[ 1.6 -2.1]
[-0.1 -0.4]
[-0.8 -1.8]]
[[-0.3 -0.4]
[ 0.1 -1.9]
[ 1.8 0.4]]]
[[ 242.7 -5.4 112.9]
[ -95.7 -121.4 26.2]
[ 57.9 -12. -88.8]
[ -17.1 -81.6 28.4]]
and each solve looks like:
LHS
[[ 2.2 1. ]
[-1. 1.9]
[ 0.4 1.8]]
RHS
[ 242.7 -5.4 112.9]
the embedded problem (one solving-step) looks like:
LHS
[[ 2.2 1. 0. 0. 0. 0. 0. 0. ]
[-1. 1.9 0. 0. 0. 0. 0. 0. ]
[ 0.4 1.8 0. 0. 0. 0. 0. 0. ]
[ 0. 0. -1.1 -0.5 0. 0. 0. 0. ]
[ 0. 0. -2.3 0.9 0. 0. 0. 0. ]
[ 0. 0. -0.6 1.6 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 1.6 -2.1 0. 0. ]
[ 0. 0. 0. 0. -0.1 -0.4 0. 0. ]
[ 0. 0. 0. 0. -0.8 -1.8 0. 0. ]
[ 0. 0. 0. 0. 0. 0. -0.3 -0.4]
[ 0. 0. 0. 0. 0. 0. 0.1 -1.9]
[ 0. 0. 0. 0. 0. 0. 1.8 0.4]]
RHS
[ 242.7 -5.4 112.9 -95.7 -121.4 26.2 57.9 -12. -88.8 -17.1 -81.6 28.4]
There is no way, given the assumptions / standard-form of lstsq to embed this independence-assumption without introducing a lot of zeros!
lstsq is:
not able to exploit sparsity as the core-algorithm is a dense-one
take a look at the transformed shape: this will be heavy in terms of memory and computation!
not able to use information from fit 0 to speed up something in fit 1
they are independent after all; no information gain in theory
able to vectorize a lot (but that's not helping in general)
Your example-shapes
Trimmed output for your specific shapes, this time: test a sparse-solver too:
Added code (at the end)
print('EMBEDDING -> sparse-solver')
Zs_ = sp.csc_matrix(Zs_) # sparse!
start = pc()
result__ = sp.linalg.lsmr(Zs_, Bs_)[0]
end = pc()
print('lhs-shape: ', Zs_.shape)
print('lhs-dense-fill-ratio: ', Zs_.nnz / np.product(Zs_.shape))
print('used time: ', end-start)
print(result__)
Output
LOOPED CALLS
lhs-shape: (190, 20, 3)
lhs-dense-fill-ratio: 1.0
used time: 0.01716980329027777
[ 11.9 31.8 29.6]
...
[ 44.8 28.2 62.3]]
EMBEDDING INTO ONE CALL
lhs-shape: (3800, 570)
lhs-dense-fill-ratio: 0.00526315789474
used time: 0.6774500271820254
[ 11.9 31.8 29.6 ... 44.8 28.2 62.3]
EMBEDDING -> sparse-solver
lhs-shape: (3800, 570)
lhs-dense-fill-ratio: 0.00526315789474
used time: 0.0038423098412817547 # a bit of a surprise
[ 11.9 31.8 29.6 ... 44.8 28.2 62.3]
Conclusion
In general: solve independently!
In some cases, the task above will be solved faster when using the sparse-solver approach, but analysis here is hard as we are comparing two completely different algorithms (direct vs. iterative) and the results might change in some dramatical way for other data.
Here is the linear algebra solution, with the speed right on par with #sascha's looped version for smaller arrays.
print('Matrix formulation')
start = pc()
result = np.squeeze(np.matmul(np.linalg.inv(np.matmul(Zs.swapaxes(1,2), Zs)),
np.matmul(Zs.swapaxes(1,2), np.atleast_3d(Bs))))
end = pc()
print('used time: ', end-start)
print(result)
Ouput:
Matrix formulation
used time: 0.00015713176480858237
[[ 89.2 43.8]
[ 68.5 41.9]
[ 61.9 20.5]
[ 5.1 44.1]]
However, #sascha's answer wins out easily for much larger inputs, especially as the size of the third dimension grows (number of exogenous variables/features).
Z, B, C = 400, 300, 20
Zs = []
Bs = []
for i in range(Z):
X, y, = make_regression(n_samples=B, n_features=C, random_state=i)
Zs.append(X)
Bs.append(y)
Zs = np.array(Zs)
Bs = np.array(Bs)
# --------
print('Matrix formulation')
start = pc()
result = np.squeeze(np.matmul(np.linalg.inv(np.matmul(Zs.swapaxes(1,2), Zs)),
np.matmul(Zs.swapaxes(1,2), np.atleast_3d(Bs))))
end = pc()
print('used time: ', end-start)
print(result)
# --------
print('Looped calls')
start = pc()
result = np.empty((Z, C))
for z in range(Z):
result[z] = np.linalg.lstsq(Zs[z], Bs[z])[0]
end = pc()
print('used time: ', end-start)
print(result)
Output:
Matrix formulation
used time: 0.24000779996413257
[[ 1.2e+01 1.3e-15 6.3e+01 ..., -8.9e-15 5.3e-15 -1.1e-14]
[ 5.8e+01 2.7e-14 -4.8e-15 ..., 8.5e+01 -1.5e-14 1.8e-14]
[ 1.2e+01 -1.2e-14 4.4e-16 ..., 6.0e-15 8.6e+01 6.0e+01]
...,
[ 2.9e-15 6.6e+01 1.1e-15 ..., 9.8e+01 -2.9e-14 8.4e+01]
[ 2.8e+01 6.1e+01 -1.2e-14 ..., -2.5e-14 6.3e+01 5.9e+01]
[ 7.0e+01 3.3e-16 8.4e+00 ..., 4.1e+01 -6.2e-15 5.8e+01]]
Looped calls
used time: 0.17400113389658145
[[ 1.2e+01 7.1e-15 6.3e+01 ..., -2.8e-14 1.1e-14 -4.8e-14]
[ 5.8e+01 -5.7e-14 -4.9e-14 ..., 8.5e+01 -5.3e-15 6.8e-14]
[ 1.2e+01 3.6e-14 4.5e-14 ..., -3.6e-15 8.6e+01 6.0e+01]
...,
[ 6.3e-14 6.6e+01 -1.4e-13 ..., 9.8e+01 2.8e-14 8.4e+01]
[ 2.8e+01 6.1e+01 -2.1e-14 ..., -1.4e-14 6.3e+01 5.9e+01]
[ 7.0e+01 -1.1e-13 8.4e+00 ..., 4.1e+01 -9.4e-14 5.8e+01]]

python numpy - improve efficiency on column-wise cosine similarity

I am fairly new to programming and I never used numpy before.
So, I have a matrix with 19001 x 19001 dimensions. It contains a lot of zeros, so it is relatively sparse. I wrote some code to compute the pairwise cosine similarity of the columns if the item in the row is non-zero. I add all the pairwise similarity values of one row and do some mathematical operations on them to obtain one value for each row of the matrix in the end (see code below). It does what it is supposed to, however as dealing with a great number of dimensions, it is really slow. Is there any way to modify my code to make it more efficient?
import numpy as np
from scipy.spatial.distance import cosine
row_number = 0
out_file = open('outfile.txt', 'w')
for row in my_matrix:
non_zeros = np.nonzero(my_matrix[row_number])[0]
non_zeros = list(non_zeros)
cosine_sim = []
for item in non_zeros:
if len(non_zeros) <= 1:
break
x = non_zeros[0]
y = non_zeros[1]
similarity = 1 - cosine(my_matrix[:, x], my_matrix[:, y])
cosine_sim.append(similarity)
non_zeros.pop(0)
summing = np.sum(cosine_sim)
mean = summing / len(cosine_sim)
log = np.log(mean)
out_file_value = log * -1
out_file.write(str(row_number) + " " + str(out_file_value) + "\n")
if row_number <= 19000:
row_number += 1
else:
break
I know that there are some function to actually compute the cosine similarity even between columns (from sklearn.metrics.pairwise import cosine_similarity), so I tried it. However, the output is kind of the same but on the same time really confusing to me even though I read the documentation and the posts on this page referring to the issue.
For instance:
my_matrix =[[0. 0. 7. 0. 5.]
[0. 0. 11. 0. 0.]
[0. 2. 0. 0. 0.]
[0. 0. 2. 11. 5.]
[0. 0. 5. 0. 0.]]
transposed = np.transpose(my_matrix)
sim_matrix = cosine_similarity(transposed)
# resulting similarity matrix
sim_matrix =[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0.14177624 0.45112924]
[0. 0. 0.14177624 1. 0.70710678]
[0. 0. 0.45112924 0.70710678 1.]]
If I compute the cosine similarity with my code above, it returns 0.45112924 for the 1st row ([0]) and 0.14177624 and 0.70710678 for row 4 ([3]).
out_file.txt
0 0.796001425306
1 nan
2 nan
3 0.856981065776
4 nan
I greatly appreciate any help or suggestions to my question!
You can consider using scipy instead. However, it doesn't take sparse matrix input. You have to provide numpy array.
import scipy.sparse as sp
from scipy.spatial.distance import cdist
X = np.random.randn(10000, 10000)
D = cdist(X, X.T, metric='cosine') # cosine distance matrix between 2 columns
Here is the speed that I got for 10000 x 10000 random array.
%timeit cdist(X, X.T, metric='cosine')
16.4 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try on small array
X = np.array([[1,0,1], [0, 3, 2], [1,0,1]])
D = cdist(X, X.T, metric='cosine')
This will give
[[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]
[ 6.07767730e-01 1.67949706e-01 9.41783727e-02]
[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]]
For example D[0, 2] is the cosine distance between column 0 and 2
from numpy.linalg import norm
1 - np.dot(X[:, 0], X[:,2])/(norm(X[:, 0]) * norm(X[:,2])) # give 0.422649

Categories

Resources