Scipy Sparse Matrix special substraction - python

I'm doing a project and I'm doing a lot of matrix computation in it.
I'm looking for a smart way to speed up my code. In my project, I'm dealing with a sparse matrix of size 100Mx1M with around 10M non-zeros values. The example below is just to see my point.
Let's say I have:
A vector v of size (2)
A vector c of size (3)
A sparse matrix X of size (2,3)
v = np.asarray([10, 20])
c = np.asarray([ 2, 3, 4])
data = np.array([1, 1, 1, 1])
row = np.array([0, 0, 1, 1])
col = np.array([1, 2, 0, 2])
X = coo_matrix((data,(row,col)), shape=(2,3))
X.todense()
# matrix([[0, 1, 1],
# [1, 0, 1]])
Currently I'm doing:
result = np.zeros_like(v)
d = scipy.sparse.lil_matrix((v.shape[0], v.shape[0]))
d.setdiag(v)
tmp = d * X
print tmp.todense()
#matrix([[ 0., 10., 10.],
# [ 20., 0., 20.]])
# At this point tmp is csr sparse matrix
for i in range(tmp.shape[0]):
x_i = tmp.getrow(i)
result += x_i.data * ( c[x_i.indices] - x_i.data)
# I only want to do the subtraction on non-zero elements
print result
# array([-430, -380])
And my problem is the for loop and especially the subtraction.
I would like to find a way to vectorize this operation by subtracting only on the non-zero elements.
Something to get directly the sparse matrix on the subtraction:
matrix([[ 0., -7., -6.],
[ -18., 0., -16.]])
Is there a way to do this smartly ?

You don't need to loop over the rows to do what you are already doing. And you can use a similar trick to perform the multiplication of the rows by the first vector:
import scipy.sparse as sps
# number of nonzero entries per row of X
nnz_per_row = np.diff(X.indptr)
# multiply every row by the corresponding entry of v
# You could do this in-place as:
# X.data *= np.repeat(v, nnz_per_row)
Y = sps.csr_matrix((X.data * np.repeat(v, nnz_per_row), X.indices, X.indptr),
shape=X.shape)
# subtract from the non-zero entries the corresponding column value in c...
Y.data -= np.take(c, Y.indices)
# ...and multiply by -1 to get the value you are after
Y.data *= -1
To see that it works, set up some dummy data
rows, cols = 3, 5
v = np.random.rand(rows)
c = np.random.rand(cols)
X = sps.rand(rows, cols, density=0.5, format='csr')
and after run the code above:
>>> x = X.toarray()
>>> mask = x == 0
>>> x *= v[:, np.newaxis]
>>> x = c - x
>>> x[mask] = 0
>>> x
array([[ 0.79935123, 0. , 0. , -0.0097763 , 0.59901243],
[ 0.7522559 , 0. , 0.67510109, 0. , 0.36240006],
[ 0. , 0. , 0.72370725, 0. , 0. ]])
>>> Y.toarray()
array([[ 0.79935123, 0. , 0. , -0.0097763 , 0.59901243],
[ 0.7522559 , 0. , 0.67510109, 0. , 0.36240006],
[ 0. , 0. , 0.72370725, 0. , 0. ]])
The way you are accumulating your result requires that there are the same number of non-zero entries in every row, which seems a pretty weird thing to do. Are you sure that is what you are after? If that's really what you want you could get that value with something like:
result = np.sum(Y.data.reshape(Y.shape[0], -1), axis=0)
but I have trouble believing that is really what you are after...

Related

How to keep a matrix unchanged

I am trying to calculate the inverse matrix using the Gauss-Jordan Method. For that, I need to find the solution X to A.X = I (A and X being N x N matrices, and I the identity matrix).
However, for every column vector of the solution matrix X I calculate in the first loop, I have to use the original matrix A, but I don't know why it keeps changing when I did a copy of it in the beginning.
def SolveGaussJordanInvMatrix(A):
N = len(A[:,0])
I = np.identity(N)
X = np.zeros([N,N], float)
A_orig = A.copy()
for m in range(N):
x = np.zeros(N, float)
v = I[:,m]
A = A_orig
for p in range(N): # Gauss-Jordan Elimination
A[p,:] /= A[p,p]
v[p] /= A[p,p]
for i in range(p): # Cancel elements above the diagonal element
v[i] -= v[p] * A[i,p]
A[i,p:] -= A[p,p:]*A[i,p]
for i in range(p+1, N): # Cancel elements below the diagonal element
v[i] -= v[p] * A[i,p]
A[i,p:] -= A[p,p:]*A[i,p]
X[:,m] = v # Add column vector to the solution matrix
return X
A = np.array([[2, 1, 4, 1 ],
[3, 4, -1, -1],
[1, -4, 7, 5],
[2, -2, 1, 3]], float)
SolveGaussJordanInvMatrix(A)
Does anyone know how turn A back to its original form after the Gauss-Elimination loop?
I'm getting
array([[ 228.1, 0. , 0. , 0. ],
[-219.9, 1. , 0. , 0. ],
[ -14.5, 0. , 1. , 0. ],
[-176.3, 0. , 0. , 1. ]])
and expect
[[ 1.36842105 -0.89473684 -1.05263158 1. ]
[-1.42105263 1.23684211 1.13157895 -1. ]
[ 0.42105263 -0.23684211 -0.13157895 -0. ]
[-2. 1.5 1.5 -1. ]]

Efficiently index 2d numpy array using two 1d arrays

I have a large 2d numpy array and two 1d arrays that represent x/y indexes within the 2d array. I want to use these 1d arrays to perform an operation on the 2d array.
I can do this with a for loop, but it's very slow when working on a large array. Is there a faster way? I tried using the 1d arrays simply as indexes but that didn't work. See this example:
import numpy as np
# Two example 2d arrays
cnt_a = np.zeros((4,4))
cnt_b = np.zeros((4,4))
# 1d arrays holding x and y indices
xpos = [0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3]
ypos = [3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0]
# This method works, but is very slow for a large array
for i in range(0,len(xpos)):
cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
# This method is fast, but gives incorrect answer
cnt_b[xpos,ypos] = cnt_b[xpos,ypos]+1
# Print the results
print 'Good:'
print cnt_a
print ''
print 'Bad:'
print cnt_b
The output from this is:
Good:
[[ 2. 1. 2. 1.]
[ 0. 3. 1. 2.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
Bad:
[[ 1. 1. 1. 1.]
[ 0. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
For the cnt_b array numpy is obviously not summing correctly, but I'm unsure how to fix this without resorting to the (v. inefficient) for loop used to calculate cnt_a.
Another approach by using 1D indexing (suggested by #Shai) extended to answer the actual question:
>>> out = np.zeros((4, 4))
>>> idx = np.ravel_multi_index((xpos, ypos), out.shape) # extract 1D indexes
>>> x = np.bincount(idx, minlength=out.size)
>>> out.flat += x
np.bincount calculates how many times each of the index is present in the xpos, ypos and stores them in x.
Or, as suggested by #Divakar:
>>> out.flat += np.bincount(idx, minlength=out.size)
We could compute the linear indices, then accumulate into zeros-initialized output array with np.add.at. Thus, with xpos and ypos as arrays, here's one implementation -
m,n = xpos.max()+1, ypos.max()+1
out = np.zeros((m,n),dtype=int)
np.add.at(out.ravel(), xpos*n+ypos, 1)
Sample run -
In [95]: # 1d arrays holding x and y indices
...: xpos = np.array([0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3])
...: ypos = np.array([3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0])
...:
In [96]: cnt_a = np.zeros((4,4))
In [97]: # This method works, but is very slow for a large array
...: for i in range(0,len(xpos)):
...: cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
...:
In [98]: m,n = xpos.max()+1, ypos.max()+1
...: out = np.zeros((m,n),dtype=int)
...: np.add.at(out.ravel(), xpos*n+ypos, 1)
...:
In [99]: cnt_a
Out[99]:
array([[ 2., 1., 2., 1.],
[ 0., 3., 1., 2.],
[ 1., 1., 1., 1.],
[ 1., 0., 0., 0.]])
In [100]: out
Out[100]:
array([[2, 1, 2, 1],
[0, 3, 1, 2],
[1, 1, 1, 1],
[1, 0, 0, 0]])
you can iterate on both lists at once, and increment for each couple (if you are not used to it, zip can combine lists)
for x, y in zip(xpos, ypos):
cnt_b[x][y] += 1
But this will be about the same speed as your solution A.
If your lists xpos/ypos are of length n, I don't see how you can update your matrix in less than o(n) since you'll have to check each pair one way or an other.
Other solution: you could count (with collections.Counter possibly) the similar index pairs (ex: (0, 3) etc...) and update the matrix with the count value. But I doubt it would be much faster, since you the time gained on updating the matrix would be lost on counting multiple occurrences.
Maybe I am totally wrong tho, in which case I'd be curious too to see a not o(n) answer
I think you are looking for ravel_multi_index funciton
lidx = np.ravel_multi_index((xpos, ypos), cnt_a.shape)
converts to "flatten" 1D indices into cnt_a and cnt_b:
np.add.at( cnt_b, lidx, 1 )

Testing similarity of several datasets by producing a cross-correlation matrix

I am trying to compare several datasets and basically test, if they show the same feature, although this feature might be shifted, reversed or attenuated.
A very simple example below:
A = np.array([0., 0, 0, 1., 2., 3., 4., 3, 2, 1, 0, 0, 0])
B = np.array([0., 0, 0, 0, 0, 1, 2., 3., 4, 3, 2, 1, 0])
C = np.array([0., 0, 0, 1, 1.5, 2, 1.5, 1, 0, 0, 0, 0, 0])
D = np.array([0., 0, 0, 0, 0, -2, -4, -2, 0, 0, 0, 0, 0])
x = np.arange(0,len(A),1)
I thought the best way to do it would be to normalize these signals and get absolute values (their attenuation is not important for me at this stage, I am interested in the position... but I might be wrong, so I will welcome thoughts about this concept too) and calculate the area where they overlap. I am following up on this answer - the solution looked very elegant and simple, but I may be implementing it wrongly.
def normalize(sig):
#ns = sig/max(np.abs(sig))
ns = sig/sum(sig)
return ns
a = normalize(A)
b = normalize(B)
c = normalize(C)
d = normalize(D)
which then look like this:
But then, when I try to implement the solution from the answer, I run into problems.
OLD
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
w1 = np.abs(w1)
w2 = np.abs(w2)
M[c1,c2] = integrate.trapz(min(np.abs(w2).any(),np.abs(w1).any()))
print M
Produces TypeError: 'numpy.bool_' object is not iterable or IndexError: list assignment index out of range. But I only included the .any() because without them, I was getting the ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
EDIT - NEW
(thanks #Kody King)
The new code is now:
M = np.zeros([4,4])
SH = np.zeros([4,4])
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
crossCorrelation = np.correlate(w1,w2, 'full')
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(w2) + 1
similarity = crossCorrelation[bestShift]
M[c1,c2] = similarity
SH[c1,c2] = actualShift
M = M/M.max()
print M, '\n', SH
And the output:
[[ 1. 1. 0.95454545 0.63636364]
[ 1. 1. 0.95454545 0.63636364]
[ 0.95454545 0.95454545 0.95454545 0.63636364]
[ 0.63636364 0.63636364 0.63636364 0.54545455]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
The matrix of shifts looks ok now, but the actual correlation matrix does not. I am really puzzled by the fact that the lowest correlation value is for correlating d with itself. What I would like to achieve now is that:
EDIT - UPDATE
Following on the advice, I used the recommended normalization formula (dividing the signal by its sum), but the problem wasn't solved, just reversed. Now the correlation of d with d is 1, but all the other signals don't correlate with themselves.
New output:
[[ 0.45833333 0.45833333 0.5 0.58333333]
[ 0.45833333 0.45833333 0.5 0.58333333]
[ 0.5 0.5 0.57142857 0.66666667]
[ 0.58333333 0.58333333 0.66666667 1. ]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
The correlation value should be highest for correlating a signal with itself (i.e. to have the highest values on the main diagonal).
To get the correlation values in the range between 0 and 1, so as a result, I would have 1s on the main diagonal and other numbers (0.x) elsewhere.
I was hoping the M = M/M.max() would do the job, but only if condition no. 1 is fulfilled, which it currently isn't.
As ssm said numpy's correlate function works well for this problem. You mentioned that you are interested in the position. The correlate function can also help you tell how far one sequence is shifted from another.
import numpy as np
def compare(a, b):
# 'full' pads the sequences with 0's so they are correlated
# with as little as 1 actual element overlapping.
crossCorrelation = np.correlate(a,b, 'full')
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(b) + 1
similarity = crossCorrelation[bestShift]
print('Shift: ' + str(actualShift))
print('Similatiy: ' + str(similarity))
return {'shift': actualShift, 'similarity': similarity}
print('\nExpected shift: 0')
compare([0,0,1,0,0], [0,0,1,0,0])
print('\nExpected shift: 2')
compare([0,0,1,0,0], [1,0,0,0,0])
print('\nExpected shift: -2')
compare([1,0,0,0,0], [0,0,1,0,0])
Edit:
You need to normalize each sequence before correlating them, or the larger sequences will have a very high correlation with the all the other sequences.
A property of cross-correlation is that:
So if you normalize by dividing each sequence by it's sum, the similarity will always be between 0 and 1.
I recommend you don't take the absolute value of a sequence. That changes the shape, not just the scale. For instance np.abs([1, -2]) == [1, 2]. Normalizing will already ensure that sequence is mostly positive and adds up to 1.
Second Edit:
I had a realization. Think of the signals as vectors. Normalized vectors always have a max dot product with themselves. Cross-Correlation is just a dot product calculated at various shifts. If you normalize the signals like you would a vector (divide s by sqrt(s dot s)), the self correlations will always be maximal and 1.
import numpy as np
def normalize(s):
magSquared = np.correlate(s, s) # s dot itself
return s / np.sqrt(magSquared)
a = np.array([0., 0, 0, 1., 2., 3., 4., 3, 2, 1, 0, 0, 0])
b = np.array([0., 0, 0, 0, 0, 1, 2., 3., 4, 3, 2, 1, 0])
c = np.array([0., 0, 0, 1, 1.5, 2, 1.5, 1, 0, 0, 0, 0, 0])
d = np.array([0., 0, 0, 0, 0, -2, -4, -2, 0, 0, 0, 0, 0])
a = normalize(a)
b = normalize(b)
c = normalize(c)
d = normalize(d)
M = np.zeros([4,4])
SH = np.zeros([4,4])
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
# Taking the absolute value catches signals which are flipped.
crossCorrelation = np.abs(np.correlate(w1, w2, 'full'))
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(w2) + 1
similarity = crossCorrelation[bestShift]
M[c1,c2] = similarity
SH[c1,c2] = actualShift
print(M, '\n', SH)
Outputs:
[[ 1. 1. 0.97700842 0.86164044]
[ 1. 1. 0.97700842 0.86164044]
[ 0.97700842 0.97700842 1. 0.8819171 ]
[ 0.86164044 0.86164044 0.8819171 1. ]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
You want to use a cross-correlation between the vectors:
https://en.wikipedia.org/wiki/Cross-correlation
https://docs.scipy.org/doc/numpy/reference/generated/numpy.correlate.html
For example:
>>> np.correlate(A,B)
array([ 31.])
>>> np.correlate(A,C)
array([ 19.])
>>> np.correlate(A,D)
array([-28.])
If you don't care about the sign, you can simply take the absolute value ...

Calculate the triangular matrix of distances between NumPy array of coordinates

I have an NumPy array of coordinates. For example purposes, I will use this
In [1]: np.random.seed(123)
In [2]: coor = np.random.randint(10, size=12).reshape(-1,3)
In [3]: coor
Out[3]: array([[2, 2, 6],
[1, 3, 9],
[6, 1, 0],
[1, 9, 0]])
I want the triangular matrix of distances between all coordinates. A simple approach would be to code a double loop over all coordinates
In [4]: n_coor = len(coor)
In [5]: dist = np.zeros((n_coor, n_coor))
In [6]: for j in xrange(n_coor):
for k in xrange(j+1, n_coor):
dist[j, k] = np.sqrt(np.sum((coor[j] - coor[k]) ** 2))
with the result being an upper triangular matrix of the distances
In [7]: dist
Out[7]: array([[ 0. , 3.31662479, 7.28010989, 9.2736185 ],
[ 0. , 0. , 10.48808848, 10.81665383],
[ 0. , 0. , 0. , 9.43398113],
[ 0. , 0. , 0. , 0. ]])
Leveraging NumPy, I can avoid looping using
In [8]: dist = np.sqrt(((coor[:, None, :] - coor) ** 2).sum(-1))
but the result is the entire matrix
In [9]: dist
Out[9]: array([[ 0. , 3.31662479, 7.28010989, 9.2736185 ],
[ 3.31662479, 0. , 10.48808848, 10.81665383],
[ 7.28010989, 10.48808848, 0. , 9.43398113],
[ 9.2736185 , 10.81665383, 9.43398113, 0. ]])
This one line version takes roughly half the time when I use 2048 coordinates (4 s instead of 10 s) but this is doing twice as many calculations as it needs in order to get the symmetric matrix. Is there a way to adjust the one line command to only get the triangular matrix (and the additional 2x speedup, i.e. 2 s)?
We can use SciPy's pdist method to get those distances. So, we just need to initialize the output array and then set the upper triangular values with those distances
from scipy.spatial.distance import pdist
n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
row,col = np.triu_indices(n_coor,1)
dist[row,col] = pdist(coor)
Alternatively, we can use boolean-indexing to assign values, replacing the last two lines
dist[np.arange(n_coor)[:,None] < np.arange(n_coor)] = pdist(coor)
Runtime test
Functions:
def subscripted_indexing(coor):
n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
row,col = np.triu_indices(n_coor,1)
dist[row,col] = pdist(coor)
return dist
def boolean_indexing(coor):
n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
r = np.arange(n_coor)
dist[r[:,None] < r] = pdist(coor)
return dist
Timings:
In [110]: # Setup input array
...: coor = np.random.randint(0,10, (2048,3))
In [111]: %timeit subscripted_indexing(coor)
10 loops, best of 3: 91.4 ms per loop
In [112]: %timeit boolean_indexing(coor)
10 loops, best of 3: 47.8 ms per loop

Elementwise division, disregarding zeros

It's a python question: let's say I have an m+1-dimensional numpy array a consisting of non-negative numbers, and I would like to obtain an array b of the same size where the last coordinates are normalized so that they sum up to 1, or zero in case all of them were zeros. For example, if m = 2, my code would be as follows
import numpy as np
a = np.array([[[ 0.34 , 0.66],
[ 0.75 , 0.25]],
[[ 0. , 0. ],
[ 1. , 0. ]]])
for i1 in range(len(a)):
for i2 in range(len(a)):
s = a[i1][i2].sum()
if s > 0:
a[i1][i2] = a[i1][i2]/s
however I find this method sloppy. Also, it works only for fixed m.
This can be done by broadcasting. There are several ways to take into account the zero-sum exception. Without taking it into account, you could write
import numpy as np
shape = (2, 3, 4)
X = np.random.randn(*shape) ** 2
sums = X.sum(-1)
Y = X / sums[..., np.newaxis]
Now, in order to take into account potential zero-sum-ness of some lines, we set one line of the data to 0:
X[0, 0, :] = 0
sums = X.sum(-1)
nnz = sums != 0
Y = np.zeros_like(X)
Y[nnz, :] = X[nnz, :] / sums[nnz, np.newaxis]
You will observe that Y.sum(axis=-1) has the entry 0 in coordinate (0,0) reflecting the zero-ness of the corresponding line.
EDIT: Application to the concrete example
X = np.array(array([[[ 0.34 , 0.66],
[ 0.75 , 0.25]],
[[ 0. , 0. ],
[ 1. , 0. ]]]))
sums = X.sum(-1)
nnz = sums != 0
Y = np.zeros_like(X)
Y[nnz, :] = X[nnz, :] / sums[nnz, np.newaxis]
yields Y == X (because along the last axis the sum is already one or zero.)

Categories

Resources