Related
I have the following function that calculates the eucledian distance between all combinations of the vectors in Matrix A and Matrix B
def distance_matrix(A,B):
n=A.shape[1]
m=B.shape[1]
C=np.zeros((n,m))
for ai, a in enumerate(A.T):
for bi, b in enumerate(B.T):
C[ai][bi]=np.linalg.norm(a-b)
return C
This works fine and creates an n*m-Matrix from a d*n-Matrix and a d*m-Matrix containing the eucledian distance between all combinations of the column vectors.
>>> print(A)
[[-1 -1 1 1 2]
[ 1 -1 2 -1 1]]
>>> print(B)
[[-2 -1 1 2]
[-1 2 1 -1]]
>>> print(distance_matrix(A,B))
[[2.23606798 1. 2. 3.60555128]
[1. 3. 2.82842712 3. ]
[4.24264069 2. 1. 3.16227766]
[3. 3.60555128 2. 1. ]
[4.47213595 3.16227766 1. 2. ]]
I spent some time looking for a numpy or scipy function to achieve this in a more efficient way. Is there such a function or what would be the vecotrized way to do this?
You can use:
np.linalg.norm(A[:,:,None]-B[:,None,:],axis=0)
or (totaly equivalent but without in-built function)
((A[:,:,None]-B[:,None,:])**2).sum(axis=0)**0.5
We need a 5x4 final array so we extend our array this way:
A[:,:,None] -> 2,5,1
↑ ↓
B[:,None,:] -> 2,1,4
A[:,:,None] - B[:,None,:] -> 2,5,4
and we apply our sum over the axis 0 to finally get a 5,4 ndarray.
Yes, you can broadcast your vectors:
A = np.array([[-1, -1, 1, 1, 2], [ 1, -1, 2, -1, 1]])
B = np.array([[-2, -1, 1, 2], [-1, 2, 1, -1]])
C = np.linalg.norm(A.T[:, None, :] - B.T[None, :, :], axis=-1)
print(C)
array([[2.23606798, 1. , 2. , 3.60555128],
[1. , 3. , 2.82842712, 3. ],
[4.24264069, 2. , 1. , 3.16227766],
[3. , 3.60555128, 2. , 1. ],
[4.47213595, 3.16227766, 1. , 2. ]])
You can get an explanation of how it works here:
https://sparrow.dev/pairwise-distance-in-numpy/
I am following this example here: https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)
which gives:
original data
[[1 2]
[3 4]
[5 6]]
column mean
[ 3. 4.]
centered matrix
[[-2. -2.]
[ 0. 0.]
[ 2. 2.]]
covariance matrix
[[ 4. 4.]
[ 4. 4.]]
vectors
[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
values
[ 8. 0.]
projected data
[[-2.82842712 0. ]
[ 0. 0. ]
[ 2.82842712 0. ]]
If I want to find the first principal direction, do I simply take the eigenvalue that corresponds to the largest eigenvector? Therefore:[0.70710678, 0.70710678] ?
Building upon this, is the first principal component the highest eigenvector projected onto the data? Something like:
vectors[:,:1].T.dot(C.T)
which gives:
array([[-2.82842712, 0. , 2.82842712]])
I just fear I have the terminology confused, or I'm oversimplifying things. Thanks in advance!
I am trying to interpolate a 2D numpy matrix with the dimensions (5, 3) to a matrix with the dimensions (7, 3) along the axis 1 (columns). Obviously, the wrong approach would be to randomly insert rows anywhere between the original matrix, see the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Target (terrible interpolation -> not wanted!):
[[0, 1, 1]
[0, 1.5, 0.5]
[0, 2, 0]
[0, 3, 1]
[0, 3.5, 0.5]
[0, 4, 0]
[0, 5, 1]]
The correct approach would be to take every row into account and interpolate between all of them to expand the source matrix to a (7, 3) matrix. I am aware of the scipy.interpolate.interp1d or scipy.interpolate.interp2d methods, but could not get it to work with other Stack Overflow posts or websites. I hope to receive any type of tips or tricks.
Update #1: The expected values should be equally spaced.
Update #2:
What I want to do is basically use the separate columns of the original matrix, expand the length of the column to 7 and interpolate between the values of the original column. See the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Split into 3 separate Columns:
[0 [1 [1
0 2 0
0 3 1
0 4 0
0] 5] 1]
Expand length to 7 and interpolate between them, example for second column:
[1
1.66
2.33
3
3.66
4.33
5]
It seems like each column can be treated completely independently, but for each column you need to define essentially an "x" coordinate so that you can fit some function "f(x)" from which you generate your output matrix.
Unless the rows in your matrix are associated with some other datastructure (e.g. a vector of timestamps), an obvious set of x values is just the row-number:
x = numpy.arange(0, Source.shape[0])
You can then construct an interpolating function:
fit = scipy.interpolate.interp1d(x, Source, axis=0)
and use that to construct your output matrix:
Target = fit(numpy.linspace(0, Source.shape[0]-1, 7)
which produces:
array([[ 0. , 1. , 1. ],
[ 0. , 1.66666667, 0.33333333],
[ 0. , 2.33333333, 0.33333333],
[ 0. , 3. , 1. ],
[ 0. , 3.66666667, 0.33333333],
[ 0. , 4.33333333, 0.33333333],
[ 0. , 5. , 1. ]])
By default, scipy.interpolate.interp1d uses piecewise-linear interpolation. There are many more exotic options within scipy.interpolate, based on higher order polynomials, etc. Interpolation is a big topic in itself, and unless the rows of your matrix have some particular properties (e.g. being regular samples of a signal with a known frequency range), there may be no "truly correct" way of interpolating. So, to some extent, the choice of interpolation scheme will be somewhat arbitrary.
You can do this as follows:
from scipy.interpolate import interp1d
import numpy as np
a = np.array([[0, 1, 1],
[0, 2, 0],
[0, 3, 1],
[0, 4, 0],
[0, 5, 1]])
x = np.array(range(a.shape[0]))
# define new x range, we need 7 equally spaced values
xnew = np.linspace(x.min(), x.max(), 7)
# apply the interpolation to each column
f = interp1d(x, a, axis=0)
# get final result
print(f(xnew))
This will print
[[ 0. 1. 1. ]
[ 0. 1.66666667 0.33333333]
[ 0. 2.33333333 0.33333333]
[ 0. 3. 1. ]
[ 0. 3.66666667 0.33333333]
[ 0. 4.33333333 0.33333333]
[ 0. 5. 1. ]]
I am trying to compare several datasets and basically test, if they show the same feature, although this feature might be shifted, reversed or attenuated.
A very simple example below:
A = np.array([0., 0, 0, 1., 2., 3., 4., 3, 2, 1, 0, 0, 0])
B = np.array([0., 0, 0, 0, 0, 1, 2., 3., 4, 3, 2, 1, 0])
C = np.array([0., 0, 0, 1, 1.5, 2, 1.5, 1, 0, 0, 0, 0, 0])
D = np.array([0., 0, 0, 0, 0, -2, -4, -2, 0, 0, 0, 0, 0])
x = np.arange(0,len(A),1)
I thought the best way to do it would be to normalize these signals and get absolute values (their attenuation is not important for me at this stage, I am interested in the position... but I might be wrong, so I will welcome thoughts about this concept too) and calculate the area where they overlap. I am following up on this answer - the solution looked very elegant and simple, but I may be implementing it wrongly.
def normalize(sig):
#ns = sig/max(np.abs(sig))
ns = sig/sum(sig)
return ns
a = normalize(A)
b = normalize(B)
c = normalize(C)
d = normalize(D)
which then look like this:
But then, when I try to implement the solution from the answer, I run into problems.
OLD
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
w1 = np.abs(w1)
w2 = np.abs(w2)
M[c1,c2] = integrate.trapz(min(np.abs(w2).any(),np.abs(w1).any()))
print M
Produces TypeError: 'numpy.bool_' object is not iterable or IndexError: list assignment index out of range. But I only included the .any() because without them, I was getting the ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
EDIT - NEW
(thanks #Kody King)
The new code is now:
M = np.zeros([4,4])
SH = np.zeros([4,4])
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
crossCorrelation = np.correlate(w1,w2, 'full')
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(w2) + 1
similarity = crossCorrelation[bestShift]
M[c1,c2] = similarity
SH[c1,c2] = actualShift
M = M/M.max()
print M, '\n', SH
And the output:
[[ 1. 1. 0.95454545 0.63636364]
[ 1. 1. 0.95454545 0.63636364]
[ 0.95454545 0.95454545 0.95454545 0.63636364]
[ 0.63636364 0.63636364 0.63636364 0.54545455]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
The matrix of shifts looks ok now, but the actual correlation matrix does not. I am really puzzled by the fact that the lowest correlation value is for correlating d with itself. What I would like to achieve now is that:
EDIT - UPDATE
Following on the advice, I used the recommended normalization formula (dividing the signal by its sum), but the problem wasn't solved, just reversed. Now the correlation of d with d is 1, but all the other signals don't correlate with themselves.
New output:
[[ 0.45833333 0.45833333 0.5 0.58333333]
[ 0.45833333 0.45833333 0.5 0.58333333]
[ 0.5 0.5 0.57142857 0.66666667]
[ 0.58333333 0.58333333 0.66666667 1. ]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
The correlation value should be highest for correlating a signal with itself (i.e. to have the highest values on the main diagonal).
To get the correlation values in the range between 0 and 1, so as a result, I would have 1s on the main diagonal and other numbers (0.x) elsewhere.
I was hoping the M = M/M.max() would do the job, but only if condition no. 1 is fulfilled, which it currently isn't.
As ssm said numpy's correlate function works well for this problem. You mentioned that you are interested in the position. The correlate function can also help you tell how far one sequence is shifted from another.
import numpy as np
def compare(a, b):
# 'full' pads the sequences with 0's so they are correlated
# with as little as 1 actual element overlapping.
crossCorrelation = np.correlate(a,b, 'full')
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(b) + 1
similarity = crossCorrelation[bestShift]
print('Shift: ' + str(actualShift))
print('Similatiy: ' + str(similarity))
return {'shift': actualShift, 'similarity': similarity}
print('\nExpected shift: 0')
compare([0,0,1,0,0], [0,0,1,0,0])
print('\nExpected shift: 2')
compare([0,0,1,0,0], [1,0,0,0,0])
print('\nExpected shift: -2')
compare([1,0,0,0,0], [0,0,1,0,0])
Edit:
You need to normalize each sequence before correlating them, or the larger sequences will have a very high correlation with the all the other sequences.
A property of cross-correlation is that:
So if you normalize by dividing each sequence by it's sum, the similarity will always be between 0 and 1.
I recommend you don't take the absolute value of a sequence. That changes the shape, not just the scale. For instance np.abs([1, -2]) == [1, 2]. Normalizing will already ensure that sequence is mostly positive and adds up to 1.
Second Edit:
I had a realization. Think of the signals as vectors. Normalized vectors always have a max dot product with themselves. Cross-Correlation is just a dot product calculated at various shifts. If you normalize the signals like you would a vector (divide s by sqrt(s dot s)), the self correlations will always be maximal and 1.
import numpy as np
def normalize(s):
magSquared = np.correlate(s, s) # s dot itself
return s / np.sqrt(magSquared)
a = np.array([0., 0, 0, 1., 2., 3., 4., 3, 2, 1, 0, 0, 0])
b = np.array([0., 0, 0, 0, 0, 1, 2., 3., 4, 3, 2, 1, 0])
c = np.array([0., 0, 0, 1, 1.5, 2, 1.5, 1, 0, 0, 0, 0, 0])
d = np.array([0., 0, 0, 0, 0, -2, -4, -2, 0, 0, 0, 0, 0])
a = normalize(a)
b = normalize(b)
c = normalize(c)
d = normalize(d)
M = np.zeros([4,4])
SH = np.zeros([4,4])
for c1,w1 in enumerate([a,b,c,d]):
for c2,w2 in enumerate([a,b,c,d]):
# Taking the absolute value catches signals which are flipped.
crossCorrelation = np.abs(np.correlate(w1, w2, 'full'))
bestShift = np.argmax(crossCorrelation)
# This reverses the effect of the padding.
actualShift = bestShift - len(w2) + 1
similarity = crossCorrelation[bestShift]
M[c1,c2] = similarity
SH[c1,c2] = actualShift
print(M, '\n', SH)
Outputs:
[[ 1. 1. 0.97700842 0.86164044]
[ 1. 1. 0.97700842 0.86164044]
[ 0.97700842 0.97700842 1. 0.8819171 ]
[ 0.86164044 0.86164044 0.8819171 1. ]]
[[ 0. -2. 1. 0.]
[ 2. 0. 3. 2.]
[-1. -3. 0. -1.]
[ 0. -2. 1. 0.]]
You want to use a cross-correlation between the vectors:
https://en.wikipedia.org/wiki/Cross-correlation
https://docs.scipy.org/doc/numpy/reference/generated/numpy.correlate.html
For example:
>>> np.correlate(A,B)
array([ 31.])
>>> np.correlate(A,C)
array([ 19.])
>>> np.correlate(A,D)
array([-28.])
If you don't care about the sign, you can simply take the absolute value ...
I have data in a file in following form:
user_id, item_id, rating
1, abc,5
1, abcd,3
2, abc, 3
2, fgh, 5
So, the matrix I want to form for above data is following:
# itemd_ids
# abc abcd fgh
[[5, 3, 0] # user_id 1
[3, 0, 5]] # user_id 2
where missing data is replaced by 0.
But from this I want to create both user to user similarity matrix and item to item similarity matrix?
How do I do that?
Technically, this is not a programming problem but a math problem. But I think you better off using variance-covariance matrix. Or correlation matrix, if the scale of the values are very different, say, instead of having:
>>> x
array([[5, 3, 0],
[3, 0, 5],
[5, 5, 0],
[1, 1, 7]])
You have:
>>> x
array([[5, 300, 0],
[3, 0, 5],
[5, 500, 0],
[1, 100, 7]])
To get a variance-cov matrix:
>>> np.cov(x)
array([[ 6.33333333, -3.16666667, 6.66666667, -8. ],
[ -3.16666667, 6.33333333, -5.83333333, 7. ],
[ 6.66666667, -5.83333333, 8.33333333, -10. ],
[ -8. , 7. , -10. , 12. ]])
Or the correlation matrix:
>>> np.corrcoef(x)
array([[ 1. , -0.5 , 0.91766294, -0.91766294],
[-0.5 , 1. , -0.80295507, 0.80295507],
[ 0.91766294, -0.80295507, 1. , -1. ],
[-0.91766294, 0.80295507, -1. , 1. ]])
This is the way to look at it, the diagonal cell, i.e., (0,0) cell, is the correlation of your 1st vector in X to it self, so it is 1. The other cells, i.e, (0,1) cell, is the correlation between the 1st and 2nd vector in X. They are negatively correlated. Or similarly, the 1st and 3rd cell are positively correlated.
covariance matrix or correlation matrix avoid the zero problem pointed out by #Akavall.
See this question: What's the fastest way in Python to calculate cosine similarity given sparse matrix data?
Having:
A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])
dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out
Result in:
array([[ 1. , 0.40824829, 0.40824829],
[ 0.40824829, 1. , 0.33333333],
[ 0.40824829, 0.33333333, 1. ]])
But that works for dense matrix. For sparse you have to develop your solution.