Scipy library similarity score calculations - python

I'm trying to compute similarity scores using vectors:
from scipy.spatial import distance
x = [1,2,4]
y = [1,3,5]
d = distance.cdist(x, y, 'seuclidean', V=None)
However, I keep getting this error:
ValueError: XA must be a 2-dimensional array.

First off you need to use NumPy arrays for input arrays and the error is telling you they need to be 2-D (as column vectors in this case). So:
from scipy.spatial import distance
import numpy as np
x = [1,2,4]
y = [1,3,5]
x = np.array(x).reshape(-1, 1)
y = np.array(y).reshape(-1, 1)
x
array([[1],
[2],
[4]])
y
array([[1],
[3],
[5]])
d = distance.cdist(x, y, 'seuclidean', V=None)
d
array([[ 0. , 1.22474487, 2.44948974],
[ 0.61237244, 0.61237244, 1.83711731],
[ 1.83711731, 0.61237244, 0.61237244]])
There are many method in the distance module that do calculate a similarity (distance) between two vectors. A common example is cosine.
x = [1,2,4]
y = [1,3,5]
distance.cosine(x, y)
0.0040899966895213691

Related

return Cosine Similarity not as single value

How can I make a pure NumPy function that will return an array of the shape of the 2 arrays with the cosine similarities of all the pairwise comparisons of the rows of the input array?
I don't want to return a single value.
dataSet1 = [5, 6, 7, 2]
dataSet2 = [2, 3, 1, 15]
def cosine_similarity(list1, list2):
# How to?
pass
print(cosine_similarity(dataSet1, dataSet2))
You can use scipy for this as stated in this answer.
from scipy import spatial
dataSet1 = [5, 6, 7, 2]
dataSet2 = [2, 3, 1, 15]
result = 1 - spatial.distance.cosine(dataSet1, dataSet2)
You can also use the cosine_similarity function from sklearn.
from sklearn.feature_extraction.text import CountVectorizer ##if the documents are text
from sklearn.metrics.pairwise import cosine_similarity
def cos(docs):
if len(docs)==1:
return []
cos_final = []
count_vectorizer= CountVectorizer(tokenizer=tokenize)
doc1= ['missing' if x is np.nan else x for x in docs]
count_vec=count_vectorizer.fit_transform(doc1)
#print(count_vec)
cosine_sim_matrix= cosine_similarity(count_vec)
#print(cosine_sim_matrix)
return cosine_sim_matrix
What you are searching for is cosine_similarity from sklearn library.
Here is a simple example:
Lets we have x which has 5 dimensional 3 vectors and y which has only 1 vector. We can compute cosine similarity as follows:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
x = np.random.rand(3,5)
y = np.random.rand(1,5)
# >>> x
# array([[0.21668023, 0.05705532, 0.6391782 , 0.97990692, 0.90601101],
# [0.82725409, 0.30221347, 0.98101159, 0.13982621, 0.88490538],
# [0.09895812, 0.19948788, 0.12710054, 0.61409403, 0.56001643]])
# >>> y
# array([[0.70531146, 0.10222257, 0.6027328 , 0.87662291, 0.27053804]])
cosine_similarity(x, y)
Then the output is the cosine similarity of each vector from x (3) with y (1) so the output has 3x1 values:
array([[0.84139047],
[0.75146312],
[0.75255157]])

Efficiently generating a Cauchy matrix from two Numpy arrays

A Cauchy matrix (Wikipedia article) is a matrix determined by two vectors (arrays of numbers). Given two vectors x and y, the Cauchy matrix C generated by them is defined entry-wise as
C[i][j] := 1/(x[i] - y[j])
Given two Numpy arrays x and y, what is an efficient way to generate a Cauchy matrix?
This is the most efficient way I found, using array broadcasting to take advantage of vectorization.
1.0 / (x.reshape((-1,1)) - y)
Edit: #HYRY and #shx2 have suggested that, instead of x.reshape((-1,1)), which makes a copy, you can use x[:,np.newaxis], which returns a view of the same array. #HYRY also suggests 1.0/np.subtract.outer(x,y), which is slightly slower for me but maybe more explicit.
Example:
>>> x = numpy.array([1,2,3,4]) #x
>>> y = numpy.array([5,6,7]) #y
>>>
>>> #transpose x, to nx1
... x = x.reshape((-1,1))
>>> x
array([[1],
[2],
[3],
[4]])
>>>
>>> #array of differences x[i] - y[j]
... #an nx1 array minus a 1xm array is an nxm array
... diff_matrix = x-y
>>> diff_matrix
array([[-4, -5, -6],
[-3, -4, -5],
[-2, -3, -4],
[-1, -2, -3]])
>>>
>>> #apply the multiplicative inverse to each entry
... cauchym = 1.0/diff_matrix
>>> cauchym
array([[-0.25 , -0.2 , -0.16666667],
[-0.33333333, -0.25 , -0.2 ],
[-0.5 , -0.33333333, -0.25 ],
[-1. , -0.5 , -0.33333333]])
I tried a few other methods, all of which were significantly slower.
This is the naive approach, which costs list comprehension:
cauchym = numpy.array([[ 1.0/(x_i-y_j) for y_j in y] for x_i in x])
This one generates the matrix as a 1-dimensional array (saving the cost of nested Python lists) and reshapes it to a matrix afterward. It also moves the division to a single Numpy operation:
cauchym = 1.0/numpy.array([(x_i-y_j) for x_i in x for y_j in y]).reshape([len(x),len(y)])
Using numpy.repeat and numpy.tile (which respectively tile the array horizontally and vertically). This way makes unnecessary copies:
lenx = len(x)
leny = len(y)
xm = numpy.repeat(x,leny) #the i'th row is s_i
ym = numpy.tile(y,lenx)
cauchym = (1.0/(xm-ym)).reshape([lenx,leny]);
I created a function hope it helps u to understand in a better way.
# Creating a function in order to form a cauchy matrix
def cauchy_matrix(arr1,arr2):
"""
Enter two arrays in order to get a cauchy matrix.The input array should be a 1-D array.
arr1 = First 1-D array
arr2 = Second 1-D array
It returns the cauchy matrix having shape equal to m*n, where m is size of arr1 and n is size of arr2.
"""
my_list = []
try:
for i in range(len(arr1)):
for j in range(len(arr2)):
z = 1/(arr1[i]-arr2[j])
my_list.append(z)
return np.array(my_list).reshape(arr1.shape[0],arr2.shape[0])
except ZeroDivisionError:
print("Check if both the arrays has '0' as one of it's element. One array can have a zero but both the arrays having '0' is not acceptable!")

How to normalize a NumPy array to a unit vector?

I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?
If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True
I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))
This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)
To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1
If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)
You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler
There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))
If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]
If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()
If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.
Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])
For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)
If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.
A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x

linear interpolation in numpy

I have 2 numpy arrays
X = [[2 3 6], [7 2 9], [7 1 4]]
a = [0 0.0005413307 0.0010949014 0.0015468832 0.0027740823 0.0033288284]
b = [0 0.0050251256 0.0100502513 0.0150753769 0.0201005025 0.0251256281]
new = []
for z in range(3):
new.append(interp1d(a, z[0], b, 'linear'))
I am getting error as :
if xi is not None and shape[axis] != len(xi):
TypeError: tuple indices must be integers, not str
I need to find the linear interpolation of the same. How can I find that?
I have values X with respect to time a but I want to find interpolation for time b.
Linear interpolation will give me 3 points as in X for every a[i] and b[i] ?
You put the arguments in wrong order. Flowing is the help message of interp1d, check it out:
interp1d(x, y, kind='linear', axis=-1, copy=True, bounds_error=True,fill_value=np.nan)
Interpolate a 1-D function.
x and y are arrays of values used to approximate some function f:
y = f(x) .
This class returns a function whose call method uses interpolation
to find the value of new points.
interp1d is a function whose return value is a new function. This new function can then be called with values in the given interpolation range:
from scipy.interpolate import interp1d
x1 = [ 0., 0.04007922, 0.04723573, 0.05440107, 0.06178645, 0.06837938]
x2 = [ 0., 0.00502513, 0.01005025, 0.01507538, 0.0201005, 0.02512563]
f = interp1d(x1, x2)
f([0.0, 0.01, 0.02, 0.03, 0.068])
#array([ 0. , 0.0012538 , 0.0025076 , 0.0037614 , 0.02483647])

Numpy meshgrid in 3D

Numpy's meshgrid is very useful for converting two vectors to a coordinate grid. What is the easiest way to extend this to three dimensions? So given three vectors x, y, and z, construct 3x3D arrays (instead of 2x2D arrays) which can be used as coordinates.
Numpy (as of 1.8 I think) now supports higher that 2D generation of position grids with meshgrid. One important addition which really helped me is the ability to chose the indexing order (either xy or ij for Cartesian or matrix indexing respectively), which I verified with the following example:
import numpy as np
x_ = np.linspace(0., 1., 10)
y_ = np.linspace(1., 2., 20)
z_ = np.linspace(3., 4., 30)
x, y, z = np.meshgrid(x_, y_, z_, indexing='ij')
assert np.all(x[:,0,0] == x_)
assert np.all(y[0,:,0] == y_)
assert np.all(z[0,0,:] == z_)
Here is the source code of meshgrid:
def meshgrid(x,y):
"""
Return coordinate matrices from two coordinate vectors.
Parameters
----------
x, y : ndarray
Two 1-D arrays representing the x and y coordinates of a grid.
Returns
-------
X, Y : ndarray
For vectors `x`, `y` with lengths ``Nx=len(x)`` and ``Ny=len(y)``,
return `X`, `Y` where `X` and `Y` are ``(Ny, Nx)`` shaped arrays
with the elements of `x` and y repeated to fill the matrix along
the first dimension for `x`, the second for `y`.
See Also
--------
index_tricks.mgrid : Construct a multi-dimensional "meshgrid"
using indexing notation.
index_tricks.ogrid : Construct an open multi-dimensional "meshgrid"
using indexing notation.
Examples
--------
>>> X, Y = np.meshgrid([1,2,3], [4,5,6,7])
>>> X
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
>>> Y
array([[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
`meshgrid` is very useful to evaluate functions on a grid.
>>> x = np.arange(-5, 5, 0.1)
>>> y = np.arange(-5, 5, 0.1)
>>> xx, yy = np.meshgrid(x, y)
>>> z = np.sin(xx**2+yy**2)/(xx**2+yy**2)
"""
x = asarray(x)
y = asarray(y)
numRows, numCols = len(y), len(x) # yes, reversed
x = x.reshape(1,numCols)
X = x.repeat(numRows, axis=0)
y = y.reshape(numRows,1)
Y = y.repeat(numCols, axis=1)
return X, Y
It is fairly simple to understand. I extended the pattern to an arbitrary number of dimensions, but this code is by no means optimized (and not thoroughly error-checked either), but you get what you pay for. Hope it helps:
def meshgrid2(*arrs):
arrs = tuple(reversed(arrs)) #edit
lens = map(len, arrs)
dim = len(arrs)
sz = 1
for s in lens:
sz*=s
ans = []
for i, arr in enumerate(arrs):
slc = [1]*dim
slc[i] = lens[i]
arr2 = asarray(arr).reshape(slc)
for j, sz in enumerate(lens):
if j!=i:
arr2 = arr2.repeat(sz, axis=j)
ans.append(arr2)
return tuple(ans)
Can you show us how you are using np.meshgrid? There is a very good chance that you really don't need meshgrid because numpy broadcasting can do the same thing without generating a repetitive array.
For example,
import numpy as np
x=np.arange(2)
y=np.arange(3)
[X,Y] = np.meshgrid(x,y)
S=X+Y
print(S.shape)
# (3, 2)
# Note that meshgrid associates y with the 0-axis, and x with the 1-axis.
print(S)
# [[0 1]
# [1 2]
# [2 3]]
s=np.empty((3,2))
print(s.shape)
# (3, 2)
# x.shape is (2,).
# y.shape is (3,).
# x's shape is broadcasted to (3,2)
# y varies along the 0-axis, so to get its shape broadcasted, we first upgrade it to
# have shape (3,1), using np.newaxis. Arrays of shape (3,1) can be broadcasted to
# arrays of shape (3,2).
s=x+y[:,np.newaxis]
print(s)
# [[0 1]
# [1 2]
# [2 3]]
The point is that S=X+Y can and should be replaced by s=x+y[:,np.newaxis] because
the latter does not require (possibly large) repetitive arrays to be formed. It also generalizes to higher dimensions (more axes) easily. You just add np.newaxis where needed to effect broadcasting as necessary.
See http://www.scipy.org/EricsBroadcastingDoc for more on numpy broadcasting.
i think what you want is
X, Y, Z = numpy.mgrid[-10:10:100j, -10:10:100j, -10:10:100j]
for example.
Here is a multidimensional version of meshgrid that I wrote:
def ndmesh(*args):
args = map(np.asarray,args)
return np.broadcast_arrays(*[x[(slice(None),)+(None,)*i] for i, x in enumerate(args)])
Note that the returned arrays are views of the original array data, so changing the original arrays will affect the coordinate arrays.
Instead of writing a new function, numpy.ix_ should do what you want.
Here is an example from the documentation:
>>> ixgrid = np.ix_([0,1], [2,4])
>>> ixgrid
(array([[0],
[1]]), array([[2, 4]]))
>>> ixgrid[0].shape, ixgrid[1].shape
((2, 1), (1, 2))'
You can achieve that by changing the order:
import numpy as np
xx = np.array([1,2,3,4])
yy = np.array([5,6,7])
zz = np.array([9,10])
y, z, x = np.meshgrid(yy, zz, xx)

Categories

Resources