Measure distance between data set of 5D

Measure distance between data set of 5D - python

I want to measure the distance (Euclidean) between data sets of 5 dimensions.
It looks like this:
center x
0 [0.09771348879, 1.856078237, 2.100760575, 9.25... [-1.35602640228e-12, -2.94706481441e-11, -6.51...
1 [8.006780488, 1.097849488, 0.6275244427, 0.572... [4.99212418613, 5.01853294023, -0.014304672946...
2 [-1.40785823, -1.714959744, -0.5524032233, -0.... [-1.61000102139e-11, -4.680034138e-12, 1.96087...
index, then point (center), and the third is the other point (x), all the points are 5D.
I want to use pdist since it's applicable to n-d. But the problem is that the points are arranged as m n-dimensional row vectors in the matrix X. While what I have above is only the data format and not the matrix and contains the index as well which it should not.
My code is:( S is the format above)
S = pd.DataFrame(paired_data, columns=['x','center'])
print (S.to_string())
Y = pdist(S[1:], 'euclidean')
print Y

This seems to work:
for i in range(S.shape[0]):
M = np.matrix( [S['x'][i], S['center'][i]] )
print pdist(M, 'euclidean')
or with iterrows():
for row in S.iterrows():
M = np.matrix( [row[1]['x'], row[1]['center']] )
print pdist(M, 'euclidean')
Note that the creation of a matrix isn't necessary, pdist will handle a python list of lists just fine:
for row in S.iterrows():
print pdist([row[1]['x'], row[1]['center']], 'euclidean')

Related

Calculating Covariance of datasets

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
I have the above dataset where X corresponds to the rows and Y corresponds to the columns. I was wondering how I can find the covariance of X and Y. is it as simple as running np.cov()?

It is as simple as doing np.cov(matrix).
P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
covariance_matrix = np.cov(P)
print(covariance_matrix)
array([[ 2.24741487e-04, 6.99919604e-05, 2.57114780e-05,
-2.82152656e-05, 1.06129995e-04],
[ 6.99919604e-05, 2.26110038e-04, 9.53538651e-07,
8.16500154e-05, -2.01348493e-05],
[ 2.57114780e-05, 9.53538651e-07, 7.92448292e-05,
1.35747682e-05, -8.11832888e-05],
[-2.82152656e-05, 8.16500154e-05, 1.35747682e-05,
2.03852891e-04, -1.26682381e-04],
[ 1.06129995e-04, -2.01348493e-05, -8.11832888e-05,
-1.26682381e-04, 2.37225703e-04]])

Unfortunately, it is not as simple as running np.cov(); at least in your case.
For the given problem, the table P has only non-negative entries and sums to 1.0. Moreover, since the table is called P and you invoke the random variables X and Y I'm somewhat certain that you present the joint probability table of a discrete, bivariate probability distribution of a random vector (X, Y). In turn, np.cov(X) is not correct as it computes the empirical covariance matrix of a table of datapoints (where each row represents an observation and each column refers to a single feature).
However, you provided the probabilities rather than actual data. This source provides an example of a bivariate probability table where the values of X and Y are actually provided, enabling the computation of Cov(X,Y). Additionally, this reference elaborates on such tables of smaller size.
Since no values are provided, I assume that X takes values 0,...,4 and Y takes values 0,...,8. Given $mu_{X}$ and $mu_{Y}$ as the expectations of X and Y, and f(x,y) as the entries in your table P, the definition of the covariance is given by
can be efficiently computed via
import numpy as np
# values the random variables can take
X = np.array([0,1,2,3,4])
Y = np.array([0,1,2,3,4,5,6,7,8])
# expectation
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
# Covriance by loop
Cov = 0.0
for i in range(P.shape[0]):
for j in range(P.shape[1]):
Cov_1 += (X[i] - mu_X)*(Y[j] - mu_Y)*P[i,j]
or, directly via NumPy as
# Covariance by matrix multiplication
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
Cov = np.sum(np.multiply(np.outer(X-mu_X, Y-mu_Y), P))
Naturally, both results coincide (up to a floating-point error).
If you replace X and Y with the actual values the random variable can take, yu can simply rerun the code and compute the new covariance value Cov.

interp2(V,k) from Matlab to Python

I'm trying to convert this version of interp2 from Matlab to Python.
In Matlab is used as
Vq = interp2(V,k)
Which perform interpolation over a matrix V where each original interval has been recursively subdivided k times. Adding a total of 2^k-1 elements to each division.
However I haven't found a Python alternative to this function. I tried with scipy.interpolation.interp2 but only works with three matrix.

I found this alternative in a Forum, looks like an email transcription, but I´ll paste the answer here anyways.
import numpy as np
def interp2d_interleave(z,n):
'''performs linear interpolation on a grid
all points are interpolated in one step not recursively
Parameters
----------
z : 2d array (M,N)
n : int
number of points interpolated
Returns
-------
zi : 2d array ((M-1)*n+M, (N-1)*n+N)
original and linear interpolated values
'''
frac = np.atleast_2d(np.arange(0,n+1)/(1.0+n)).T
zi1 = np.kron(z[:,:-1],np.ones(len(frac))) + np.kron(np.diff(z),frac.T)
zi1 = np.hstack((zi1,z[:,-1:]))
zi2 = np.kron(zi1.T[:,:-1],np.ones(len(frac))) + np.kron(np.diff(zi1.T),frac.T)
zi2 = np.hstack((zi2,zi1.T[:,-1:]))
return zi2.T
def interp2d_interleave_recursive(z,n):
'''interpolates by recursively interleaving n times
'''
zi = z.copy()
for ii in range(1,n+1):
zi = interp2d_interleave(zi,1)
return zi
This should be used as follows
xyz = np.zeros((2, 2))
xyz = interp2d_interleave_recursive(xyz, 1)
And the result would be:

Finding relative maximums of a 2-D numpy array

I have a 2-D numpy array that can be subdivided into 64 boxes (think of a chessboard).
The goal is a function that returns the position and value of the maximum in each box. Something like:
FindRefs(array) --> [(argmaxX00, argmaxY00, Max00), ...,(argmaxX63, argmaxY63, Max63)]
where argmaxXnn and argmaxYnn are the indexes of the whole array (not of the box), and Maxnn is the max value in each box. In other words,
Maxnn = array[argmaxYnn,argmaxYnn]
I've tryed the obvious "nested-for" solution:
def FindRefs(array):
Height, Width = array.shape
plumx = []
plumy = []
lum = []
w = int(Width/8)
h = int(Height/8)
for n in range(0,8): # recorrer boxes
x0 = n*w
x1 = (n+1)*w
for m in range(0,8):
y0 = m*h
y1 = (m+1)*h
subflatind = a[y0:y1,x0:x1].argmax() # flatten index of box
y, x = np.unravel_index(subflatind, (h, w))
X = x0 + x
Y = y0 + y
lum.append(a[Y,X])
plumx.append(X)
plumy.append(Y)
refs = []
for pt in range(0,len(plumx)):
ptx = plumx[pt]
pty = plumy[pt]
refs.append((ptx,pty,lum[pt]))
return refs
It works, but is neither elegant nor eficient.
So I've tryed this more pythonic version:
def FindRefs(a):
box = [(n*w,m*h) for n in range(0,8) for m in range(0,8)]
flatinds = [a[b[1]:h+b[1],b[0]:w+b[0]].argmax() for b in box]
unravels = np.unravel_index(flatinds, (h, w))
ur = [(unravels[1][n],unravels[0][n]) for n in range(0,len(box))]
absinds = [map(sum,zip(box[n],ur[n])) for n in range(0,len(box))]
refs = [(absinds[n][0],absinds[n][1],a[absinds[n][1],absinds[n][0]]) for n in range(0,len(box))]
return refs
It works fine but, to my surprise, is not more efficient than the previous version!
The question is: Is there a more clever way to do the task?
Note that efficiency matters, as I have many large arrays for processing.
Any clue is welcome. :)

Try this:
from numpy.lib.stride_tricks import as_strided as ast
import numpy as np
def FindRefs3(a):
box = tuple(x/8 for x in a.shape)
z=ast(a, \
shape=(8,8)+box, \
strides=(a.strides[0]*box[0],a.strides[1]*box[1])+a.strides)
v3 = np.max(z,axis=-1)
i3r = np.argmax(z,axis=-1)
v2 = np.max(v3,axis=-1)
i2 = np.argmax(v3,axis=-1)
i2x = np.indices(i2.shape)
i3 = i3r[np.ix_(*[np.arange(x) for x in i2.shape])+(i2,)]
i3x = np.indices(i3.shape)
ix0 = i2x[0]*box[0]+i2
ix1 = i3x[1]*box[1]+i3
return zip(np.ravel(ix0),np.ravel(ix1),np.ravel(v2))
Note that your first FindRefs reverses indices, so that for a tuple (i1,i2,v), a[i1,i2] won't return the right value, whereas a[i2,i1] will.
So here's what the code does:
It first calculates the dimensions that each box needs to have (box) given the size of your array. Note that this doesn't do any checking: you need to have an array that can be divided evenly into an 8 by 8 grid.
Then z with ast is the messiest bit. It takes the 2d array, and turns it into a 4d array. The 4d array has dimensions (8,8,box[0],box[1]), so it lets you choose which box you want (the first two axes) and then what position you want in the box (the next two). This lets us deal with all the boxes at once by doing operations on the last two axes.
v3 gives us the maximum values along the last axis: in other words, it contains the maximum of each column in each box. i3r contains the index of which row in the box contained that max value.
v2 takes the maximum of v3 along its own last axis, which is now dealing with rows in the box: it takes the column maxes, and finds the maximum of them, so that v2 is a 2d array containing the maximum value of each box. If all you wanted were the maximums, this is all you'd need.
i2 is the index of the column in the box that holds the maximum value.
Now we need to get the index of the row in the box... that's trickier. i3r contains the row index of the max of each column in the box, but we want the row for the specific column that's specified in i2. We do this by choosing an element from i3r using i2, which gives us i3.
At this point, i2 and i3 are 8 by 8 arrays containing the row and column indexes of the maximums relative to each box. We want the absolute indexes. So we create i2x and i3x (actually, this is pointless; we could just create one, as they are the same), which are just arrays of what the indexes for i2 and i3 are (0,1,2,...,8 etc in one dimension, and so on). We then multiply these by the box sizes, and add the relative max indexes, to get the absolute max indexes.
We then combine these to get the same output that you had. Note that if you keep them as arrays, though, instead of making tuples, it's much faster.

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?

numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab

Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()

This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

Getting CDF of variable-sized numpy arrays in Python using same bins?

I'd like to make a set of comparable empirical CDFs for a few numpy arrays (each of different length) and store these in a pandas dataframe:
a = scipy.randn(100)
b = scipy.randn(500)
# ECDF from statmodels
cdf_a = ECDF(a)
cdf_b = ECDF(b)
The problem is that cdf_a.x, cdf_a.y will be of different lengths of cdf_b.x, cdf_b.y and I would like these to be the same length, i.e. use same number of bins to compute the CDF so that these can be plotted on same scale from a pandas DataFrame. This is not possible:
df = pandas.DataFrame({"cdf_a": cdf_a.y, "cdf_b": cdf_b.y})
Since the cdfs are not of the same length. How can I bin a and b using similar bins when computing their CDFs, so that I get comparable same-length vectors back?
Is this the best solution?
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)

The way we use it in some goodness of fit tests is to stack the arrays, so they are defined on all points, points from both arrays.
Then use np.searchsorted to get the ranking, number of points in dataset 1 below x and number of points in dataset 2 below x.
If I remember correctly, look at scipy.stats.ks_2samp
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)

It appears that this is a good solution:
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
Then len(v1) == len(v2) and these can be plotted as CDFs of a, b on the same scale.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Measure distance between data set of 5D - python

Related

Calculating Covariance of datasets

interp2(V,k) from Matlab to Python

Finding relative maximums of a 2-D numpy array

Python cross correlation

Getting CDF of variable-sized numpy arrays in Python using same bins?

Categories

Resources