I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?
numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab
Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()
This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)
Related
I would like to fit a copula to a dataframe with 2 columns: a and b. Then I have to calculate the conditional probability of a < 0, when b<-2 (i.e. P(a<0|b<-1).
I have tried the following code in python using the library copula; I am able to fit the copula to the data but I am not sure about calculating cdf :
import copula
df = pandas.read_csv("filename")
cop = copulas.multivariate.GaussianMultivariate()
cop.fit(df)
I know the function cdf can calculate the conditional probability but I am not fully sure how to use that here.
The cdf method takes in an array of inputs and returns an array of the same shape, being the cumulative probability of each input value.
give a try to this code:
import numpy as np
# the array of inputs where b<-2 and a<0
inputs = np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2 and x<0])
# Pass the inputs...
conditional_prob = cop.cdf(inputs)
another possible approach (a bit more formal, but longer)
# inputs
pdf = cop.pdf(inputs)
# pass the inputs where b < -2 to the copula's pdf method to calculate the probability density function of B
pdf_b = cop.pdf(np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2]))
# Calculate P(A and B)
p_a_and_b = pdf * pdf_b
# Calculate P(B)
p_b = cop.cdf(np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2]))
# Calculate P(A|B)
conditional_prob = p_a_and_b / p_b
let us know if it works for you. cheers.
I'm trying to cross correlate two sets of data, by taking the fourier transform of both and multiplying the conjugate of the first fft with the second fft, before transforming back to time space. In order to test my code, I am comparing the output with the output of numpy.correlate. However, when I plot my code, (restricted to a certain window), it seems the two signals go in opposite directions/are mirrored about zero.
This is what my output looks like
My code:
import numpy as np
import pyplot as plt
phl_data = np.sin(np.arange(0, 10, 0.1))
mlac_data = np.cos(np.arange(0, 10, 0.1))
N = phl_data.size
zeroes = np.zeros(N-1)
phl_data = np.append(phl_data, zeroes)
mlac_data = np.append(mlac_data, zeroes)
# cross-correlate x = phl_data, y = mlac_data:
# take FFTs:
phl_fft = np.fft.fft(phl_data)
mlac_fft = np.fft.fft(mlac_data)
# fft of cross-correlation
Cw = np.conj(phl_fft)*mlac_fft
#Cw = np.fft.fftshift(Cw)
# transform back to time space:
Cxy = np.fft.fftshift(np.fft.ifft(Cw))
times = np.append(np.arange(-N+1, 0, dt),np.arange(0, N, dt))
plt.plot(times, Cxy)
plt.xlim(-250, 250)
# test against convolving:
c = np.correlate(phl_data, mlac_data, mode='same')
plt.plot(times, c)
plt.show()
(both data sets have been padded with N-1 zeroes)
The documentation to numpy.correlate explains this:
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
and:
Notes
The definition of correlation above is not unique and sometimes correlation may be defined differently. Another common definition is:
c'_{av}[k] = sum_n a[n] conj(v[n+k])
which is related to c_{av}[k] by c'_{av}[k] = c_{av}[-k].
Thus, there is not a unique definition, and the two common definitions lead to a reversed output.
I have a numpy array whose values are distributed in the following manner
From this array I need to get a random sub-sample which is normally distributed.
I need to get rid of the values from the array which are above the red line in the picture. i.e. I need to get rid of some occurences of certain values from the array so that my distribution gets smoothened when the abrupt peaks are removed.
And my array's distribution should become like this:
Can this be achieved in python, without manually looking for entries corresponding to the peaks and remove some occurences of them ? Can this be done in a simpler way ?
The following kind of works, it is rather aggressive, though:
It works by ordering the samples, transforming to uniform and then trying to select a regular griddish subsample. If you feel it is too aggressive you could increase ns which is essentially the number of samples kept.
Also, please note that it requires the knowledge of the true distribution. In case of normal distribution you should be fine with using sample mean and unbiased variance estimate (the one with n-1).
Code (without plotting):
import scipy.stats as ss
import numpy as np
a = ss.norm.rvs(size=1000)
b = ss.uniform.rvs(size=1000)<0.4
a[b] += 0.1*np.sin(10*a[b])
def smooth(a, gran=25):
o = np.argsort(a)
s = ss.norm.cdf(a[o])
ns = int(gran / np.max(s[gran:] - s[:-gran]))
grid, dp = np.linspace(0, 1, ns, endpoint=False, retstep=True)
grid += dp/2
idx = np.searchsorted(s, grid)
c = np.flatnonzero(idx[1:] <= idx[:-1])
while c.size > 0:
idx[c+1] = idx[c] + 1
c = np.flatnonzero(idx[1:] <= idx[:-1])
idx = idx[:np.searchsorted(idx, len(a))]
return o[idx]
ap = a[smooth(a)]
c, b = np.histogram(a, 40)
cp, _ = np.histogram(ap, b)
Say I fit a model in statsmodels
mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()
When I do mod.summary() I may see the following:
Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearity conditions in a variable? Is this warning stored somewhere in the model object?
Also, where can I find a description of the fields in summary()?
You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.
If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:
>>> xs = np.random.randn(100, 5) # independent variables
>>> corr = np.corrcoef(xs, rowvar=0) # correlation matrix
>>> w, v = np.linalg.eig(corr) # eigen values & eigen vectors
>>> w
array([ 1.256 , 1.1937, 0.7273, 0.9516, 0.8714])
However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then
>>> noise = np.random.randn(100) # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083, 1.9569, 1.1687, 0.8681, 0.9981])
one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:
>>> v[:,0]
array([-0.4077, 0.0059, -0.5886, 0.0018, 0.6981])
Ignoring almost zero coefficients, above basically says x[0], x[2] and x[4] are colinear (as expected). If one standardizes xs values and multiplies by this eigen vector, the result will hover around zero with small variance:
>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0) # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)
Note that ys.var() is basically the eigen value which was close to zero.
So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.
Based on a similar question for R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.
According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.
Translated example for determinant:
import numpy as np
import pandas as pd
# Create a sample random dataframe
np.random.seed(321)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
x3 = np.random.rand(100)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
# Now create a dataframe with multicollinearity
multicollinear_df = df.copy()
multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']
# Compute both correlation matrices
corr = np.corrcoef(df, rowvar=0)
multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)
# Compare the determinants
print np.linalg.det(corr) . # 0.988532159861
print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16
And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.
print np.linalg.cond(corr) . # 1.23116253259
print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15
I'd like to make a set of comparable empirical CDFs for a few numpy arrays (each of different length) and store these in a pandas dataframe:
a = scipy.randn(100)
b = scipy.randn(500)
# ECDF from statmodels
cdf_a = ECDF(a)
cdf_b = ECDF(b)
The problem is that cdf_a.x, cdf_a.y will be of different lengths of cdf_b.x, cdf_b.y and I would like these to be the same length, i.e. use same number of bins to compute the CDF so that these can be plotted on same scale from a pandas DataFrame. This is not possible:
df = pandas.DataFrame({"cdf_a": cdf_a.y, "cdf_b": cdf_b.y})
Since the cdfs are not of the same length. How can I bin a and b using similar bins when computing their CDFs, so that I get comparable same-length vectors back?
Is this the best solution?
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
The way we use it in some goodness of fit tests is to stack the arrays, so they are defined on all points, points from both arrays.
Then use np.searchsorted to get the ranking, number of points in dataset 1 below x and number of points in dataset 2 below x.
If I remember correctly, look at scipy.stats.ks_2samp
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
It appears that this is a good solution:
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
Then len(v1) == len(v2) and these can be plotted as CDFs of a, b on the same scale.