Checking for Multicollinearity in Python [duplicate] - python

Say I fit a model in statsmodels
mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()
When I do mod.summary() I may see the following:
Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearity conditions in a variable? Is this warning stored somewhere in the model object?
Also, where can I find a description of the fields in summary()?

You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.
If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:
>>> xs = np.random.randn(100, 5) # independent variables
>>> corr = np.corrcoef(xs, rowvar=0) # correlation matrix
>>> w, v = np.linalg.eig(corr) # eigen values & eigen vectors
>>> w
array([ 1.256 , 1.1937, 0.7273, 0.9516, 0.8714])
However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then
>>> noise = np.random.randn(100) # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083, 1.9569, 1.1687, 0.8681, 0.9981])
one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:
>>> v[:,0]
array([-0.4077, 0.0059, -0.5886, 0.0018, 0.6981])
Ignoring almost zero coefficients, above basically says x[0], x[2] and x[4] are colinear (as expected). If one standardizes xs values and multiplies by this eigen vector, the result will hover around zero with small variance:
>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0) # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)
Note that ys.var() is basically the eigen value which was close to zero.
So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.

Based on a similar question for R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.
According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.
Translated example for determinant:
import numpy as np
import pandas as pd
# Create a sample random dataframe
np.random.seed(321)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
x3 = np.random.rand(100)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
# Now create a dataframe with multicollinearity
multicollinear_df = df.copy()
multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']
# Compute both correlation matrices
corr = np.corrcoef(df, rowvar=0)
multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)
# Compare the determinants
print np.linalg.det(corr) . # 0.988532159861
print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16
And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.
print np.linalg.cond(corr) . # 1.23116253259
print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15

Related

How to fit and calculate conditional probability of copula in Python

I would like to fit a copula to a dataframe with 2 columns: a and b. Then I have to calculate the conditional probability of a < 0, when b<-2 (i.e. P(a<0|b<-1).
I have tried the following code in python using the library copula; I am able to fit the copula to the data but I am not sure about calculating cdf :
import copula
df = pandas.read_csv("filename")
cop = copulas.multivariate.GaussianMultivariate()
cop.fit(df)
I know the function cdf can calculate the conditional probability but I am not fully sure how to use that here.
The cdf method takes in an array of inputs and returns an array of the same shape, being the cumulative probability of each input value.
give a try to this code:
import numpy as np
# the array of inputs where b<-2 and a<0
inputs = np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2 and x<0])
# Pass the inputs...
conditional_prob = cop.cdf(inputs)
another possible approach (a bit more formal, but longer)
# inputs
pdf = cop.pdf(inputs)
# pass the inputs where b < -2 to the copula's pdf method to calculate the probability density function of B
pdf_b = cop.pdf(np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2]))
# Calculate P(A and B)
p_a_and_b = pdf * pdf_b
# Calculate P(B)
p_b = cop.cdf(np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2]))
# Calculate P(A|B)
conditional_prob = p_a_and_b / p_b
let us know if it works for you. cheers.

Calculating Covariance of datasets

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
I have the above dataset where X corresponds to the rows and Y corresponds to the columns. I was wondering how I can find the covariance of X and Y. is it as simple as running np.cov()?
It is as simple as doing np.cov(matrix).
P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
covariance_matrix = np.cov(P)
print(covariance_matrix)
array([[ 2.24741487e-04, 6.99919604e-05, 2.57114780e-05,
-2.82152656e-05, 1.06129995e-04],
[ 6.99919604e-05, 2.26110038e-04, 9.53538651e-07,
8.16500154e-05, -2.01348493e-05],
[ 2.57114780e-05, 9.53538651e-07, 7.92448292e-05,
1.35747682e-05, -8.11832888e-05],
[-2.82152656e-05, 8.16500154e-05, 1.35747682e-05,
2.03852891e-04, -1.26682381e-04],
[ 1.06129995e-04, -2.01348493e-05, -8.11832888e-05,
-1.26682381e-04, 2.37225703e-04]])
Unfortunately, it is not as simple as running np.cov(); at least in your case.
For the given problem, the table P has only non-negative entries and sums to 1.0. Moreover, since the table is called P and you invoke the random variables X and Y I'm somewhat certain that you present the joint probability table of a discrete, bivariate probability distribution of a random vector (X, Y). In turn, np.cov(X) is not correct as it computes the empirical covariance matrix of a table of datapoints (where each row represents an observation and each column refers to a single feature).
However, you provided the probabilities rather than actual data. This source provides an example of a bivariate probability table where the values of X and Y are actually provided, enabling the computation of Cov(X,Y). Additionally, this reference elaborates on such tables of smaller size.
Since no values are provided, I assume that X takes values 0,...,4 and Y takes values 0,...,8. Given $mu_{X}$ and $mu_{Y}$ as the expectations of X and Y, and f(x,y) as the entries in your table P, the definition of the covariance is given by
can be efficiently computed via
import numpy as np
# values the random variables can take
X = np.array([0,1,2,3,4])
Y = np.array([0,1,2,3,4,5,6,7,8])
# expectation
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
# Covriance by loop
Cov = 0.0
for i in range(P.shape[0]):
for j in range(P.shape[1]):
Cov_1 += (X[i] - mu_X)*(Y[j] - mu_Y)*P[i,j]
or, directly via NumPy as
# Covariance by matrix multiplication
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
Cov = np.sum(np.multiply(np.outer(X-mu_X, Y-mu_Y), P))
Naturally, both results coincide (up to a floating-point error).
If you replace X and Y with the actual values the random variable can take, yu can simply rerun the code and compute the new covariance value Cov.

Calculating cross-correlation with fft returning backwards output

I'm trying to cross correlate two sets of data, by taking the fourier transform of both and multiplying the conjugate of the first fft with the second fft, before transforming back to time space. In order to test my code, I am comparing the output with the output of numpy.correlate. However, when I plot my code, (restricted to a certain window), it seems the two signals go in opposite directions/are mirrored about zero.
This is what my output looks like
My code:
import numpy as np
import pyplot as plt
phl_data = np.sin(np.arange(0, 10, 0.1))
mlac_data = np.cos(np.arange(0, 10, 0.1))
N = phl_data.size
zeroes = np.zeros(N-1)
phl_data = np.append(phl_data, zeroes)
mlac_data = np.append(mlac_data, zeroes)
# cross-correlate x = phl_data, y = mlac_data:
# take FFTs:
phl_fft = np.fft.fft(phl_data)
mlac_fft = np.fft.fft(mlac_data)
# fft of cross-correlation
Cw = np.conj(phl_fft)*mlac_fft
#Cw = np.fft.fftshift(Cw)
# transform back to time space:
Cxy = np.fft.fftshift(np.fft.ifft(Cw))
times = np.append(np.arange(-N+1, 0, dt),np.arange(0, N, dt))
plt.plot(times, Cxy)
plt.xlim(-250, 250)
# test against convolving:
c = np.correlate(phl_data, mlac_data, mode='same')
plt.plot(times, c)
plt.show()
(both data sets have been padded with N-1 zeroes)
The documentation to numpy.correlate explains this:
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
and:
Notes
The definition of correlation above is not unique and sometimes correlation may be defined differently. Another common definition is:
c'_{av}[k] = sum_n a[n] conj(v[n+k])
which is related to c_{av}[k] by c'_{av}[k] = c_{av}[-k].
Thus, there is not a unique definition, and the two common definitions lead to a reversed output.

Covariance matrix from np.polyfit() has negative diagonal?

Problem: the cov=True option of np.polyfit() produces a diagonal with non-sensical negative values.
UPDATE: after playing with this some more, I am really starting to suspect a bug in numpy? Is that possible? Deleting any pair of 13 values from the dataset will fix the problem.
I am using np.polyfit() to calculate the slope and intercept coefficients of a dataset. A plot of the values produces a very linear (but not perfectly) linear graph. I am attempting to get the standard deviation on these coefficients with np.sqrt(np.diag(cov)); however, this throws an error because the diagonal contains negative values.
It should be mathematically impossible to produce a covariate matrix with a negative diagonal--what is numpy doing wrong?
Here is a snippet that reproduces the problem:
import numpy as np
x = [1476728821.797, 1476728821.904, 1476728821.911, 1476728821.920, 1476728822.031, 1476728822.039,
1476728822.047, 1476728822.153, 1476728822.162, 1476728822.171, 1476728822.280, 1476728822.289,
1476728822.297, 1476728822.407, 1476728822.416, 1476728822.423, 1476728822.530, 1476728822.539,
1476728822.547, 1476728822.657, 1476728822.666, 1476728822.674, 1476728822.759, 1476728822.788,
1476728822.797, 1476728822.805, 1476728822.915, 1476728822.923, 1476728822.931, 1476728823.038,
1476728823.047, 1476728823.054, 1476728823.165, 1476728823.175, 1476728823.182, 1476728823.292,
1476728823.300, 1476728823.308, 1476728823.415, 1476728823.424, 1476728823.432, 1476728823.551,
1476728823.559, 1476728823.567, 1476728823.678, 1476728823.689, 1476728823.697, 1476728823.808,
1476728823.828, 1476728823.837, 1476728823.947, 1476728823.956, 1476728823.964, 1476728824.074,
1476728824.083, 1476728824.091, 1476728824.201, 1476728824.209, 1476728824.217, 1476728824.324,
1476728824.333, 1476728824.341, 1476728824.451, 1476728824.460, 1476728824.468, 1476728824.579,
1476728824.590, 1476728824.598, 1476728824.721, 1476728824.730, 1476728824.788]
y = [6309927, 6310105, 6310116, 6310125, 6310299, 6310317, 6310326, 6310501, 6310513, 6310523, 6310688,
6310703, 6310712, 6310875, 6310891, 6310900, 6311058, 6311069, 6311079, 6311243, 6311261, 6311272,
6311414, 6311463, 6311479, 6311490, 6311665, 6311683, 6311692, 6311857, 6311867, 6311877, 6312037,
6312054, 6312065, 6312230, 6312248, 6312257, 6312430, 6312442, 6312455, 6312646, 6312665, 6312675,
6312860, 6312879, 6312894, 6313071, 6313103, 6313117, 6313287, 6313304, 6313315, 6313489, 6313505,
6313518, 6313675, 6313692, 6313701, 6313875, 6313888, 6313898, 6314076, 6314093, 6314104, 6314285,
6314306, 6314321, 6314526, 6314541, 6314638]
z, cov = np.polyfit(np.asarray(x), np.asarray(y), 1, cov=True)
std = np.sqrt(np.diag(cov))
print z
print cov
print std
It looks like it's related to your x values: they have a total range of about 3, with an offset of about 1.5 billion.
In your code
np.asarray(x)
converts the x values in a ndarray of float64. While this is fine to correctly represent the x values themselves, it might not be enough to carry on the required computations to get the covariance matrix.
np.asarray(x, dtype=np.float128)
would solve the problem, but polyfit can't work with float128 :(
TypeError: array type float128 is unsupported in linalg
As a workaround, you can subtract the offset from x and then using polyfit. This produces a covariance matrix with positive diagonal:
x1 = x - np.mean(x)
z1, cov1 = np.polyfit(np.asarray(x1), np.asarray(y), 1, cov=True)
std1 = np.sqrt(np.diag(cov1))
print z1 # prints: array([ 1.56607841e+03, 6.31224162e+06])
print cov1 # prints: array([[ 4.56066546e+00, -2.90980285e-07],
# [ -2.90980285e-07, 3.36480951e+00]])
print std1 # prints: array([ 2.13557146, 1.83434171])
You'll have to rescale the results accordingly.

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?
numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab
Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()
This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

Categories

Resources