Calculating Covariance of datasets - python

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
I have the above dataset where X corresponds to the rows and Y corresponds to the columns. I was wondering how I can find the covariance of X and Y. is it as simple as running np.cov()?

It is as simple as doing np.cov(matrix).
P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
covariance_matrix = np.cov(P)
print(covariance_matrix)
array([[ 2.24741487e-04, 6.99919604e-05, 2.57114780e-05,
-2.82152656e-05, 1.06129995e-04],
[ 6.99919604e-05, 2.26110038e-04, 9.53538651e-07,
8.16500154e-05, -2.01348493e-05],
[ 2.57114780e-05, 9.53538651e-07, 7.92448292e-05,
1.35747682e-05, -8.11832888e-05],
[-2.82152656e-05, 8.16500154e-05, 1.35747682e-05,
2.03852891e-04, -1.26682381e-04],
[ 1.06129995e-04, -2.01348493e-05, -8.11832888e-05,
-1.26682381e-04, 2.37225703e-04]])

Unfortunately, it is not as simple as running np.cov(); at least in your case.
For the given problem, the table P has only non-negative entries and sums to 1.0. Moreover, since the table is called P and you invoke the random variables X and Y I'm somewhat certain that you present the joint probability table of a discrete, bivariate probability distribution of a random vector (X, Y). In turn, np.cov(X) is not correct as it computes the empirical covariance matrix of a table of datapoints (where each row represents an observation and each column refers to a single feature).
However, you provided the probabilities rather than actual data. This source provides an example of a bivariate probability table where the values of X and Y are actually provided, enabling the computation of Cov(X,Y). Additionally, this reference elaborates on such tables of smaller size.
Since no values are provided, I assume that X takes values 0,...,4 and Y takes values 0,...,8. Given $mu_{X}$ and $mu_{Y}$ as the expectations of X and Y, and f(x,y) as the entries in your table P, the definition of the covariance is given by
can be efficiently computed via
import numpy as np
# values the random variables can take
X = np.array([0,1,2,3,4])
Y = np.array([0,1,2,3,4,5,6,7,8])
# expectation
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
# Covriance by loop
Cov = 0.0
for i in range(P.shape[0]):
for j in range(P.shape[1]):
Cov_1 += (X[i] - mu_X)*(Y[j] - mu_Y)*P[i,j]
or, directly via NumPy as
# Covariance by matrix multiplication
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
Cov = np.sum(np.multiply(np.outer(X-mu_X, Y-mu_Y), P))
Naturally, both results coincide (up to a floating-point error).
If you replace X and Y with the actual values the random variable can take, yu can simply rerun the code and compute the new covariance value Cov.

Related

How to fit and calculate conditional probability of copula in Python

I would like to fit a copula to a dataframe with 2 columns: a and b. Then I have to calculate the conditional probability of a < 0, when b<-2 (i.e. P(a<0|b<-1).
I have tried the following code in python using the library copula; I am able to fit the copula to the data but I am not sure about calculating cdf :
import copula
df = pandas.read_csv("filename")
cop = copulas.multivariate.GaussianMultivariate()
cop.fit(df)
I know the function cdf can calculate the conditional probability but I am not fully sure how to use that here.
The cdf method takes in an array of inputs and returns an array of the same shape, being the cumulative probability of each input value.
give a try to this code:
import numpy as np
# the array of inputs where b<-2 and a<0
inputs = np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2 and x<0])
# Pass the inputs...
conditional_prob = cop.cdf(inputs)
another possible approach (a bit more formal, but longer)
# inputs
pdf = cop.pdf(inputs)
# pass the inputs where b < -2 to the copula's pdf method to calculate the probability density function of B
pdf_b = cop.pdf(np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2]))
# Calculate P(A and B)
p_a_and_b = pdf * pdf_b
# Calculate P(B)
p_b = cop.cdf(np.array([[x, y] for x, y in zip(df['a'], df['b']) if y<-2]))
# Calculate P(A|B)
conditional_prob = p_a_and_b / p_b
let us know if it works for you. cheers.

What is the inverse operation of np.log() and np.diff()?

I have used the statement dataTrain = np.log(mdataTrain).diff() in my program. I want to reverse the effects of the statement. How can it be done in Python?
The reverse will involve taking the cumulative sum and then the exponential. Since pd.Series.diff loses information, namely the first value in a series, you will need to store and reuse this data:
np.random.seed(0)
s = pd.Series(np.random.random(10))
print(s.values)
# [ 0.5488135 0.71518937 0.60276338 0.54488318 0.4236548 0.64589411
# 0.43758721 0.891773 0.96366276 0.38344152]
t = np.log(s).diff()
t.iat[0] = np.log(s.iat[0])
res = np.exp(t.cumsum())
print(res.values)
# [ 0.5488135 0.71518937 0.60276338 0.54488318 0.4236548 0.64589411
# 0.43758721 0.891773 0.96366276 0.38344152]
Pandas .diff() and .cumsum() are easy ways to perform finite difference calcs. And as a matter of fact, .diff() is default to .diff(1) - first element of pandas series or dataframe will be a nan; whereas .diff(-1) will loose the last element as nan.
x = pd.Series(np.linspace(.1,2,100)) # uniformly spaced x = mdataTrain
y = np.log(x) # the logarithm function
dx = x.diff() # x finite differences - this vector is a constant
dy = y.diff() # y finite differences
dy_dx_apprx = dy/dx # approximate derivative of logarithm function
dy_dx = 1/x # exact derivative of logarithm function
cs_dy = dy.cumsum() + y[0] # approximate "integral" of approximate "derivative" of y... adding the constant, and reconstructing y
x_invrtd = np.exp(cs_dy) # inverting the log function with exp...
rx = x - x_invrtd # residual values due to computation processess...
abs(rx).sum()
(x-x').sum = 2.0e-14 , two orders above python float EPS [1e-16], will be the sum of residues of the inversion process described below:
x -> exp(cumsum(diff(log(x))) -> x'
The finite difference of log(x) can also be compared with it's exact derivative, 1/x.
There will be a significant error or residue once the discretization of x is a gross one, only 100 points between .1 and 2.

Covariance matrix from np.polyfit() has negative diagonal?

Problem: the cov=True option of np.polyfit() produces a diagonal with non-sensical negative values.
UPDATE: after playing with this some more, I am really starting to suspect a bug in numpy? Is that possible? Deleting any pair of 13 values from the dataset will fix the problem.
I am using np.polyfit() to calculate the slope and intercept coefficients of a dataset. A plot of the values produces a very linear (but not perfectly) linear graph. I am attempting to get the standard deviation on these coefficients with np.sqrt(np.diag(cov)); however, this throws an error because the diagonal contains negative values.
It should be mathematically impossible to produce a covariate matrix with a negative diagonal--what is numpy doing wrong?
Here is a snippet that reproduces the problem:
import numpy as np
x = [1476728821.797, 1476728821.904, 1476728821.911, 1476728821.920, 1476728822.031, 1476728822.039,
1476728822.047, 1476728822.153, 1476728822.162, 1476728822.171, 1476728822.280, 1476728822.289,
1476728822.297, 1476728822.407, 1476728822.416, 1476728822.423, 1476728822.530, 1476728822.539,
1476728822.547, 1476728822.657, 1476728822.666, 1476728822.674, 1476728822.759, 1476728822.788,
1476728822.797, 1476728822.805, 1476728822.915, 1476728822.923, 1476728822.931, 1476728823.038,
1476728823.047, 1476728823.054, 1476728823.165, 1476728823.175, 1476728823.182, 1476728823.292,
1476728823.300, 1476728823.308, 1476728823.415, 1476728823.424, 1476728823.432, 1476728823.551,
1476728823.559, 1476728823.567, 1476728823.678, 1476728823.689, 1476728823.697, 1476728823.808,
1476728823.828, 1476728823.837, 1476728823.947, 1476728823.956, 1476728823.964, 1476728824.074,
1476728824.083, 1476728824.091, 1476728824.201, 1476728824.209, 1476728824.217, 1476728824.324,
1476728824.333, 1476728824.341, 1476728824.451, 1476728824.460, 1476728824.468, 1476728824.579,
1476728824.590, 1476728824.598, 1476728824.721, 1476728824.730, 1476728824.788]
y = [6309927, 6310105, 6310116, 6310125, 6310299, 6310317, 6310326, 6310501, 6310513, 6310523, 6310688,
6310703, 6310712, 6310875, 6310891, 6310900, 6311058, 6311069, 6311079, 6311243, 6311261, 6311272,
6311414, 6311463, 6311479, 6311490, 6311665, 6311683, 6311692, 6311857, 6311867, 6311877, 6312037,
6312054, 6312065, 6312230, 6312248, 6312257, 6312430, 6312442, 6312455, 6312646, 6312665, 6312675,
6312860, 6312879, 6312894, 6313071, 6313103, 6313117, 6313287, 6313304, 6313315, 6313489, 6313505,
6313518, 6313675, 6313692, 6313701, 6313875, 6313888, 6313898, 6314076, 6314093, 6314104, 6314285,
6314306, 6314321, 6314526, 6314541, 6314638]
z, cov = np.polyfit(np.asarray(x), np.asarray(y), 1, cov=True)
std = np.sqrt(np.diag(cov))
print z
print cov
print std
It looks like it's related to your x values: they have a total range of about 3, with an offset of about 1.5 billion.
In your code
np.asarray(x)
converts the x values in a ndarray of float64. While this is fine to correctly represent the x values themselves, it might not be enough to carry on the required computations to get the covariance matrix.
np.asarray(x, dtype=np.float128)
would solve the problem, but polyfit can't work with float128 :(
TypeError: array type float128 is unsupported in linalg
As a workaround, you can subtract the offset from x and then using polyfit. This produces a covariance matrix with positive diagonal:
x1 = x - np.mean(x)
z1, cov1 = np.polyfit(np.asarray(x1), np.asarray(y), 1, cov=True)
std1 = np.sqrt(np.diag(cov1))
print z1 # prints: array([ 1.56607841e+03, 6.31224162e+06])
print cov1 # prints: array([[ 4.56066546e+00, -2.90980285e-07],
# [ -2.90980285e-07, 3.36480951e+00]])
print std1 # prints: array([ 2.13557146, 1.83434171])
You'll have to rescale the results accordingly.

Checking for Multicollinearity in Python [duplicate]

Say I fit a model in statsmodels
mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()
When I do mod.summary() I may see the following:
Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearity conditions in a variable? Is this warning stored somewhere in the model object?
Also, where can I find a description of the fields in summary()?
You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.
If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:
>>> xs = np.random.randn(100, 5) # independent variables
>>> corr = np.corrcoef(xs, rowvar=0) # correlation matrix
>>> w, v = np.linalg.eig(corr) # eigen values & eigen vectors
>>> w
array([ 1.256 , 1.1937, 0.7273, 0.9516, 0.8714])
However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then
>>> noise = np.random.randn(100) # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083, 1.9569, 1.1687, 0.8681, 0.9981])
one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:
>>> v[:,0]
array([-0.4077, 0.0059, -0.5886, 0.0018, 0.6981])
Ignoring almost zero coefficients, above basically says x[0], x[2] and x[4] are colinear (as expected). If one standardizes xs values and multiplies by this eigen vector, the result will hover around zero with small variance:
>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0) # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)
Note that ys.var() is basically the eigen value which was close to zero.
So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.
Based on a similar question for R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.
According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.
Translated example for determinant:
import numpy as np
import pandas as pd
# Create a sample random dataframe
np.random.seed(321)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
x3 = np.random.rand(100)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
# Now create a dataframe with multicollinearity
multicollinear_df = df.copy()
multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']
# Compute both correlation matrices
corr = np.corrcoef(df, rowvar=0)
multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)
# Compare the determinants
print np.linalg.det(corr) . # 0.988532159861
print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16
And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.
print np.linalg.cond(corr) . # 1.23116253259
print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?
numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab
Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()
This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

Categories

Resources