Covariance matrix from np.polyfit() has negative diagonal? - python

Problem: the cov=True option of np.polyfit() produces a diagonal with non-sensical negative values.
UPDATE: after playing with this some more, I am really starting to suspect a bug in numpy? Is that possible? Deleting any pair of 13 values from the dataset will fix the problem.
I am using np.polyfit() to calculate the slope and intercept coefficients of a dataset. A plot of the values produces a very linear (but not perfectly) linear graph. I am attempting to get the standard deviation on these coefficients with np.sqrt(np.diag(cov)); however, this throws an error because the diagonal contains negative values.
It should be mathematically impossible to produce a covariate matrix with a negative diagonal--what is numpy doing wrong?
Here is a snippet that reproduces the problem:
import numpy as np
x = [1476728821.797, 1476728821.904, 1476728821.911, 1476728821.920, 1476728822.031, 1476728822.039,
1476728822.047, 1476728822.153, 1476728822.162, 1476728822.171, 1476728822.280, 1476728822.289,
1476728822.297, 1476728822.407, 1476728822.416, 1476728822.423, 1476728822.530, 1476728822.539,
1476728822.547, 1476728822.657, 1476728822.666, 1476728822.674, 1476728822.759, 1476728822.788,
1476728822.797, 1476728822.805, 1476728822.915, 1476728822.923, 1476728822.931, 1476728823.038,
1476728823.047, 1476728823.054, 1476728823.165, 1476728823.175, 1476728823.182, 1476728823.292,
1476728823.300, 1476728823.308, 1476728823.415, 1476728823.424, 1476728823.432, 1476728823.551,
1476728823.559, 1476728823.567, 1476728823.678, 1476728823.689, 1476728823.697, 1476728823.808,
1476728823.828, 1476728823.837, 1476728823.947, 1476728823.956, 1476728823.964, 1476728824.074,
1476728824.083, 1476728824.091, 1476728824.201, 1476728824.209, 1476728824.217, 1476728824.324,
1476728824.333, 1476728824.341, 1476728824.451, 1476728824.460, 1476728824.468, 1476728824.579,
1476728824.590, 1476728824.598, 1476728824.721, 1476728824.730, 1476728824.788]
y = [6309927, 6310105, 6310116, 6310125, 6310299, 6310317, 6310326, 6310501, 6310513, 6310523, 6310688,
6310703, 6310712, 6310875, 6310891, 6310900, 6311058, 6311069, 6311079, 6311243, 6311261, 6311272,
6311414, 6311463, 6311479, 6311490, 6311665, 6311683, 6311692, 6311857, 6311867, 6311877, 6312037,
6312054, 6312065, 6312230, 6312248, 6312257, 6312430, 6312442, 6312455, 6312646, 6312665, 6312675,
6312860, 6312879, 6312894, 6313071, 6313103, 6313117, 6313287, 6313304, 6313315, 6313489, 6313505,
6313518, 6313675, 6313692, 6313701, 6313875, 6313888, 6313898, 6314076, 6314093, 6314104, 6314285,
6314306, 6314321, 6314526, 6314541, 6314638]
z, cov = np.polyfit(np.asarray(x), np.asarray(y), 1, cov=True)
std = np.sqrt(np.diag(cov))
print z
print cov
print std

It looks like it's related to your x values: they have a total range of about 3, with an offset of about 1.5 billion.
In your code
np.asarray(x)
converts the x values in a ndarray of float64. While this is fine to correctly represent the x values themselves, it might not be enough to carry on the required computations to get the covariance matrix.
np.asarray(x, dtype=np.float128)
would solve the problem, but polyfit can't work with float128 :(
TypeError: array type float128 is unsupported in linalg
As a workaround, you can subtract the offset from x and then using polyfit. This produces a covariance matrix with positive diagonal:
x1 = x - np.mean(x)
z1, cov1 = np.polyfit(np.asarray(x1), np.asarray(y), 1, cov=True)
std1 = np.sqrt(np.diag(cov1))
print z1 # prints: array([ 1.56607841e+03, 6.31224162e+06])
print cov1 # prints: array([[ 4.56066546e+00, -2.90980285e-07],
# [ -2.90980285e-07, 3.36480951e+00]])
print std1 # prints: array([ 2.13557146, 1.83434171])
You'll have to rescale the results accordingly.

Related

format Error in scipy interpn when passing grid points and values from griddata interpolation of sparse data

Hi and thanks in advance for your time,
I'm trying to use scipy interpolation/extrapolation from data with 3D coordinates + value(accuracy)
The purpose is to use the interpn as the function to later run in a global optimizer to try and speed up a hyperparameter tunning task.
The strategy is:
inputs are a sparse dataset of 4Dims, 3 parameters + accuracy
from the min and max of each parameter create boundaries that define
rec grid and use mean accuracy as fill_value
create a values filled grid using scipy.interpolation.griddata
run the function scipy.interpolation.interpn passing the grid
points, values and desired point (which interpn can both interpolate
or extrapolate)
here is the documentation from scipy:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.griddata.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interpn.html
https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html
Problem: everything works up to point 4 but interpn won't accept the format of grid points/values i'm using
#python3
import numpy as np
from scipy.interpolate import griddata,interpn
experimentsString = "3,1,8,0.636;3,5,16,0.741;3,10,32,0.680;20,1,8,0.715;20,5,16,0.719;20,10,32,0.693;40,1,8,0.500;40,5,16,0.504;40,10,32,0.500;3,1,8,0.715;3,1,16,0.746;3,1,32,0.724;3,1,8,0.667;3,1,16,0.662;3,1,32,0.728;3,1,8,0.750;3,1,16,0.711;3,1,32,0.719;3,1,8,0.750;3,1,16,0.750;3,1,32,0.671;3,1,8,0.737;3,1,16,0.680;3,1,32,0.711;3,1,8,0.737;3,1,16,0.724;3,1,32,0.728;3,1,8,0.737;3,1,16,0.728;3,1,32,0.724;3,1,8,0.702;"
experimentsRows = experimentsString.split(";")
print(*experimentsRows, sep= "\n")
sequenceLength=[]
sampleRate=[]
fullyConnected=[]
accuracy=[]
zippedDataPts=[]
for row in experimentsRows:
if len(row) > 1:
values=row.split(",")
sequenceLength.append(int(values[0]))
sampleRate.append(int(values[1]))
fullyConnected.append(int(values[2]))
accuracy.append(float(values[3]))
point=np.array([int(values[0]),int(values[1]),int(values[2])])
zippedDataPts.append(point)
zippedDataPtsCopy=zippedDataPts.copy()
zippedDataPts = np.array(zippedDataPtsCopy,dtype=float)
unZippedDataPts=(np.array(sequenceLength),np.array(sampleRate),np.array(fullyConnected))
minSequenceLength=min(sequenceLength)
maxSequenceLength=max(sequenceLength)
print("sequenceLength Bounds: ",minSequenceLength,maxSequenceLength)
minSampleRate=min(sampleRate)
maxSampleRate=max(sampleRate)
print("sampleRate Bounds: ",minSampleRate,maxSampleRate)
minFullyConnected=min(fullyConnected)
maxFullyConnected=max(fullyConnected)
print("fullyConnected Bounds: ",minFullyConnected,maxFullyConnected)
meanAccuracy=np.mean(accuracy)
print("Mean Accuracy: ",meanAccuracy)
accuracyArr=np.array(accuracy,dtype=float)
print("accuracyArr:",np.shape(accuracyArr))
x=np.linspace(minSequenceLength,maxSequenceLength,num=int(maxSequenceLength-minSequenceLength),dtype=int)
print("LINSPACE x")
print(x)
y=np.linspace(minSampleRate,maxSampleRate,num=int(maxSampleRate-minSampleRate),dtype=int)
print("LINSPACE y")
print(y)
z=np.linspace(minFullyConnected,maxFullyConnected,num=int(maxFullyConnected-minFullyConnected),dtype=int)
print("LINSPACE z")
print(z)
X,Y,Z = np.meshgrid(x,y,z)
X=X.astype(float)
Y=Y.astype(float)
Z=Z.astype(float)
print("X",np.shape(X))
print("Y",np.shape(Y))
print("Z",np.shape(Z))
XX, YY, ZZ = np.array(X.ravel()), np.array(Y.ravel()), np.array(Z.ravel())
print("XX",np.shape(XX))
print("YY",np.shape(YY))
print("ZZ",np.shape(ZZ))
dataGridValues1D = griddata(zippedDataPts,accuracyArr,(XX,YY,ZZ),method='linear',fill_value=meanAccuracy)
dataGridValues3D = griddata(zippedDataPts,accuracyArr,(X,Y,Z),method='linear',fill_value=meanAccuracy)
# dataGridValuesArr = np.array(dataGridValues)
print("dataGridValues1D:",np.shape(dataGridValues1D))
print("dataGridValues3D:",np.shape(dataGridValues3D))
xc=x.copy()
yc=x.copy()
zc=x.copy()
xf = xc.astype(float)
yf = yc.astype(float)
zf = zc.astype(float)
testPoint=np.array([16.0,6.0,32.0],dtype=float)
I conducted the following experiments for the interpn function with the following error messages:
guess = interpn((xf,yf,zf),dataGridValues1D,testPoint,method='linear',fill_value=None,bounds_error=False)
#ValueError: There are 3 point arrays, but values has 1 dimensions
guess = interpn((xf,yf,zf),dataGridValues3D,testPoint,method='linear',fill_value=None,bounds_error=False)
#ValueError: There are 37 points and 9 values in dimension 0
guess = interpn((X,Y,Z),dataGridValues1D,testPoint,method='linear',fill_value=None,bounds_error=False)
#ValueError: There are 3 point arrays, but values has 1 dimensions
guess = interpn((X,Y,Z),dataGridValues3D,testPoint,method='linear',fill_value=None,bounds_error=False)
#ValueError: The points in dimension 0 must be strictly ascending
guess = interpn((XX,YY,ZZ),dataGridValues1D,testPoint,method='linear',fill_value=None,bounds_error=False)
#ValueError: There are 3 point arrays, but values has 1 dimensions
guess = interpn((XX,YY,ZZ),dataGridValues3D,testPoint,method='linear',fill_value=None,bounds_error=False)
#ValueError: The points in dimension 0 must be strictly ascending
guess = interpn(zippedGridPoints,dataGridValues1D,testPoint,method='linear',fill_value=None,bounds_error=False)
#ValueError: There are 7992 point arrays, but values has 1 dimensions
guess = interpn(zippedGridPoints,dataGridValues3D,testPoint,method='linear',fill_value=None,bounds_error=False)
#ValueError: There are 7992 point arrays, but values has 3 dimensions

Calculating Covariance of datasets

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
I have the above dataset where X corresponds to the rows and Y corresponds to the columns. I was wondering how I can find the covariance of X and Y. is it as simple as running np.cov()?
It is as simple as doing np.cov(matrix).
P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
covariance_matrix = np.cov(P)
print(covariance_matrix)
array([[ 2.24741487e-04, 6.99919604e-05, 2.57114780e-05,
-2.82152656e-05, 1.06129995e-04],
[ 6.99919604e-05, 2.26110038e-04, 9.53538651e-07,
8.16500154e-05, -2.01348493e-05],
[ 2.57114780e-05, 9.53538651e-07, 7.92448292e-05,
1.35747682e-05, -8.11832888e-05],
[-2.82152656e-05, 8.16500154e-05, 1.35747682e-05,
2.03852891e-04, -1.26682381e-04],
[ 1.06129995e-04, -2.01348493e-05, -8.11832888e-05,
-1.26682381e-04, 2.37225703e-04]])
Unfortunately, it is not as simple as running np.cov(); at least in your case.
For the given problem, the table P has only non-negative entries and sums to 1.0. Moreover, since the table is called P and you invoke the random variables X and Y I'm somewhat certain that you present the joint probability table of a discrete, bivariate probability distribution of a random vector (X, Y). In turn, np.cov(X) is not correct as it computes the empirical covariance matrix of a table of datapoints (where each row represents an observation and each column refers to a single feature).
However, you provided the probabilities rather than actual data. This source provides an example of a bivariate probability table where the values of X and Y are actually provided, enabling the computation of Cov(X,Y). Additionally, this reference elaborates on such tables of smaller size.
Since no values are provided, I assume that X takes values 0,...,4 and Y takes values 0,...,8. Given $mu_{X}$ and $mu_{Y}$ as the expectations of X and Y, and f(x,y) as the entries in your table P, the definition of the covariance is given by
can be efficiently computed via
import numpy as np
# values the random variables can take
X = np.array([0,1,2,3,4])
Y = np.array([0,1,2,3,4,5,6,7,8])
# expectation
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
# Covriance by loop
Cov = 0.0
for i in range(P.shape[0]):
for j in range(P.shape[1]):
Cov_1 += (X[i] - mu_X)*(Y[j] - mu_Y)*P[i,j]
or, directly via NumPy as
# Covariance by matrix multiplication
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
Cov = np.sum(np.multiply(np.outer(X-mu_X, Y-mu_Y), P))
Naturally, both results coincide (up to a floating-point error).
If you replace X and Y with the actual values the random variable can take, yu can simply rerun the code and compute the new covariance value Cov.

logarithmic rebinning of 2D array

I have a 1D ray containing data that looks like this (48000 points), spaced by one wavenumber (R = 1 cm-1). The shape of the x and y array is (48000, 1), I want to rebin both in a similar way
xarr=[50000,9999,9998,....,2000]
yarr=[0.1,0.02,0.8,0.5....0.1]
I wish to decrease the spatial resolution, lets say R= 10 cm-1), so I want ten times less points (4800), from 50000 to 2000. And do the same for the y array
How to start?
I try by taking the natural log of the wavelength scale, then re-bin this onto a new log of wavelength scale generate using np.linspace()
xi=np.log(xarr[0])
xf=np.log(xarr[-1])
xnew=np.linspace(xi, xf, num=4800)
now I need to recast the y array into this xnew array, I am thinking of using rebin, a 2D rebin, but not sure how to use this. Any suggestions?
import numpy as np
arr1=[2,3,65,3,5...,32,2]
series=np.array(arr1)
print(series[:3])
I tried this and it seems to work!
import numpy as np
import scipy.stats as stats
#irregular x and y arrays
yirr= np.random.randint(1,101,10)
xirr=np.arange(10)
nbins=5
bin_means, bin_edges, binnumber = stats.binned_statistic(xirr,yirr, 'mean', bins=nbins)
yreg=bin_means # <== regularized yarr
xi=xirr[0]
xf=xirr[-1]
xreg=np.linspace(xi, xf, num=nbins)
print('yreg',yreg)
print('xreg',xreg) # <== regularized xarr
If anyone can find an improvement or see a problem with this, please post!
I'll try it on my logarithmically scaled data now

Checking for Multicollinearity in Python [duplicate]

Say I fit a model in statsmodels
mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()
When I do mod.summary() I may see the following:
Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearity conditions in a variable? Is this warning stored somewhere in the model object?
Also, where can I find a description of the fields in summary()?
You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.
If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:
>>> xs = np.random.randn(100, 5) # independent variables
>>> corr = np.corrcoef(xs, rowvar=0) # correlation matrix
>>> w, v = np.linalg.eig(corr) # eigen values & eigen vectors
>>> w
array([ 1.256 , 1.1937, 0.7273, 0.9516, 0.8714])
However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then
>>> noise = np.random.randn(100) # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083, 1.9569, 1.1687, 0.8681, 0.9981])
one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:
>>> v[:,0]
array([-0.4077, 0.0059, -0.5886, 0.0018, 0.6981])
Ignoring almost zero coefficients, above basically says x[0], x[2] and x[4] are colinear (as expected). If one standardizes xs values and multiplies by this eigen vector, the result will hover around zero with small variance:
>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0) # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)
Note that ys.var() is basically the eigen value which was close to zero.
So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.
Based on a similar question for R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.
According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.
Translated example for determinant:
import numpy as np
import pandas as pd
# Create a sample random dataframe
np.random.seed(321)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
x3 = np.random.rand(100)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
# Now create a dataframe with multicollinearity
multicollinear_df = df.copy()
multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']
# Compute both correlation matrices
corr = np.corrcoef(df, rowvar=0)
multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)
# Compare the determinants
print np.linalg.det(corr) . # 0.988532159861
print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16
And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.
print np.linalg.cond(corr) . # 1.23116253259
print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?
numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab
Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()
This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

Categories

Resources