What is the inverse operation of np.log() and np.diff()? - python

I have used the statement dataTrain = np.log(mdataTrain).diff() in my program. I want to reverse the effects of the statement. How can it be done in Python?

The reverse will involve taking the cumulative sum and then the exponential. Since pd.Series.diff loses information, namely the first value in a series, you will need to store and reuse this data:
np.random.seed(0)
s = pd.Series(np.random.random(10))
print(s.values)
# [ 0.5488135 0.71518937 0.60276338 0.54488318 0.4236548 0.64589411
# 0.43758721 0.891773 0.96366276 0.38344152]
t = np.log(s).diff()
t.iat[0] = np.log(s.iat[0])
res = np.exp(t.cumsum())
print(res.values)
# [ 0.5488135 0.71518937 0.60276338 0.54488318 0.4236548 0.64589411
# 0.43758721 0.891773 0.96366276 0.38344152]

Pandas .diff() and .cumsum() are easy ways to perform finite difference calcs. And as a matter of fact, .diff() is default to .diff(1) - first element of pandas series or dataframe will be a nan; whereas .diff(-1) will loose the last element as nan.
x = pd.Series(np.linspace(.1,2,100)) # uniformly spaced x = mdataTrain
y = np.log(x) # the logarithm function
dx = x.diff() # x finite differences - this vector is a constant
dy = y.diff() # y finite differences
dy_dx_apprx = dy/dx # approximate derivative of logarithm function
dy_dx = 1/x # exact derivative of logarithm function
cs_dy = dy.cumsum() + y[0] # approximate "integral" of approximate "derivative" of y... adding the constant, and reconstructing y
x_invrtd = np.exp(cs_dy) # inverting the log function with exp...
rx = x - x_invrtd # residual values due to computation processess...
abs(rx).sum()
(x-x').sum = 2.0e-14 , two orders above python float EPS [1e-16], will be the sum of residues of the inversion process described below:
x -> exp(cumsum(diff(log(x))) -> x'
The finite difference of log(x) can also be compared with it's exact derivative, 1/x.
There will be a significant error or residue once the discretization of x is a gross one, only 100 points between .1 and 2.

Related

Calculating Covariance of datasets

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
I have the above dataset where X corresponds to the rows and Y corresponds to the columns. I was wondering how I can find the covariance of X and Y. is it as simple as running np.cov()?
It is as simple as doing np.cov(matrix).
P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
covariance_matrix = np.cov(P)
print(covariance_matrix)
array([[ 2.24741487e-04, 6.99919604e-05, 2.57114780e-05,
-2.82152656e-05, 1.06129995e-04],
[ 6.99919604e-05, 2.26110038e-04, 9.53538651e-07,
8.16500154e-05, -2.01348493e-05],
[ 2.57114780e-05, 9.53538651e-07, 7.92448292e-05,
1.35747682e-05, -8.11832888e-05],
[-2.82152656e-05, 8.16500154e-05, 1.35747682e-05,
2.03852891e-04, -1.26682381e-04],
[ 1.06129995e-04, -2.01348493e-05, -8.11832888e-05,
-1.26682381e-04, 2.37225703e-04]])
Unfortunately, it is not as simple as running np.cov(); at least in your case.
For the given problem, the table P has only non-negative entries and sums to 1.0. Moreover, since the table is called P and you invoke the random variables X and Y I'm somewhat certain that you present the joint probability table of a discrete, bivariate probability distribution of a random vector (X, Y). In turn, np.cov(X) is not correct as it computes the empirical covariance matrix of a table of datapoints (where each row represents an observation and each column refers to a single feature).
However, you provided the probabilities rather than actual data. This source provides an example of a bivariate probability table where the values of X and Y are actually provided, enabling the computation of Cov(X,Y). Additionally, this reference elaborates on such tables of smaller size.
Since no values are provided, I assume that X takes values 0,...,4 and Y takes values 0,...,8. Given $mu_{X}$ and $mu_{Y}$ as the expectations of X and Y, and f(x,y) as the entries in your table P, the definition of the covariance is given by
can be efficiently computed via
import numpy as np
# values the random variables can take
X = np.array([0,1,2,3,4])
Y = np.array([0,1,2,3,4,5,6,7,8])
# expectation
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
# Covriance by loop
Cov = 0.0
for i in range(P.shape[0]):
for j in range(P.shape[1]):
Cov_1 += (X[i] - mu_X)*(Y[j] - mu_Y)*P[i,j]
or, directly via NumPy as
# Covariance by matrix multiplication
mu_X = np.dot(Y, np.sum(P,0))
mu_Y = np.dot(X, np.sum(P,1))
Cov = np.sum(np.multiply(np.outer(X-mu_X, Y-mu_Y), P))
Naturally, both results coincide (up to a floating-point error).
If you replace X and Y with the actual values the random variable can take, yu can simply rerun the code and compute the new covariance value Cov.

Sum of Gaussian random variables using python

Given two independent Gaussian variables X and Y, with probability density functions pdf1 and pdf2, then I want to calculate Z = X + Y ~ PDF(Z).
The probability density function of Z is given by the convolution of pdf1 and pdf2.
I have taken the code base (see scipy - Python: How to get the convolution of two continuous distributions? - Stack Overflow) and adapted it.
First, I tested the solution with mean=0 and sigma²=1 for both pdf1 and pdf2. I got the correct solution.
E(Z)=E(X)+E(Y)=0 and Var(Z)=Var(X)+Var(Y)=2
Second, I tested the solution with mean=2 and sigma²=8 for both pdf1 and pdf2. I got an approximate solution with large errors. Result was E(Z)=E(X)+E(Y)=3.21 and Var(Z)=Var(X)+Var(Y)=12.21 but expected was E(Z)=E(X)+E(Y)=4.0 and Var(Z)=Var(X)+Var(Y)=16.0.
The critical part in the code is the convolution of pmf1 and pmf2. The sum of the convoluted PDF should be 1.0 and not 0.93.
Hint: I used a reference implementation based on the "openturns" library to verify my results.
#given two independent gaussian variables X,Y; calculate Z = X + Y ~ PDF(Z)
delta = 1e-4
big_grid = np.arange(-10,10,delta)
mean = 2 #E(X)=E(Y)=2
std = np.sqrt(8) #Var(X)=Var(Y)=8
X = norm(loc=mean, scale=std)
Y = norm(loc=mean, scale=std)
pmf1 = X.pdf(big_grid)*delta
print("Sum of gaussian pmf: "+str(sum(pmf1)))
pmf2 = Y.pdf(big_grid)*delta
print("Sum of gaussian pmf: "+str(sum(pmf1)))
conv_pmf = signal.fftconvolve(pmf1,pmf2,'same') #convolution of pmf1 and pmf2
print("Sum of convoluted pmf: "+str(sum(conv_pmf)))
pdf1 = pmf1/delta
pdf2 = pmf2/delta
conv_pdf = conv_pmf/delta
print("Integration of convoluted pdf: " + str(np.trapz(conv_pdf, big_grid)))
plt.plot(big_grid, pdf1, label='Gaussian PDF1')
plt.plot(big_grid, pdf2, label='Gaussian PDF2')
plt.plot(big_grid, conv_pdf, label='Sum')
plt.legend(loc='best'), plt.suptitle('PDFs')
plt.show()
Mean and variance of convoluted PDF
#E(Z)=E(X)+E(Y); Var(Z)=Var(X)+Var(Y); if E(X)=E(Y)=2 and Var(X)=Var(Y)=8 it follows E(Z)=4 and Var(Z)=16
E_Z = (big_grid * conv_pmf).sum(); E_Z #E(Z) = Σ z . P(z): sum(z[j] * p(z[j])) expected: E(Z)=4
E_Z_squared = (big_grid**2 * conv_pmf).sum(); E_Z_squared #E(Z²) = Σ z² . P(z): sum(z[j]² * p(z[j]))
Var_Z = E_Z_squared - (E_Z)**2; Var_Z #Var(Z) = E(Z²) - E(Z)²; expected: Var(Z)=16
This is the output I get.
Sum of gaussian pmf1: 0.9976499589626819
Sum of gaussian pmf2: 0.9976499589626819
Sum of convoluted pmf: 0.9321607580277965
Integration of convoluted pdf: 0.9321591482687606
E_Z = 3.210819533318452
E_Z_squared = 22.52303025237063
Var_Z = 12.21366817683131
So what is going wrong here? How can I adapt the code to get correct results?
The results you have now are fine. There is no reason to believe the sums you are printing here would be equal to 1. Although it is true that the integral of the PDF over the entire support (from negative to positive infinity) would be 1, this doesn't have to be true discretised version because it is an approximation.
Remember also that your grid is arange(-10, 10, delta), and that a significant proportion of the total probability of norm(4, 4) lies outside of that range.
Luckily, you know the PDF for the sum of normal variables, so you can check your results yourself using the CDF of the real distribution.
def realcdf(x):
return stats.norm(loc = 4, scale = 4).cdf(x)
print("Supposed to be: " + str(realcdf(max(big_grid)) - realcdf(min(big_grid))))
With output:
Supposed to be: 0.9329569316499936
Which is not 1. In fact the fftconvolve approximation is quite close. Errors arising from floating point arithmetic and the discretisation onto the grid likely account for the relatively small difference between the two.
As for the statistics at the end, enlarging the size of the grid should help. For example, on the grid:
big_grid = np.arange(-20,20,delta)
Produces statistics closer to the truth:
E_Z = 3.9994379102826576
Var_Z = 15.991432657282482

Numpy symmetric matrix becomes asymmetric when I applied min-max scaling

I have a symmetric matrix (1877 x 1877), here is the matrix file. I try to standardize the values between 0-1. After I apply this method, the matrix is no longer symmetric. Any help is appreciated.
print((dist.transpose() == dist).all()) # this prints 'True'
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler()
return minmax_scale.fit_transform(X)
sci_dist_scaled = sci_minmax(dist)
(sci_dist_scaled.transpose() == sci_dist_scaled).all() # this print 'False'
sci_dist_scaled.dtype, dist.dtype # (dtype('float64'), dtype('float64'))
Looking at this description the minmaxscaler appears to work column-by-column, so, naturally, you can't expect it to preserve symmetry.
What's best to do in your case depends a bit on what you are trying to achieve, really. If having the values between 0 and 1 is all you require you can rescale by hand:
mn, mx = dist.min(), dist.max()
dist01 = (dist - mn) / (mx - mn)
but depending on your ultimate problem this may be too simplistic...

Checking for Multicollinearity in Python [duplicate]

Say I fit a model in statsmodels
mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()
When I do mod.summary() I may see the following:
Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearity conditions in a variable? Is this warning stored somewhere in the model object?
Also, where can I find a description of the fields in summary()?
You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.
If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:
>>> xs = np.random.randn(100, 5) # independent variables
>>> corr = np.corrcoef(xs, rowvar=0) # correlation matrix
>>> w, v = np.linalg.eig(corr) # eigen values & eigen vectors
>>> w
array([ 1.256 , 1.1937, 0.7273, 0.9516, 0.8714])
However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then
>>> noise = np.random.randn(100) # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083, 1.9569, 1.1687, 0.8681, 0.9981])
one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:
>>> v[:,0]
array([-0.4077, 0.0059, -0.5886, 0.0018, 0.6981])
Ignoring almost zero coefficients, above basically says x[0], x[2] and x[4] are colinear (as expected). If one standardizes xs values and multiplies by this eigen vector, the result will hover around zero with small variance:
>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0) # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)
Note that ys.var() is basically the eigen value which was close to zero.
So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.
Based on a similar question for R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.
According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.
Translated example for determinant:
import numpy as np
import pandas as pd
# Create a sample random dataframe
np.random.seed(321)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
x3 = np.random.rand(100)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
# Now create a dataframe with multicollinearity
multicollinear_df = df.copy()
multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']
# Compute both correlation matrices
corr = np.corrcoef(df, rowvar=0)
multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)
# Compare the determinants
print np.linalg.det(corr) . # 0.988532159861
print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16
And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.
print np.linalg.cond(corr) . # 1.23116253259
print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?
numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab
Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()
This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

Categories

Resources