scipy.pdist() returns NaN values - python

I'm trying to cluster time series. The intra-cluster elements have same shapes but different scales. Therefore, I would like to use a correlation measure as metric for clustering. I'm trying correlation or pearson coefficient distance (any suggestion or alternative is welcome).
However, the following code returns error when I run Z = linkage(dist) because there are some NaN values in dist. There are not NaN values in time_series, this is confirmed by
np.any(isnan(time_series))
which returns False
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage
dist = pdist(time_series, metric='correlation')
Z = linkage(dist)
fig = plt.figure()
dn = dendrogram(Z)
plt.show()
As alternative, I will use pearson distance
from scipy.stats import pearsonr
def pearson_distance(a,b):
return 1 - pearsonr(a,b)[0]
dist = pdist(time_series, pearson_distance)`
but this generates some runtime warnings and takes a lot of time.

scipy.pdist(time_series, metric='correlation')
If you take a look at the manual, the correlation options divides by the difference. So it could be that you have two timestamps that are the same, and dividing zero by zero gives us NaN.

Related

How to find the lag between two time series using cross-correlation

Say the two series are:
x = [4,4,4,4,6,8,10,8,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4]
y = [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,6,8,10,8,6,4,4]
Series x clearly lags y by 12 time periods.
However, using the following code as suggested in Python cross correlation:
import numpy as np
c = np.correlate(x, y, "full")
lag = np.argmax(c) - c.size/2
leads to an incorrect lag of -0.5.
What's wrong here?
If you want to do it the easy way you should simply use scipy correlation_lags
Also, remember to subtract the mean from the inputs.
import numpy as np
from scipy import signal
x = [4,4,4,4,6,8,10,8,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4]
y = [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,6,8,10,8,6,4,4]
correlation = signal.correlate(x-np.mean(x), y - np.mean(y), mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
lag = lags[np.argmax(abs(correlation))]
This gives lag=-12, that is the difference between the index of the first six in x and in y, if you swap inputs it gives +12
Edit
Why to subtract the mean
If the signals have non-zero mean the terms at the center of the correlation will become larger, because there you have a larger support sample to compute the correlation. Furthermore, for very large data, subtracting the mean makes the calculations more accurate.
Here I illustrate what would happen if the mean was not subtracted for this example.
plt.plot(abs(correlation))
plt.plot(abs(signal.correlate(x, y, mode="full")))
plt.plot(abs(signal.correlate(np.ones_like(x)*np.mean(x), np.ones_like(y)*np.mean(y))))
plt.legend(['subtracting mean', 'constant signal', 'keeping the mean'])
Notice that the maximum on the blue curve (at 10) does not coincide with the maximum of the orange curve.

Pinescript correlation(source_a, source_b, length) -> to python

I need help with translating pine correlation function to python, I've already translated stdev and swma functions, but this one is a bit confusing for me.
I've also found this explanation but didn't quite understand how to implement it:
in python try using pandas with .rolling(window).corr where window is
the correlation coefficient period, pandas allow you to compute any
rolling statistic by using rolling(). The correlation coefficient from
pine is calculated with : sma(y*x,length) - sma(y,length)*sma(x,length) divided by
stdev(length*y,length)*stdev(length*x,length) where stdev is based on
the naïve algorithm.
Pine documentation for this func:
> Correlation coefficient. Describes the degree to which two series tend
> to deviate from their sma values. correlation(source_a, source_b,
> length) → series[float] RETURNS Correlation coefficient.
ARGUMENTS
source_a (series) Source series.
source_b (series) Target series.
length (integer) Length (number of bars back).
Using pandas is indeed the best option, TA-Lib also has a CORREL function. In order for you to get a better idea of how the correlation function in pine is implemented here is a python code making use of numpy, note that this is not an efficient solution.
import numpy as np
from matplotlib import pyplot as plt
def sma(src,m):
coef = np.ones(m)/m
return np.convolve(src,coef,mode="valid")
def stdev(src,m):
a = sma(src*src,m)
b = np.power(sma(src,m),2)
return np.sqrt(a-b)
def correlation(x,y,m):
cov = sma(x*y,m) - sma(x,m)*sma(y,m)
den = stdev(x,m)*stdev(y,m)
return cov/den
ts = np.random.normal(size=500).cumsum()
n = np.linspace(0,1,len(ts))
cor = correlation(ts,n,14)
plt.subplot(2,1,1)
plt.plot(ts)
plt.subplot(2,1,2)
plt.plot(cor)
plt.show()

Python-Generating numbers according to a corellation matrix

Hi, I am trying to generate correlated data as close to the first table as possible (first three rows shown out of a total of 13). The correlation matrix for the relevant columns is also shown (corr_total).
I am trying the following code, which shows the error:
"LinAlgError: 4-th leading minor not positive definite"
from scipy.linalg import cholesky
# Correlation matrix
# Compute the (upper) Cholesky decomposition matrix
upper_chol = cholesky(corr_total)
# What should be here? The mu and sigma of one row of a table?
rnd = np.random.normal(2.57, 0.78, size=(10,7))
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
My question is what goes into the values of The mu and sigma, and how to resolve the error shown above.
Thanks!
P.S I have edited the question to show the original table. It shows data for four patients. I basically want to make synthetic data for more cases, that replicates the patterns found in these patients
Thank you for answering my question about when data you have access to. The error that you received was generated when you called cholesky. cholesky requires that your matrix be positive semidefinite. One way to check if a matrix is semi-positive definite is to see if all of its eigenvalues are greater than zero. One of the eigenvalues of your correlation/covarance matrix is nearly zero. I think that cholesky is just being fussy. Use can use scipy.linalg.sqrtm as an alternate decomposition.
For your question on the generation of multivariate normals, the random normal that you generate should be a standard random normal, i.e. a mean of 0 and a width of 1. Numpy provides a standard random normal generator with np.random.randn.
To generate a multivariate normal, you should also take the decomposition of the covariance, not the correlation matrix. The following will generate a multivariate normal using an affine transformation, as in your question.
from scipy.linalg import cholesky, sqrtm
relavant_columns = ['Affecting homelife',
'Affecting mobility',
'Affecting social life/hobbies',
'Affecting work',
'Mood',
'Pain Score',
'Range of motion in Doc']
# df is a pandas dataframe containing the data frame from figure 1
mu = df[relavant_columns].mean().values
cov = df[relavant_columns].cov().values
number_of_sample = 10
# generate using affine transformation
#c2 = cholesky(cov).T
c2 = sqrtm(cov).T
s = np.matmul(c2, np.random.randn(c2.shape[0], number_of_sample)) + mu.reshape(-1, 1)
# transpose so each row is a sample
s = s.T
Numpy also has a built-in function which can generate multivariate normals directly
s = np.random.multivariate_normal(mu, cov, size=number_of_sample)

Python generate random right skewed gaussian with constraints

I need to generate a unit curve that is going to look like a right skewed gaussian and I have the following constraints:
The X axis is Days (variable but usually 45+)
All values on the Y axis sum to 1
The peak will always occur around day 4 or 5
Example:
Is there a way to do this programmatically in python?
as noted by #Severin, a gamma looks to be a reasonable fit. e.g:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sps
x = np.linspace(75)
plt.plot(x, sps.gamma.pdf(x, 4) '.-')
plt.show()
if they really need to sum to 1, rather than integrate, I'd use the cdf and then use np.diff on the result

Get statistical difference of correlation coefficient in python

To get the correlation between two arrays in python, I am using:
from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)
However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?
My temporary solution:
After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:
import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)
It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.
Edit to clarify:
The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.

Categories

Resources