Pinescript correlation(source_a, source_b, length) -> to python - python

I need help with translating pine correlation function to python, I've already translated stdev and swma functions, but this one is a bit confusing for me.
I've also found this explanation but didn't quite understand how to implement it:
in python try using pandas with .rolling(window).corr where window is
the correlation coefficient period, pandas allow you to compute any
rolling statistic by using rolling(). The correlation coefficient from
pine is calculated with : sma(y*x,length) - sma(y,length)*sma(x,length) divided by
stdev(length*y,length)*stdev(length*x,length) where stdev is based on
the naïve algorithm.
Pine documentation for this func:
> Correlation coefficient. Describes the degree to which two series tend
> to deviate from their sma values. correlation(source_a, source_b,
> length) → series[float] RETURNS Correlation coefficient.
ARGUMENTS
source_a (series) Source series.
source_b (series) Target series.
length (integer) Length (number of bars back).

Using pandas is indeed the best option, TA-Lib also has a CORREL function. In order for you to get a better idea of how the correlation function in pine is implemented here is a python code making use of numpy, note that this is not an efficient solution.
import numpy as np
from matplotlib import pyplot as plt
def sma(src,m):
coef = np.ones(m)/m
return np.convolve(src,coef,mode="valid")
def stdev(src,m):
a = sma(src*src,m)
b = np.power(sma(src,m),2)
return np.sqrt(a-b)
def correlation(x,y,m):
cov = sma(x*y,m) - sma(x,m)*sma(y,m)
den = stdev(x,m)*stdev(y,m)
return cov/den
ts = np.random.normal(size=500).cumsum()
n = np.linspace(0,1,len(ts))
cor = correlation(ts,n,14)
plt.subplot(2,1,1)
plt.plot(ts)
plt.subplot(2,1,2)
plt.plot(cor)
plt.show()

Related

Why does statsmodels acf function give different answers to scipy's pearsonr?

I calculated the ACF at particular lags of a time series "manually" by shifting in Pandas and using scipy.stats.pearsonr(), but got an answer visibly different from what was shown in the ACF plot from statsmodels.graphics.tsaplots.plot_acf()
Looking into it more, I calculated the ACF with statsmodels.api.tsa.acf() and while the value there agrees with the ACF plot (as I'd expect, since that's what plot_acf() is plotting!) it's substantially different from the Pearson correlation. MWE (data file and Jupyter notebook, plus .py file with same code in case you prefer!) available at: https://github.com/MultiverseHG/acf_problem. Code also shown below:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy.stats import pearsonr
# read in data
bus = pd.read_csv('cleaned_bus.csv', index_col='Month', parse_dates=True)
# compute first 12 lags of ACF with statsmodels
sm_acf = sm.tsa.acf(bus, nlags=12, fft=False)
print(sm_acf)
# compute ACF by shifting and using scipy.stats.pearsonr
# drop nulls that arise from shifting and corresponding rows of unshifted data
trimmed_acf = []
for lag in range(13):
shifted = bus.riders.shift(lag).iloc[lag:]
trimmed = bus.riders.iloc[lag:]
corr = pearsonr(shifted, trimmed)[0] # [0] to grab r ([1] is p-value)
trimmed_acf.append(corr)
trimmed_acf = np.array(trimmed_acf)
print(trimmed_acf)
# not the same - how different are they?
print(sm_acf - trimmed_acf)
# maybe acf is filling in the missing values with zeroes?
# from looking at the source code, doesn't seem like it, but let's try
zeroed_acf = []
for lag in range(13):
shifted = bus.riders.shift(lag).fillna(0)
corr = pearsonr(shifted, bus.riders)[0]
zeroed_acf.append(corr)
zeroed_acf = np.array(zeroed_acf)
print(zeroed_acf)
# different again!
print(sm_acf - zeroed_acf)
# so why does statsmodels give a different ACF than calculating directly with Pearson's r?

Apply SciPy newton method to optimize a pandas dataframe Weibull sum

I'm a novice programmer, but know my way around excel. However, I'm trying to teach myself python to enable myself to work with much larger datasets and, primarily, because I'm finding it really interesting and enjoyable.
I'm trying to figure out how to recreate the Excel goal seek function (I believe SciPy newton should be equivalent) within the script I have written. However, instead of being able to define a simple function f(x) for which to find the root of, I'm trying to find the root of the sum of a dataframe column. And I have no idea how to approach this.
My code up until the goal seek part is as follows:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import weibull_min
# need to use a gamma function later on, so import math
import math
%matplotlib inline
# create dataframe using lidar experimental data
df = pd.read_csv(r'C:\Users\Latitude\Documents\Coursera\Wind Resource\Proj' \
'ect\Wind_Lidar_40and140.txt',
sep=' ',
header=None,
names=['Year','Month','Day','Hour','v_40','v_140'])
# add in columns for velocity cubed
df['v_40_cubed'] = df['v_40']**3
df['v_140_cubed'] = df['v_140']**3
# calculate mean wind speed, mean cubed wind speed, mean wind speed cubed
# use these to calculate energy patter factor, c and k
v_40_bar = df['v_40'].mean()
v_40_cubed_bar = df['v_40_cubed'].mean()
v_40_bar_cubed = v_40_bar ** 3
# energy pattern factor = epf
epf = v_40_cubed_bar / v_40_bar_cubed
# shape parameter = k
k_40 = 1 + 3.69/epf**2
# scale factor = c
# use imported math library to use gamma function math.gamma
c_40 = v_40_bar / math.gamma(1+1/k_40)
# create new dataframe from current, using bins of 0.25 and generate frequency for these
#' bins'
bins_1 = np.linspace(0,16,65,endpoint=True)
freq_df = df.apply(pd.Series.value_counts, bins=bins_1)
# tidy up the dataframe by dropping superfluous columns and adding in a % time column for
# frequency
freq_df_tidy = freq_df.drop(['Year','Month','Day','Hour','v_40_cubed','v_140_cubed'], axis=1)
freq_df_tidy['v_40_%time'] = freq_df_tidy['v_40']/freq_df_tidy['v_40'].sum()
# add in usable bin value for potential calculation of weibull
freq_df_tidy['windspeed_bin'] = np.linspace(0,16,64,endpoint=False)
# calculate weibull column and wind power density from the weibull fit
freq_df_tidy['Weibull_40'] = weibull_min.pdf(freq_df_tidy['windspeed_bin'], k_40, loc=0, scale=c_40)/4
freq_df_tidy['Wind_Power_Density_40'] = 0.5 * 1.225 * freq_df_tidy['Weibull_40'] * freq_df_tidy['windspeed_bin']**3
# calculate wind power density from experimental data
df['Wind_Power_Density_40'] = 0.5 * 1.225 * df['v_40']**3
At this stage, the result from the Weibull data, round(freq_df_tidy['Wind_Power_Density_40'].sum(),2) gives 98.12.
The result from the experimental data, round(df['Wind_Power_Density_40'].mean(),2) gives 101.14.
My aim now is to optimise the parameter c_40, which is used in the weibull calculated weibull power density (98.12), so that the result of the function round(freq_df_tidy['Wind_Power_Density_40'].sum(),2) is close to equal the experimental wind power density (101.14).
Any help on this would be hugely appreciated. Apologies if I've entered too much code into the request - I wanted to provide as much detail as possible. From my research, I think the SciPy newton method should do the trick, but I can't figure out how to apply it here.

scipy.pdist() returns NaN values

I'm trying to cluster time series. The intra-cluster elements have same shapes but different scales. Therefore, I would like to use a correlation measure as metric for clustering. I'm trying correlation or pearson coefficient distance (any suggestion or alternative is welcome).
However, the following code returns error when I run Z = linkage(dist) because there are some NaN values in dist. There are not NaN values in time_series, this is confirmed by
np.any(isnan(time_series))
which returns False
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage
dist = pdist(time_series, metric='correlation')
Z = linkage(dist)
fig = plt.figure()
dn = dendrogram(Z)
plt.show()
As alternative, I will use pearson distance
from scipy.stats import pearsonr
def pearson_distance(a,b):
return 1 - pearsonr(a,b)[0]
dist = pdist(time_series, pearson_distance)`
but this generates some runtime warnings and takes a lot of time.
scipy.pdist(time_series, metric='correlation')
If you take a look at the manual, the correlation options divides by the difference. So it could be that you have two timestamps that are the same, and dividing zero by zero gives us NaN.

is seaborn confidence interval computed correctly?

First, I must admit that my statistics knowledge is rusty at best: even when it was shining new, it's not a discipline I particularly liked, which means I had a hard time making sense of it.
Nevertheless, I took a look at how the barplot graphs were calculating error bars, and was surprised to find a "confidence interval" (CI) used instead of (the more common) standard deviation. Researching more CI led me to this wikipedia article which seems to say that, basically, a CI is computed as:
Or, in pseudocode:
def ci_wp(a):
"""calculate confidence interval using Wikipedia's formula"""
m = np.mean(a)
s = 1.96*np.std(a)/np.sqrt(len(a))
return m - s, m + s
But what we find in seaborn/utils.py is:
def ci(a, which=95, axis=None):
"""Return a percentile range from an array of values."""
p = 50 - which / 2, 50 + which / 2
return percentiles(a, p, axis)
Now maybe I'm missing this completely, but this seems just like a completely different calculation than the one proposed by Wikipedia. Can anyone explain this discrepancy?
To give another example, from comments, why do we get so different results between:
>>> sb.utils.ci(np.arange(100))
array([ 2.475, 96.525])
>>> ci_wp(np.arange(100))
[43.842250270646467,55.157749729353533]
And to compare with other statistical tools:
def ci_std(a):
"""calculate margin of error using standard deviation"""
m = np.mean(a)
s = np.std(a)
return m-s, m+s
def ci_sem(a):
"""calculate margin of error using standard error of the mean"""
m = np.mean(a)
s = sp.stats.sem(a)
return m-s, m+s
Which gives us:
>>> ci_sem(np.arange(100))
(46.598850802411796, 52.401149197588204)
>>> ci_std(np.arange(100))
(20.633929952277882, 78.366070047722118)
Or with a random sample:
rng = np.random.RandomState(10)
a = rng.normal(size=100)
print sb.utils.ci(a)
print ci_wp(a)
print ci_sem(a)
print ci_std(a)
... which yields:
[-1.9667006 2.19502303]
(-0.1101230745774124, 0.26895640045116026)
(-0.017774461397903049, 0.17660778727165088)
(-0.88762281417683186, 1.0464561400505796)
Why are Seaborn's numbers so radically different from the other results?
Your calculation using this Wikipedia formula is completely right. Seaborn just uses another method: https://en.wikipedia.org/wiki/Bootstrapping_(statistics). It's well described by Dragicevic [1]:
[It] consists of generating many alternative datasets from the experimental data by randomly drawing observations with replacement. The variability across these datasets is assumed to approximate sampling error and is used to compute so-called bootstrap confidence intervals. [...] It is very versatile and works for many kinds of distributions.
In the Seaborn's source code, a barplot uses estimate_statistic which bootstraps the data then computes the confidence interval on it:
>>> sb.utils.ci(sb.algorithms.bootstrap(np.arange(100)))
array([43.91, 55.21025])
The result is consistent with your calculation.
[1] Dragicevic, P. (2016). Fair statistical communication in HCI. In Modern Statistical Methods for HCI (pp. 291-330). Springer, Cham.
You need to check the code of percentiles. The seaborn ci code you posted simply computes the percentile limits. This interval has a defined mean of 50 (median) and a default range of 95% confidence interval. The actual mean, the standard deviation, etc. will appear in the percentiles routine.

Calculate moments (mean, variance) of distribution in python

I have two arrays. x is the independent variable, and counts is the number of counts of x occurring, like a histogram. I know I can calculate the mean by defining a function:
def mean(x,counts):
return np.sum(x*counts) / np.sum(counts)
Is there a general function I can use to calculate each moment from the distribution defined by x and counts? I would also like to compute the variance.
You could use the moment function from scipy. It calculates the n-th central moment of your data.
You could also define your own function, which could look something like this:
def nmoment(x, counts, c, n):
return np.sum(counts*(x-c)**n) / np.sum(counts)
In that function, c is meant to be the point around which the moment is taken, and n is the order. So to get the variance you could do nmoment(x, counts, np.average(x, weights=counts), 2).
import scipy as sp
from scipy import stats
stats.moment(counts, moment = 2) #variance
stats.moment returns nth central moment.
Numpy supports order statistics now
https://numpy.org/doc/stable/reference/routines.statistics.html
np.average
np.std
np.var
etc

Categories

Resources