Why does statsmodels acf function give different answers to scipy's pearsonr? - python

I calculated the ACF at particular lags of a time series "manually" by shifting in Pandas and using scipy.stats.pearsonr(), but got an answer visibly different from what was shown in the ACF plot from statsmodels.graphics.tsaplots.plot_acf()
Looking into it more, I calculated the ACF with statsmodels.api.tsa.acf() and while the value there agrees with the ACF plot (as I'd expect, since that's what plot_acf() is plotting!) it's substantially different from the Pearson correlation. MWE (data file and Jupyter notebook, plus .py file with same code in case you prefer!) available at: https://github.com/MultiverseHG/acf_problem. Code also shown below:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy.stats import pearsonr
# read in data
bus = pd.read_csv('cleaned_bus.csv', index_col='Month', parse_dates=True)
# compute first 12 lags of ACF with statsmodels
sm_acf = sm.tsa.acf(bus, nlags=12, fft=False)
print(sm_acf)
# compute ACF by shifting and using scipy.stats.pearsonr
# drop nulls that arise from shifting and corresponding rows of unshifted data
trimmed_acf = []
for lag in range(13):
shifted = bus.riders.shift(lag).iloc[lag:]
trimmed = bus.riders.iloc[lag:]
corr = pearsonr(shifted, trimmed)[0] # [0] to grab r ([1] is p-value)
trimmed_acf.append(corr)
trimmed_acf = np.array(trimmed_acf)
print(trimmed_acf)
# not the same - how different are they?
print(sm_acf - trimmed_acf)
# maybe acf is filling in the missing values with zeroes?
# from looking at the source code, doesn't seem like it, but let's try
zeroed_acf = []
for lag in range(13):
shifted = bus.riders.shift(lag).fillna(0)
corr = pearsonr(shifted, bus.riders)[0]
zeroed_acf.append(corr)
zeroed_acf = np.array(zeroed_acf)
print(zeroed_acf)
# different again!
print(sm_acf - zeroed_acf)
# so why does statsmodels give a different ACF than calculating directly with Pearson's r?

Related

Apply SciPy newton method to optimize a pandas dataframe Weibull sum

I'm a novice programmer, but know my way around excel. However, I'm trying to teach myself python to enable myself to work with much larger datasets and, primarily, because I'm finding it really interesting and enjoyable.
I'm trying to figure out how to recreate the Excel goal seek function (I believe SciPy newton should be equivalent) within the script I have written. However, instead of being able to define a simple function f(x) for which to find the root of, I'm trying to find the root of the sum of a dataframe column. And I have no idea how to approach this.
My code up until the goal seek part is as follows:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import weibull_min
# need to use a gamma function later on, so import math
import math
%matplotlib inline
# create dataframe using lidar experimental data
df = pd.read_csv(r'C:\Users\Latitude\Documents\Coursera\Wind Resource\Proj' \
'ect\Wind_Lidar_40and140.txt',
sep=' ',
header=None,
names=['Year','Month','Day','Hour','v_40','v_140'])
# add in columns for velocity cubed
df['v_40_cubed'] = df['v_40']**3
df['v_140_cubed'] = df['v_140']**3
# calculate mean wind speed, mean cubed wind speed, mean wind speed cubed
# use these to calculate energy patter factor, c and k
v_40_bar = df['v_40'].mean()
v_40_cubed_bar = df['v_40_cubed'].mean()
v_40_bar_cubed = v_40_bar ** 3
# energy pattern factor = epf
epf = v_40_cubed_bar / v_40_bar_cubed
# shape parameter = k
k_40 = 1 + 3.69/epf**2
# scale factor = c
# use imported math library to use gamma function math.gamma
c_40 = v_40_bar / math.gamma(1+1/k_40)
# create new dataframe from current, using bins of 0.25 and generate frequency for these
#' bins'
bins_1 = np.linspace(0,16,65,endpoint=True)
freq_df = df.apply(pd.Series.value_counts, bins=bins_1)
# tidy up the dataframe by dropping superfluous columns and adding in a % time column for
# frequency
freq_df_tidy = freq_df.drop(['Year','Month','Day','Hour','v_40_cubed','v_140_cubed'], axis=1)
freq_df_tidy['v_40_%time'] = freq_df_tidy['v_40']/freq_df_tidy['v_40'].sum()
# add in usable bin value for potential calculation of weibull
freq_df_tidy['windspeed_bin'] = np.linspace(0,16,64,endpoint=False)
# calculate weibull column and wind power density from the weibull fit
freq_df_tidy['Weibull_40'] = weibull_min.pdf(freq_df_tidy['windspeed_bin'], k_40, loc=0, scale=c_40)/4
freq_df_tidy['Wind_Power_Density_40'] = 0.5 * 1.225 * freq_df_tidy['Weibull_40'] * freq_df_tidy['windspeed_bin']**3
# calculate wind power density from experimental data
df['Wind_Power_Density_40'] = 0.5 * 1.225 * df['v_40']**3
At this stage, the result from the Weibull data, round(freq_df_tidy['Wind_Power_Density_40'].sum(),2) gives 98.12.
The result from the experimental data, round(df['Wind_Power_Density_40'].mean(),2) gives 101.14.
My aim now is to optimise the parameter c_40, which is used in the weibull calculated weibull power density (98.12), so that the result of the function round(freq_df_tidy['Wind_Power_Density_40'].sum(),2) is close to equal the experimental wind power density (101.14).
Any help on this would be hugely appreciated. Apologies if I've entered too much code into the request - I wanted to provide as much detail as possible. From my research, I think the SciPy newton method should do the trick, but I can't figure out how to apply it here.

How to set order AND seasonal params for SARIMA

Hi can someone give me a dummy guide for setting the order params and seasonal order params in the SARIMA model from statsmodels?
Are those numbers obtained from the ACF and PACF plot? If yes, how do you get the numbers for AR and MA from those plots? I know the differencing(I) cannot be derived from those two plots, how should I decide whether I should set it to 0, 1, or 2?
Thank you
You can use:
pip install pmdarima
pmdarima
pip install pyramid-arima
Please look at:
pyramid-auto-arima
from pyramid.arima.stationarity import ADFTest
adf_test = ADFTest(alpha=0.05)
adf_test.is_stationary(series)
train, test = series[1:500], series[501:910]
train.shape
test.shape
plt.plot(train)
plt.plot(test)
plt.title("Pyramid")
plt.show()
It does:
arima_fit = statsmodels.tsa.SARIMAX(data_set, order = (1,0,1), seasonal_order = (0,1,0,50), trend = 'c').fit()
prediction = arima_fit.predict('start', 'end', dynamic = True)
About your question regarding to ACF and PACF
The ACF stands for Autocorrelation function, and the PACF for Partial
Autocorrelation function. Looking at these two plots together can help
us form an idea of what models to fit. Autocorrelation computes and
plots the autocorrelations of a time series. Autocorrelation is the
correlation between observations of a time series separated by k time
units.
Similarly, partial autocorrelations measure the strength of
relationship with other terms being accounted for, in this case other
terms being the intervening lags present in the model. For example,
the partial autocorrelaton at lag 4 is the correlation at lag 4,
accounting for the correlations at lags 1, 2, and 3. To generate these
plots in Minitab, we go to Stat > Time Series > Autocorrelation or
Stat > Time Series > Partial Autocorrelation. I've generated these
plots for our simulated data below:
Fitting-an-arima-model

Pinescript correlation(source_a, source_b, length) -> to python

I need help with translating pine correlation function to python, I've already translated stdev and swma functions, but this one is a bit confusing for me.
I've also found this explanation but didn't quite understand how to implement it:
in python try using pandas with .rolling(window).corr where window is
the correlation coefficient period, pandas allow you to compute any
rolling statistic by using rolling(). The correlation coefficient from
pine is calculated with : sma(y*x,length) - sma(y,length)*sma(x,length) divided by
stdev(length*y,length)*stdev(length*x,length) where stdev is based on
the naïve algorithm.
Pine documentation for this func:
> Correlation coefficient. Describes the degree to which two series tend
> to deviate from their sma values. correlation(source_a, source_b,
> length) → series[float] RETURNS Correlation coefficient.
ARGUMENTS
source_a (series) Source series.
source_b (series) Target series.
length (integer) Length (number of bars back).
Using pandas is indeed the best option, TA-Lib also has a CORREL function. In order for you to get a better idea of how the correlation function in pine is implemented here is a python code making use of numpy, note that this is not an efficient solution.
import numpy as np
from matplotlib import pyplot as plt
def sma(src,m):
coef = np.ones(m)/m
return np.convolve(src,coef,mode="valid")
def stdev(src,m):
a = sma(src*src,m)
b = np.power(sma(src,m),2)
return np.sqrt(a-b)
def correlation(x,y,m):
cov = sma(x*y,m) - sma(x,m)*sma(y,m)
den = stdev(x,m)*stdev(y,m)
return cov/den
ts = np.random.normal(size=500).cumsum()
n = np.linspace(0,1,len(ts))
cor = correlation(ts,n,14)
plt.subplot(2,1,1)
plt.plot(ts)
plt.subplot(2,1,2)
plt.plot(cor)
plt.show()

Genextreme fit not working for some datasets

I'm trying to fit a GEV distribution to temperature data to help identify extreme values. I have data sets for different regions - for some regions the fit works fine but for others it breaks down. It appears that it is setting the location parameter close to the maximum of the distribution range. All data sets are large, of the same size, complete and have no particularly strange values.
Could you please suggest what might be happening or how I can investigate the genextreme function process to work out what the problem is?
Here's the relevant bits of code (values are read in from NetCDF without any problem):
import pandas as pd
import numpy as np
import netCDF4 as nc
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import genextreme as gev
# calculate GEV fit
fit = gev.fit(season_temp)
# GEV parameters from fit
c, loc, scale = fit
fit_mean= loc
min_extreme,max_extreme = gev.interval(0.99,c,loc,scale)
# evenly spread x axis values for pdf plot
x = np.linspace(min(season_temp),max(season_temp),200)
# plot distribution
fig,ax = plt.subplots(1, 1)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(season_temp,30,normed=True,alpha=0.3)
And here are two examples of outputs from different regions, successful and not:
Successful fit
Unsuccessful fit
The successfully fitted distribution has mean location parameter of 1.066 compared to data mean of 2.395. The one that failed has calculated a location parameter of 12.202 compared to data mean of 2.138.
Thanks in advance for your help!

scipy/numpy FFT on data from file

I looked into many examples of scipy.fft and numpy.fft. Specifically this example Scipy/Numpy FFT Frequency Analysis is very similar to what I want to do. Therefore, I used the same subplot positioning and everything looks very similar.
I want to import data from a file, which contains just one column to make my first test as easy as possible.
My code writes like this:
import numpy as np
import scipy as sy
import scipy.fftpack as syfp
import pylab as pyl
# Read in data from file here
array = np.loadtxt("data.csv")
length = len(array)
# Create time data for x axis based on array length
x = sy.linspace(0.00001, length*0.00001, num=length)
# Do FFT analysis of array
FFT = sy.fft(array)
# Getting the related frequencies
freqs = syfp.fftfreq(array.size, d=(x[1]-x[0]))
# Create subplot windows and show plot
pyl.subplot(211)
pyl.plot(x, array)
pyl.subplot(212)
pyl.plot(freqs, sy.log10(FFT), 'x')
pyl.show()
The problem is that I will always get my peak at exactly zero, which should not be the case at all. It really should appear at around 200 Hz.
With smaller range:
Still biggest peak at zero.
As already mentioned, it seems like your signal has a DC component, which will cause a peak at f=0. Try removing the mean with, e.g., arr2 = array - np.mean(array).
Furthermore, for analyzing signals, you might want to try plotting power spectral density.:
import matplotlib.pylab as plt
import matplotlib.mlab as mlb
Fs = 1./(d[1]- d[0]) # sampling frequency
plt.psd(array, Fs=Fs, detrend=mlb.detrend_mean)
plt.show()
Take a look at the documentation of plt.psd(), since there a quite a lot of options to fiddle with. For investigating the change of the spectrum over time, plt.specgram() comes in handy.

Categories

Resources