Have been working with time series in Python, and using sm.tsa.seasonal_decompose. In the docs they introduce the function like this:
We added a naive seasonal decomposition tool in the same vein as R’s decompose.
Here is a copy of the code from the docs and its output:
import statsmodels.api as sm
dta = sm.datasets.co2.load_pandas().data
# deal with missing values. see issue
dta.co2.interpolate(inplace=True)
res = sm.tsa.seasonal_decompose(dta.co2)
res.plot()
They say it is naive but there is no disclaimer about what is wrong with it. Does anyone know?
I made some (aehm...naive) researches, and, according to the reference, it seems that StatsModels uses the classic moving average method to detect trends and apply seasonal decomposition (you can check more here, specifically about Moving Average and Classical Decomposition).
However, other advanced seasonal decomposition techniques are available, such as STL decomposition, which also has some Python implementations. (UPDATE - 11/04/2019 as pointed out in the comments by #squarespiral, such implementations appear to have been merged in the master branch of StatsModels).
At the above links, you can find a complete reference on the advantages and disadvantages of each one of the proposed methods.
Hope it helps!
Related
I want to cluster some stars based on given position (X,Y,Z) using DBSCAN, I do not know how to adjust the data to get the right numbers of clusters to plot it afterward?
this is how the data looks like
what is the right parameters for these data?
the number of rows are 1.202672e+06
import pandas as pd
data = pd.read_csv('datasets/full_dataset.csv')
from sklearn.cluster import DBSCAN
clusters=DBSCAN(eps=0.5,min_samples=40,metric="euclidean",algorithm="auto")
min_samples is arguably one of the tougher ones to choose, but you can decide that by just looking at the results and deciding how much noise you are okay with.
Choosing eps can be aided by running k-NN to understand the density distribution of your data. I believe that the DBACAN paper recommends in more detail. There might even be a way to plot this in python (in R it is kNNdistplot).
I would prefer to use OPTICS, which is essentially doing all eps values simultaneously. However, I haven't found a decent implementation of this in either in python or R. In fact, there is an incorrect implementation in python which doesn't follow the original OPTICS paper at all.
If you really want to use optics, I recommend using a java implementation available using ELKI.
If anyone else has heard of a proper python implementation, I'd love to hear it.
If you want to go the trial and error route, start eps much smaller and go from there.
So mixed-effects regression model is used when I believe that there is dependency with a particular group of a feature. I've attached the Wiki link because it explains better than me. (https://en.wikipedia.org/wiki/Mixed_model)
Although I believe that there are many occasions in which we need to consider the mixed-effects, there aren't too many modules that support this.
R has lme4 and Python seems to have a similar module, but they are both statistic driven; they do not use the cost function algorithm such as gradient boosting.
In Machine Learning setting, how would you handle the situation that you need to consider mixed-effects? Are there any other models that can handle longitudinal data with mixed-effects(random-effects)?
(R seems to have a package that supports mixed-effects: https://rd.springer.com/article/10.1007%2Fs10994-011-5258-3
But I am looking for a Python solution.
There are, at least, two ways to handle longitudinal data with mixed-effects in Python:
StatsModel for linear mixed effects;
MERF for mixed effects random forest.
If you go for StatsModel, I'd recommend you to do some of the examples provided here. If you go for MERF, I'd say that the best starting point is here.
I hope it helps!
I am studying the correlation between a set of input variables and a response variable, price. These are all in time series.
1) Is it necessary that I smooth out the curve where the input variable is cyclical (autoregressive)? If so, how?
2) Once a correlation is established, I would like to quantify exactly how the input variable affects the response variable.
Eg: "Once X increases >10% then there is an 2% increase in y 6 months later."
Which python libraries should I be looking at to implement this - in particular to figure out the lag time between two correlated occurrences?
Example:
I already looked at: statsmodels.tsa.ARMA but it seems to deal with predicting only one variable over time. In scipy the covariance matrix can tell me about the correlation, but does not help with figuring out the lag time.
While part of the question is more statistics based, the bit about how to do it in Python seems at home here. I see that you've since decided to do this in R from looking at your question on Cross Validated, but in case you decide to move back to Python, or for the benefit of anyone else finding this question:
I think you were in the right area looking at statsmodels.tsa, but there's a lot more to it than just the ARMA package:
http://statsmodels.sourceforge.net/devel/tsa.html
In particular, have a look at statsmodels.tsa.vector_ar for modelling multivariate time series. The documentation for it is available here:
http://statsmodels.sourceforge.net/devel/vector_ar.html
The page above specifies that it's for working with stationary time series - I presume this means removing both trend and any seasonality or periodicity. The following link is ultimately readying a model for forecasting, but it discusses the Box-Jenkins approach for building a model, including making it stationary:
http://www.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture16_TS3.pdf
You'll notice that link discusses looking for autocorrelations (ACF) and partial autocorrelations (PACF), and then using the Augmented Dickey-Fuller test to test whether the series is now stationary. Tools for all three can be found in statsmodels.tsa.stattools. Likewise, statsmodels.tsa.arma_process has ACF and PACF.
The above link also discusses using metrics like AIC to determine the best model; both statsmodels.tsa.var_model and statsmodels.tsa.ar_model include AIC (amongst other measures). The same measures seem to be used for calculating lag order in var_model, using select_order.
In addition, the pandas library is at least partially integrated into statsmodels and has a lot of time series and data analysis functionality itself, so will probably be of interest. The time series documentation is located here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
I am trying to use GARCH(1,1) to find the hedge ratio as described in this paper http://search.livjm.ac.uk/AFE/AFE_docs/cibef0402.pdf. However, Python does not offer packages for GARCH(1,1), thus I think I have to implement it myself.
The data I have for the Index and the Futures are their daily returns. I would like to write a function that takes in the daily returns and output the beta of GARCH as the hedging ratio. However, I am at loss where to start writing the GARCH function. Could anyone outline step-by-step the algorithm for GARCH(1,1) in this case?
There is an implementation of this in the Python statsmodels library. The source code is available here.
There is also ARCH models in Python
Thanks in advance for any answers. I want to conduct a 2-way repeated measures ANOVA in python where one IV has 5 levels and the other 4 levels, with one DV. I've tried looking around in scipy documentation and a few online blogs but can't seem to find anything.
You can use the rm_anova function in the Pingouin package (of which I am the creator) that works directly with pandas DataFrame, e.g.:
import pingouin as pg
# Compute the 2-way repeated measures ANOVA. This will return a dataframe.
pg.rm_anova(dv='dv', within=['iv1', 'iv2'], subject='id', data=df)
# Optional post-hoc tests
pg.pairwise_ttests(dv='dv', within=['iv1', 'iv2'], subject='id', data=df)
this is an old question but I will provide an answer.
You could take a look at pyvttbl. Using this library (can be installed via Pip) you can carry out n-way ANOVA for both independent and repeated measures (and mixed designs). Note that it seems like that you will have to use Pyvttbl own data frame method to handle your data.
It is pretty simple:
dataframe.anova('dv', sub='id', wfactors=['iv1', 'iv2'])
You can see my blog post for a more elaborated example on how to carry out a 2-way ANOVA for repeated measures.