I've been trying to identify some seasonal customers in a dataset. As a first approach, I used the seasonal_decompose() function of the statsmodel package - this is useful for visualizing specific customers, but won't work for the whole dataset as I have almost 8000 different time series, one for each client.
Then, I decided trying the ADF test - the problem here was that it only detects
if my series is stationary or not, and because of the trend it won't work in my case.
I also tried combining this with the KPSS test (that tests for trend-stationarity),
but the results still bad.
Now, I have thought about four alternatives:
Find a way to evaluate it manually using a mean/variance approach
Try using CHTest
Try using the Darts package
Detrend my data and apply those tests (or others) again
The thing is that I couldn't find good examples of any of this in Python... most of the
solutions I found for my problem are developed in R. Is there a suitable way of
doing this in Python or should I give up, export my series and try using R?
Could you help me with some tips? I would really appreciate reading suggestions too. Thanks!
Related
I am trying to find the cross-correlation between two time series and it turns out that they are auto-correlated and non-stationary. Reading about them, it seems that you have to pre-whiten the series before finding the correlation. But there is no clear process mentioned anywhere on how to pre-whiten a time series. I'm going to do it in Python so if anyone can mention even the steps, that would be great. R has a nice "prewhiten()" function for this purpose and I was wondering how to implement it in Python.
So mixed-effects regression model is used when I believe that there is dependency with a particular group of a feature. I've attached the Wiki link because it explains better than me. (https://en.wikipedia.org/wiki/Mixed_model)
Although I believe that there are many occasions in which we need to consider the mixed-effects, there aren't too many modules that support this.
R has lme4 and Python seems to have a similar module, but they are both statistic driven; they do not use the cost function algorithm such as gradient boosting.
In Machine Learning setting, how would you handle the situation that you need to consider mixed-effects? Are there any other models that can handle longitudinal data with mixed-effects(random-effects)?
(R seems to have a package that supports mixed-effects: https://rd.springer.com/article/10.1007%2Fs10994-011-5258-3
But I am looking for a Python solution.
There are, at least, two ways to handle longitudinal data with mixed-effects in Python:
StatsModel for linear mixed effects;
MERF for mixed effects random forest.
If you go for StatsModel, I'd recommend you to do some of the examples provided here. If you go for MERF, I'd say that the best starting point is here.
I hope it helps!
I am studying the correlation between a set of input variables and a response variable, price. These are all in time series.
1) Is it necessary that I smooth out the curve where the input variable is cyclical (autoregressive)? If so, how?
2) Once a correlation is established, I would like to quantify exactly how the input variable affects the response variable.
Eg: "Once X increases >10% then there is an 2% increase in y 6 months later."
Which python libraries should I be looking at to implement this - in particular to figure out the lag time between two correlated occurrences?
Example:
I already looked at: statsmodels.tsa.ARMA but it seems to deal with predicting only one variable over time. In scipy the covariance matrix can tell me about the correlation, but does not help with figuring out the lag time.
While part of the question is more statistics based, the bit about how to do it in Python seems at home here. I see that you've since decided to do this in R from looking at your question on Cross Validated, but in case you decide to move back to Python, or for the benefit of anyone else finding this question:
I think you were in the right area looking at statsmodels.tsa, but there's a lot more to it than just the ARMA package:
http://statsmodels.sourceforge.net/devel/tsa.html
In particular, have a look at statsmodels.tsa.vector_ar for modelling multivariate time series. The documentation for it is available here:
http://statsmodels.sourceforge.net/devel/vector_ar.html
The page above specifies that it's for working with stationary time series - I presume this means removing both trend and any seasonality or periodicity. The following link is ultimately readying a model for forecasting, but it discusses the Box-Jenkins approach for building a model, including making it stationary:
http://www.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture16_TS3.pdf
You'll notice that link discusses looking for autocorrelations (ACF) and partial autocorrelations (PACF), and then using the Augmented Dickey-Fuller test to test whether the series is now stationary. Tools for all three can be found in statsmodels.tsa.stattools. Likewise, statsmodels.tsa.arma_process has ACF and PACF.
The above link also discusses using metrics like AIC to determine the best model; both statsmodels.tsa.var_model and statsmodels.tsa.ar_model include AIC (amongst other measures). The same measures seem to be used for calculating lag order in var_model, using select_order.
In addition, the pandas library is at least partially integrated into statsmodels and has a lot of time series and data analysis functionality itself, so will probably be of interest. The time series documentation is located here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
Thanks in advance for any answers. I want to conduct a 2-way repeated measures ANOVA in python where one IV has 5 levels and the other 4 levels, with one DV. I've tried looking around in scipy documentation and a few online blogs but can't seem to find anything.
You can use the rm_anova function in the Pingouin package (of which I am the creator) that works directly with pandas DataFrame, e.g.:
import pingouin as pg
# Compute the 2-way repeated measures ANOVA. This will return a dataframe.
pg.rm_anova(dv='dv', within=['iv1', 'iv2'], subject='id', data=df)
# Optional post-hoc tests
pg.pairwise_ttests(dv='dv', within=['iv1', 'iv2'], subject='id', data=df)
this is an old question but I will provide an answer.
You could take a look at pyvttbl. Using this library (can be installed via Pip) you can carry out n-way ANOVA for both independent and repeated measures (and mixed designs). Note that it seems like that you will have to use Pyvttbl own data frame method to handle your data.
It is pretty simple:
dataframe.anova('dv', sub='id', wfactors=['iv1', 'iv2'])
You can see my blog post for a more elaborated example on how to carry out a 2-way ANOVA for repeated measures.
I took a scientific programming course this semester that I really enjoyed and experimented with a lot. We used python, and all the related modules. I am taking a physics lab next semester and I just wanted to hear from some of you how python can help me in ways that excel can't or in ways that are better than excel's capabilities. I use Mathematica for symbolic stuff so I would use python for data purposes.
Off the top of my head, here are the related things I can do:
All of the things you would expect in a intro course (loops, arrays, slicing arrays, etc).
Reading data from a text file.
Plotting scatter, line, and bar graphs.
Learning how to plot linear regression but haven't totally figured it out.
I have done 7 of the problems on Project Euler (nothing to brag about, but it might give you a better idea of where I stand in skills).
Looking forward to hearing from some of you. You don't have to explain how to use the things you mention, I could look up the documentation.
The paper Python all a scientist needs comes to mind. I hope you can make the needed transformations from Biology to Physics.
Scipy will also be useful to you, as it includes many more advanced analysis tools. For example, Scipy includes a linear regression, and gets more interesting from there. Along with the other tools you mentioned, you'll probably find most of your needs covered.
Other notes on tool selection:
Mathematica is a great tool, if you can afford it. I've played around with the other options, like Sympy, and sadly, they don't come close to being as useful as Mathematica.
I can't imagine using Excel for any serious scientific work. If you're planning to continue forward using the tools that you learn in class, you might as well start with tools that offer you that potential.
Don't reject Excel outright. It's still great for doing simple data analysis and plotting. Excel also has the considerable advantage of being installed on most engineer and scientist's computers, making it a lot easier to share your work with colleagues.
That said, I do use Python when Excel just won't cut it; times when I've had to:
color the points in a scatter plot based on a third column
plot a field of vectors
extract a few values from each of several thousand data files to do statistical process control
generate dozens of scatter plots over different dimensions of a large data set to find which variables are important
solve a nonlinear equation at several intermediate points of a calculation, not just as the final result.
accept variable length input from a user to define a problem
VBA in Excel can do a lot of those things too, but it becomes painful fast in such a primitive language. I dream that Microsoft will make IronPython a first-class scripting language in the next version of Excel. Until then, you might want to try Resolver One
I can recall 2 presentations by Jan Martinek on EuroScipy 2008, he's PhD candidate and presented some fun experiments with Physics in the background. Abstracts are here and I'm sure he would't mind to share more if you contact him directly. Also, take a look at other presentation from EuroScipy, there are some more Physics-related ones.