Two-way repeated measures ANOVA python function - python

Thanks in advance for any answers. I want to conduct a 2-way repeated measures ANOVA in python where one IV has 5 levels and the other 4 levels, with one DV. I've tried looking around in scipy documentation and a few online blogs but can't seem to find anything.

You can use the rm_anova function in the Pingouin package (of which I am the creator) that works directly with pandas DataFrame, e.g.:
import pingouin as pg
# Compute the 2-way repeated measures ANOVA. This will return a dataframe.
pg.rm_anova(dv='dv', within=['iv1', 'iv2'], subject='id', data=df)
# Optional post-hoc tests
pg.pairwise_ttests(dv='dv', within=['iv1', 'iv2'], subject='id', data=df)

this is an old question but I will provide an answer.
You could take a look at pyvttbl. Using this library (can be installed via Pip) you can carry out n-way ANOVA for both independent and repeated measures (and mixed designs). Note that it seems like that you will have to use Pyvttbl own data frame method to handle your data.
It is pretty simple:
dataframe.anova('dv', sub='id', wfactors=['iv1', 'iv2'])
You can see my blog post for a more elaborated example on how to carry out a 2-way ANOVA for repeated measures.

Related

Detect Seasonality in Time Series with Python

I've been trying to identify some seasonal customers in a dataset. As a first approach, I used the seasonal_decompose() function of the statsmodel package - this is useful for visualizing specific customers, but won't work for the whole dataset as I have almost 8000 different time series, one for each client.
Then, I decided trying the ADF test - the problem here was that it only detects
if my series is stationary or not, and because of the trend it won't work in my case.
I also tried combining this with the KPSS test (that tests for trend-stationarity),
but the results still bad.
Now, I have thought about four alternatives:
Find a way to evaluate it manually using a mean/variance approach
Try using CHTest
Try using the Darts package
Detrend my data and apply those tests (or others) again
The thing is that I couldn't find good examples of any of this in Python... most of the
solutions I found for my problem are developed in R. Is there a suitable way of
doing this in Python or should I give up, export my series and try using R?
Could you help me with some tips? I would really appreciate reading suggestions too. Thanks!

Statsmodels seasonal_decompose - what is naive about it?

Have been working with time series in Python, and using sm.tsa.seasonal_decompose. In the docs they introduce the function like this:
We added a naive seasonal decomposition tool in the same vein as R’s decompose.
Here is a copy of the code from the docs and its output:
import statsmodels.api as sm
dta = sm.datasets.co2.load_pandas().data
# deal with missing values. see issue
dta.co2.interpolate(inplace=True)
res = sm.tsa.seasonal_decompose(dta.co2)
res.plot()
They say it is naive but there is no disclaimer about what is wrong with it. Does anyone know?
I made some (aehm...naive) researches, and, according to the reference, it seems that StatsModels uses the classic moving average method to detect trends and apply seasonal decomposition (you can check more here, specifically about Moving Average and Classical Decomposition).
However, other advanced seasonal decomposition techniques are available, such as STL decomposition, which also has some Python implementations. (UPDATE - 11/04/2019 as pointed out in the comments by #squarespiral, such implementations appear to have been merged in the master branch of StatsModels).
At the above links, you can find a complete reference on the advantages and disadvantages of each one of the proposed methods.
Hope it helps!

Which model to use when mixed-effects, random-effects added regression is needed

So mixed-effects regression model is used when I believe that there is dependency with a particular group of a feature. I've attached the Wiki link because it explains better than me. (https://en.wikipedia.org/wiki/Mixed_model)
Although I believe that there are many occasions in which we need to consider the mixed-effects, there aren't too many modules that support this.
R has lme4 and Python seems to have a similar module, but they are both statistic driven; they do not use the cost function algorithm such as gradient boosting.
In Machine Learning setting, how would you handle the situation that you need to consider mixed-effects? Are there any other models that can handle longitudinal data with mixed-effects(random-effects)?
(R seems to have a package that supports mixed-effects: https://rd.springer.com/article/10.1007%2Fs10994-011-5258-3
But I am looking for a Python solution.
There are, at least, two ways to handle longitudinal data with mixed-effects in Python:
StatsModel for linear mixed effects;
MERF for mixed effects random forest.
If you go for StatsModel, I'd recommend you to do some of the examples provided here. If you go for MERF, I'd say that the best starting point is here.
I hope it helps!

Chi squared test in Python

I'd like to run a chi-squared test in Python. I've created code to do this, but I don't know if what I'm doing is right, because the scipy docs are quite sparse.
Background first: I have two groups of users. My null hypothesis is that there is no significant difference in whether people in either group are more likely to use desktop, mobile, or tablet.
These are the observed frequencies in the two groups:
[[u'desktop', 14452], [u'mobile', 4073], [u'tablet', 4287]]
[[u'desktop', 30864], [u'mobile', 11439], [u'tablet', 9887]]
Here is my code using scipy.stats.chi2_contingency:
obs = np.array([[14452, 4073, 4287], [30864, 11439, 9887]])
chi2, p, dof, expected = stats.chi2_contingency(obs)
print p
This gives me a p-value of 2.02258737401e-38, which clearly is significant.
My question is: does this code look valid? In particular, I'm not sure whether I should be using scipy.stats.chi2_contingency or scipy.stats.chisquare, given the data I have.
I can't comment too much on the use of the function. However, the issue at hand may be statistical in nature. The very small p-value you are seeing is most likely a result of your data containing large frequencies ( in the order of ten thousand). When sample sizes are too large, any differences will become significant - hence the small p-value. The tests you are using are very sensitive to sample size. See here for more details.
You are using chi2_contingency correctly. If you feel uncertain about the appropriate use of a chi-squared test or how to interpret its result (i.e. your question is about statistical testing rather than coding), consider asking it over at the "CrossValidated" site: https://stats.stackexchange.com/

Time series - correlation and lag time

I am studying the correlation between a set of input variables and a response variable, price. These are all in time series.
1) Is it necessary that I smooth out the curve where the input variable is cyclical (autoregressive)? If so, how?
2) Once a correlation is established, I would like to quantify exactly how the input variable affects the response variable.
Eg: "Once X increases >10% then there is an 2% increase in y 6 months later."
Which python libraries should I be looking at to implement this - in particular to figure out the lag time between two correlated occurrences?
Example:
I already looked at: statsmodels.tsa.ARMA but it seems to deal with predicting only one variable over time. In scipy the covariance matrix can tell me about the correlation, but does not help with figuring out the lag time.
While part of the question is more statistics based, the bit about how to do it in Python seems at home here. I see that you've since decided to do this in R from looking at your question on Cross Validated, but in case you decide to move back to Python, or for the benefit of anyone else finding this question:
I think you were in the right area looking at statsmodels.tsa, but there's a lot more to it than just the ARMA package:
http://statsmodels.sourceforge.net/devel/tsa.html
In particular, have a look at statsmodels.tsa.vector_ar for modelling multivariate time series. The documentation for it is available here:
http://statsmodels.sourceforge.net/devel/vector_ar.html
The page above specifies that it's for working with stationary time series - I presume this means removing both trend and any seasonality or periodicity. The following link is ultimately readying a model for forecasting, but it discusses the Box-Jenkins approach for building a model, including making it stationary:
http://www.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture16_TS3.pdf
You'll notice that link discusses looking for autocorrelations (ACF) and partial autocorrelations (PACF), and then using the Augmented Dickey-Fuller test to test whether the series is now stationary. Tools for all three can be found in statsmodels.tsa.stattools. Likewise, statsmodels.tsa.arma_process has ACF and PACF.
The above link also discusses using metrics like AIC to determine the best model; both statsmodels.tsa.var_model and statsmodels.tsa.ar_model include AIC (amongst other measures). The same measures seem to be used for calculating lag order in var_model, using select_order.
In addition, the pandas library is at least partially integrated into statsmodels and has a lot of time series and data analysis functionality itself, so will probably be of interest. The time series documentation is located here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html

Categories

Resources