I have a doubt here on how to work this. New to the world of Stats and Python. A student is trying to decide between two Processing Units. He want to use the Processing Unit for his research to run high performance algorithms, so the only thing he is concerned with is speed. He picks a high performance algorithm on a large data set and runs it on both Processing Units 10 times, timing each run in hours. Results are given in the below lists TestSample1 and TestSample2.
from scipy import stats
import numpy as nupy
TestSample1 = nupy.array([11,9,10,11,10,12,9,11,12,9])
TestSample2 = nupy.array([11,13,10,13,12,9,11,12,12,11])
Assumption: Both the dataset samples above are random, independent, parametric & normally distributed
Hint: You can import ttest function from scipy to perform t tests
First T test
One sample t-test
Check if the mean of the TestSample1 is equal to zero.
Null Hypothesis is that mean is equal to zero.
Alternate hypothesis is that it is not equal to zero.
Question 2
Given,
1. Null Hypothesis : There is no significant difference between datasets
2. Alternate Hypothesis : There is a significant difference
Do two-sample testing and check whether to reject Null Hypothesis or not.
Question 3 -
Do two-sample testing and check whether there is significant difference between speeds of two samples: - TestSample1 & TestSample3
He is trying a third Processing Unit - TestSample3.
TestSample3 = nupy.array([9,10,9,11,10,13,12,9,12,12])
Assumption: Both the datasets (TestSample1 & TestSample3) are random, independent, parametric & normally distributed
Question 1
The way to do this with SciPy would be this:
stats.ttest_1samp(TestSample1, popmean=0)
It is not a useful test to perform in this context though, because we already know that the null hypothesis must be false. Negative times are impossible, so the only way for the population mean of times to be zero would be if every time measured were always zero, which is clearly not the case.
Question 2
Here's how to do a two-sample t-test for independent samples with SciPy:
stats.ttest_ind(TestSample1, TestSample2)
Output:
Ttest_indResult(statistic=-1.8325416653445783, pvalue=0.08346710398411555)
So the t-statistic is -1.8, but its deviation from zero is not formally significant (p = 0.08). This result is inconclusive. Of course it would be better to have more precise measurements, not rounded to hours.
In any case, I would argue that given your stated setting, you do not really need this test either. It is highly unlikely that two different CPU perform exactly the same, and you just want to decide which one to go with. Simply choosing the one with the lower average time, regardless of significance test results, is clearly the right decision here.
Question 3
This is analogous to Question 2.
Related
I am trying to investigate the distribution of maximum likelihood estimates for specifically for a large number of covariates p and a high dimensional regime (meaning that p/n, with n the sample size, is about 1/5). I am generating the data and then using statsmodels.api.Logit to fit the parameters to my model.
The problem is, this only seems to work in a low dimensional regime (like 300 covariates and 40000 observations). Specifically, I get that the maximum number of iterations has been reached, the log likelihood is inf i.e. has diverged, and a 'singular matrix' error.
I am not sure how to remedy this. Initially, when I was still working with smaller values (say 80 covariates, 4000 observations), and I got this error occasionally, I set a maximum of 70 iterations rather than 35. This seemed to help.
However it clearly will not help now, because my log likelihood function is diverging. It is not just a matter of non-convergence within the maixmum number of iterations.
It would be easy to answer that these packages are simply not meant to handle such numbers, however there have been papers specifically investigating this high dimensional regime, say here where p=800 covariates and n=4000 observations are used.
Granted, this paper used R rather than python. Unfortunately I do not know R. However I should think that python optimisation should be of comparable 'quality'?
My questions:
Might it be the case that R is better suited to handle data in this high p/n regime than python statsmodels? If so, why and can the techniques of R be used to modify the python statsmodels code?
How could I modify my code to work for numbers around p=800 and n=4000?
In the code you currently use (from several other questions), you implicitly use the Newton-Raphson method. This is the default for the sm.Logit model. It computes and inverts the Hessian matrix to speed-up estimation, but that is incredibly costly for large matrices - let alone oft results in numerical instability when the matrix is near singular, as you have already witnessed. This is briefly explained on the relevant Wikipedia entry.
You can work around this by using a different solver, like e.g. the bfgs (or lbfgs), like so,
model = sm.Logit(y, X)
result = model.fit(method='bfgs')
This runs perfectly well for me even with n = 10000, p = 2000.
Aside from estimation, and more problematically, your code for generating samples results in data that suffer from a large degree of quasi-separability, in which case the whole MLE approach is questionable at best. You should urgently look into this, as it suggests your data may not be as well-behaved as you might like them to be. Quasi-separability is very well explained here.
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
Is it possible to do clustering without providing any input apart from the data? The clustering method/algorithm should decide from the data on how many logical groups the data can be divided, even it doesn't require me to input the threshold eucledian distance on which the clusters are built, this also needs to be learned from the data.
Could you please suggest me what is closest solution for my problem?
Why not code your algorithm to create a list of clusters ranging from size 1 to n (which could be defined in a config file so that you can avoid hard coding and just fix it once).
Once that is done, compute the clusters of size 1 to n. Choose the value which gives you the smallest Mean Square Error.
This would require some additional work by your machine to determine the optimal number of logical groups the data can be divided (bounded between 1 and n).
Clustering is an explorative technique.
This means it must always be able to produce different results, as desired by the user. Having many parameters is a feature. It means the method can be adapted easily to very different data, and to user preferences.
There will never be a generally useful parameter-free technique. At best, some parameters will have default values or heuristics (such as Euclidean distance, such as standardizing the input prior to clusterings such as the gap statistic for choosing k) that may give a reasonable first try in 80% of cases. But after that first try, you'll need to understand the data, and try other parameters to learn more about your data.
Methods that claim to be "parameter free" usually just have some hidden parameters set so it works on the few toy example it was demonstrated on.
I have a question regarding the "correct" Augmented Dickey–Fuller (ADF) test with "sm.tsa.stattools.adfuller" in Python / iPython. I'm testing ADF on a time series of temperatures.
According to the documentation, the test provides an autolag in the args:
autolag {‘AIC’, ‘BIC’, ‘t-stat’, None}
if None, then maxlag lags are used
if ‘AIC’ or ‘BIC’, then the number
of lags is chosen to minimize the corresponding information criterium
‘t-stat’ based choice of maxlag. Starts with maxlag and drops a lag
until the t-statistic on the last lag length is significant at the 95
% level.
Which one is scientifically correct / accepted? (And what does AIC/BIC mean?)
Also, is it possible to calculate an approximate p-value for Z(t), like in some Stata tables? See Link
I'd like to run a chi-squared test in Python. I've created code to do this, but I don't know if what I'm doing is right, because the scipy docs are quite sparse.
Background first: I have two groups of users. My null hypothesis is that there is no significant difference in whether people in either group are more likely to use desktop, mobile, or tablet.
These are the observed frequencies in the two groups:
[[u'desktop', 14452], [u'mobile', 4073], [u'tablet', 4287]]
[[u'desktop', 30864], [u'mobile', 11439], [u'tablet', 9887]]
Here is my code using scipy.stats.chi2_contingency:
obs = np.array([[14452, 4073, 4287], [30864, 11439, 9887]])
chi2, p, dof, expected = stats.chi2_contingency(obs)
print p
This gives me a p-value of 2.02258737401e-38, which clearly is significant.
My question is: does this code look valid? In particular, I'm not sure whether I should be using scipy.stats.chi2_contingency or scipy.stats.chisquare, given the data I have.
I can't comment too much on the use of the function. However, the issue at hand may be statistical in nature. The very small p-value you are seeing is most likely a result of your data containing large frequencies ( in the order of ten thousand). When sample sizes are too large, any differences will become significant - hence the small p-value. The tests you are using are very sensitive to sample size. See here for more details.
You are using chi2_contingency correctly. If you feel uncertain about the appropriate use of a chi-squared test or how to interpret its result (i.e. your question is about statistical testing rather than coding), consider asking it over at the "CrossValidated" site: https://stats.stackexchange.com/