Chi squared test in Python - python

I'd like to run a chi-squared test in Python. I've created code to do this, but I don't know if what I'm doing is right, because the scipy docs are quite sparse.
Background first: I have two groups of users. My null hypothesis is that there is no significant difference in whether people in either group are more likely to use desktop, mobile, or tablet.
These are the observed frequencies in the two groups:
[[u'desktop', 14452], [u'mobile', 4073], [u'tablet', 4287]]
[[u'desktop', 30864], [u'mobile', 11439], [u'tablet', 9887]]
Here is my code using scipy.stats.chi2_contingency:
obs = np.array([[14452, 4073, 4287], [30864, 11439, 9887]])
chi2, p, dof, expected = stats.chi2_contingency(obs)
print p
This gives me a p-value of 2.02258737401e-38, which clearly is significant.
My question is: does this code look valid? In particular, I'm not sure whether I should be using scipy.stats.chi2_contingency or scipy.stats.chisquare, given the data I have.

I can't comment too much on the use of the function. However, the issue at hand may be statistical in nature. The very small p-value you are seeing is most likely a result of your data containing large frequencies ( in the order of ten thousand). When sample sizes are too large, any differences will become significant - hence the small p-value. The tests you are using are very sensitive to sample size. See here for more details.

You are using chi2_contingency correctly. If you feel uncertain about the appropriate use of a chi-squared test or how to interpret its result (i.e. your question is about statistical testing rather than coding), consider asking it over at the "CrossValidated" site: https://stats.stackexchange.com/

Related

How to estimate the optimal cutpoint for a binary outcome in python

I have a dataset of diabetic patients which has been used to train an xgboost model in several outcomes such as stroke, amputation, and more. Originally we used the continuous numeric variables as-is, but we found ambiguity in the results since for example age was giving us results where the older you get the higher the risk to have a stroke.
But, for us as physicians we need a narrower range, so we divide those variables in bins. And indeed this gave us more insight. Nonetheless, we are seeing that some contiguous intervals appear in our results pretty close.
Continuing from the example above, bin(64-78) and bin(79-88), appear one after the other and no other bin from the age variable appears. So we think that the best approach, in this case, is to find the best optimal cutpoint at which the age starts to become a risk factor for stroke.
Then I came across this document (https://www.mayo.edu/research/documents/biostat-79pdf/doc-10027230) which explains in SAS how to find those cutpoints. I am not experienced enough to program this myself, so I want to know how could I achieve to find these cutpoints in python?
Do please restrict to that language, I have already seen R, SAS, even SPSS examples but none in python. There must be some way to do this in Python.
It's very difficult to determine without seeing the data but there are a few ways of doing it. One way is doing a logistic regression in your data which will give you a probability distribution from the binary class, you can then use the Receiver Operating Characteristic (ROC) to determine the optimal threshold depending on how important it is to you to prioritize true positive rate over the absence of false-positives.
You can find an article about this here

One sample & Two Sample t-tests in Python

I have a doubt here on how to work this. New to the world of Stats and Python. A student is trying to decide between two Processing Units. He want to use the Processing Unit for his research to run high performance algorithms, so the only thing he is concerned with is speed. He picks a high performance algorithm on a large data set and runs it on both Processing Units 10 times, timing each run in hours. Results are given in the below lists TestSample1 and TestSample2.
from scipy import stats
import numpy as nupy
TestSample1 = nupy.array([11,9,10,11,10,12,9,11,12,9])
TestSample2 = nupy.array([11,13,10,13,12,9,11,12,12,11])
Assumption: Both the dataset samples above are random, independent, parametric & normally distributed
Hint: You can import ttest function from scipy to perform t tests
First T test
One sample t-test
Check if the mean of the TestSample1 is equal to zero.
Null Hypothesis is that mean is equal to zero.
Alternate hypothesis is that it is not equal to zero.
Question 2
Given,
1. Null Hypothesis : There is no significant difference between datasets
2. Alternate Hypothesis : There is a significant difference
Do two-sample testing and check whether to reject Null Hypothesis or not.
Question 3 -
Do two-sample testing and check whether there is significant difference between speeds of two samples: - TestSample1 & TestSample3
He is trying a third Processing Unit - TestSample3.
TestSample3 = nupy.array([9,10,9,11,10,13,12,9,12,12])
Assumption: Both the datasets (TestSample1 & TestSample3) are random, independent, parametric & normally distributed
Question 1
The way to do this with SciPy would be this:
stats.ttest_1samp(TestSample1, popmean=0)
It is not a useful test to perform in this context though, because we already know that the null hypothesis must be false. Negative times are impossible, so the only way for the population mean of times to be zero would be if every time measured were always zero, which is clearly not the case.
Question 2
Here's how to do a two-sample t-test for independent samples with SciPy:
stats.ttest_ind(TestSample1, TestSample2)
Output:
Ttest_indResult(statistic=-1.8325416653445783, pvalue=0.08346710398411555)
So the t-statistic is -1.8, but its deviation from zero is not formally significant (p = 0.08). This result is inconclusive. Of course it would be better to have more precise measurements, not rounded to hours.
In any case, I would argue that given your stated setting, you do not really need this test either. It is highly unlikely that two different CPU perform exactly the same, and you just want to decide which one to go with. Simply choosing the one with the lower average time, regardless of significance test results, is clearly the right decision here.
Question 3
This is analogous to Question 2.

Approaches for using statistics packages for maximum likelihood estimation for hundreds of covariates

I am trying to investigate the distribution of maximum likelihood estimates for specifically for a large number of covariates p and a high dimensional regime (meaning that p/n, with n the sample size, is about 1/5). I am generating the data and then using statsmodels.api.Logit to fit the parameters to my model.
The problem is, this only seems to work in a low dimensional regime (like 300 covariates and 40000 observations). Specifically, I get that the maximum number of iterations has been reached, the log likelihood is inf i.e. has diverged, and a 'singular matrix' error.
I am not sure how to remedy this. Initially, when I was still working with smaller values (say 80 covariates, 4000 observations), and I got this error occasionally, I set a maximum of 70 iterations rather than 35. This seemed to help.
However it clearly will not help now, because my log likelihood function is diverging. It is not just a matter of non-convergence within the maixmum number of iterations.
It would be easy to answer that these packages are simply not meant to handle such numbers, however there have been papers specifically investigating this high dimensional regime, say here where p=800 covariates and n=4000 observations are used.
Granted, this paper used R rather than python. Unfortunately I do not know R. However I should think that python optimisation should be of comparable 'quality'?
My questions:
Might it be the case that R is better suited to handle data in this high p/n regime than python statsmodels? If so, why and can the techniques of R be used to modify the python statsmodels code?
How could I modify my code to work for numbers around p=800 and n=4000?
In the code you currently use (from several other questions), you implicitly use the Newton-Raphson method. This is the default for the sm.Logit model. It computes and inverts the Hessian matrix to speed-up estimation, but that is incredibly costly for large matrices - let alone oft results in numerical instability when the matrix is near singular, as you have already witnessed. This is briefly explained on the relevant Wikipedia entry.
You can work around this by using a different solver, like e.g. the bfgs (or lbfgs), like so,
model = sm.Logit(y, X)
result = model.fit(method='bfgs')
This runs perfectly well for me even with n = 10000, p = 2000.
Aside from estimation, and more problematically, your code for generating samples results in data that suffer from a large degree of quasi-separability, in which case the whole MLE approach is questionable at best. You should urgently look into this, as it suggests your data may not be as well-behaved as you might like them to be. Quasi-separability is very well explained here.

P value for Normality test very small despite normal histogram

I've looked over the normality tests in scipy stats for both scipy.stats.mstats.normaltest as well as scipy.stats.shapiro and it looks like they both assume the null hypothesis is that the data they're given is normal.
Ie, a p value less than .05 would indicate that they're not normal.
I'm doing a regression with LassoCV in SKLearn, and in order to give myself better results I log transformed the answers, which gives a histogram that looks like this:
Looks normal to me.
However, when I run the data through either of the two tests mentioned above I get very small p values that would indicate the data is not normal, and in a big way.
This is what I get when I use scipy.stats.shapiro
scipy.stats.shapiro(y)
Out[69]: (0.9919402003288269, 3.8889791653673456e-07)
And I get this when I run scipy.stats.mstats.normaltest:
scipy.stats.mstats.normaltest(y)
NormaltestResult(statistic=25.755128535282189, pvalue=2.5547293546709236e-06)
It seems implausible to me that my data would test out as being so far from normality with the histogram it has.
Is there something causing this discrepancy, or am I not interpreting the results correctly?
If the numbers on the vertical axis are the number of observations for the respective class, then the sample size is about 1500. For such a large sample size goodness-of-fit tests are rarely useful. But is it really necessary that your data is perfectly normally distributed? If you want to analyze the data with a statistical method, is this method maybe robust under ("small") deviations from the normal distribution assumption?
In practice the question is usually "Is the normal distribution assumption acceptable" for my statistical analysis. A perfect normal distribution is very rarly available.
An additional comment on histograms: One has to be careful by interpreting data from histograms because if the data "looks normal" or not may depend on the width of the histogram classes. Histograms are only hints which should be treated with caution.
If you run this n times and take the mean of the p values, you will get what you expect. Run it in a loop in a Monte Carlo way.

Utilising Genetic algorithm to overcome different size datasets in model

SO I realise the question I am asking here is large and complex.
A potential solution to variences in sizes of
In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering,
but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model.
The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the
percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified
but hopefully demonstrates the principle)
Example data
Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy
12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527
12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527
12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527
13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527
13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527
13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527
13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527
13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527
14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527
14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527
14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527
14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527
14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527
15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527
15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527
15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527
15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527
16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527
16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527
16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527
16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527
16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527
17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527
17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527
17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527
17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527
17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527
So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation:
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where α is the only dynamic number
If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates
For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed"
Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed"
So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature.
So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department?
So to clarify, this post is no longer a discussion about methodology
I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank.
Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.
The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets)
May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point.
And particularly those slides.
[EDIT]
After a second reading, here is my second understanding:
You have a set of example data with two related attributes X and Y.
You do not want X/Y to dominate when Y is small, (considered as less representative).
As a consequence you want to "weigth" the examples with a adapted value adjusted_xy .
You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R.
To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy .
With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal.
Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ).
One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information.
Once this said, I think you surely know how GA works . You have to
define the content of the chromosome : this appears to be your alpha parameter.
define an appropriate fitness function
The fitness function for one individual can be a sum of distances over all examples of the dataset.
As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA.
As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot).
Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).
Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate.
The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem).
The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem.
'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation.
Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class.
Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course).
Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.
No. Don't use a genetic algorithm.
The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before.
You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles.
This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding.
Take the problem back where it belongs and ask the statisticians instead.
For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not.
If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.
I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
which may give you something to play with.
Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows.
There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is.
Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :)
There are other tests of course, but AIC should get you started.
For a simple test, check out p-value

Categories

Resources