Scipy and Sklearn chi2 implementations give different results - python

I an using sklearn.feature_selection.chi2 for feature selection and found out some unexpected results (check the code). Do anyone knows what is the reason or can point me to some documentation or pull request?
I include a comparison of the results I got and the expected ones obtained by hand and using scipy.stats.chi2_contingency.
The code:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
from sklearn.feature_selection import chi2, SelectKBest
x = np.array([[1, 1, 1, 0, 1], [1, 0, 1, 0, 0], [0, 0, 1, 1, 1], [0, 0, 1, 1, 0], [0, 0, 0, 1, 1], [0, 0, 0, 1, 0]])
y = np.array([1, 1, 2, 2, 3, 3])
scores = []
for i in range(x.shape[1]):
result = chi2_contingency(pd.crosstab(x[:, i], y))
scores.append(result[0])
sel = SelectKBest(score_func=chi2, k=3)
sel.fit(x, y)
print(scores)
print(sel.scores_)
print(sel.get_support())
The results are:
[6., 2.4, 6.0, 6.0, 0.0] (Expected)
[4. 2. 2. 2. 0.] (Unexpected)
[ True True False True False]
Using scipy, it keeps features 0, 2, 3, while, with sklearn it keeps features 0,1,3.

First, you have the observed values and expected values interchanges when calculating with the scipy implementation, it should be
scores = []
for i in range(x.shape[1]):
result = chi2_contingency(pd.crosstab(y,x[:,i] ))
scores.append(result[0])
So now the scipy results are :
[6.000000000000001, 2.4000000000000004, 6.000000000000001, 6.000000000000001, 0.0]
While the one with sklearn's chi2 are
[4. 2. 2. 2. 0.]
Now I went into the source code, and they both calculate the chi square values little differently
The sklearn implementation
You can check line 171 where chi2 class is defined, this the implementation in sklearn before being passed to _chisquare class.
scipy implementation
You can view the scipy implementation here,which calls this function to finally calculate the chi square values.
As you can see from the implementation the difference in values is because of the transformations they perform on the obsevred and expected values before calculating the chi square values.
References:
chi square feature selection using scipy

Yes, they do give different results. And I think you should trust the results from scipy, and reject the results from sklearn.
But let me provide details of my reasoning, because I could be wrong.
I lately observed a similar effect to what you describe, with a data set of 300 data points: the results of the two chi2 implementations differ indeed. In my case the difference was striking. I described the issue in details in this article , followed by this Cross Validated discussion thread and I also submitted a bug request to sklearn, available for review here.
The added value from my research, if any, seems to be that the results delivered by the scipy implementation seem correct, while the results from sklearn are incorrect. Please see the article for the details. But I only focused on my sample, so the conclusion may not be universally true. Sadly the source code analysis is beyond my capability, but I hope this input can help someone to possibly either improve the code, or disprove my reasoning if wrong.

Related

How to derive odds-ratios and p-values to see if the difference is significant?

I have two different groups of samples: samples1 and samples2.
Moreover, I have 18 different elements and for each element, there is the corresponding score attained from using all samples of samples1 and samples2, respectively.
e.g.:
score_samples1[0] means the score for the first element attained by using all samples of samples1.
score_samples2[0] means the score for the first element attained by using all samples of samples2.
Now, I want to compare the difference between these two sample groups and find out whether this difference is statistically significant.
As shown below, I have applied a t-test to get a p-value to assess the significance.
My question is as follows:
Is there a way to derive odds-ratios and p-values to see if the difference is significant?
from scipy import stats
import statistics
score_samples1=[1.430442073, 1.347975371, 1.175088823, 1.20141693, 1.152665995, 1.105591463, 1.343297357, 1.251526193, 1.107442697, 1.302090741, 1.312426241, 1.24880381, 1.46855296, 1.12369795, 1.344426189, 1.24276685, 1.216269219, 1.172317535]
score_samples2=[1.663793448, 1.506660754, 1.387285644, 1.440433062, 1.367680224, 1.340102236, 1.632881551, 1.522894543, 1.137437101,1.581845495, 1.540401185, 1.549114159, 1.558038893, 1.392571495, 1.532717551, 1.451731862, 1.277597967, 1.336609308]
stats.ttest_ind(score_samples1,score_samples2)
stats.ttest_ind(score_samples1,score_samples2, equal_var=False)
Ttest_indResult(statistic=-5.03264933189511, pvalue=1.7512132919948795e-05)
#Paired t-test
stats.ttest_rel(score_samples1,score_samples2)
Ttest_relResult(statistic=-11.148411105604898, pvalue=3.0763665473016024e-09)
Assume that I categorize the scores as follows:
scores_ge_cutoff_samples1=[1 if x>=1.30 else 0 for x in score_samples1]
scores_ge_cutoff_samples2=[1 if x>=1.30 else 0 for x in score_samples2]
scores_ge_cutoff_samples1
[1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]
scores_ge_cutoff_samples2
[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1]
oddsratio, pvalue = stats.fisher_exact([[16, 2], [7, 11]])
pvalue
0.004510832141532924
oddsratio
12.571428571428571
Based on this analysis, we can conclude that samples2 having a score>=1.30 is 12.57 times more likely than samples1 having a score>=1.30.
However, I was aiming a get an odds ratio for the difference between samples1 and samples2 scores.
You need to read about experimental procedure. "Is this significant" is not something you decide with some computation afterwards; it's a critical parameter of your experimental design. You decide before you do the experiment, just what level of significance you'll accept as confirming the hypothesis you chose.
A one-tailed t-test requires a hypothesis that, say, sample 1 is greater than sample 2.
A two-tailed t-test requires a hypothesis that sample 1 and sample 2 are from different distributions -- but not which would be greater than the other, just that they're different.
Since you've done both tests, it appears that you failed to design your experiment at all. For scientific integrity, you now have to start over, design your experiment, and re-run your samples.
On the other hand, you're in very good shape for a reasonable result. Typical thresholds for a conclusion are a p-levels of 0.95, 0.98, and 0.99; these accept error rates of 5%, 2%, and 1%, respectively.
Your p-scores are far below even the most stringent of these (e-5 versus e-2), so you shouldn't have any trouble with that part. The code is quite simple -- something such as this:
t_score, prob = stats.ttest_ind(score_samples1, score_samples2)
if prob <= 0.01:
print("The hypothesis is confirmed")
else
print("The hypothesis is not confirmed")

difference between accuracy_score and jaccard_similarity_score

What is the difference between sklearn.metrics.jaccard_similarity_score and sklearn.metrics.accuracy_score ?
1.When do we use accuracy_score ?
2.When do we use jaccard ?
3.I know the formula.Could someone explain the algorithm behind these metrics.
4.How can I calculate jaccard on my dataframes?
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)
thanks
The accuracy_score is straight forward, which is one of the reasons why it is a common choice. It's the amount of correcty classified samples divided by the total, so in your case:
from sklearn.metrics import jaccard_score, accuracy_score
print(a)
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]])
accuracy_score(a[0,:], a[1,:])
# 0.25
Which is the same as doing:
(a[0,:] == a[1,:]).sum()/a.shape[1]
# 0.25
The jaccard_score is suited specially for certain problems, such as in object detection. You can get a better understanding by taking a look at Jaccard index, which is also known as intersection over union, and measures the overlap of two sample sets divided by the union (sample size minus the intersection).
Note that sklearn.metrics.jaccard_similarity_score is deprecated, and you should probably be looking at sklearn.metrics.jaccard_score. The latter has several averaging modes, depending on the what you're most interested in. By default is is in binary which you should change since you're dealing with multiple labels.
So depending on your application you'll be more interested in one or the other. Though if you aren't sure I'd suggest you to go with the simpler, which is the accuracy score.

What is the difference between fit() and fit_predict() in SpectralClustering

I am trying to understand and use the spectral clustering from sklearn.
Let us say we have X matrix input and we create a spectral clustering object as follows:
clustering = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
Then, we call a fit_predict using the spectral cluster object.
clusters = clustering.fit_predict(X)
What confuses me is that when does 'the affinity matrix for X using the selected affinity is created'? Because as per the documentation the
fit_predict() method 'Performs clustering on X and returns cluster labels.' But it doesn't explicitly say that it also computes 'the affinity matrix for X using the selected affinity' before clustering.
I appreciate any help or tips.
As already implied in another answer, fit_predict is just a convenience method in order to return the cluster labels. According to the documentation, fit
Creates an affinity matrix for X using the selected affinity, then applies spectral clustering to this affinity matrix.
while fit_predict
Performs clustering on X and returns cluster labels.
Here, Performs clustering on X should be understood as what is described for fit, i.e. Creates an affinity matrix [...].
It is not difficult to verify that calling fit_predict is equivalent to getting the labels_ attribute from the object after fit; using some dummy data, we have
from sklearn.cluster import SpectralClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [10, 0],
[10, 2], [10, 4], [1, 0]])
# 1st way - use fit and get the labels_
clustering = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
clustering.fit(X)
clustering.labels_
# array([1, 1, 0, 0, 0, 1])
# 2nd way - using fit_predict
clustering2 = SpectralClustering(n_clusters=2,
assign_labels="discretize",
random_state=0)
clustering2.fit_predict(X)
# array([1, 1, 0, 0, 0, 1])
np.array_equal(clustering.labels_, clustering2.fit_predict(X))
# True
Looking at source code of fit_predict() it seems that it's just a convenience method - it literally just calls fit() and returns labels from the object.

Map test data using sklearn TSNE

Is there a way to extract the mapping procedure in sklearn.manifold.TSNE in python so that you can map new data into the reduced dimensional space?
Importantly, I mean without having to retrain on the new data as well here.
For example say you trained a TSNE map as follows:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
X_embedded = TSNE(n_components=2).fit_transform(X)
As seen in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Can you extract the transformation so that you can map new data into the same space:
Y = np.array([[0, 0.8, 0.8], [0.1, 0, 1], [1.2, 0.2, 1], [1, 1.1, 1]])
Any help on this matter would be greatly appreciated!
tSNE is a non-linear, non-parametric embedding.
So there is no "closed form" way of updating it with new points. Even worse: adding new points may require existing points to move.
Because of this, making tSNE apply to new data will require substantial changes to the method, it won't be the original tSNE anymore.
Parametric t-SNE has option to apply on the test data but this is not available in Sklearn. Reference issue.
Having set this we have mention that it is implemented in other place here

Adding new points to the t-SNE model

I try to use t-SNE algorithm in the scikit-learn:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X)
Output:
array([[ 0.00017599, 0.00003993], #1
[ 0.00009891, 0.00021913],
[ 0.00018554, -0.00009357],
[ 0.00009528, -0.00001407]]) #2
After that I try to add some points with the coordinates exactly like in the first array X to the existing model:
Y = np.array([[0, 0, 0], [1, 1, 1]])
model.fit_transform(Y)
Output:
array([[ 0.00017882, 0.00004002], #1
[ 0.00009546, 0.00022409]]) #2
But coords in the second array not equal to the first and last coords from the first array.
I understand that this is the right behaviour, but how can I add new coords to the model and get the same coords in the output array for the same coords in the input array?
Also I still need to get closest points even after appending new points.
Quoting the author of t-SNE from here: https://lvdmaaten.github.io/tsne/
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.
Also, this answer on stats.stackexchange.com contains ideas and a link to
a very nice and very fast recent Python implementation of t-SNE https://github.com/pavlin-policar/openTSNE that allows embedding of new points out of the box
and a link to https://github.com/berenslab/rna-seq-tsne/.

Categories

Resources