difference between accuracy_score and jaccard_similarity_score - python

What is the difference between sklearn.metrics.jaccard_similarity_score and sklearn.metrics.accuracy_score ?
1.When do we use accuracy_score ?
2.When do we use jaccard ?
3.I know the formula.Could someone explain the algorithm behind these metrics.
4.How can I calculate jaccard on my dataframes?
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)
thanks

The accuracy_score is straight forward, which is one of the reasons why it is a common choice. It's the amount of correcty classified samples divided by the total, so in your case:
from sklearn.metrics import jaccard_score, accuracy_score
print(a)
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]])
accuracy_score(a[0,:], a[1,:])
# 0.25
Which is the same as doing:
(a[0,:] == a[1,:]).sum()/a.shape[1]
# 0.25
The jaccard_score is suited specially for certain problems, such as in object detection. You can get a better understanding by taking a look at Jaccard index, which is also known as intersection over union, and measures the overlap of two sample sets divided by the union (sample size minus the intersection).
Note that sklearn.metrics.jaccard_similarity_score is deprecated, and you should probably be looking at sklearn.metrics.jaccard_score. The latter has several averaging modes, depending on the what you're most interested in. By default is is in binary which you should change since you're dealing with multiple labels.
So depending on your application you'll be more interested in one or the other. Though if you aren't sure I'd suggest you to go with the simpler, which is the accuracy score.

Related

How to derive odds-ratios and p-values to see if the difference is significant?

I have two different groups of samples: samples1 and samples2.
Moreover, I have 18 different elements and for each element, there is the corresponding score attained from using all samples of samples1 and samples2, respectively.
e.g.:
score_samples1[0] means the score for the first element attained by using all samples of samples1.
score_samples2[0] means the score for the first element attained by using all samples of samples2.
Now, I want to compare the difference between these two sample groups and find out whether this difference is statistically significant.
As shown below, I have applied a t-test to get a p-value to assess the significance.
My question is as follows:
Is there a way to derive odds-ratios and p-values to see if the difference is significant?
from scipy import stats
import statistics
score_samples1=[1.430442073, 1.347975371, 1.175088823, 1.20141693, 1.152665995, 1.105591463, 1.343297357, 1.251526193, 1.107442697, 1.302090741, 1.312426241, 1.24880381, 1.46855296, 1.12369795, 1.344426189, 1.24276685, 1.216269219, 1.172317535]
score_samples2=[1.663793448, 1.506660754, 1.387285644, 1.440433062, 1.367680224, 1.340102236, 1.632881551, 1.522894543, 1.137437101,1.581845495, 1.540401185, 1.549114159, 1.558038893, 1.392571495, 1.532717551, 1.451731862, 1.277597967, 1.336609308]
stats.ttest_ind(score_samples1,score_samples2)
stats.ttest_ind(score_samples1,score_samples2, equal_var=False)
Ttest_indResult(statistic=-5.03264933189511, pvalue=1.7512132919948795e-05)
#Paired t-test
stats.ttest_rel(score_samples1,score_samples2)
Ttest_relResult(statistic=-11.148411105604898, pvalue=3.0763665473016024e-09)
Assume that I categorize the scores as follows:
scores_ge_cutoff_samples1=[1 if x>=1.30 else 0 for x in score_samples1]
scores_ge_cutoff_samples2=[1 if x>=1.30 else 0 for x in score_samples2]
scores_ge_cutoff_samples1
[1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]
scores_ge_cutoff_samples2
[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1]
oddsratio, pvalue = stats.fisher_exact([[16, 2], [7, 11]])
pvalue
0.004510832141532924
oddsratio
12.571428571428571
Based on this analysis, we can conclude that samples2 having a score>=1.30 is 12.57 times more likely than samples1 having a score>=1.30.
However, I was aiming a get an odds ratio for the difference between samples1 and samples2 scores.
You need to read about experimental procedure. "Is this significant" is not something you decide with some computation afterwards; it's a critical parameter of your experimental design. You decide before you do the experiment, just what level of significance you'll accept as confirming the hypothesis you chose.
A one-tailed t-test requires a hypothesis that, say, sample 1 is greater than sample 2.
A two-tailed t-test requires a hypothesis that sample 1 and sample 2 are from different distributions -- but not which would be greater than the other, just that they're different.
Since you've done both tests, it appears that you failed to design your experiment at all. For scientific integrity, you now have to start over, design your experiment, and re-run your samples.
On the other hand, you're in very good shape for a reasonable result. Typical thresholds for a conclusion are a p-levels of 0.95, 0.98, and 0.99; these accept error rates of 5%, 2%, and 1%, respectively.
Your p-scores are far below even the most stringent of these (e-5 versus e-2), so you shouldn't have any trouble with that part. The code is quite simple -- something such as this:
t_score, prob = stats.ttest_ind(score_samples1, score_samples2)
if prob <= 0.01:
print("The hypothesis is confirmed")
else
print("The hypothesis is not confirmed")

Turning off x% of ACTIVE bits in a binary numpy matrix as opposed to turning off x% of ALL bits

I have a binary numpy matrix and I want to randomly turn off 30% of the matrix, where turning off 30% means to replace 30% of the 1s with 0s. In general, I want to do this many times, so if I do this 5 times I expect the final matrix to have 100*(1-0.3)^5 = 16% 1s of the original matrix, which is all 1s initially.
The important thing is that I want to turn off 30% of the active bits (the ones), as opposed to turning off 30% of the whole matrix (the ones and the zeros, where turning off zero is just zero).
I came up with a method to do this, but it doesn't seem to achieve the above, because, after 5 sessions of turning off 30%, the matrix is 23% 1s as opposed to 16% 1s.
To illustrate through example, my approach is as follows:
>>> mask=np.array([[1,1,1,1,1],[1,1,1,0,0],[1,1,0,0,0]])
>>> mask
array([[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 0, 0, 0]])
>>> np.where(mask==0, np.zeros_like(mask), mask * np.random.binomial(1, 0.7, mask.shape))
array([[1, 1, 1, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 0, 0]])
The code above gives a new matrix where if a bit is 0 it remains 0, and if it is 1 it is turned off 30% of the time.
In this small example, everything seems to work fine as I have removed exactly 30% of the ones (I had 10, now I have 7). But I don't think my approach generalizes well for a large matrix. I believe this is due to the following:
Although the Bernoulli trials are supposed to be independent of each other, numpy probably tries to ensure that overall, 30% of all the trials are Tails. But in my code "all trials" equals the size of the full matrix, and not the number of ones in my matrix, and this is what causes the problem.
What is a clean pythonic way to nullify 30% of the active bits, as opposed to 30% of all the bits?
It seems I figured it out. Basically, I create a new zero matrix of the same size and I only edit the fields where the initial matrix has 1s, and the number of times the Bernoulli trial is performed is the number of 1s in the initial matrix, as opposed to the full size of the matrix, as desired.
>>> mask=np.array([[1,1,1,1,1],[1,1,1,0,0],[1,1,0,0,0]])
>>> new_mask = np.zeros_like(mask)
>>> new_mask[mask==1] = np.random.binomial(1,0.7,mask[mask==1].shape)* mask[mask==1]
>>> new_mask
array([[0, 1, 0, 1, 1],
[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0]])
This works for large matrices and it does give me a final size of ~16% 1s (see question body for context).

Automatically detecting clusters in a 2d array/heatmap

I have run object detection on a video file and summed the seconds each pixel is activated to find the amount of time an object is shown in this area which gives me a 2d array of time values. Since these objects are in the same position of the video most of the time it leads to some areas of the screen having much higher activation than others. Now I would like to find a way to automatically detect "clusters" without knowing the number of clusters beforehand. I have considered using something like k-means but also read a little about finding local maximums, but I can't quite figure out how to put all this together or which method is the best to go with. Also, the objects vary in size, so I'm not sure I can go with the local maximum method?
The final result would be a list of ids and maximum time value for each cluster.
[[3, 3, 3, 0, 0, 0, 0, 0, 0]
[3, 3, 3, 0, 0, 0, 2, 2, 2]
[3, 3, 3, 0, 0, 0, 2, 2, 2]
[0, 0, 0, 0, 0, 0, 2, 2, 2]]
From this example array I would end out with a list:
id | Seconds
1 | 3
2 | 2
I havn't tried much since I have no clue where to start and any recommendations of methods with code examples or links to where I can find it to accomplish this would be greatly appreciated! :)
You could look at different methods for clustering in: https://scikit-learn.org/stable/modules/clustering.html
If you do not know the number of clusters beforehand you might want to use a different algorithm than K-means (one which is not dependent on the number of clusters). I would suggest reading about dbscan and hdbscan for this task. Good luck :)

Identifying Outliers in a Set of 1-D Binary Vectors in Python

I am investigating the optimal way to identify outlier vectors when you have m 1-D binary vectors with n features, for example:
a =[[1, 0, 1, 1, 1, 0, 1],
[0, 0, 0, 1, 1, 1, 0],
[0, 1, 1, 0, 0, 1, 1]]
In my case n and m are in the 100's. I would like to identify which vectors are outliers in the population. I have found some information using Mahalanobis Distance in SciPy and packages like HDBSCAN (note, I will be clustering these outliers after they are identified to see if there are any further patterns in the outliers). In both cases the examples are limited, but I also don't know if this the best method to use with binary vectors. Any advice and examples or references would be much appreciated.

argrelextrema and flat extrema

Function argrelextrema from scipy.signal does not detect flat extrema.
Example:
import numpy as np
from scipy.signal import argrelextrema
data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 1, 0 ])
argrelextrema(data, np.greater)
(array([2]),)
the first max (2) is detected, the second max (3, 3) is not detected.
Any workaround for this behaviour?
Thanks.
Short answer: Probably argrelextrema will not be flexible enough for your task. Consider writing your own function matching your needs.
Longer answer: Are you bound to use argrelextrema? If yes, then you can play around with the comparator and the order arguments of argrelextrema (see the reference).
For your easy example, it would be enough to chose np.greater_equal as comparator.
>>> data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 1, 0 ])
>>> print(argrelextrema(data, np.greater_equal,order=1))
(array([2, 6, 7]),)
Note however that in this way
>>> data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 4, 1, 0 ])
>>> print(argrelextrema(data, np.greater_equal,order=1))
(array([2, 6, 8]),)
behaves differently that you would probably like, finding the first 3 and the 4 as maxima, since argrelextrema now sees everything as a maximum that is greater or equal to its two nearest neighbors. You can now use the order argument to decide to how many neighbors this comparison must hold - choosing order=2 would change my upper example to only find 4 as a maximum.
>>> print(argrelextrema(data, np.greater_equal,order=2))
(array([2, 8]),)
There is, however, a downside to this - let's change the data once more:
>>> data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 4, 1, 5 ])
>>> print(argrelextrema(data, np.greater_equal,order=2))
(array([ 2, 10]),)
Adding another peak as a last value keeps you from finding your peak at 4, as argrelextrema is now seeing a second-neighbor that is greater than 4 (which can be useful for noisy data, but not necessarily the behavior expected in all cases).
Using argrelextrema, you will always be limited to binary operations between a fixed number of neighbors. Note, however, that all argrelextrema is doing in your example above is to return n, if data[n] > data[n-1] and data[n] > data[n+1]. You could easily implement this yourself, and then refine the rules, for example by checking the second neighbor in case that the first neighbor has the same value.
For the sake of completeness, there seems to be a more elaborate function in scipy.signal, find_peaks_cwt. I have however no experience using it and can therefore not give you more details about it.
I'm really surprised that no one figured out an answer to this. All you need to do is preprocess the array to remove duplicates that are located next to each other and you can run argrelextrema like so:
import numpy as np
from scipy.signal import argrelextrema
data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 1, 0 ])
filter_table = [False] + list(np.equal(data[:-1], data[1:]))
data = np.array([x for idx, x in enumerate(data) if not filter_table[idx]])
argrelextrema(data, np.greater)

Categories

Resources