Automatically detecting clusters in a 2d array/heatmap - python

I have run object detection on a video file and summed the seconds each pixel is activated to find the amount of time an object is shown in this area which gives me a 2d array of time values. Since these objects are in the same position of the video most of the time it leads to some areas of the screen having much higher activation than others. Now I would like to find a way to automatically detect "clusters" without knowing the number of clusters beforehand. I have considered using something like k-means but also read a little about finding local maximums, but I can't quite figure out how to put all this together or which method is the best to go with. Also, the objects vary in size, so I'm not sure I can go with the local maximum method?
The final result would be a list of ids and maximum time value for each cluster.
[[3, 3, 3, 0, 0, 0, 0, 0, 0]
[3, 3, 3, 0, 0, 0, 2, 2, 2]
[3, 3, 3, 0, 0, 0, 2, 2, 2]
[0, 0, 0, 0, 0, 0, 2, 2, 2]]
From this example array I would end out with a list:
id | Seconds
1 | 3
2 | 2
I havn't tried much since I have no clue where to start and any recommendations of methods with code examples or links to where I can find it to accomplish this would be greatly appreciated! :)

You could look at different methods for clustering in: https://scikit-learn.org/stable/modules/clustering.html
If you do not know the number of clusters beforehand you might want to use a different algorithm than K-means (one which is not dependent on the number of clusters). I would suggest reading about dbscan and hdbscan for this task. Good luck :)

Related

Finding FFT Coefficents With Python

I am a new user of Python. I have a signal that contains 16 datas.
for example:
'a = [ 1, 2, 3, 4, 1, 1, 1, 1, 1, 1 ,2, 3, 4, 1, 1]'
I tried to numpy.fft.fft but I can not figure out how can I sum these frequencies and calculate the Fourier Coefficients.
Thank you.
The numpy docs include a helpful example at the end of the page for np.fft.fft (https://numpy.org/doc/stable/reference/generated/numpy.fft.fft.html)
Basically, you want to use np.fft.fft(a) to transform your data, in tandem with np.fft.fftfreq(np.shape(a)[-1]) to figure out which frequencies your transform corresponds to.
Check out the docs for np.fft.fftfreq as well (https://numpy.org/doc/stable/reference/generated/numpy.fft.fftfreq.html#numpy.fft.fftfreq)
See here (https://dsp.stackexchange.com/questions/26927/what-is-a-frequency-bin) for a discussion on frequency bins and here (https://realpython.com/python-scipy-fft/) for a solid tutorial on scipy/numpy fft.

How to derive odds-ratios and p-values to see if the difference is significant?

I have two different groups of samples: samples1 and samples2.
Moreover, I have 18 different elements and for each element, there is the corresponding score attained from using all samples of samples1 and samples2, respectively.
e.g.:
score_samples1[0] means the score for the first element attained by using all samples of samples1.
score_samples2[0] means the score for the first element attained by using all samples of samples2.
Now, I want to compare the difference between these two sample groups and find out whether this difference is statistically significant.
As shown below, I have applied a t-test to get a p-value to assess the significance.
My question is as follows:
Is there a way to derive odds-ratios and p-values to see if the difference is significant?
from scipy import stats
import statistics
score_samples1=[1.430442073, 1.347975371, 1.175088823, 1.20141693, 1.152665995, 1.105591463, 1.343297357, 1.251526193, 1.107442697, 1.302090741, 1.312426241, 1.24880381, 1.46855296, 1.12369795, 1.344426189, 1.24276685, 1.216269219, 1.172317535]
score_samples2=[1.663793448, 1.506660754, 1.387285644, 1.440433062, 1.367680224, 1.340102236, 1.632881551, 1.522894543, 1.137437101,1.581845495, 1.540401185, 1.549114159, 1.558038893, 1.392571495, 1.532717551, 1.451731862, 1.277597967, 1.336609308]
stats.ttest_ind(score_samples1,score_samples2)
stats.ttest_ind(score_samples1,score_samples2, equal_var=False)
Ttest_indResult(statistic=-5.03264933189511, pvalue=1.7512132919948795e-05)
#Paired t-test
stats.ttest_rel(score_samples1,score_samples2)
Ttest_relResult(statistic=-11.148411105604898, pvalue=3.0763665473016024e-09)
Assume that I categorize the scores as follows:
scores_ge_cutoff_samples1=[1 if x>=1.30 else 0 for x in score_samples1]
scores_ge_cutoff_samples2=[1 if x>=1.30 else 0 for x in score_samples2]
scores_ge_cutoff_samples1
[1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]
scores_ge_cutoff_samples2
[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1]
oddsratio, pvalue = stats.fisher_exact([[16, 2], [7, 11]])
pvalue
0.004510832141532924
oddsratio
12.571428571428571
Based on this analysis, we can conclude that samples2 having a score>=1.30 is 12.57 times more likely than samples1 having a score>=1.30.
However, I was aiming a get an odds ratio for the difference between samples1 and samples2 scores.
You need to read about experimental procedure. "Is this significant" is not something you decide with some computation afterwards; it's a critical parameter of your experimental design. You decide before you do the experiment, just what level of significance you'll accept as confirming the hypothesis you chose.
A one-tailed t-test requires a hypothesis that, say, sample 1 is greater than sample 2.
A two-tailed t-test requires a hypothesis that sample 1 and sample 2 are from different distributions -- but not which would be greater than the other, just that they're different.
Since you've done both tests, it appears that you failed to design your experiment at all. For scientific integrity, you now have to start over, design your experiment, and re-run your samples.
On the other hand, you're in very good shape for a reasonable result. Typical thresholds for a conclusion are a p-levels of 0.95, 0.98, and 0.99; these accept error rates of 5%, 2%, and 1%, respectively.
Your p-scores are far below even the most stringent of these (e-5 versus e-2), so you shouldn't have any trouble with that part. The code is quite simple -- something such as this:
t_score, prob = stats.ttest_ind(score_samples1, score_samples2)
if prob <= 0.01:
print("The hypothesis is confirmed")
else
print("The hypothesis is not confirmed")

Turning off x% of ACTIVE bits in a binary numpy matrix as opposed to turning off x% of ALL bits

I have a binary numpy matrix and I want to randomly turn off 30% of the matrix, where turning off 30% means to replace 30% of the 1s with 0s. In general, I want to do this many times, so if I do this 5 times I expect the final matrix to have 100*(1-0.3)^5 = 16% 1s of the original matrix, which is all 1s initially.
The important thing is that I want to turn off 30% of the active bits (the ones), as opposed to turning off 30% of the whole matrix (the ones and the zeros, where turning off zero is just zero).
I came up with a method to do this, but it doesn't seem to achieve the above, because, after 5 sessions of turning off 30%, the matrix is 23% 1s as opposed to 16% 1s.
To illustrate through example, my approach is as follows:
>>> mask=np.array([[1,1,1,1,1],[1,1,1,0,0],[1,1,0,0,0]])
>>> mask
array([[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 0, 0, 0]])
>>> np.where(mask==0, np.zeros_like(mask), mask * np.random.binomial(1, 0.7, mask.shape))
array([[1, 1, 1, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 0, 0]])
The code above gives a new matrix where if a bit is 0 it remains 0, and if it is 1 it is turned off 30% of the time.
In this small example, everything seems to work fine as I have removed exactly 30% of the ones (I had 10, now I have 7). But I don't think my approach generalizes well for a large matrix. I believe this is due to the following:
Although the Bernoulli trials are supposed to be independent of each other, numpy probably tries to ensure that overall, 30% of all the trials are Tails. But in my code "all trials" equals the size of the full matrix, and not the number of ones in my matrix, and this is what causes the problem.
What is a clean pythonic way to nullify 30% of the active bits, as opposed to 30% of all the bits?
It seems I figured it out. Basically, I create a new zero matrix of the same size and I only edit the fields where the initial matrix has 1s, and the number of times the Bernoulli trial is performed is the number of 1s in the initial matrix, as opposed to the full size of the matrix, as desired.
>>> mask=np.array([[1,1,1,1,1],[1,1,1,0,0],[1,1,0,0,0]])
>>> new_mask = np.zeros_like(mask)
>>> new_mask[mask==1] = np.random.binomial(1,0.7,mask[mask==1].shape)* mask[mask==1]
>>> new_mask
array([[0, 1, 0, 1, 1],
[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0]])
This works for large matrices and it does give me a final size of ~16% 1s (see question body for context).

difference between accuracy_score and jaccard_similarity_score

What is the difference between sklearn.metrics.jaccard_similarity_score and sklearn.metrics.accuracy_score ?
1.When do we use accuracy_score ?
2.When do we use jaccard ?
3.I know the formula.Could someone explain the algorithm behind these metrics.
4.How can I calculate jaccard on my dataframes?
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)
thanks
The accuracy_score is straight forward, which is one of the reasons why it is a common choice. It's the amount of correcty classified samples divided by the total, so in your case:
from sklearn.metrics import jaccard_score, accuracy_score
print(a)
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]])
accuracy_score(a[0,:], a[1,:])
# 0.25
Which is the same as doing:
(a[0,:] == a[1,:]).sum()/a.shape[1]
# 0.25
The jaccard_score is suited specially for certain problems, such as in object detection. You can get a better understanding by taking a look at Jaccard index, which is also known as intersection over union, and measures the overlap of two sample sets divided by the union (sample size minus the intersection).
Note that sklearn.metrics.jaccard_similarity_score is deprecated, and you should probably be looking at sklearn.metrics.jaccard_score. The latter has several averaging modes, depending on the what you're most interested in. By default is is in binary which you should change since you're dealing with multiple labels.
So depending on your application you'll be more interested in one or the other. Though if you aren't sure I'd suggest you to go with the simpler, which is the accuracy score.

Identifying Outliers in a Set of 1-D Binary Vectors in Python

I am investigating the optimal way to identify outlier vectors when you have m 1-D binary vectors with n features, for example:
a =[[1, 0, 1, 1, 1, 0, 1],
[0, 0, 0, 1, 1, 1, 0],
[0, 1, 1, 0, 0, 1, 1]]
In my case n and m are in the 100's. I would like to identify which vectors are outliers in the population. I have found some information using Mahalanobis Distance in SciPy and packages like HDBSCAN (note, I will be clustering these outliers after they are identified to see if there are any further patterns in the outliers). In both cases the examples are limited, but I also don't know if this the best method to use with binary vectors. Any advice and examples or references would be much appreciated.

Categories

Resources