Two-sample Kolmogorov-Smirnov Test in Python Scipy on brain data - python

I have brain anatomy measurements from 2 different groups of individuals. One group has more individuals than the other (say n and m individuals each). I have to run the KS test on this data. I am a little unclear about the arguments to pass to the scipy two sample KS test. Will arguments to the scipy 2 sample ks test be every individual from group 1 against every individual in group 2 in a for loop ? Or is it every feature in group 1 against every other feature in group 2 ?
I wrote this code but it's obviously wrong as i am using iteritems() to loop over the columns when perhaps it should be n*m ?
for group1, group2 in zip(group1.transpose().iteritems(),
group2.transpose().iteritems()):
value, pvalue = ks_2samp(np.array(group1[1]), np.array(group2[1]))
print(value, pvalue)
if pvalue > 0.05:
print('Samples are likely drawn from the same distributions
(fail to reject H0)')
else:
print('Samples are likely drawn from the different
distributions (reject H0)')

Let's say one of the measurements is brain mass. Gather all the brain mass measurements for group 1 into a sequence (or 1-d array), and do the same for group 2. Pass these two sequences to ks_2samp. That will test whether the brain masses of the two groups come from the same distribution.
For example, if group1 and group2 are Pandas DataFrames with a row for each individual and with columns for the different measurements associated with each individual, including one called "mass" for brain mass, you would do:
value, pvalue = ks_2samp(group1['mass'].to_numpy(), group2['mass'].to_numpy())

Related

randomisation in python within a group for A/B test

I have a python list of size 67 with three unique values with following distribution
A - 20 occurrences randomly spread in the list
B - 36 occurrences randomly spread in the list
C - 11 occurrences randomly spread in the list
I want to perform random selection at 10% within each group to perform a special treatment on the values selected from randomisation.
Based on the occurrences in the list shown above, 2 treatments for group A, 3 treatments for B and 1 treatment for C should have been done.
Selection for treatment need not be done exactly on the 10th occurrence of a value but the ratio of treatment to values should be maintained at approximately 10%.
Right now, I have this code
import random
if random.random() <= 0.1
do something
Using this code, doesn't get me above requirement treatments at a group level. Instead it randomly picks treatments across all groups. I want to trim the selection of random samples at a group level. How do I do that?
Also, if this list were dynamic and keeps getting bigger and bigger and populated with more values of A,B,C at run time albeit with different distributions, how can I still maintain the randomisation at a group (unique value in the list) level.

How to iterate pandas dataframe without memory error

I have a csv file having 140K rows. Working with pandas library.
Now the problem is I have to compare each rows with every other rows.
Now the problem is it's taking too much time.
At the same time, I am creating another column where I am appending many data for each row based on the comparison. Here I am getting memory error.
What is the optimal solution for atleast Memory error?
I am working on 12GB RAM, Google Colaboratory.
Dataframe sample:
ID x_coordinate y_coordinate
1 2 3
2 3 4
............
X 1 5
Now, I need to find distance each row with other rows and if the distance in certain threshold, I am assigning a new id for that two row which are in certain distance. So, if in my case ID 1 and ID 2 is in a certain distance I assigned a for both. And ID 2 and ID X is in certain distance I am assigning b as new matched id like below
ID x_coordinate y_coordinate Matched ID
1 2 3 [a]
2 3 4 [a, b]
............
X 1 5 [b]
For distance I am using √{(xi − xj)2 + (yi − yj)2}
Threshold can be anything. Say m unit.
This reads like you attempt to hold the complete square distance matrix in memory, which obviously doesn't scale very well, as you have noticed.
I'd suggest you to read up on how DBSCAN clustering approaches the problem, compared to e.g., hierarchical clustering:
https://en.wikipedia.org/wiki/DBSCAN#Complexity
Instead of computing all the pairwise distances at once, they seem to
put the data into a spatial database (for efficient neighborhood queries with a threshold) and then
iterate the points to identify the neighbors and the relevant distances on the fly.
Unfortunately I can't point you to readily available code or pandas functionality to support this though.

Python - NaN values in df.corr

I am finishing a work and I am trying to check the correlation between some informations.
Basically I have the data from survivors from a incident and I want to know the correlation between other informations with their survavility.
So, I have the main df with all informations, then:
#creating a df to list who not survived(0) and another df to list who survived(1)
Input: df_s0 = df.query("Survived == 0")
df_s1 = df.query("Survived == 1")
Input: df_s0.corr()
Based on correlation formula:
cor(a,b) = cov(a,b)/(stdev(a) * stdev(b))
If either a or b are all constant (zero variance) then correlation between those two are not defined (division by zero producing NaNs).
In your example, the Survived column of df_s0 is constant (all zeros) and hence correlation is undefined for this column with other columns.
If you want to figure out the relationship between a discrete variable (Survived) and the rest of your features, you can look at the box plots (to be able to compare different statistics like mean, IQR,...) of your features across different groups of Survived 0 and 1. If you want to go a step further you can use ANOVA to characterize the importance of your features based on their variance within and across different groups!

How to find the alignment of two data sets in pandas

Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.
Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths

Correlation of two samples with replicates

I have a expression values (log2) for 200 genes in two conditions treated and untreated and for each condition I have 20 replicates. I want to calculate the correlation between each condition for each gene and rank them from highest to lowest.
This is more of a biostats problem, but still I think it is an important one for biologists/bio-programmers many of us encounter this.
The dataset looks like this:
Gene UT1 UT2 T1 T2
DDR1 8.111795978 7.7606511867 7.9362235824 7.5974674936
RFC2 10.2418824097 9.7752152714 10.0085488406 9.5723427524
HSPA6 6.5850239731 6.7916563534 6.6883401632 7.3659252344
PAX8 9.2965160827 9.2031177653 9.249816924 8.667772504
GUCA1A 5.4828021059 5.3797749957 5.4312885508 5.1297319374
I have shown only two replicates for each sample in the sample data.
I am looking for a solution in R or python.
cor function in R does not give me what i want.
If I understand correctly from your question,you need to calculate correlation between UT1 and T1 and UT2 and T2 for all the Genes.
There is a way to do it in R :
df <- data.frame(Gene = c("DDR1","RFC2","HSPA6","PAX8","GUCA1A")
, UT1 = c(8.111796, 10.241882, 6.585024 , 9.296516 , 5.482802),
UT2 =c( 7.760651 ,9.775215 ,6.791656, 9.203118, 5.379775),
T1 =c(7.936224 ,10.008549, 6.688340 , 9.249817 , 5.431289),
T2 =c(7.597467 ,9.572343 ,7.365925 ,8.667773 ,5.129732))
make a matrix like this :
mat1 <- cbind(file$UT1,file$T1)
initialize a correlation matrix :
cor1 <- matrix(0,length(file$Gene),length(file$Gene))
then calculate correlation all against all genes like this :
for(i in 1:length(df$Gene)) cor1[i,] = apply(mat1,1,function(x) cor(x,mat1[df$Gene[i],]))
I hope this helps.
All sources I've read indicate that you need to create an average measure for each replicate. I've seen both mean and median used, although you may want to look into more advanced pre-processing/normalization methods (like RMA). Once you've done that you can calculate the correlation between untreated and treated.
There is no way to calculate correlation in the way that you're looking for. Any method that would do so will ultimately boil down to summarizing the information across the two conditions through getting a summary probe measure across the replicates (as above).
Alternatively you could do something like calculate the correlation between each treated and untreated replicate for each probe, and take the average correlation.
Assuming that the first column account for the names of the rows and first column for their names, i.e., assuming that your data contains only numeric values, you can simply do the following in R, which will give you a n x n matrix with all pairwise correlations between genes.
cor(data)
You may want to specify what type of correlation you want to use... What is the length of the time-series? There are whole studies developed to address the issue of selecting an appropriate measure, e.g., see:
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013

Categories

Resources