I have a expression values (log2) for 200 genes in two conditions treated and untreated and for each condition I have 20 replicates. I want to calculate the correlation between each condition for each gene and rank them from highest to lowest.
This is more of a biostats problem, but still I think it is an important one for biologists/bio-programmers many of us encounter this.
The dataset looks like this:
Gene UT1 UT2 T1 T2
DDR1 8.111795978 7.7606511867 7.9362235824 7.5974674936
RFC2 10.2418824097 9.7752152714 10.0085488406 9.5723427524
HSPA6 6.5850239731 6.7916563534 6.6883401632 7.3659252344
PAX8 9.2965160827 9.2031177653 9.249816924 8.667772504
GUCA1A 5.4828021059 5.3797749957 5.4312885508 5.1297319374
I have shown only two replicates for each sample in the sample data.
I am looking for a solution in R or python.
cor function in R does not give me what i want.
If I understand correctly from your question,you need to calculate correlation between UT1 and T1 and UT2 and T2 for all the Genes.
There is a way to do it in R :
df <- data.frame(Gene = c("DDR1","RFC2","HSPA6","PAX8","GUCA1A")
, UT1 = c(8.111796, 10.241882, 6.585024 , 9.296516 , 5.482802),
UT2 =c( 7.760651 ,9.775215 ,6.791656, 9.203118, 5.379775),
T1 =c(7.936224 ,10.008549, 6.688340 , 9.249817 , 5.431289),
T2 =c(7.597467 ,9.572343 ,7.365925 ,8.667773 ,5.129732))
make a matrix like this :
mat1 <- cbind(file$UT1,file$T1)
initialize a correlation matrix :
cor1 <- matrix(0,length(file$Gene),length(file$Gene))
then calculate correlation all against all genes like this :
for(i in 1:length(df$Gene)) cor1[i,] = apply(mat1,1,function(x) cor(x,mat1[df$Gene[i],]))
I hope this helps.
All sources I've read indicate that you need to create an average measure for each replicate. I've seen both mean and median used, although you may want to look into more advanced pre-processing/normalization methods (like RMA). Once you've done that you can calculate the correlation between untreated and treated.
There is no way to calculate correlation in the way that you're looking for. Any method that would do so will ultimately boil down to summarizing the information across the two conditions through getting a summary probe measure across the replicates (as above).
Alternatively you could do something like calculate the correlation between each treated and untreated replicate for each probe, and take the average correlation.
Assuming that the first column account for the names of the rows and first column for their names, i.e., assuming that your data contains only numeric values, you can simply do the following in R, which will give you a n x n matrix with all pairwise correlations between genes.
cor(data)
You may want to specify what type of correlation you want to use... What is the length of the time-series? There are whole studies developed to address the issue of selecting an appropriate measure, e.g., see:
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013
Related
I have this dataframe. I would like to find a way to make a correlation matrix between an Hour and the same hour of the day before (for example H01 of 28/09 vs H01 of 27/09).
I thought about two different approaches:
1) Do the corr matrix of the transpose dataframe.
dft=df.transpose()
dft.corr()
2) create a copy of the dataframe with 1 day/rows of lag and than do .corrwith() in order to compare them.
In the first approach I obtain weird results (for example rows like 634 and 635 low correlated even if they have values very similar), in the second approach I obtain all ones. I'm ideally looking forward to find the correlation in days close to eachothers actually. Send help please
I am finishing a work and I am trying to check the correlation between some informations.
Basically I have the data from survivors from a incident and I want to know the correlation between other informations with their survavility.
So, I have the main df with all informations, then:
#creating a df to list who not survived(0) and another df to list who survived(1)
Input: df_s0 = df.query("Survived == 0")
df_s1 = df.query("Survived == 1")
Input: df_s0.corr()
Based on correlation formula:
cor(a,b) = cov(a,b)/(stdev(a) * stdev(b))
If either a or b are all constant (zero variance) then correlation between those two are not defined (division by zero producing NaNs).
In your example, the Survived column of df_s0 is constant (all zeros) and hence correlation is undefined for this column with other columns.
If you want to figure out the relationship between a discrete variable (Survived) and the rest of your features, you can look at the box plots (to be able to compare different statistics like mean, IQR,...) of your features across different groups of Survived 0 and 1. If you want to go a step further you can use ANOVA to characterize the importance of your features based on their variance within and across different groups!
I have brain anatomy measurements from 2 different groups of individuals. One group has more individuals than the other (say n and m individuals each). I have to run the KS test on this data. I am a little unclear about the arguments to pass to the scipy two sample KS test. Will arguments to the scipy 2 sample ks test be every individual from group 1 against every individual in group 2 in a for loop ? Or is it every feature in group 1 against every other feature in group 2 ?
I wrote this code but it's obviously wrong as i am using iteritems() to loop over the columns when perhaps it should be n*m ?
for group1, group2 in zip(group1.transpose().iteritems(),
group2.transpose().iteritems()):
value, pvalue = ks_2samp(np.array(group1[1]), np.array(group2[1]))
print(value, pvalue)
if pvalue > 0.05:
print('Samples are likely drawn from the same distributions
(fail to reject H0)')
else:
print('Samples are likely drawn from the different
distributions (reject H0)')
Let's say one of the measurements is brain mass. Gather all the brain mass measurements for group 1 into a sequence (or 1-d array), and do the same for group 2. Pass these two sequences to ks_2samp. That will test whether the brain masses of the two groups come from the same distribution.
For example, if group1 and group2 are Pandas DataFrames with a row for each individual and with columns for the different measurements associated with each individual, including one called "mass" for brain mass, you would do:
value, pvalue = ks_2samp(group1['mass'].to_numpy(), group2['mass'].to_numpy())
I have 140 csv files. Each file has 3 variables and is about 750 GB. Number of observation varies from 60 to 90 million.
I also have another small file, treatment_data - with 138000 row (for each unique ID) and 21 column (01 column for ID and 20 columns of 1s and 0s indicating whether the ID was given a particular treatment or not.
The variables are,
ID_FROM: A Numeric ID
ID_TO: A Numeric ID
DISTANCE: A numeric variable of physical distance between ID_FROM and ID_TO
(So in total, I have 138000*138000 (= 19+ Billion)rows - for every possible bi-lateral combination all ID, divided across these 140 files.
Research Question: Given a distance, how many neighbors (of each treatment type) an ID has.
So I need help with a system (preferably in Pandas) where
the researcher will input a distance
the program will look over all the files and filter out the the
rows wither DISTANCE between ID_FROM and ID_TO is less than
the given distance
output a single dataframe. (DISTANCE can be dropped at this
point)
merge the dataframe with the treatment_data by matching ID_TO
with ID. (ID_TO can be dropped at this point)
collapse the data by ID_FROM (group_by and sum the 1s, across
20 treatment variable.
(In the Final output dataset, I will have 138000 row and 21 column. 01 column for ID. 20 column for each different treatment type. So, for example, I will be able to answer the question, "Within '2000' meter, How many neighbors of '500' (ID) is in 'treatment_media' category?"
IMPORTANT SIDE NOTE:
The DISTANCE variable range between 0 to roughly the radius of an
average sized US State (in meter). Researcher is mostly interested to
see what happens with in 5000 meter. Which usually drops 98% of
observations. But sometimes, he/she will check for longer distance
measure too. So I have to keep all the observations available.
Otherwise, I could have simply filtered out the DISTANCE more than
5000 from the raw input files and made my life easier. The reason I
think this is important is that, the data are sorted based in
ID_FROM across 140 files. If I could somehow rearrange these 19+
billion rows based on DISTANCE and associate them have some kind of
dictionary system, then the program does not need to go over all the
140 files. Most of the time, the researcher will be looking into only
2 percentile of DISTANCE range. It seems like a colossal waste of
time that I have to loop over 140 files. But this is a secondary
thought. Please do provide answer even if you can't use this
additional side-note.
I tried looping over 140 files for a particular distance in Stata, It
takes 11+ hour to complete the task. Which is not acceptable as the
researcher will want to vary the distance with in 0 to 5000 range.
But, most of the computation time is wasted on reading each dataset
on memory (that is how Stata do it). That is why I am seeking help in
Python.
Is there a particular reason that you need to do the whole thing in Python?
This seems like something that a SQL database would be very good at. I think a basic outline like the following could work:
TABLE Distances {
Integer PrimaryKey,
String IdFrom,
String IdTo,
Integer Distance
}
INDEX ON Distances(IdFrom, Distance);
TABLE TreatmentData {
Integer PrimaryKey,
String Id,
String TreatmentType
}
INDEX ON TreatmentData(Id, TreatmentType);
-- How many neighbors of ID 500 are within 2000 meters and have gotten
-- the "treatment_media" treatment?
SELECT
d.IdFrom AS Id,
td.Treatment,
COUNT(*) AS Total
FROM Distances d
JOIN TreatmentData td ON d.IdTo = td.Id
WHERE d.IdFrom = "500"
AND d.Distance <= 2000
AND td.TreatmentType = "treatment_media"
GROUP BY 1, 2;
There's probably some other combination of indexes that would give better performance, but this seems like it would at least answer your example question.
Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.
Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths