I am finishing a work and I am trying to check the correlation between some informations.
Basically I have the data from survivors from a incident and I want to know the correlation between other informations with their survavility.
So, I have the main df with all informations, then:
#creating a df to list who not survived(0) and another df to list who survived(1)
Input: df_s0 = df.query("Survived == 0")
df_s1 = df.query("Survived == 1")
Input: df_s0.corr()
Based on correlation formula:
cor(a,b) = cov(a,b)/(stdev(a) * stdev(b))
If either a or b are all constant (zero variance) then correlation between those two are not defined (division by zero producing NaNs).
In your example, the Survived column of df_s0 is constant (all zeros) and hence correlation is undefined for this column with other columns.
If you want to figure out the relationship between a discrete variable (Survived) and the rest of your features, you can look at the box plots (to be able to compare different statistics like mean, IQR,...) of your features across different groups of Survived 0 and 1. If you want to go a step further you can use ANOVA to characterize the importance of your features based on their variance within and across different groups!
Related
The problem is that I am trying to run a specific row I choose to calculate what percentage the specific row value is away from the intended outputs mean (which is already calculated from another column), to find what percentage it deviates from the intended outputs mean.
I want to run each item individually like so:
Below I made a dataframe column to store the result
df['pct difference'] = ((df['tertiary_tag']['price'] - df['ab roller']['mean'])/df['ab roller']['mean']) * 100
For example, let's say the mean is 10 and I know that the item is 8 dollars, figure out whatever percentage away from the mean that product is and return that number for each items of that dataset. Return what percentage it deviates from the mean.
Keep in mind, the problem is not solved by a loop because I am sure pandas has something more practical to calculate the % difference not pct_change.
I also thought maybe to get the very specific row make a column as some indexing so I can use, and use that to access any row with in the columns by using that index and from indexing do my operation whatever you want for example to calculate difference in percentage between two rows?
I thought maybe through indexing the column of the price?
df = df.set_index(['price']) df.index = pd.to_datetime(df.index)
def percent_diff(df, row1, row2):
"""
Calculating the percentage difference between two specific rows in a dataframe
"""
return (df.loc[row1, 'value'] - df.loc[row2, 'value']) / df.loc[row2, 'value'] * 100
In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing
My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.
I am trying to align my data so that when I use another comparison method the two data sets are aligned so that they are most similar. So far I have cross-correlated the two Pandas Series and found the lag position for highest correlation. How can I then shift my data to give the a new highest correlation lag position of 0 when the Series are then cross-correlated again.
I have 4 fairly large Pandas Series. One of these Series is a Query to be compared to the other 3 Series and itself.
To find the offset for highest correlation between a query-target series pair, I have used np.correlate() and have calculated the lag position, and for highest correlation. Having found this lag position I have tried to incorporate this lag into each of the series in order that, once a cross correlation is recalculated, the new lag for highest correlation is now 0. Unfortunately, this has not been very successful.
I feel there are a few ways I could be going wrong in my methodology here, I'm very new to coding, so any pointers will be very much appreciated.
What I Have So Far
Producing a DataFrame containing the original lag positions for highest correlation in each comparison.
lags_o = pd.Dataframe({"a":[np.correlate(s4, s1, mode='full').argmax() - np.correlate(s4, s1, mode='full').size/2],
"b": [np.correlate(s4, s2, mode='full').argmax() - np.correlate(s4, s2, mode='full').size/2],
"c": [np.correlate(s4, s3, mode='full').argmax() - np.correlate(s4, s3, mode='full').size/2],
"d": [np.correlate(s4, s4, mode='full').argmax() - np.correlate(s4, s4, mode='full').size/2]})
When this is run I get the expected value of 0 for the "d" column, indicating that the two series are optimally aligned (which makes sense). The other columns return non zero values so now i want to incorporate these required shifts into the new cross correlation.
# shifting the series by the lag given in lags_o for that comparison
s1_lagged = s1.shift(lags_o["a"].item()
# selecting all non-NaN values in the series for the next correlation
s1_lagged = s1_lagged[~np.isnan(s1_lagged)]
# this is repeated for the other routes - selecting the appropriate df column
What I expected to get back when the query route and the new shifted target series was then passed to the cross-correlation was that each lag position in lags_n would be 0. However, this is not what I am getting at all. Even more confusingly is that the new lag position does not seem to relate to the old lag position (as in the lag does not seem to shift along with the shift imputed into the series). I have tried shifting both the query and target series in turn but have not managed to get the required value.
So my question is how should I correctly manipulate these Series so that i can align these data sets. Happy New Year and thank you for your time and any suggestions you may have.
Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.
Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths
I have a expression values (log2) for 200 genes in two conditions treated and untreated and for each condition I have 20 replicates. I want to calculate the correlation between each condition for each gene and rank them from highest to lowest.
This is more of a biostats problem, but still I think it is an important one for biologists/bio-programmers many of us encounter this.
The dataset looks like this:
Gene UT1 UT2 T1 T2
DDR1 8.111795978 7.7606511867 7.9362235824 7.5974674936
RFC2 10.2418824097 9.7752152714 10.0085488406 9.5723427524
HSPA6 6.5850239731 6.7916563534 6.6883401632 7.3659252344
PAX8 9.2965160827 9.2031177653 9.249816924 8.667772504
GUCA1A 5.4828021059 5.3797749957 5.4312885508 5.1297319374
I have shown only two replicates for each sample in the sample data.
I am looking for a solution in R or python.
cor function in R does not give me what i want.
If I understand correctly from your question,you need to calculate correlation between UT1 and T1 and UT2 and T2 for all the Genes.
There is a way to do it in R :
df <- data.frame(Gene = c("DDR1","RFC2","HSPA6","PAX8","GUCA1A")
, UT1 = c(8.111796, 10.241882, 6.585024 , 9.296516 , 5.482802),
UT2 =c( 7.760651 ,9.775215 ,6.791656, 9.203118, 5.379775),
T1 =c(7.936224 ,10.008549, 6.688340 , 9.249817 , 5.431289),
T2 =c(7.597467 ,9.572343 ,7.365925 ,8.667773 ,5.129732))
make a matrix like this :
mat1 <- cbind(file$UT1,file$T1)
initialize a correlation matrix :
cor1 <- matrix(0,length(file$Gene),length(file$Gene))
then calculate correlation all against all genes like this :
for(i in 1:length(df$Gene)) cor1[i,] = apply(mat1,1,function(x) cor(x,mat1[df$Gene[i],]))
I hope this helps.
All sources I've read indicate that you need to create an average measure for each replicate. I've seen both mean and median used, although you may want to look into more advanced pre-processing/normalization methods (like RMA). Once you've done that you can calculate the correlation between untreated and treated.
There is no way to calculate correlation in the way that you're looking for. Any method that would do so will ultimately boil down to summarizing the information across the two conditions through getting a summary probe measure across the replicates (as above).
Alternatively you could do something like calculate the correlation between each treated and untreated replicate for each probe, and take the average correlation.
Assuming that the first column account for the names of the rows and first column for their names, i.e., assuming that your data contains only numeric values, you can simply do the following in R, which will give you a n x n matrix with all pairwise correlations between genes.
cor(data)
You may want to specify what type of correlation you want to use... What is the length of the time-series? There are whole studies developed to address the issue of selecting an appropriate measure, e.g., see:
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013