Compare two dataframe columns with binary data - python

I have two columns with binary data (1s and 0s) And I want to check what's the percent similiarity between one column and the other. Obviously, as they are binary, it is important that the coincidence is based in the position of each cell, not in the global amount of 0s and 1s. In example:
column_1 column_2
0 1
1 1
0 0
1 0
In that case, in both columns there are the same equal number of 0s and 1s (which means a 100% coincidence) however, taking into account the order or position of each, there's just a 50% coincidence. That last steatment is the one I'm trying to figure out.
I know I could do it with a loop... however in case of larger lists that could be a problem.

This gets a binary vector that gives True where col 1 equals 2 and 0 else where, sums it up, and divides by the number of samples.
sim = sum( df.column_1 == df.column_2 ) / len(df.column_1)

Related

Custom column on dataframe based on other columns

I have a dataframe as seen below:
df = | class |
0
1
1
...
0
where number of rows is 113269 where number of ones = 46337 and number of zeros = 66932.
What i would like to do is create a feature named id with random numbers for 0 to 50.
As a result each id will have some 0s and some 1s assigned to it.
Through tests i noticed that each id has a similar distribution of 1s and 0s compared to the distribution of the original whole dataset (which is zeros/ones = 1.444 ).
What i want is to be able to change manualy this number for as many clients as possible.
Any ideas?

Correlations for multiple indexes

I'm new to Python, and I've got problems in calculating correlation coefficients for multiple participants.
I've got a dataframe just like this :
|Index|Participant|Condition|ReactionTime1|ReactionTime2|
|:---:|:---------:|:-------:|:-----------:|:-------------:|
|1|1|A|320|542|
|2|1|A|250|623|
|3|1|B|256|547|
|4|1|B|301|645|
|5|2|A|420|521|
|6|2|A|123|456|
|7|2|B|265|362|
|8|2|B|402|631|
I am wondering how to calculate correlation coefficient between Reaction Time 1 and Reaction Time2 for Participant 1 and for Participant 2 in each condition. My real dataset is way bigger than this (hundreds of Reaction Time for each participant, and there are a lot of participant too). Is there a general way to calculate this and to put coeff in a new df like this?
|Index|Participant|Condition|Correlation coeff|
|:---:|:---------:|:-------:|:-----------:|
|1|1|A|?|
|2|1|B|?|
|3|2|A|?|
|4|2|B|?|
Thanks :)
You can try groupby and apply with np.corrcoef, and reset_index afterwards:
result = (df.groupby(["Participant", "Condition"])
.apply(lambda gr: np.corrcoef(gr["ReactionTime1"], gr["ReactionTime2"])[0, 1])
.reset_index(name="Correlation coeff"))
which gives
Participant Condition Correlation coeff
0 1 A -1.0
1 1 B 1.0
2 2 A 1.0
3 2 B 1.0
We use [0, 1] on the returned value of np.corrcoef since it returns a symmetric matrix where diagonal elements are normalized to 1 and off-diagonal elements are the same and each gives the desired coefficient (so might as well index with [1, 0]). That is,
array([[1. , 0.25691558],
[0.25691558, 1. ]])
is an example returned value and we are interested in the off-diagonal entry.
Why it returned all +/- 1 in your case: since each participant & conditon pair only has 2 entries for each reaction, they are always perfectly correleated and the sign is determined via their orientation i.e. if one increases from one coordinate to the other coordinate, does other one increase or decrease.

Balance dataset using pandas

This is for a machine learning program.
I am working with a dataset that has a csv which contains an id, for a .tif image in another directory, and a label, 1 or 0. There are 220,025 rows in the csv. I have loaded this csv as a pandas dataframe. Currently in the dataframe, there are 220,025 rows, with 130,908 rows with label 0 and 89,117 rows with label 1.
There are 41,791 more rows with label 0 than label 1. I want to randomly drop the extra rows with label 1. After that, I want to decrease the sample size from 178,234 to just 50,000, with 25,000 ids for each label.
Another approach might be to randomly drop 105,908 rows with label 1 and 64,117 with label 0.
How can I do this using pandas?
I have already looked at using .groupby and then using .sample, but that drops an equal amount of rows in both labels, while I only want to drop rows in one label.
Sample of the csv:
id,label
f38a6374c348f90b587e046aac6079959adf3835,0
c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
755db6279dae599ebb4d39a9123cce439965282d,0
bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
068aba587a4950175d04c680d38943fd488d6a9d,0
acfe80838488fae3c89bd21ade75be5c34e66be7,0
a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1
7f6ccae485af121e0b6ee733022e226ee6b0c65f,1
559e55a64c9ba828f700e948f6886f4cea919261,0
8eaaa7a400aa79d36c2440a4aa101cc14256cda4,0
Personally, I would break it up into the following steps:
Since you have more 0s than 1s, we're first going to ensure that we even out the number of each. Here, I'm using the sample data you pasted in as df
Count the number of 1s (since this is our smaller value)
ones_subset = df.loc[df["label"] == 1, :]
number_of_1s = len(ones_subset)
print(number_of_1s)
3
Sample only the zeros to match the number of number_of_1s
zeros_subset = df.loc[df["label"] == 0, :]
sampled_zeros = zeros_subset.sample(number_of_1s)
print(sampled_zeros)
Stick these 2 chunks (all of the 1s from our ones_subset and our matched sampled_zeros together to make one clean dataframe that has an equal number of 1 and 0 labels
clean_df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)
print(clean_df)
id label
0 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1
1 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da 1
2 7f6ccae485af121e0b6ee733022e226ee6b0c65f 1
3 559e55a64c9ba828f700e948f6886f4cea919261 0
4 f38a6374c348f90b587e046aac6079959adf3835 0
5 068aba587a4950175d04c680d38943fd488d6a9d 0
Now that we have a cleaned up dataset, we can proceed with the last step:
Use the groupby(...).sample(...) approach you mentioned to further downsample this dataset. Taking this from a dataset that has 3 of each label (three 1s and three 0s) to a smaller matched size- (two 1s and two 0s)
downsampled_df = clean_df.groupby("label").sample(2)
print(downsampled_df)
id label
4 f38a6374c348f90b587e046aac6079959adf3835 0
5 068aba587a4950175d04c680d38943fd488d6a9d 0
1 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da 1
0 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

How do I create a pivot table in Pandas where one column is the mean of some values, and the other column is the sum of others?

Basically, how would I create a pivot table that consolidates data, where one of the columns of data it represents is calculated, say, by likelihood percentage (0.0 - 1.0) by taking the mean, and another is calculated by number ordered which sums all of them?
Right now I can specify values=... to indicate what should make up one of the two, but then when I specify the aggfunc=... I don't know how the two interoperate.
In my head I'd specify two values for values=... (likelihood percentage and number ordered) and two values for aggfunc=..., but this does not seem to be working.
You could supply to aggfunc a dictionary with column:funtion (key:value) pairs:
df = pd.DataFrame({'a':['a','a','a'],'m':[1,2,3],'s':[1,2,3]})
print df
a m s
0 a 1 1
1 a 2 2
2 a 3 3
df.pivot_table(index='a', values=['m','s'], aggfunc={'m':pd.Series.mean,'s':sum})
m s
a
a 2 6

Categories

Resources