Replacing dataframe values by median value of group - python

Apologies if this is a repeat, I didn't find a similar answer.
Big picture: I have a df with NaN values which I would like to replace with an imputed median value for that column. However, the built-in imputers in sklearn that I found use the median (or whatever metric) from the entire column. My data has labels and I would like to replace each NaN value with the median value for that column from other samples belonging to that label only.
I can do this by splitting the df into one df for each label, imputing over each of those dfs, and combining, but this logic doesn't scale well. I could have up to 20 classes, and I fundamentally don't believe this is the 'right' way to do it.
I would like to do this without copying my df, by using a groupby object in a split-apply-combine technique (or another technique you think would work). I appreciate your help.
Example df:
r1 r2 r3 label
0 12 NaN 58 0
1 34 52 24 1
2 32 4 NaN 1
3 7 89 2 0
4 22 19 12 1
Here, I would like the NaN value at (0, r2) to equal the median of that column for label 0, which is the value 89 (from 3, r2).
I would like the NaN value at (2,r3) to equal the median of that column for label 1, which is median(24, 12), or 18.
Example successful result:
r1 r2 r3 label
0 12 89 58 0
1 34 52 24 1
2 32 4 18 1
3 7 89 2 0
4 22 19 12 1

In [158]: df.groupby('label', group_keys=False) \
.apply(lambda x: x.fillna(x.median()).astype(int))
Out[158]:
r1 r2 r3 label
0 12 89 58 0
3 7 89 2 0
1 34 52 24 1
2 32 4 18 1
4 22 19 12 1
or using transform:
In [149]: df[['label']].join(df.groupby('label')
.transform(lambda x: x.fillna(x.median())).astype(int))
Out[149]:
label r1 r2 r3
0 0 12 89 58
1 1 34 52 24
2 1 32 4 18
3 0 7 89 2
4 1 22 19 12

Related

Dividing one dataframe by another in python using pandas with float values

I have two separate data frames named df1 and df2 as shown below:
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 51 58 0.879310
1 1 16 20 95 115 0.826087
2 2 9 9 33 42 0.785714
3 2 12 86 51 137 0.372263
4 2 67 41 98 139 0.705036
5 3 8 0 0 0 0.000000
6 4 99 32 26 58 0.448276
7 4 101 100 24 124 0.193548
8 4 115 69 26 95 0.273684
9 5 6 40 57 97 0.587629
10 5 19 53 87 140 0.621429
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 64 71 0.901408
1 1 16 10 90 100 0.900000
2 2 9 79 86 165 0.521212
3 2 12 12 73 85 0.858824
4 2 67 54 96 150 0.640000
5 3 8 0 0 0 0.000000
6 4 99 86 28 114 0.245614
7 4 101 32 25 57 0.438596
8 4 115 97 16 113 0.141593
9 5 6 86 43 129 0.333333
10 5 19 59 27 86 0.313953
I have already found the sum values for df1 and df2 in Allele_Count and Coverage Depth but I need to divide the resulting Alt_Allele_Count and Coverage_Depth of both df's with one another to fine the total allele frequency(AF). I have tried dividing the two variable and got the error message :
TypeError: float() argument must be a string or a number, not 'DataFrame'
when I tried to convert them to floats and this table when I laft it as a df:
Alt_Allele_Count Coverage_Depth
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
My code so far:
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1[['Alt_Allele_Count']] + df2[['Alt_Allele_Count']])
print(Alt_Allele_Count)
Coverage_Depth = (df1[['Coverage_Depth']] + df2[['Coverage_Depth']]).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)
The error stems from the difference between a pandas series and a dataframe. Series are 1 dimensional structures like a singular column, while dataframes are 2d objects like tables. Series added together make a new series of values while dataframes added together make something a lot less usable.
Taking slices of a dataframe can either result in a series or dataframe object depending on how you do it:
df['column_name'] -> Series
df[['column_name', 'column_2']] -> Dataframe
So in the line:
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
df1[['Ref_Allele_Count']] becomes a singular column dataframe rather than a series.
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
Should return the correct result here. Same goes for the rest of the columns you're adding together.
This can be fixed by only using once set of brackets '[]' while referring to a column in a pandas df, rather than 2.
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
# note that I changed your double brackets ([["col_name"]]) to single (["col_name"])
# this results in pd.Series objects instead of pd.DataFrame objects
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1['Alt_Allele_Count'] + df2['Alt_Allele_Count'])
print(Alt_Allele_Count)
Coverage_Depth = (df1['Coverage_Depth'] + df2['Coverage_Depth']).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)

Pandas Python highest 2 rows of every 3 and tabling the results

Suppose I have the following dataframe:
. Column1 Column2
0 25 1
1 89 2
2 59 3
3 78 10
4 99 20
5 38 30
6 89 100
7 57 200
8 87 300
Im not sure if what I want to do is impossible or not. But I want to compare every three rows of column1 and then take the highest 2 out the three rows and assign the corresponding 2 Column2 values to a new column. The values in column 3 does not matter if they are joined or not. It does not matter if they are arranged or not for I know every 2 rows of column 3 belong to every 3 rows of column 1.
. Column1 Column2 Column3
0 25 1 2
1 89 2 3
2 59 3
3 78 10 20
4 99 20 10
5 38 30
6 89 100 100
7 57 200 300
8 87 300
You can use np.arange with np.repeat to create a grouping array which groups every 3 values.
Then use GroupBy.nlargest then extract indices of those values using pd.Index.get_level_values, then assign them to Column3 pandas handles index alignment.
n_grps = len(df)/3
g = np.repeat(np.arange(n_grps), 3)
idx = df.groupby(g)['Column1'].nlargest(2).index.get_level_values(1)
vals = df.loc[idx, 'Column2']
vals
# 1 2
# 2 3
# 4 20
# 3 10
# 6 100
# 8 300
# Name: Column2, dtype: int64
df['Column3'] = vals
df
Column1 Column2 Column3
0 25 1 NaN
1 89 2 2.0
2 59 3 3.0
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 NaN
8 87 300 300.0
To get output like you mentioned in the question you have to sort and push NaN to last then you have perform this additional step.
df['Column3'] = df.groupby(g)['Column3'].apply(lambda x:x.sort_values()).values
Column1 Column2 Column3
0 25 1 2.0
1 89 2 3.0
2 59 3 NaN
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 300.0
8 87 300 NaN

Pandas Q-cut: Binning Data using an Expanding Window Approach

This question is somewhat similar to a 2018 question I have found on an identical topic.
I am hoping that if I ask it in a simpler way, someone will be able to figure out a simple fix to the issue that I am currently facing:
I have a timeseries dataframe named "df", which is roughly structured as follows:
V_1 V_2 V_3 V_4
1/1/2000 17 77 15 88
1/2/2000 85 78 6 59
1/3/2000 31 9 49 16
1/4/2000 81 55 28 33
1/5/2000 8 82 82 4
1/6/2000 89 87 57 62
1/7/2000 50 60 54 49
1/8/2000 65 84 29 26
1/9/2000 12 57 53 84
1/10/2000 6 27 70 56
1/11/2000 61 6 38 38
1/12/2000 22 8 82 58
1/13/2000 17 86 65 42
1/14/2000 9 27 42 86
1/15/2000 63 78 18 35
1/16/2000 73 13 51 61
1/17/2000 70 64 75 83
If I wanted to use all the columns to produce daily quantiles, I would follow this approach:
quantiles = df.apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
The output looks like this:
V_1 V_2 V_3 V_4
2000-01-01 1 3 0 4
2000-01-02 4 3 0 3
2000-01-03 2 0 2 0
2000-01-04 4 1 0 0
2000-01-05 0 4 4 0
2000-01-06 4 4 3 3
2000-01-07 2 2 3 2
2000-01-08 3 4 1 0
2000-01-09 0 2 2 4
2000-01-10 0 1 4 2
2000-01-11 2 0 1 1
2000-01-12 1 0 4 2
2000-01-13 1 4 3 1
2000-01-14 0 1 1 4
2000-01-15 3 3 0 1
2000-01-16 4 0 2 3
2000-01-17 3 2 4 4
What I want to do:
I would like to produce quantiles of the data in "df" using observations that occurred before and at a specific point in time. I do not want to include observations that occurred after the specific point in time.
For instance:
To calculate the bins for the 2nd of January 2000, I would like to just use observations from the 1st and 2nd of January 2000; and, nothing after the dates;
To calculate the bins for the 3rd of January 2000, I would like to just use observations from the 1st, 2nd and 3rd of January 2000; and, nothing after the dates;
To calculate the bins for the 4th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd and 4th of January 2000; and, nothing after the dates;
To calculate the bins for the 5th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd, 4th and 5th of January 2000; and, nothing after the dates;
Otherwise put, I would like to use this approach to calculate the bins for ALL the datapoints in "df". That is, to calculate bins from the 1st of January 2000 to the 17th of January 2000.
In short, what I want to do is to conduct an expanding window q-cut (if there is any such thing). It helps to avoid "look-ahead" bias when dealing with timeseries data.
This code block below is wrong, but it illustrates exactly what I am trying to accomplish:
quantiles = df.expanding().apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
Does anyone have any ideas of how to do this in a simpler fashion than this
I am new so take this with a grain of salt, but when broken down I believe your question is a duplicate because it requires simple datetime index slicing answered HERE.
lt_jan_5 = df.loc[:'2000-01-05'].apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
print(lt_jan_5)
V_1 V_2 V_3 V_4
2000-01-01 1 2 1 4
2000-01-02 4 3 0 3
2000-01-03 2 0 3 1
2000-01-04 3 1 2 2
2000-01-05 0 4 4 0
Hope this is helpful

Extract Features Using two Column Values in a dataframe from another dataframe in python

I have a dataframe 1 which contains two columns nodes1_id , node2_id
And i have another dataframe which contains 14 columns including nodeid and 13 anonymous features
This is my df1
df1.head()
node1_id node2_id
6 5
5 2
4 6
6 2
2 3
This is my df2
df2.head()
node_id f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13
0 2 14 14 14 12 12 12 7 7 7 0 0 0 15
1 3 31 9 7 31 16 12 31 15 12 31 15 12 8
2 4 0 0 0 0 0 0 0 0 0 0 0 0 7
3 5 31 4 1 31 7 1 31 9 1 31 9 0 15
4 6 31 27 20 31 24 14 31 20 10 31 20 5 7
I want to add these f1...f13 columns based on some similarity to my dataframe df1 i.e, in df1 comparing 1st row 6 and 5 using features of 6 an 5 how can i add in that row of the dataframe
def similarity_score(v1, v2):
# calculate your similarity score here
return score
def similarity(id_1, id_2):
# extract the rows from df2 corresponding to
# the given ids and convert them to lists, or
# numpy arrays. After this you can calculate the similarity score
feature_vector1 = list(df2.loc[df2['node_id'] == id_1, :])
feature_vector2 = list(df2.loc[df2['node_id'] == id_2, :])
return similarity_score(feature_vector1, feature_vector2)
df1['similarity'] = df1.apply(lambda ids: similarity(*ids), axis=1)

Checking if values of a row are consecutive

I have a df like this:
1 2 3 4 5 6
0 5 10 12 35 70 80
1 10 11 23 40 42 47
2 5 26 27 38 60 65
Where all the values in each row are different and have an increasing order.
I would like to create a new column with 1 or 0 if there are at least 2 consecutive numbers.
For example the second and third row have 10 and 11, and 26 and 27. Is there a more pythonic way than using an iterator?
Thanks
Use DataFrame.diff for difference per rows, compare by 1, check if at least one True per rows and last cast to integers:
df['check'] = df.diff(axis=1).eq(1).any(axis=1).astype(int)
print (df)
1 2 3 4 5 6 check
0 5 10 12 35 70 80 0
1 10 11 23 40 42 47 1
2 5 26 27 38 60 65 1
For improve performance use numpy:
arr = df.values
df['check'] = np.any(((arr[:, 1:] - arr[:, :-1]) == 1), axis=1).astype(int)

Categories

Resources