Let's say I have two dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'person':[1,1,2,2,3], 'sub_id':[20,21,21,21,21], 'otherval':[np.nan, np.nan, np.nan, np.nan, np.nan], 'other_stuff':[1,1,1,1,1]}, columns=['person','sub_id','otherval','other_stuff'])
df2 = pd.DataFrame({'sub_id':[20,21,22,23,24,25], 'otherval':[8,9,10,11,12,13]})
I want each level of person in df1 to have all levels of sub_id (including any duplicates) and their respective otherval from df2. In other words, my merged result should look like:
person sub_id otherval other_stuff
1 20 8 1
1 21 9 NaN
1 22 10 NaN
1 23 11 Nan
1 24 12 NaN
1 25 13 NaN
2 20 8 NaN
2 21 9 1
2 21 9 1
2 22 10 NaN
2 23 11 NaN
2 24 12 NaN
2 25 13 NaN
3 20 8 NaN
3 21 9 1
3 22 10 NaN
3 23 11 NaN
3 24 12 NaN
3 25 13 NaN
Notice how person==2 has two rows where sub_id==21.
You can get your desired output with the following:
df3 = df1.groupby('person').apply(lambda x: pd.merge(x,df2, on='sub_id', how='right')).reset_index(level = (0,1), drop = True)
df3.person = df3.person.ffill().astype(int)
print df3
That should yield:
# person sub_id otherval_x other_stuff otherval_y
# 0 1 20 NaN 1.0 8
# 1 1 21 NaN 1.0 9
# 2 1 22 NaN NaN 10
# 3 1 23 NaN NaN 11
# 4 1 24 NaN NaN 12
# 5 1 25 NaN NaN 13
# 6 2 21 NaN 1.0 9
# 7 2 21 NaN 1.0 9
# 8 2 20 NaN NaN 8
# 9 2 22 NaN NaN 10
# 10 2 23 NaN NaN 11
# 11 2 24 NaN NaN 12
# 12 2 25 NaN NaN 13
# 13 3 21 NaN 1.0 9
# 14 3 20 NaN NaN 8
# 15 3 22 NaN NaN 10
# 16 3 23 NaN NaN 11
# 17 3 24 NaN NaN 12
# 18 3 25 NaN NaN 13
I hope that helps.
Related
How can I replace values in a pandas dataframe with values from another dataframe based common columns.
I need to replace NaN values in dataframe1 based on the common columns of "types" and "o_period". any suggestion?
df1
types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
8 2 1 1 44882.0 39.0
9 2 1 2 17176.0 29.0
10 2 2 1 28609.0 58.0
11 2 2 2 20370.0 53.0
12 2 3 1 7064.0 12.0
13 2 3 2 13099.0 44.0
14 2 4 1 NaN NaN
15 2 4 2 7117.0 18.0
16 3 1 1 1179.0 1.0
17 3 1 2 552.0 1.0
18 3 2 1 781.0 0.0
19 3 2 2 676.0 1.0
20 3 3 1 783.0 6.0
21 3 3 2 1948.0 2.0
22 3 4 1 NaN NaN
23 3 4 2 274.0 1.0
24 4 1 1 251.0 0.0
25 4 1 2 105.0 0.0
26 4 2 1 288.0 0.0
27 4 2 2 192.0 0.0
28 4 3 1 349.0 2.0
29 4 3 2 1208.0 11.0
30 4 4 1 NaN NaN
31 4 4 2 2051.0 4.0
32 5 1 1 45.0 0.0
33 5 1 2 NaN NaN
34 5 2 1 789.0 7.0
35 5 2 2 437.0 7.0
36 5 3 1 1157.0 5.0
37 5 3 2 2161.0 12.0
38 5 4 1 NaN NaN
39 5 4 2 542.0 1.0
df2
types o_periods s_months incidents
0 1 1 911.0 3.0
1 1 2 1689.0 8.0
2 2 1 26852.0 36.0
3 2 2 14440.0 36.0
4 3 1 914.0 2.0
5 3 2 862.0 1.0
6 4 1 296.0 1.0
7 4 2 889.0 4.0
8 5 1 664.0 4.0
9 5 2 1047.0 7.0
df3:rows with NaN
types c_years o_periods s_months incidents
6 1 4 1 NaN NaN
14 2 4 1 NaN NaN
22 3 4 1 NaN NaN
30 4 4 1 NaN NaN
33 5 1 2 NaN NaN
38 5 4 1 NaN NaN
I have tried to merge df2 with df3 but the indexing seems to reset.
First separate the rows where you have NaN values out into a new dataframe called df3 and drop the rows where there are NaN values from df1.
Then do a left join based on the new dataframe.
df4 = pd.merge(df3,df2,how='left',on=['types','o_period'])
After that is done, append the rows from df4 back into df1.
Another way is to combine the 2 columns you want to lookup into a single column
df1["types_o"] = df1["types_o"].astype(str) + df1["o_period"].astype(str)
df2["types_o"] = df2["types_o"].astype(str) + df2["o_period"].astype(str)
Then you can do a look up on the missing values.
df1.types_o.replace('Nan', np.NaN, inplace=True)
df1.loc[df1['s_months'].isnull(),'s_months'] = df2['types_o'].map(df1.types_o)
df1.loc[df1['incidents'].isnull(),'incidents'] = df2['types_o'].map(df1.types_o)
You didn't paste any code or examples of your data which is easily reproducible so this is the best I can do.
If I have a pandas dataframe like this:
0 1 2 3 4 5
A 5 5 10 9 4 5
B 10 10 10 8 1 1
C 8 8 0 9 6 3
D 10 10 11 4 2 9
E 0 9 1 5 8 3
If I set a threshold of 7, how do I loop through each row and set the values after the threshold is no longer met equal to np.nan such that I get a data frame like this:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2 9
E 0 9 1 5 8 NaN
Where everything after the last number greater than 7 is set equal to np.nan.
Let's try this:
df.where(df.where(df > 7).bfill(axis=1).notna())
Output:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
create a mask m by using df.where on df.gt(7) and bfill and isna. Finally, indexing df using m
m = df.where(df.gt(7)).bfill(1).notna()
df[m]
Out[24]:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
A very nice question , reverse the order then cumsum the one equal to 0 should be NaN
df.where(df.iloc[:,::-1].gt(7).cumsum(1).ne(0))
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
Here's my dataset
id feature_1 feature_2 feature_3 feature_4 feature_5
1 10 15 10 15 20
2 10 NaN 10 NaN 20
3 10 NaN 10 NaN 20
4 10 46 NaN 23 20
5 10 NaN 10 NaN 20
Here's what I need, I want to sort data based on completeness level (the higher percentage data is not nan, is higher the completeness level) of the dataset, the str is will be ascending so make me easier impute the missing value
id feature_1 feature_2 feature_3 feature_4 feature_5
2 10 NaN 10 NaN 20
3 10 NaN 10 NaN 20
5 10 NaN 10 NaN 20
4 10 46 NaN 23 20
1 10 15 10 15 20
Best Regards,
Try this:
import pandas as pd
import numpy as np
d = ({
'A' : ['X',np.NaN,np.NaN,'X','Y',np.NaN,'X','X',np.NaN,'X','X'],
'B' : ['Y',np.NaN,'X','Val','X','X',np.NaN,'X','X','X','X'],
'C' : ['Y','X','X',np.NaN,'X','X','Val','X','X',np.NaN,np.NaN],
})
df = pd.DataFrame(data=d)
df.T.isnull().sum()
Out[72]:
0 0
1 2
2 1
3 1
4 0
5 1
6 1
7 0
8 1
9 1
10 1
dtype: int64
df['is_null'] = df.T.isnull().sum()
df.sort_values('is_null', ascending=False)
Out[77]:
A B C is_null
1 NaN NaN X 2
2 NaN X X 1
3 X Val NaN 1
5 NaN X X 1
6 X NaN Val 1
8 NaN X X 1
9 X X NaN 1
10 X X NaN 1
0 X Y Y 0
4 Y X X 0
7 X X X 0
If want sorting by column with maximal number of NaNs:
c = df.isnull().sum().idxmax()
print (c)
feature_2
df = df.sort_values(c, na_position='first', ascending=False)
print (df)
id feature_1 feature_2 feature_3 feature_4 feature_5
1 2 10 NaN 10.0 NaN 20
2 3 10 NaN 10.0 NaN 20
4 5 10 NaN 10.0 NaN 20
3 4 10 46.0 NaN 23.0 20
0 1 10 15.0 10.0 15.0 20
I want to ignore NaN values in my selected dataframe columns when I want to normalize with sklearn.preprocessing.normalize. Column example:
0 12.0
1 12.0
2 3.0
3 NaN
4 3.0
5 3.0
6 NaN
7 NaN
8 3.0
9 3.0
10 3.0
11 4.0
12 10.0
You can make use of function dropna(). It will return the same dataframe with rows containing NaN deleted.
>>> a.dropna()
0 12.0
0 1 12
1 2 3
3 4 3
4 5 3
7 8 3
8 9 3
9 10 3
10 11 4
11 12 10
Suppose I have a pandas dataframe as follows,
data
id A B C D E
1 NaN 1 NaN 1 1
1 NaN 2 NaN 2 2
1 NaN 3 NaN NaN 3
1 NaN 4 NaN NaN 4
1 NaN 5 NaN NaN 5
2 NaN 6 NaN NaN 6
2 NaN 7 5 NaN 7
2 NaN 8 6 2 8
2 NaN 9 NaN NaN 9
2 NaN 10 NaN 4 10
3 NaN 11 NaN NaN 11
3 NaN 12 NaN NaN 12
3 NaN 13 NaN NaN 13
3 NaN 14 NaN NaN 14
3 NaN 15 NaN NaN 15
3 NaN 16 NaN NaN 16
I am using the following command,
pd.DataFrame(data.count().sort_values(ascending=False)).reset_index()
and get the following output,
index 0
0 E 16
1 B 16
2 id 16
3 D 4
4 C 2
5 A 0
I want the following output,
columns count unique(id) count
E 16 3
B 16 3
D 4 2
C 2 1
A 0 0
where count is same as the previous one but the unique(id) count is the unique ids for each columns present. I want both to be as two fields.
Can anybody help me in doing this?
Thanks
Starting with:
In [7]: df
Out[7]:
id A B C D E
0 1 NaN 1 NaN 1.0 1
1 1 NaN 2 NaN 2.0 2
2 1 NaN 3 NaN NaN 3
3 1 NaN 4 NaN NaN 4
4 1 NaN 5 NaN NaN 5
5 2 NaN 6 NaN NaN 6
6 2 NaN 7 5.0 NaN 7
7 2 NaN 8 6.0 2.0 8
8 2 NaN 9 NaN NaN 9
9 2 NaN 10 NaN 4.0 10
10 3 NaN 11 NaN NaN 11
11 3 NaN 12 NaN NaN 12
12 3 NaN 13 NaN NaN 13
13 3 NaN 14 NaN NaN 14
14 3 NaN 15 NaN NaN 15
15 3 NaN 16 NaN NaN 16
Here is a rather inelegant way:
In [8]: (df.groupby('id').count() > 0).sum()
Out[8]:
A 0
B 3
C 1
D 2
E 3
dtype: int64
So, simply make your base DataFrame as you've specified:
In [9]: counts = (df[['A','B','C','D','E']].count().sort_values(ascending=False)).to_frame(name='counts')
In [10]: counts
Out[10]:
counts
E 16
B 16
D 4
C 2
A 0
And then simply:
In [11]: counts['unique(id) counts'] = (df.groupby('id').count() > 0).sum()
In [12]: counts
Out[12]:
counts unique(id) counts
E 16 3
B 16 3
D 4 2
C 2 1
A 0 0