is there a better way to handle NaN values? - python

I have an input dataframe
KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
A (C602+C603) C601 75
B (C605+C606) C602 NaN
C 75 L239+C602 NaN
D (32*(C603+44)) 75 NaN
E L239 NaN C601
I have an Indicator df
99 75 C604 C602 C601 C603 C605 C606 44 L239 32
PatientID
1 1 0 1 0 1 0 0 0 1 0 1
2 0 0 0 0 0 0 1 1 0 0 0
3 1 1 1 1 0 1 1 1 1 1 1
4 0 0 0 0 0 1 0 1 0 1 0
5 1 0 1 1 1 1 0 1 1 1 1
source:
input_df = pd.DataFrame({'KPI_ID': ['A','B','C','D','E'], 'KPI_Key1': ['(C602+C603)','(C605+C606)','75','(32*(C603+44))','L239'] , 'KPI_Key2' : ['C601','C602','L239+C602','75',np.NaN] , 'KPI_Key3' : ['75',np.NaN,np.NaN,np.NaN,'C601']})
indicator_df = pd.DataFrame({'PatientID': [1,2,3,4,5],'99' : ['1','0','1','0','1'],'75' : ['0','0','1','0','0'],'C604' : ['1','0','1','0','1'],'C602' : ['0','0','1','0','1'],'C601' : ['1','0','0','0','1'],'C603' : ['0','0','1','1','1'],'C605' : ['0','1','1','0','0'],'C606' : ['0','1','1','1','1'],'44' : ['1','0','1','0','1'],'L239' : ['0','0','1','1','1'], '32' : ['1','0','1','0','1'],}).set_index('PatientID')
My Goal is to create an output df like this (by evaluating the input_df against indicator_df )
final_out_df:
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
1 A 0 1 0
2 A 0 0 0
3 A 2 0 1
4 A 1 0 0
5 A 2 1 0
1 B 0 0 0
2 B 2 0 0
3 B 2 1 0
... ... ... ... ...
I am VERY Close and my logic works fine except I am unable to handle the NaN values in the input_df.I am able to generate the output for KPI_ID 'A' since none of the three formulas (KPI_Key1,KPI_Key2,KPI_Key3 for 'A') are null. But I fail to generate it for 'B'. Is there anything I can do instead of using a dummy variuable in place of NaN and creating that row in indicator_df?
Here is what I did so far:
indicator_df = indicator_df.astype('int32')
final_out_df = pd.DataFrame()
out_df = pd.DataFrame(index=indicator_df.index)
out_df.reset_index(level=0, inplace=True)
final_out_df = pd.DataFrame()
#running loop only for 'A' so it won't fail
for i in range(0,len(input_df)-4):
for j in ['KPI_Key1','KPI_Key2','KPI_Key3']:
exp = input_df[j].iloc[i]
temp_out_df=indicator_df.eval(re.sub(r'(\w+)', r'`\1`', exp)).reset_index(name=j)
out_df['KPI_ID'] = input_df['KPI_ID'].iloc[i]
out_df = out_df.merge(temp_out_df, on='PatientID', how='left')
final_out_df= final_out_df.append(out_df)
out_df = pd.DataFrame(index=indicator_df.index)
out_df.reset_index(level=0, inplace=True)

Replace NaN by None and create a dict of local variables to allow a correct evaluation with pd.eval:
def eval_kpi(row):
kpi = row.filter(like='KPI_Key').fillna('None')
return pd.Series(pd.eval(kpi, local_dict=row['local_vars']), index=kpi.index)
final_out_df = indicator_df.astype(int).apply(dict, axis=1) \
.rename('local_vars').reset_index() \
.merge(input_df, how='cross')
final_out_df.update(final_out_df.apply(eval_kpi, axis=1))
final_out_df = final_out_df.drop(columns='local_vars') \
.sort_values(['KPI_ID', 'PatientID']) \
.reset_index(drop=True)
Output:
>>> final_out_df
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
0 1 A 0.0 1.0 75.0
1 2 A 0.0 0.0 75.0
2 3 A 2.0 0.0 75.0
3 4 A 1.0 0.0 75.0
4 5 A 2.0 1.0 75.0
5 1 B 0.0 0.0 NaN
6 2 B 2.0 0.0 NaN
7 3 B 2.0 1.0 NaN
8 4 B 1.0 0.0 NaN
9 5 B 1.0 1.0 NaN
10 1 C 75.0 0.0 NaN
11 2 C 75.0 0.0 NaN
12 3 C 75.0 2.0 NaN
13 4 C 75.0 1.0 NaN
14 5 C 75.0 2.0 NaN
15 1 D 1408.0 75.0 NaN
16 2 D 1408.0 75.0 NaN
17 3 D 1440.0 75.0 NaN
18 4 D 1440.0 75.0 NaN
19 5 D 1440.0 75.0 NaN
20 1 E 0.0 NaN 1.0
21 2 E 0.0 NaN 0.0
22 3 E 1.0 NaN 0.0
23 4 E 1.0 NaN 0.0
24 5 E 1.0 NaN 1.0

I was able to solve it by adding:
if exp == exp:
before parsing the exp through the regex.

Related

how to substract two successive rows in a dataframe?

I have a dataFrame (python) like this:
x y z time
0 0.730110 4.091428 7.833503 1618237788537
1 0.691825 4.024428 7.998608 1618237788537
2 0.658325 3.998107 8.195119 1618237788537
3 0.658325 4.002893 8.408080 1618237788537
4 0.677468 4.017250 8.561220 1618237788537
I want to add column to this dataFrame called computed. This column includes values computed as for:
row 0: (0.730110-0)^2 +(4.091428-0)^2 +(7.833503-0)^2
row 1: (0.691825 -0.730110)^2 +(4.024428- 4.091428)^2 +(7.998608-7.833503)^2
etc
How can do that please.
TL;DR:
df['computed'] = df.diff().pow(2).sum(axis=1)
df.at[0, 'computed'] = df.loc[0].pow(2).sum()
Step by step:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], 'b': [1, 1, 2, 3, 5, 8], 'c': [1, 4, 9, 16, 25, 36]})
df
# a b c
# 0 1 1 1
# 1 2 1 4
# 2 3 2 9
# 3 4 3 16
# 4 5 5 25
# 5 6 8 36
df.diff()
# a b c
# 0 NaN NaN NaN
# 1 1.0 0.0 3.0
# 2 1.0 1.0 5.0
# 3 1.0 1.0 7.0
# 4 1.0 2.0 9.0
# 5 1.0 3.0 11.0
df.diff().pow(2)
# a b c
# 0 NaN NaN NaN
# 1 1.0 0.0 9.0
# 2 1.0 1.0 25.0
# 3 1.0 1.0 49.0
# 4 1.0 4.0 81.0
# 5 1.0 9.0 121.0
df.diff().pow(2).sum(axis=1)
# 0 0.0
# 1 10.0
# 2 27.0
# 3 51.0
# 4 86.0
# 5 131.0
df['computed'] = df.diff().pow(2).sum(axis=1)
df
# a b c computed
# 0 1 1 1 0.0
# 1 2 1 4 10.0
# 2 3 2 9 27.0
# 3 4 3 16 51.0
# 4 5 5 25 86.0
# 5 6 8 36 131.0
df.at[0, 'computed'] = df.loc[0].pow(2).sum()
df
# a b c computed
# 0 1 1 1 3.0
# 1 2 1 4 10.0
# 2 3 2 9 27.0
# 3 4 3 16 51.0
# 4 5 5 25 86.0
# 5 6 8 36 131.0
Relevant documentation and related questions:
Difference between rows with .diff();
Square each cell with .pow(2);
Sum by row with .sum(axis=1);
How to calculate sum of Nth power of each cell for a column in dataframe?;
Set value for particular cell in pandas DataFrame?.

Combining two dataframes

I've tried merging two dataframes, but I can't seem to get it to work. Each time I merge, the rows where I expect values are all 0. Dataframe df1 already as some data in it, with some left blank. Dataframe df2 will populate those blank rows in df1 where column names match at each value in "TempBin" and each value in "Month" in df1.
EDIT:
Both dataframes are in a for loop. df1 acts as my "storage", df2 changes for each location iteration. So if df2 contained the results for LocationZP, I would also want that data inserted in the matching df1 rows. If I use df1 = df1.append(df2) in the for loop, all of the rows from df2 keep inserting at the very end of df1 for each iteration.
df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 0 0 0
13 1 0 0 0
13 2 0 0 0
13 3 0 0 0
13 4 0 0 0
13 5 0 0 0
df2:
Month TempBin LocationAA
13 0 11
13 1 22
13 2 33
13 3 44
13 4 55
13 5 66
desired output in df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 11 0 0
13 1 22 0 0
13 2 33 0 0
13 3 44 0 0
13 4 55 0 0
13 5 66 0 0
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]}
)
df1 = pd.merge(df1, df2, on=["Month","TempBin","LocationAA"], how="left")
result:
Month TempBin LocationAA LocationXA LocationZP
1 0 7.0 1.0 2.0
1 1 98.0 0.0 89.0
1 2 12.0 23.0 38.0
1 3 3.0 14.0 17.0
1 4 7.0 9.0 14.0
1 5 1.0 8.0 99.0
13 0 NaN NaN NaN
13 1 NaN NaN NaN
13 2 NaN NaN NaN
13 3 NaN NaN NaN
13 4 NaN NaN NaN
13 5 NaN NaN NaN
Here's some code that worked for me:
# Merge two df into one dataframe on the columns "TempBin" and "Month" filling nan values with 0.
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]})
df_merge = pd.merge(df1, df2, how='left',
left_on=['TempBin', 'Month'],
right_on=['TempBin', 'Month'])
df_merge.fillna(0, inplace=True)
# add column LocationAA and fill it with the not null value from column LocationAA_x and LocationAA_y
df_merge['LocationAA'] = df_merge.apply(lambda x: x['LocationAA_x'] if pd.isnull(x['LocationAA_y']) else x['LocationAA_y'], axis=1)
# remove column LocationAA_x and LocationAA_y
df_merge.drop(['LocationAA_x', 'LocationAA_y'], axis=1, inplace=True)
print(df_merge)
Output:
Month TempBin LocationXA LocationZP LocationAA
0 1 0 1.0 2.0 0.0
1 1 1 0.0 89.0 0.0
2 1 2 23.0 38.0 0.0
3 1 3 14.0 17.0 0.0
4 1 4 9.0 14.0 0.0
5 1 5 8.0 99.0 0.0
6 13 0 0.0 0.0 11.0
7 13 1 0.0 0.0 22.0
8 13 2 0.0 0.0 33.0
9 13 3 0.0 0.0 44.0
10 13 4 0.0 0.0 55.0
11 13 5 0.0 0.0 66.0
Let me know if there's something you don't understand in the comments :)
PS: Sorry for the extra comments. But I left them there for some more explanations.
You need to use append to get the desired output:
df1 = df1.append(df2)
and if you want to replace the Nulls to zeros add:
df1 = df1.fillna(0)
Here is another way using combine_first()
i = ['Month','TempBin']
df2.set_index(i).combine_first(df1.set_index(i)).reset_index()

How to change a non top 3 values columns in a dataframe in Python [duplicate]

This question already has an answer here:
How to find the top column values of each row in a pandas dataframe
(1 answer)
Closed 1 year ago.
I have a dataframe that was made out of BOW results called df_BOW
dataframe looks like this
df_BOW
Out[42]:
blue drama this ... book mask
0 3 0 1 ... 1 0
1 0 1 0 ... 0 4
2 0 1 3 ... 6 0
3 6 0 0 ... 1 0
4 7 2 0 ... 0 0
... ... ... ... ... ... ...
81991 0 0 0 ... 0 1
81992 0 0 0 ... 0 1
81993 3 3 5 ... 4 1
81994 4 0 0 ... 0 0
81995 0 1 0 ... 9 2
this data frame has around 12,000 column and 82,000 rows
I want to reduce the number of columns by doing this
for each row keep only top 3 columns and make everything else 0
so for row number 543 ( the original record looks like this)
blue drama this ... book mask
543 1 11 21 ... 7 4
It should become like this
blue drama this ... book mask
543 0 11 21 ... 7 0
only top 3 columns kept (drama, this, book) all other columns became zeros
blue drama this ... book mask
929 5 3 2 ... 4 3
will become
blue drama this ... book mask
929 5 3 0 ... 4 0
at the end of I should remove all columns that are zeros for all rows
I start putting this function to loop all rows and all columns
for i in range(0, len(df_BOW.index)):
Col1No = 0
Col1Val = 0
Col2No = 0
Col2Val = 0
Col3No = 0
Col3Val = 0
for j in range(0, len(df_BOW.columns)):
if (df_BOW.iloc[i,j] > min(Col1Val, Col2Val, Col3Val)):
if (Col1Val <= Col2Val) & (Col1Val <= Col3Val):
df_BOW.iloc[i,Col1No] = 0
Col1Val = df_BOW.iloc[i,j]
Col1No = j
elif (Col2Val <= Col1Val) & (Col2Val <= Col3Val):
df_BOW.iloc[i,Col2No] = 0
Col2Val = df_BOW.iloc[i,j]
Col2No = j
elif (Col3Val <= Col1Val) & (Col3Val <= Col2Val):
df_BOW.iloc[i,Col3No] = 0
Col3Val = df_BOW.iloc[i,j]
Col3No = j
I don't think this loop is the best way to do that.
beside it will become impossible to do for top 50 columns with this loop.
is there a better way to do that?
You can use pandas.Series.nlargest, pass keep as first to include the first record only if multiple value exists for top 3 largest values. Finally use fillna(0) to fill all the NaN columns with 0
df.apply(lambda row: row.nlargest(3, keep='first'), axis=1).fillna(0)
OUTPUT:
blue book drama mask this
0 0.0 1.0 0.0 0.0 1.0
1 1.0 0.0 1.0 4.0 0.0
2 2.0 6.0 0.0 0.0 3.0
3 3.0 1.0 0.0 0.0 0.0
4 4.0 0.0 2.0 0.0 0.0
5 0.0 0.0 0.0 1.0 0.0
6 0.0 0.0 0.0 1.0 0.0
7 3.0 4.0 0.0 0.0 5.0
8 4.0 0.0 0.0 0.0 0.0
9 0.0 9.0 1.0 2.0 0.0

Pandas countif based on multiple conditions, result in new column

How can I add a field that returns 1/0 if the value in any specified column in not NaN?
Example:
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'val1': [2,2,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,2],
'val2': [7,0.2,5,8,np.nan,1,0,np.nan,1,1],
})
display(df)
mycols = ['val1', 'val2']
# if entry in mycols != np.nan, then df[row, 'countif'] =1; else 0
Desired output dataframe:
We do not need countif logic in pandas , try notna + any
df['out'] = df[['val1','val2']].notna().any(1).astype(int)
df
Out[381]:
id val1 val2 out
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1
Using iloc accessor filtre last two columns. Check if the sum of not NaNs in each row is more than zero. Convert resulting Boolean to integers.
df['countif']=df.iloc[:,1:].notna().sum(1).gt(0).astype(int)
id val1 val2 countif
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1

How to remove certain features that have a low completeness rate in a Data frame(Python)

I have a Data Frame with more than 450 variables and more than 500 000 rows. However, some variables have null values ​​over 90%. I would like to delete features with more than > 90% empty rows.
I made my description of my variables:
Data Frame :
df = pd.DataFrame({
'A':list('abcdefghij'),
'B':[4,np.nan,np.nan,np.nan,np.nan,np.nan, np.nan, np.nan, np.nan, np.nan],
'C':[7,8,np.nan,4,2,3,6,5, 4, 6],
'D':[1,3,5,np.nan,1,0,10,7, np.nan, 5],
'E':[5,3,6,9,2,4,7,3, 5, 9],
'F':list('aaabbbckfr'),
'G':[np.nan,8,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan, np.nan, np.nan]})
print(df)
A B C D E F G
0 a 4.0 7 1 5 a NaN
1 b NaN 8 3 3 a 8.0
2 c NaN NaN 5 6 a NaN
3 d NaN 4 NaN 9 b NaN
4 e NaN 2 1 2 b NaN
5 f NaN 3 0 4 b NaN
6 g NaN 6 10 7 c NaN
7 h NaN 5 7 3 k NaN
8 i NaN 4 NaN 5 f NaN
9 j NaN 6 5 9 r NaN
Describe:
desc = df.describe(include = 'all')
d1 = desc.loc['varType'] = desc.dtypes
d3 = desc.loc['rowsNull'] = df.isnull().sum()
d4 = desc.loc['%rowsNull'] = round((d3/len(df))*100, 2)
print(desc)
A B C D E F G
count 10 1 10 10 10 10 1
unique 10 NaN NaN NaN NaN 6 NaN
top i NaN NaN NaN NaN b NaN
freq 1 NaN NaN NaN NaN 3 NaN
mean NaN 4 5.4 4.3 5.3 NaN 8
std NaN NaN 2.22111 3.16403 2.45176 NaN NaN
min NaN 4 2 0 2 NaN 8
25% NaN 4 4 1.5 3.25 NaN 8
50% NaN 4 5.5 4.5 5 NaN 8
75% NaN 4 6.75 6.5 6.75 NaN 8
max NaN 4 9 10 9 NaN 8
varType object float64 float64 float64 float64 object float64
rowsNull 0 9 1 2 0 0 9
%rowsNull 0 90 10 20 0 0 90
In this exemple we have juste 2 features to delete 'B' and 'G'.
But in my case i find 40 variables whose '%rowsNull' greater than > 90%, how should i do not take into account these variables in my modeling?
I have no idea how to do this.
Please help me.
Thanks.
First compare missing values and then get mean (it working because Trues are processing like 1s), last filter by boolean indexing with loc, because removing columns:
df = df.loc[:, df.isnull().mean() <.9]
print (df)
A C D E F
0 a 7.0 1.0 5 a
1 b 8.0 3.0 3 a
2 c NaN 5.0 6 a
3 d 4.0 NaN 9 b
4 e 2.0 1.0 2 b
5 f 3.0 0.0 4 b
6 g 6.0 10.0 7 c
7 h 5.0 7.0 3 k
8 i 4.0 NaN 5 f
9 j 6.0 5.0 9 r
Detail:
print (df.isnull().mean())
A 0.0
B 0.9
C 0.1
D 0.2
E 0.0
F 0.0
G 0.9
dtype: float64
You can find columns with more than 90% null values and drop
cols_to_drop = df.columns[df.isnull().sum()/len(df) >= .90]
df.drop(cols_to_drop, axis = 1, inplace = True)
A C D E F
0 a 7.0 1.0 5 a
1 b 8.0 3.0 3 a
2 c NaN 5.0 6 a
3 d 4.0 NaN 9 b
4 e 2.0 1.0 2 b
5 f 3.0 0.0 4 b
6 g 6.0 10.0 7
7 h 5.0 7.0 3 k
8 i 4.0 NaN 5 f
9 j 6.0 5.0 9 r
Based on your code, you could do something like
keepCols = desc.columns[desc.loc['%rowsNull'] < 90]
df = df[keepCols]

Categories

Resources