I have the data like this.
id
feature_1
feature_2
1
a
e
2
b
c
3
c
d
4
d
b
5
e
a
I want the one-hot-encoded like feature with the first column representing 1 and the second column representing 0.5. Like the following table.
id
a
b
c
d
e
1
1
0
0
0
0.5
2
0
1
0.5
0
0
3
0
0
1
0.5
0
4
0
0.5
0
1
0
5
0.5
0
0
0
1
But when applying sklearn.preprocessing.OneHotEncoder
it outputs 10 columns with respective 1s.
How can I achieve this?
For the two columns, you can do:
pd.crosstab(df.id, df.feature_1) + pd.crosstab(df['id'], df['feature_2']) * .5
Output:
feature_1 a b c d e
id
1 1.0 0.0 0.0 0.0 0.5
2 0.0 1.0 0.5 0.0 0.0
3 0.0 0.0 1.0 0.5 0.0
4 0.0 0.5 0.0 1.0 0.0
5 0.5 0.0 0.0 0.0 1.0
If you have more than two features, with the weights defined, then you can melt then map the features to the weights:
weights = {'feature_1':1, 'feature_2':0.5}
flatten = df.melt('id')
(flatten['variable'].map(weights)
.groupby([flattern['id'], flatten['value']])
.sum().unstack('value', fill_value=0)
)
Related
I have an input dataframe
KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
A (C602+C603) C601 75
B (C605+C606) C602 NaN
C 75 L239+C602 NaN
D (32*(C603+44)) 75 NaN
E L239 NaN C601
I have an Indicator df
99 75 C604 C602 C601 C603 C605 C606 44 L239 32
PatientID
1 1 0 1 0 1 0 0 0 1 0 1
2 0 0 0 0 0 0 1 1 0 0 0
3 1 1 1 1 0 1 1 1 1 1 1
4 0 0 0 0 0 1 0 1 0 1 0
5 1 0 1 1 1 1 0 1 1 1 1
source:
input_df = pd.DataFrame({'KPI_ID': ['A','B','C','D','E'], 'KPI_Key1': ['(C602+C603)','(C605+C606)','75','(32*(C603+44))','L239'] , 'KPI_Key2' : ['C601','C602','L239+C602','75',np.NaN] , 'KPI_Key3' : ['75',np.NaN,np.NaN,np.NaN,'C601']})
indicator_df = pd.DataFrame({'PatientID': [1,2,3,4,5],'99' : ['1','0','1','0','1'],'75' : ['0','0','1','0','0'],'C604' : ['1','0','1','0','1'],'C602' : ['0','0','1','0','1'],'C601' : ['1','0','0','0','1'],'C603' : ['0','0','1','1','1'],'C605' : ['0','1','1','0','0'],'C606' : ['0','1','1','1','1'],'44' : ['1','0','1','0','1'],'L239' : ['0','0','1','1','1'], '32' : ['1','0','1','0','1'],}).set_index('PatientID')
My Goal is to create an output df like this (by evaluating the input_df against indicator_df )
final_out_df:
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
1 A 0 1 0
2 A 0 0 0
3 A 2 0 1
4 A 1 0 0
5 A 2 1 0
1 B 0 0 0
2 B 2 0 0
3 B 2 1 0
... ... ... ... ...
I am VERY Close and my logic works fine except I am unable to handle the NaN values in the input_df.I am able to generate the output for KPI_ID 'A' since none of the three formulas (KPI_Key1,KPI_Key2,KPI_Key3 for 'A') are null. But I fail to generate it for 'B'. Is there anything I can do instead of using a dummy variuable in place of NaN and creating that row in indicator_df?
Here is what I did so far:
indicator_df = indicator_df.astype('int32')
final_out_df = pd.DataFrame()
out_df = pd.DataFrame(index=indicator_df.index)
out_df.reset_index(level=0, inplace=True)
final_out_df = pd.DataFrame()
#running loop only for 'A' so it won't fail
for i in range(0,len(input_df)-4):
for j in ['KPI_Key1','KPI_Key2','KPI_Key3']:
exp = input_df[j].iloc[i]
temp_out_df=indicator_df.eval(re.sub(r'(\w+)', r'`\1`', exp)).reset_index(name=j)
out_df['KPI_ID'] = input_df['KPI_ID'].iloc[i]
out_df = out_df.merge(temp_out_df, on='PatientID', how='left')
final_out_df= final_out_df.append(out_df)
out_df = pd.DataFrame(index=indicator_df.index)
out_df.reset_index(level=0, inplace=True)
Replace NaN by None and create a dict of local variables to allow a correct evaluation with pd.eval:
def eval_kpi(row):
kpi = row.filter(like='KPI_Key').fillna('None')
return pd.Series(pd.eval(kpi, local_dict=row['local_vars']), index=kpi.index)
final_out_df = indicator_df.astype(int).apply(dict, axis=1) \
.rename('local_vars').reset_index() \
.merge(input_df, how='cross')
final_out_df.update(final_out_df.apply(eval_kpi, axis=1))
final_out_df = final_out_df.drop(columns='local_vars') \
.sort_values(['KPI_ID', 'PatientID']) \
.reset_index(drop=True)
Output:
>>> final_out_df
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
0 1 A 0.0 1.0 75.0
1 2 A 0.0 0.0 75.0
2 3 A 2.0 0.0 75.0
3 4 A 1.0 0.0 75.0
4 5 A 2.0 1.0 75.0
5 1 B 0.0 0.0 NaN
6 2 B 2.0 0.0 NaN
7 3 B 2.0 1.0 NaN
8 4 B 1.0 0.0 NaN
9 5 B 1.0 1.0 NaN
10 1 C 75.0 0.0 NaN
11 2 C 75.0 0.0 NaN
12 3 C 75.0 2.0 NaN
13 4 C 75.0 1.0 NaN
14 5 C 75.0 2.0 NaN
15 1 D 1408.0 75.0 NaN
16 2 D 1408.0 75.0 NaN
17 3 D 1440.0 75.0 NaN
18 4 D 1440.0 75.0 NaN
19 5 D 1440.0 75.0 NaN
20 1 E 0.0 NaN 1.0
21 2 E 0.0 NaN 0.0
22 3 E 1.0 NaN 0.0
23 4 E 1.0 NaN 0.0
24 5 E 1.0 NaN 1.0
I was able to solve it by adding:
if exp == exp:
before parsing the exp through the regex.
This question already has an answer here:
How to find the top column values of each row in a pandas dataframe
(1 answer)
Closed 1 year ago.
I have a dataframe that was made out of BOW results called df_BOW
dataframe looks like this
df_BOW
Out[42]:
blue drama this ... book mask
0 3 0 1 ... 1 0
1 0 1 0 ... 0 4
2 0 1 3 ... 6 0
3 6 0 0 ... 1 0
4 7 2 0 ... 0 0
... ... ... ... ... ... ...
81991 0 0 0 ... 0 1
81992 0 0 0 ... 0 1
81993 3 3 5 ... 4 1
81994 4 0 0 ... 0 0
81995 0 1 0 ... 9 2
this data frame has around 12,000 column and 82,000 rows
I want to reduce the number of columns by doing this
for each row keep only top 3 columns and make everything else 0
so for row number 543 ( the original record looks like this)
blue drama this ... book mask
543 1 11 21 ... 7 4
It should become like this
blue drama this ... book mask
543 0 11 21 ... 7 0
only top 3 columns kept (drama, this, book) all other columns became zeros
blue drama this ... book mask
929 5 3 2 ... 4 3
will become
blue drama this ... book mask
929 5 3 0 ... 4 0
at the end of I should remove all columns that are zeros for all rows
I start putting this function to loop all rows and all columns
for i in range(0, len(df_BOW.index)):
Col1No = 0
Col1Val = 0
Col2No = 0
Col2Val = 0
Col3No = 0
Col3Val = 0
for j in range(0, len(df_BOW.columns)):
if (df_BOW.iloc[i,j] > min(Col1Val, Col2Val, Col3Val)):
if (Col1Val <= Col2Val) & (Col1Val <= Col3Val):
df_BOW.iloc[i,Col1No] = 0
Col1Val = df_BOW.iloc[i,j]
Col1No = j
elif (Col2Val <= Col1Val) & (Col2Val <= Col3Val):
df_BOW.iloc[i,Col2No] = 0
Col2Val = df_BOW.iloc[i,j]
Col2No = j
elif (Col3Val <= Col1Val) & (Col3Val <= Col2Val):
df_BOW.iloc[i,Col3No] = 0
Col3Val = df_BOW.iloc[i,j]
Col3No = j
I don't think this loop is the best way to do that.
beside it will become impossible to do for top 50 columns with this loop.
is there a better way to do that?
You can use pandas.Series.nlargest, pass keep as first to include the first record only if multiple value exists for top 3 largest values. Finally use fillna(0) to fill all the NaN columns with 0
df.apply(lambda row: row.nlargest(3, keep='first'), axis=1).fillna(0)
OUTPUT:
blue book drama mask this
0 0.0 1.0 0.0 0.0 1.0
1 1.0 0.0 1.0 4.0 0.0
2 2.0 6.0 0.0 0.0 3.0
3 3.0 1.0 0.0 0.0 0.0
4 4.0 0.0 2.0 0.0 0.0
5 0.0 0.0 0.0 1.0 0.0
6 0.0 0.0 0.0 1.0 0.0
7 3.0 4.0 0.0 0.0 5.0
8 4.0 0.0 0.0 0.0 0.0
9 0.0 9.0 1.0 2.0 0.0
I have a list of dfs, df_list:
[ CLASS IDX A B C D
0 1 1 1.0 0.0 0.0 0.0
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 NaN NaN NaN NaN
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 0.900 0.100 0.0 0.0
1 1 2 1.000 0.000 0.0 0.0
2 1 3 0.999 0.001 0.0 0.0]
I would like to summarize the data into one df based on conditions and values in the individual dfs. Each df has 4 columns of interest, A, B, C and D. If the value in for example column A is >= 0.1 in df_list[0], I want to print 'A' in the summary df. If two columns, for example A and B, have values >= 0.1, I want to print 'A/B'. The final summary df for this data should be:
CLASS IDX 0 1 2
0 1 1 A NaN A/B
1 1 2 A A A
2 1 3 A A A
In the summary df, the column labels (0,1,2) represent the position of the df in the df_list.
I am starting with this
for index, values in enumerate(df_list):
# summarize the data
But not sure what would be the best way to continue..
Any help greatly appreciated!
Here there is one approach
cols = ['A','B','C','D']
def join_func(df):
m = df[cols].ge(0.1)
return (df[cols].mask(m, cols)
.where(m, np.nan)
.apply(lambda x: '/'.join(x.dropna()), axis=1))
result = (df_list[0].loc[:, ['CLASS','IDX']]
.assign(**{str(i) : join_func(df)
for i, df in enumerate(df_list)}))
print(result)
CLASS IDX 0 1 2
0 1 1 A A/B
1 1 2 A A A
2 1 3 A A A
I have a dataframe in the following format:
import pandas as pd
d1 = {'ID': ['A','A','A','B','B','B','B','B','C'],
'Time':
['1/18/2016','2/17/2016','2/16/2016','1/15/2016','2/14/2016','2/13/2016',
'1/12/2016','2/9/2016','1/11/2016'],
'Product_ID': ['2','1','1','1','1','2','1','2','2'],
'Var_1': [0.11,0.22,0.09,0.07,0.4,0.51,0.36,0.54,0.19],
'Var_2': [1,0,1,0,1,0,1,0,1],
'Var_3': ['1','1','1','1','0','1','1','0','0']}
df1 = pd.DataFrame(d1)
Where df1 is of the form:
ID Time Product_ID Var_1 Var_2 Var_3
A 1/18/2016 2 0.11 1 1
A 2/17/2016 1 0.22 0 1
A 2/16/2016 1 0.09 1 1
B 1/15/2016 1 0.07 0 1
B 2/14/2016 1 0.4 1 0
B 2/13/2016 2 0.51 0 1
B 1/12/2016 1 0.36 1 1
B 2/9/2016 2 0.54 0 0
C 1/11/2016 2 0.19 1 0
where time is in 'MM/DD/YY' format.
This is what I have to do:
1) I would like to do is to group ID's and Product ID's by Time (Specifically by each Month).
2) I want to then carry out the following column operations.
a) First, I would like to find the sum of the columns of Var_2 and Var_3 and
b) find the mean of the column Var_1.
3) Then, I would like to create a column of count of each ID and Product_ID for each month.
4) And finally, I would also like to input items ID and Product ID for which there is no entries.
For example, for ID = A and Product ID = 1 in Time = 2016-1 (January 2016), there are no observations and thus all variables take the value of 0.
Again, For ID = A and Product ID = 1 in Time = 2016-2 (January 2016), Var_1 = (.22+.09)/2 = 0.155 Var_2 = 1, Var_3 = 1+1=2 and finally Count = 2.
This is the output that I would like.
ID Product_ID Time Var_1 Var_2 Var_3 Count
A 1 2016-1 0 0 0 0
A 1 2016-2 0.155 1 2 2
B 1 2016-1 0.215 1 1 2
B 1 2016-2 1 0.4 0 1
C 1 2016-1 0 0 0 0
C 1 2016-2 0 0 0 0
A 2 2016-1 0.11 1 1 1
A 2 2016-2 0 0 0 0
B 2 2016-1 0 0 0 0
B 2 2016-2 0.455 1 2 2
C 2 2016-1 0.19 1 0 1
C 2 2016-2 0 0 0 0
This is a little more than my programming capabilities (I know the groupby function exits but I could not figure out how to incorporate the rest of the changes). Please let me know if you have questions.
Any help will be appreciated. Thanks.
I break down the steps.
df1.Time=pd.to_datetime(df1.Time)
df1.Time=df1.Time.dt.month+df1.Time.dt.year*100
df1['Var_3']=df1['Var_3'].astype(int)
output=df1.groupby(['ID','Product_ID','Time']).agg({'Var_1':'mean','Var_2':'sum','Var_3':'sum'})
output=output.unstack(2).stack(dropna=False).fillna(0)# missing one .
output['Count']=output.max(1)
output.reset_index().sort_values(['Product_ID','ID'])
Out[1032]:
ID Product_ID Time Var_3 Var_2 Var_1 Count
0 A 1 201601 0.0 0.0 0.000 0.0
1 A 1 201602 2.0 1.0 0.155 2.0
4 B 1 201601 2.0 1.0 0.215 2.0
5 B 1 201602 0.0 1.0 0.400 1.0
2 A 2 201601 1.0 1.0 0.110 1.0
3 A 2 201602 0.0 0.0 0.000 0.0
6 B 2 201601 0.0 0.0 0.000 0.0
7 B 2 201602 1.0 0.0 0.525 1.0
8 C 2 201601 0.0 1.0 0.190 1.0
9 C 2 201602 0.0 0.0 0.000 0.0
I have a merged_dataframe which id is from X dataframe and value from Y dataframe
I want to drop rows like A which have value 1 on the last row.
How do I do it so in X dataframe, A rows are dropped?
id value
A 0
A 1
B 0
C 0
To check the last value in the row, isit by using
merged_dataframe = merged_dataframe.groupby('id').nth(-1)
get_last_value = merged_dataframe['value']
Here's 1 method of doing it i.e
mask = df.groupby('id',as_index=False)['value'].nth(-1) == 1
df.loc[mask[mask].index,'value'] = np.nan
ndf = df.dropna()
Output:
id value
0 A 0.0
2 B 0.0
3 C 0.0
If you have a dataframe like
id value
0 A 1.0
1 A 0.0
2 A 1.0
3 B 1.0
4 B 0.0
5 B 1.0
6 C 0.0
Then Output :
id value
0 A 1.0
1 A 0.0
3 B 1.0
4 B 0.0
6 C 0.0