I currently have the below code to one-hot encode a pandas dataframe using a dictionary where keys are feature names and values are list of values for the feature.
def dummy_encode_dataframe(self, df, dummy_values_dict):
for (feature, dummy_values) in sorted(dummy_values_dict.items()):
for dummy_value in sorted(dummy_values):
dummy_name = u'%s_%s' % (feature, dummy_value)
df[dummy_name] = (df[feature] == dummy_value).astype(float)
del df[feature]
return df
The dummy_values_dict has the structure:
feature name (key) list of possible values (strings)
--------- ---------------------------------
F1 ['A', 'B', 'C', 'MISSING']
F2 ['D', 'E', 'F', 'MISSING']
F3 ['G', 'H', 'I']
with sample input/output:
df (one row):
====
F1 F2 F3
--- ----- -----
'A' 'Q' 'H'
expected output:
df_output:
====
F1_A F1_B F1_C F1_MISSING F2_D F2_E F2_F F2_MISSING F3_G F3_H F3_I
--- ---- ----- --------- ---- ---- ---- ---------- ---- ---- -----
1 0 0 0 0 0 0 0 0 1 0
The problem is that the for-loops takes too long to run. Any way to optimize it?
UPDATE 1: From the comment about using OneHotEncoder in scikit-learn...
Can you elaborate on this piece of code to get the desired output?
import pandas as pd
df = pd.DataFrame(columns=['F1', 'F2', 'F3'])
df.loc[0] = ['A', 'Q', 'H']
dummy_values_dict = { 'F1': ['A', 'B', 'C', 'MISSING'], 'F2': ['D', 'E', 'F', 'MISSING'], 'F3': ['G', 'H', 'I'] }
# import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
categorical_cols = sorted(dummy_values_dict.keys())
# instantiate OneHotEncoder
# todo: encoding...
Maybe the question was not well phrased. I managed to find a more optimized implementation (probably there are better) using the below code:
import pandas as pd
import numpy as np
def dummy_encode_dataframe_optimized(df, dummy_values_dict):
column_headers = np.concatenate(np.array(
[np.array([k + '_value_' + s
for s in sorted(dummy_values_dict[k])])
for k in sorted(dummy_values_dict.keys())]), axis=0)
feature_values = [str(feature) + '_value_' + str(df[feature][0])
for feature in dummy_values_dict.keys()]
one_hot_encode_vector = np.vectorize(lambda x: float(1) if x in feature_values else float(0))(column_headers)
untouched_df = df.drop(df.ix[:, dummy_values_dict.keys()].head(0).columns, axis=1)
hot_encoded_df = pd.concat(
[
untouched_df,
pd.DataFrame(
[one_hot_encode_vector],
index=untouched_df.index,
columns=column_headers
)
], axis=1
)
return hot_encoded_df
df = pd.DataFrame(columns=['F1', 'F2', 'F3'])
df.loc[0] = ['A', 'Q', 'H']
dummy_values_dict = { 'F1': ['A', 'B', 'C', 'MISSING'], 'F2': ['D', 'E', 'F', 'MISSING'], 'F3': ['G', 'H', 'I'] }
result = dummy_encode_dataframe_optimized(df, dummy_values_dict)
pd.get_dummies should work in your case, but first we need to set all the value not in the dictionary to NaN
df = pd.DataFrame({'F1': ['A', 'B', 'C', 'MISSING'], 'F2': [
'Q', 'E', 'F', 'MISSING'], 'F3': ['G', 'H', 'I', 5]})
# F1 F2 F3
# 0 A Q G
# 1 B E H
# 2 C F I
# 3 MISSING MISSING 5
dummy_values_dict = {'F1': ['A', 'B', 'C', 'MISSING'], 'F2': [
'D', 'E', 'F', 'MISSING'], 'F3': ['G', 'H', 'I']}
We can set all the other value to np.nan:
for col in df.columns:
df.loc[~df[col].isin(dummy_values_dict[col]), col] = np.nan
print(df)
# F1 F2 F3
# 0 A NaN G
# 1 B E H
# 2 C F I
# 3 MISSING MISSING NaN
Then we can use pd.get_dummies to do the job:
print(pd.get_dummies(df))
# F1_A F1_B F1_C F1_MISSING F2_E F2_F F2_MISSING F3_G F3_H F3_I
# 0 1 0 0 0 0 0 0 1 0 0
# 1 0 1 0 0 1 0 0 0 1 0
# 2 0 0 1 0 0 1 0 0 0 1
# 3 0 0 0 1 0 0 1 0 0 0
Be aware that if we do not have one value (for example 'D' in columns 'F2'), there won't be the 'F2_D' column, but that can be fix quite easily if you do need the column.
Related
My dataframe is
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'A'],
'col2': ['action1', 'action2', 'action1', 'action3', 'action2', 'action1', 'action1', 'action2'],
'col3': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y']})
it looks like
col1 col2 col3
0 A action1 X
1 A action2 X
2 B action1 X
3 B action3 X
4 C action2 X
5 C action1 X
6 A action1 Y
7 A action2 Y
I would like to aggregate them into
col1 col2 col3
0 A,C action1,action2 X
1 B action1,action3 X
2 A action1,action2 Y
Order of items within the column does not matter. Basically i would like to aggregate col1 and col2. But differentiate the aggregation if col3 is different.
What is the approach I should take?
Probably many ways to do this, but here's a solution that uses groupby twice. Once to build the first set of actions, and next to group on the action and col3.
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'A'],
'col2': ['action1', 'action2', 'action1', 'action3', 'action2', 'action1', 'action1', 'action2'],
'col3': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y']})
df = df.sort_values(by='col2')
df = df.groupby(['col3','col1'], as_index=False)['col2'].apply(lambda x: ','.join(x))
df = df.groupby(['col3','col2'], as_index=False)['col1'].apply(lambda x: ','.join(x)).sort_index(axis=1)
Output
col1 col2 col3
0 A,C action1,action2 X
1 B action1,action3 X
2 A action1,action2 Y
IIUC, you want to group on groups that have common values in col2.
For this you need to set up a helper group:
m = df.groupby('col1')['col2'].apply(frozenset)
(df.groupby([df['col1'].map(m), 'col3'], as_index=False)
.aggregate(lambda x: ','.join(set(x)))
)
output:
col3 col1 col2
0 X A,C action1,action2
1 Y A action1,action2
2 X B action1,action3
I am trying to rename a column and combine that renamed column to others like it. The row indexes will not be the same (i.e. I am not combining 'City' and 'State' from two columns).
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df.rename(columns={'Col_one' : 'Col_1'}, inplace=True)
# Desired output:
({'Col_1': ['A', 'B', 'C', 'G', 'H', 'I'],
'Col_2': ['D', 'E', 'F', '-', '-', '-'],})
I've tried pd.concat and a few other things, but it fails to combine the columns in a way I'm expecting. Thank you!
This is melt and pivot after you have renamed:
u = df.melt()
out = (u.assign(k=u.groupby("variable").cumcount())
.pivot("k","variable","value").fillna('-'))
out = out.rename_axis(index=None,columns=None)
print(out)
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using append without modifying the actual dataframe:
result = (df[['Col_1', 'Col_2']]
.append(df[['Col_one']]
.rename(columns={'Col_one': 'Col_1'}),ignore_index=True).fillna('-')
)
OUTPUT:
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Might be a slightly longer method than other answers but the below delivered the required output.
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
# Create a list of the values we want to retain
TempList = df['Col_one']
# Append existing dataframe with the values from the list
df = df.append(pd.DataFrame({'Col_1':TempList}), ignore_index = True)
# Drop the redundant column
df.drop(columns=['Col_one'], inplace=True)
# Populate NaN with -
df.fillna('-', inplace=True)
Output is
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using concat should work.
import pandas as pd
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df2 = pd.DataFrame()
df2['Col_1'] = pd.concat([df['Col_1'], df['Col_one']], axis = 0)
df2 = df2.reset_index()
df2 = df2.drop('index', axis =1)
df2['Col_2'] = df['Col_2']
df2['Col_2'] = df2['Col_2'].fillna('-')
print(df2)
prints
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a
Now I have to do like this:
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
df['column'] = df['column'].str.replace('A', 'cat').replace('B', 'rabit').replace('C', 'octpath').replace('D', 'spider').replace('E', 'mammoth').replace('F', 'snake').replace('G', 'starfish')
But I think this is long and unreadable.
Do you know a simple solution?
Here is another approach using pandas.Series.replace:
d = {'A':'cat','B':'rabit', 'C':'octpath','D':'spider','E':'mammoth','F':'snake','G':'starfish'}
df['column'] = df['column'].replace(d)
Output:
column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 snake
6 starfish
7 -
You can define a dict of your replacement values and call map on the column passing in your dict, to handle values that are not present you can pass param na_action='ignore', this will return NaN or None as you want to keep your existing values if not present you can call fillna and pass your orig column:
In[60]:
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
d = {'A':'cat','B':'rabit', 'C':'octpath','D':'spider','E':'mammoth','F':'snake','G':'starfish'}
df['column'] = df['column'].map(d, na_action='ignore').fillna(df['column'])
df
Out[60]:
column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 snake
6 starfish
7 -
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
mapper={'A':'cat','B':'rabit','C':'octpath','D':'spider','E':'mammoth'}
df['column']=df.column.apply(lambda x:mapper.get(x))
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 None
6 None
7 None
in case you want to set default values
df['column']=df.column.apply(lambda x:mapper.get(x) if mapper.get(x) is not None else "pandas")
df.column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 pandas
6 pandas
7 pandas
greatings from shibuya
I have two dataframes
df df2
df column FOUR matches with df2 column LOOKUP COL
I need to match df column FOUR with df2 column LOOKUP COL and replace df column FOUR with the corresponding values from df2 column RETURN THIS
The resulting dataframe could overwrite df but I have it listed as result below.
NOTE: THE INDEX DOES NOT MATCH ON EACH OF THE DATAFRAMES
df = pd.DataFrame([['a', 'b', 'c', 'd'],
['e', 'f', 'g', 'h'],
['j', 'k', 'l', 'm'],
['x', 'y', 'z', 'w']])
df.columns = ['ONE', 'TWO', 'THREE', 'FOUR']
ONE TWO THREE FOUR
0 a b c d
1 e f g h
2 j k l m
3 x y z w
df2 = pd.DataFrame([['a', 'b', 'd', '1'],
['e', 'f', 'h', '2'],
['j', 'k', 'm', '3'],
['x', 'y', 'w', '4']])
df2.columns = ['X1', 'Y2', 'LOOKUP COL', 'RETURN THIS']
X1 Y2 LOOKUP COL RETURN THIS
0 a b d 1
1 e f h 2
2 j k m 3
3 x y w 4
RESULTING DF
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
You can use Series.map. You'll need to create a dictionary or a Series to use in map. A Series makes more sense here but the index should be LOOKUP COL:
df['FOUR'] = df['FOUR'].map(df2.set_index('LOOKUP COL')['RETURN THIS'])
df
Out:
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
df['Four']=[df2[df2['LOOKUP COL']==i]['RETURN THIS'] for i in df['Four']]
Should be something like sufficient to do the trick? There's probably a more pandas native way to do it.
Basically, list comprehension - We generate a new array of df2['RETURN THIS'] values based on using the lookup column as we iterate over the i in df['Four'] list.