Preserve original column name in pd.get_dummies() - python

I have a list of columns whose values are all strings. I need to one hot encode them with pd.get_dummies().
I want to keep the original name of those columns along with the value.
So lets say I have a column named Street, and its values are Paved and Not Paved.
After running get_dummies(), I would like the 2 resulting columns to be entitled Street_Paved and Street_Not_Paved. Is this possible? Basically the format for the prefix parameter is {i}_{value}, with i referring to the for i in cols common nomenclature.
My code is:
cols = ['Street', 'Alley', 'CentralAir', 'Utilities', 'LandSlope', 'PoolQC']
pd.get_dummies(df, columns = cols, prefix = '', prefix_sep = '')

If remove prefix = '', prefix_sep = '' parameters get default prefix from columns names with default separator _:
df = pd.DataFrame({'Street' : ['Paved','Paved','Not Paved','Not Paved'],
'Alley':list('acca')})
cols = ['Street','Alley']
df = pd.get_dummies(df, columns = cols)
print (df)
Street_Not Paved Street_Paved Alley_a Alley_c
0 0 1 1 0
1 0 1 0 1
2 1 0 0 1
3 1 0 1 0
If need replace all spaces by _ add rename:
cols = ['Street','Alley']
df = pd.get_dummies(df, columns = cols).rename(columns=lambda x: x.replace(' ', '_'))
print (df)
Street_Not_Paved Street_Paved Alley_a Alley_c
0 0 1 1 0
1 0 1 0 1
2 1 0 0 1
3 1 0 1 0

Related

Change the form of existing DataFrame

I want to change the form of existing dataframe to a new dataframe such that the value in the new dataframe matches the relationship of the existing two columns. Hence, in the new dataframe, "1" means there is a record in the existing dataframe and "0" means no record.
This is what I did so far. Basically through manual judging but this won't work when I have more than 1000 rows.
Existing dataframe:
series_1 = [[19,"a"],[20,"d"],[31,"d"],[31,"c"],[51,"d"]]
a_df = pd.DataFrame(series_1)
Desired dataframe:
cols = ["a","c","d"]
series_3 = [1,0,0,
0,0,1,
0,1,1,
0,0,1]
np_series = np.array(series_3).reshape(4,3)
c_df = pd.DataFrame(np_series,index = [19,20,31,51],columns=cols)
I'm wondering what are some good ways to transform the dataframe according to above request. Thank you!
try this:
pd.crosstab(a_df[0], a_df[1])
Result:
1 a c d
0
19 1 0 0
20 0 0 1
31 0 1 1
51 0 0 1
Quick Answer to your Question
import pandas as pd
dic = {
'0': [19,20,31,31,51],
'1': ['a','d','d','c','d']
}
df = pd.DataFrame(dic) #Creating a dataframe
unique_vals = df['1'].unique().tolist() # Finding unique values from desired column
for val in unique_vals:
df[val] = list(map(lambda item: 1 if item==val else 0,df['1'])) # Mapping to a new column
df.set_index('0', inplace = True) # Setting index
df.drop(['1'],axis = 1, inplace =True) #! Only use this line if you want to delete '1' column
print(df)
Output
0 a d c
19 1 0 0
20 0 1 0
31 0 1 0
31 0 0 1
51 0 1 0

Update row in a dataframe based on a second one

I have the following dataframe, df1 :
AS AT CH TR
James Robert/01/08/2019 0 0 0 1
James Robert/18/08/2019 0 0 0 1
John Smith/01/08/2019 1 0 0 0
John Smith/02/08/2019 0 1 0 0
And df2 :
TIME
Andrew Johnson/08/08/2019 1
James Robert/01/08/2019 0.5
John Smith/02/08/2019 1
If an index value is present in both dataframes (example : James Robert/01/08/2019 and John Smith/02/08/2019), I would like to delete the row in df1 if the value of df1["Column with a value"] - df2['TIME'] = 0 otherwise I would like to update the value.
The desired output would be :
AS AT CH TR
James Robert/01/08/2019 0 0 0 0.5
James Robert/18/08/2019 0 0 0 1
John Smith/01/08/2019 1 0 0 0
If a row is in both dataframes, I'm able to delete it from df1, but I can't find a way to add this particular condition : "df1["Column with a value"]"
Thanks
Instead of using index use them as columns. Place the df2['index'] column in a list. Use that list as parameter in isin method done in df1.
df2['index'] = df2.index
df1['index'] = df1.index
filtered_df1 = df1[df1['index'].isin(df2['index'].values.tolist())]
Create a dictionary with your 'index' column and the value for your 'Time' column from df2 then map it to filtered_df1.
your_dict = dict(zip(df2['index'],df2['Time']))
filtered_df1['Subtract Value'] = filtered_df1['index'].map(your_dict).fillna(value = 0)
Then do the subtraction there.
final_df = filtered_df1.sub(filtered_df1['Subtract Value'], axis=0)
Hope this helps.

Convert pandas DataFrame column of comma separated strings to one-hot encoded

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.
For example,
data = {"mesh": ["A, B, C", "C,B", ""]}
From this I would like to get a dataframe consisting of:
index A B. C
0 1 1 1
1 0 1 1
2 0 0 0
How can I do this?
Note that you're not dealing with OHEs.
str.split + stack + get_dummies + sum
df = pd.DataFrame(data)
df
mesh
0 A, B, C
1 C,B
2
(df.mesh.str.split('\s*,\s*', expand=True)
.stack()
.str.get_dummies()
.sum(level=0))
df
A B C
0 1 1 1
1 0 1 1
2 0 0 0
apply + value_counts
(df.mesh.str.split(r'\s*,\s*', expand=True)
.apply(pd.Series.value_counts, 1)
.iloc[:, 1:]
.fillna(0, downcast='infer'))
A B C
0 1 1 1
1 0 1 1
2 0 0 0
pd.crosstab
x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df
col_0 A B C
row_0
0 1 1 1
1 0 1 1
2 0 0 0
Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.
Make sure the column has unique values separated be commas
Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.
data = {"mesh": ["A, B, C", "C,B", ""]}
sof_df=pd.DataFrame(data)
sof_df.mesh=sof_df.mesh.str.replace(' ','')
sof_df.mesh.str.get_dummies(sep=',')
OUTPUT:
A B C
0 1 1 1
1 0 1 1
2 0 0 0
If categories are controlled (you know how many and who they are), best answer is by #Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:
s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
0 A,B,C,
1 B,C,D
2 NaN
3 X,W,Z
dtype: object
If you are only interested in categories B and C for the final df of dummies, I've found this workaround does the job:
cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list() for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)
B C
0 1 0
1 0 1
2 0 0
3 0 0

Keyword search between two DataFrames using python pandas

Hi I have two DataFrames like below
DF1
Alpha | Numeric | Special
and, or | 1,2,3,4,5| #,$,&
and
DF2 with single column
Content |
boy or girl |
school # morn|
I want to search if anyone of the column in DF1 has anyone of the keyword in content column of DF2 and the output should be in a new DF
output_DF
output_column|
Alpha |
Special |
someone help me with this
Solution is s bit complicated, because for multiple match (row 2) need only matched first column df1:
df1 = pd.DataFrame({'Alpha':['and','or', None, None,None],
'Numeric':['1','2','3','4','5'],
'Special':['#','$','&', None, None]})
print (df1)
Alpha Numeric Special
0 and 1 #
1 or 2 $
2 None 3 &
3 None 4 None
4 None 5 None
df2 = pd.DataFrame({'Content':['boy or girl','school # morn',
'1 school # morn', 'Pechi']})
print (df2)
Content
0 boy or girl
1 school # morn
2 1 school # morn
3 Pechi
#reshape df1
df1.columns = [np.arange(len(df1.columns)), df1.columns]
df11 = df1.unstack()
.reset_index(level=2,drop=True)
.rename_axis(('col_order','col_name'))
.dropna()
.reset_index(name='val')
print (df11)
col_order col_name val
0 0 Alpha and
1 0 Alpha or
2 1 Numeric 1
3 1 Numeric 2
4 1 Numeric 3
5 1 Numeric 4
6 1 Numeric 5
7 2 Special #
8 2 Special $
9 2 Special &
#split column by whitespaces, reshape
df22 = df2['Content'].str.split(expand=True)
.stack()
.rename('val')
.reset_index(level=1,drop=True)
.rename_axis('idx').reset_index()
print (df22)
idx val
0 0 boy
1 0 or
2 0 girl
3 1 school
4 1 #
5 1 morn
6 2 1
7 2 school
8 2 #
9 2 morn
10 3 Pechi
#left join dataframes, remove non match values by dropna
#also for multiple match get always first - use sorting with drop_duplicates
df = pd.merge(df22, df11, on='val', how='left')
.dropna(subset=['col_name'])
.sort_values(['idx','col_order'])
.drop_duplicates(['idx'])
#if necessary get values from df2
#if no value matched add Other category
df = pd.concat([df2, df.set_index('idx')], axis=1)
.fillna({'col_name':'Other'})[['val','col_name','Content']]
print (df)
val col_name Content
0 or Alpha boy or girl
1 # Special school # morn
2 1 Numeric 1 school # morn
3 NaN Other Pechi
EDIT:
:
df1 = pd.DataFrame({'Alpha':['and','or', None, None,None],
'Numeric':['1','2','3','4','5'],
'Special':['#','$','&', None, None]})
df2 = pd.DataFrame({'Content':['boy OR girl','school # morn',
'1 school # morn', 'Pechi']})
#If df1 Alpha values are not lower
#df1['Alpha'] = df1['Alpha'].str.lower()
df1.columns = [np.arange(len(df1.columns)), df1.columns]
df11 = (df1.unstack()
.reset_index(level=2,drop=True)
.rename_axis(('col_order','col_name'))
.dropna()
.reset_index(name='val_low'))
df22 = (df2['Content'].str.split(expand=True)
.stack()
.rename('val')
.reset_index(level=1,drop=True)
.rename_axis('idx')
.reset_index())
#convert columns values to lower to new column
df22['val_low'] = df22['val'].str.lower()
df = (pd.merge(df22, df11, on='val_low', how='left')
.dropna(subset=['col_name'])
.sort_values(['idx','col_order'])
.drop_duplicates(['idx']))
df = (pd.concat([df2, df.set_index('idx')], axis=1)
.fillna({'col_name':'Other'})[['val','col_name','Content']])
print (df)
val col_name Content
0 OR Alpha boy OR girl
1 # Special school # morn
2 1 Numeric 1 school # morn
3 NaN Other Pechi

Add column to pandas without headers

How does one append a column of constant values to a pandas dataframe without headers? I want to append the column at the end.
With headers I can do it this way:
df['new'] = pd.Series([0 for x in range(len(df.index))], index=df.index)
Each not empty DataFrame has columns, index and some values.
You can add default column value and create new column filled by scalar:
df[len(df.columns)] = 0
Sample:
df = pd.DataFrame({0:[1,2,3],
1:[4,5,6]})
print (df)
0 1
0 1 4
1 2 5
2 3 6
df[len(df.columns)] = 0
print (df)
0 1 2
0 1 4 0
1 2 5 0
2 3 6 0
Also for creating new column with name the simpliest is:
df['new'] = 1

Categories

Resources