Decide which category to drop in pandas get_dummies()

Decide which category to drop in pandas get_dummies() - python

Let's say I have the following df:
data = [{'c1':a, 'c2':x}, {'c1':b,'c2':y}, {'c1':c,'c2':z}]
df = pd.DataFrame(data)
Output:
c1 c2
0 a x
1 b y
2 c z
Now I want to use pd.get_dummies() to one hot encode the two categorical columns c1 and c2 and drop the first category of each col pd.get_dummies(df, columns = ['c1', 'c2'], drop_first=True). How can I decide which category to drop, without knowing the rows' order? Is there any command I missed?
EDIT:
So my goal would be to e.g., drop category b from c1 and z from c2
Output:
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0

One trick is replace values to NaNs - here is removed one value per rows:
#columns with values for avoid
d = {'c1':'b', 'c2':'z'}
d1 = {k:{v: np.nan} for k, v in d.items()}
df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
print (df)
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0
If need multiple values for remove per column use lists like:
d = {'c1':['b','c'], 'c2':['z']}
d1 = {k:{x: np.nan for x in v} for k, v in d.items()}
print (d1)
{'c1': {'b': nan, 'c': nan}, 'c2': {'z': nan}}
df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
print (df)
a x y
0 1 1 0
1 0 0 1
2 0 0 0
EDIT:
If values are unique per columns simplier is them removed in last step:
df = (pd.get_dummies(df, columns = ['c1', 'c2'], prefix='', prefix_sep='')
.drop(['b','z'], axis=1))
print (df)
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0

I'd highly recommend using sklearn instead! https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
You can view the categories accessing the <your_fitted_instance_name>.categories_ attribute after you've fitted the one hot encoder, and it also has a inverse_transform() function to reverse the one hot encoding!
As for column dropping.. the default is not to drop any. However, you can use OneHotEncoder(drop='first') in order to drop one.
Edit: Also note that sklearn offers Pipelines which can help you ensure consistent pre-processing throughout your project!
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Related

Pandas multiindex: drop rows with group-specific condition

I have created a data frame as follows:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
for v in lvl0:
df.loc[(v, 'mode'), :] = np.nan
df.sort_index(inplace=True)
Which gives
c1 c2
A mode NaN NaN
x 0 1
y 1 0
z 1 1
B mode NaN NaN
x 1 1
y 1 1
z 1 0
I then calculate the modes for group A and B for both columns as follows
for c in cols:
modes = df.groupby(level=0)[c].agg(pd.Series.mode)
df.loc[(lvl0, 'default'), c] = modes.values
which results in
c1 c2
A mode 1 1
x 0 1
y 1 0
z 1 1
B mode 1 1
x 1 1
y 1 1
z 1 0
I now want to remove all rows where the values are equal to the mode of the corresponding first level group (A-B), which in this case should result in
c1 c2
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0
I could achieve the desired result by looping over lvl0, but I was wondering if there was a more elegant solution, perhaps using groupby.
Additionally, I wonder if there is a way to add the mode rows when the modes are calculated, as opposed to adding empty (NaN) rows beforehand. If I don't add the empty rows beforehand, the line df.loc[(lvl0, 'default'), c] = modes.values gives me a KeyError on 'default'.

Use:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
df.sort_index(inplace=True)
#get first modes per groups
modes = df.groupby(level=0).agg(lambda x: x.mode().iat[0])
#compare modes per groups with removed second level and append modes df
df = (pd.concat([modes.assign(tmp='mode').set_index('tmp', append=True), \
df[df.droplevel(1).ne(modes).any(axis=1).to_numpy()]])
.sort_index())
print (df)
c1 c2
tmp
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0

Pandas dataframes - Match two columns in the two dataframes to change the value of a third column

I have two dataframes df1 and df2. x,y values in df2 is a subset of x,y values in df1. For each x,y row in df2, I want to change the value of knn column in df1 to 0, where df2[x] = df1[x] and df2[y] = df1[y]. In the example below x,y values (1,1) and (1,2) are common therefore knn column in df1 will change to [0,0,0,0]. The last line in the code below is not working. I would appreciate any guidance.
import pandas as pd
df1_dict = {'x': ['1','1','1','1'],
'y': [1,2,3,4],
'knn': [1,1,0,0]
}
df2_dict = {'x': ['1','1'],
'y': [1,2]
}
df1 = pd.DataFrame(df1_dict, columns = ['x', 'y','knn'])
df2 = pd.DataFrame(df2_dict, columns = ['x', 'y'])
df1['knn']= np.where((df1['x']==df2['x']) and df1['y']==df2['y'], 0)

You can use merge here:
u = df1.merge(df2,on=['x','y'],how='left',indicator=True)
u = (u.assign(knn=np.where(u['_merge'].eq("both"),0,u['knn']))
.reindex(columns=df1.columns))
print(u)
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0

You can use MultiIndex.isin:
c = ['x', 'y']
df1.loc[df1.set_index(c).index.isin(df2.set_index(c).index), 'knn'] = 0
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0

Create dummy variable of multiple columns with python

I am working with a dataframe containing two columns with ID numbers. For further research I want to make a sort of dummy variables of these ID numbers (with the two ID numbers). My code, however, does not merge the columns from the two dataframes. How can I merge the columns from the two dataframes and create the dummy variables?
Dataframe
import pandas as pd
import numpy as np
d = {'ID1': [1,2,3], 'ID2': [2,3,4]}
df = pd.DataFrame(data=d)
Current code
pd.get_dummies(df, prefix = ['ID1', 'ID2'], columns=['ID1', 'ID2'])
Desired output
p = {'1': [1,0,0], '2': [1,1,0], '3': [0,1,1], '4': [0,0,1]}
df2 = pd.DataFrame(data=p)
df2

If need indicators in output use max, if need count values use sum after get_dummies with another parameters and casting values to strings:
df = pd.get_dummies(df.astype(str), prefix='', prefix_sep='').max(level=0, axis=1)
#count alternative
#df = pd.get_dummies(df.astype(str), prefix='', prefix_sep='').sum(level=0, axis=1)
print (df)
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1

Different ways of skinning a cat; here's how I'd do it—use an additional groupby:
# pd.get_dummies(df.astype(str)).groupby(lambda x: x.split('_')[1], axis=1).sum()
pd.get_dummies(df.astype(str)).groupby(lambda x: x.split('_')[1], axis=1).max()
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1
Another option is stacking, if you like conciseness:
# pd.get_dummies(df.stack()).sum(level=0)
pd.get_dummies(df.stack()).max(level=0)
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1

Selecting groups fromed by groupby function

My dataframe:
df1
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
I have formed each group bygroupby function.
I need to extract the data by using group number.
My desired ouput.
In:get group 0
out:
ordercode quantity
A 1
B 3
or
group ordercode quantity
0 A 1
B 3
any suggestion would be appreciated.

Use DataFrame.xs, also is possible use parameter drop_level=False:
#if need remove original level
df1 = df.xs(0)
print (df1)
quantity
ordercode
A 1
B 3
#if avoid remove original level
df1 = df.xs(0, drop_level=False)
print (df1)
quantity
group ordercode
0 A 1
B 3
EDIT:
dfs = [df1, df2, df3]
dfs = [x[x['group'] == 0] for x in dfs]
print (dfs)

In [131]: df.loc[pd.IndexSlice[0,:]]
Out[131]:
quantity
ordercode
A 1
B 3
or
In [130]: df.loc[pd.IndexSlice[0,:], :]
Out[130]:
quantity
group ordercode
0.0 A 1
B 3

You can use GroupBy.get_group after specifying columns. Here's a demo:
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': np.random.rand(6),
'C': np.arange(6)})
gb = df.groupby('A')
print(gb[gb.obj.columns].get_group('bar'))
A B C
1 bar 0.523248 1
3 bar 0.575946 3
5 bar 0.318569 5

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?

import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decide which category to drop in pandas get_dummies() - python

Related

Pandas multiindex: drop rows with group-specific condition

Pandas dataframes - Match two columns in the two dataframes to change the value of a third column

Create dummy variable of multiple columns with python

Selecting groups fromed by groupby function

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

Categories

Resources