Decide which category to drop in pandas get_dummies() - python

Let's say I have the following df:
data = [{'c1':a, 'c2':x}, {'c1':b,'c2':y}, {'c1':c,'c2':z}]
df = pd.DataFrame(data)
Output:
c1 c2
0 a x
1 b y
2 c z
Now I want to use pd.get_dummies() to one hot encode the two categorical columns c1 and c2 and drop the first category of each col pd.get_dummies(df, columns = ['c1', 'c2'], drop_first=True). How can I decide which category to drop, without knowing the rows' order? Is there any command I missed?
EDIT:
So my goal would be to e.g., drop category b from c1 and z from c2
Output:
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0

One trick is replace values to NaNs - here is removed one value per rows:
#columns with values for avoid
d = {'c1':'b', 'c2':'z'}
d1 = {k:{v: np.nan} for k, v in d.items()}
df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
print (df)
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0
If need multiple values for remove per column use lists like:
d = {'c1':['b','c'], 'c2':['z']}
d1 = {k:{x: np.nan for x in v} for k, v in d.items()}
print (d1)
{'c1': {'b': nan, 'c': nan}, 'c2': {'z': nan}}
df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
print (df)
a x y
0 1 1 0
1 0 0 1
2 0 0 0
EDIT:
If values are unique per columns simplier is them removed in last step:
df = (pd.get_dummies(df, columns = ['c1', 'c2'], prefix='', prefix_sep='')
.drop(['b','z'], axis=1))
print (df)
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0

I'd highly recommend using sklearn instead! https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
You can view the categories accessing the <your_fitted_instance_name>.categories_ attribute after you've fitted the one hot encoder, and it also has a inverse_transform() function to reverse the one hot encoding!
As for column dropping.. the default is not to drop any. However, you can use OneHotEncoder(drop='first') in order to drop one.
Edit: Also note that sklearn offers Pipelines which can help you ensure consistent pre-processing throughout your project!
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Related

Pandas multiindex: drop rows with group-specific condition

I have created a data frame as follows:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
for v in lvl0:
df.loc[(v, 'mode'), :] = np.nan
df.sort_index(inplace=True)
Which gives
c1 c2
A mode NaN NaN
x 0 1
y 1 0
z 1 1
B mode NaN NaN
x 1 1
y 1 1
z 1 0
I then calculate the modes for group A and B for both columns as follows
for c in cols:
modes = df.groupby(level=0)[c].agg(pd.Series.mode)
df.loc[(lvl0, 'default'), c] = modes.values
which results in
c1 c2
A mode 1 1
x 0 1
y 1 0
z 1 1
B mode 1 1
x 1 1
y 1 1
z 1 0
I now want to remove all rows where the values are equal to the mode of the corresponding first level group (A-B), which in this case should result in
c1 c2
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0
I could achieve the desired result by looping over lvl0, but I was wondering if there was a more elegant solution, perhaps using groupby.
Additionally, I wonder if there is a way to add the mode rows when the modes are calculated, as opposed to adding empty (NaN) rows beforehand. If I don't add the empty rows beforehand, the line df.loc[(lvl0, 'default'), c] = modes.values gives me a KeyError on 'default'.
Use:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
df.sort_index(inplace=True)
#get first modes per groups
modes = df.groupby(level=0).agg(lambda x: x.mode().iat[0])
#compare modes per groups with removed second level and append modes df
df = (pd.concat([modes.assign(tmp='mode').set_index('tmp', append=True), \
df[df.droplevel(1).ne(modes).any(axis=1).to_numpy()]])
.sort_index())
print (df)
c1 c2
tmp
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0

Pandas dataframes - Match two columns in the two dataframes to change the value of a third column

I have two dataframes df1 and df2. x,y values in df2 is a subset of x,y values in df1. For each x,y row in df2, I want to change the value of knn column in df1 to 0, where df2[x] = df1[x] and df2[y] = df1[y]. In the example below x,y values (1,1) and (1,2) are common therefore knn column in df1 will change to [0,0,0,0]. The last line in the code below is not working. I would appreciate any guidance.
import pandas as pd
df1_dict = {'x': ['1','1','1','1'],
'y': [1,2,3,4],
'knn': [1,1,0,0]
}
df2_dict = {'x': ['1','1'],
'y': [1,2]
}
df1 = pd.DataFrame(df1_dict, columns = ['x', 'y','knn'])
df2 = pd.DataFrame(df2_dict, columns = ['x', 'y'])
df1['knn']= np.where((df1['x']==df2['x']) and df1['y']==df2['y'], 0)
You can use merge here:
u = df1.merge(df2,on=['x','y'],how='left',indicator=True)
u = (u.assign(knn=np.where(u['_merge'].eq("both"),0,u['knn']))
.reindex(columns=df1.columns))
print(u)
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0
You can use MultiIndex.isin:
c = ['x', 'y']
df1.loc[df1.set_index(c).index.isin(df2.set_index(c).index), 'knn'] = 0
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0

Create dummy variable of multiple columns with python

I am working with a dataframe containing two columns with ID numbers. For further research I want to make a sort of dummy variables of these ID numbers (with the two ID numbers). My code, however, does not merge the columns from the two dataframes. How can I merge the columns from the two dataframes and create the dummy variables?
Dataframe
import pandas as pd
import numpy as np
d = {'ID1': [1,2,3], 'ID2': [2,3,4]}
df = pd.DataFrame(data=d)
Current code
pd.get_dummies(df, prefix = ['ID1', 'ID2'], columns=['ID1', 'ID2'])
Desired output
p = {'1': [1,0,0], '2': [1,1,0], '3': [0,1,1], '4': [0,0,1]}
df2 = pd.DataFrame(data=p)
df2
If need indicators in output use max, if need count values use sum after get_dummies with another parameters and casting values to strings:
df = pd.get_dummies(df.astype(str), prefix='', prefix_sep='').max(level=0, axis=1)
#count alternative
#df = pd.get_dummies(df.astype(str), prefix='', prefix_sep='').sum(level=0, axis=1)
print (df)
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1
Different ways of skinning a cat; here's how I'd do it—use an additional groupby:
# pd.get_dummies(df.astype(str)).groupby(lambda x: x.split('_')[1], axis=1).sum()
pd.get_dummies(df.astype(str)).groupby(lambda x: x.split('_')[1], axis=1).max()
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1
Another option is stacking, if you like conciseness:
# pd.get_dummies(df.stack()).sum(level=0)
pd.get_dummies(df.stack()).max(level=0)
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1

Selecting groups fromed by groupby function

My dataframe:
df1
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
I have formed each group bygroupby function.
I need to extract the data by using group number.
My desired ouput.
In:get group 0
out:
ordercode quantity
A 1
B 3
or
group ordercode quantity
0 A 1
B 3
any suggestion would be appreciated.
Use DataFrame.xs, also is possible use parameter drop_level=False:
#if need remove original level
df1 = df.xs(0)
print (df1)
quantity
ordercode
A 1
B 3
#if avoid remove original level
df1 = df.xs(0, drop_level=False)
print (df1)
quantity
group ordercode
0 A 1
B 3
EDIT:
dfs = [df1, df2, df3]
dfs = [x[x['group'] == 0] for x in dfs]
print (dfs)
In [131]: df.loc[pd.IndexSlice[0,:]]
Out[131]:
quantity
ordercode
A 1
B 3
or
In [130]: df.loc[pd.IndexSlice[0,:], :]
Out[130]:
quantity
group ordercode
0.0 A 1
B 3
You can use GroupBy.get_group after specifying columns. Here's a demo:
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': np.random.rand(6),
'C': np.arange(6)})
gb = df.groupby('A')
print(gb[gb.obj.columns].get_group('bar'))
A B C
1 bar 0.523248 1
3 bar 0.575946 3
5 bar 0.318569 5

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Categories

Resources