How to iterate over columns with string values using get-dummies? - python

I have a dataframe with 45 columns. Most are string values, so I'm trying to use pd.get_dummies to turn the strings into numbers using df = pd.get_dummies(drop_first=True); however, the columns without string values are removed from my dataframe. I don't want to have to type out 40 or so columns names. How can I iterate over every column, ignoring ones without strings and still have them remain after the get_dummies call?

Columns can be filtered by dtypes to programmatically determine which columns to pass to get_dummies, namely only the "object or category" type columns:
new_df = pd.get_dummies(
df,
columns=df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
)
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
n = 10
df = pd.DataFrame({
'A': np.random.randint(1, 100, n),
'B': pd.Series(np.random.choice(list("ABCD"), n), dtype='category'),
'C': np.random.random(n) * 100,
'D': np.random.choice(list("EFGH"), n)
})
new_df = pd.get_dummies(
df,
columns=df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
)
df:
A B C D
0 79 A 76.437261 G
1 62 D 11.090076 E
2 17 B 20.415475 E
3 74 B 11.909536 E
4 9 D 87.790307 G
5 63 A 52.367529 E
6 28 D 49.213600 F
7 31 A 73.187110 H
8 81 B 1.458075 H
9 8 D 9.336303 H
df.dtypes:
A int32
B category
C float64
D object
dtype: object
new_df:
A C B_A B_B B_D D_E D_F D_G D_H
0 79 76.437261 1 0 0 0 0 1 0
1 62 11.090076 0 0 1 1 0 0 0
2 17 20.415475 0 1 0 1 0 0 0
3 74 11.909536 0 1 0 1 0 0 0
4 9 87.790307 0 0 1 0 0 1 0
5 63 52.367529 1 0 0 1 0 0 0
6 28 49.213600 0 0 1 0 1 0 0
7 31 73.187110 1 0 0 0 0 0 1
8 81 1.458075 0 1 0 0 0 0 1
9 8 9.336303 0 0 1 0 0 0 1

Related

How to move all items of one column to columns in pandas?

I am new to pandas. I am trying to move the items of a column to the columns of dataframe. I am struggling for hours but could not do so.
MWE
import numpy as np
import pandas as pd
df = pd.DataFrame({
'X': [10,20,30,40,50],
'Y': [list('abd'), list(), list('ab'),list('abefc'),list('e')]
})
print(df)
X Y
0 10 [a, b, d]
1 20 []
2 30 [a, b]
3 40 [a, b, e, f, c]
4 50 [e]
How to get the result like this:
X a b c d e
0 10 1 1 0 1 0
1 20 0 0 0 0 0
2 30 1 1 0 0 0
3 40 1 1 1 0 1
4 50 0 0 0 0 1
MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df[mlb.classes_] = mlb.fit_transform(df['Y'])
Pandas alternative
df.join(df['Y'].explode().str.get_dummies().groupby(level=0).max())
X Y a b c d e f
0 10 [a, b, d] 1 1 0 1 0 0
1 20 [] 0 0 0 0 0 0
2 30 [a, b] 1 1 0 0 0 0
3 40 [a, b, e, f, c] 1 1 1 0 1 1
4 50 [e] 0 0 0 0 1 0
You can try pandas.Series.str.get_dummies
out = df[['X']].join(df['Y'].apply(','.join).str.get_dummies(sep=','))
print(out)
X a b c d e f
0 10 1 1 0 1 0 0
1 20 0 0 0 0 0 0
2 30 1 1 0 0 0 0
3 40 1 1 1 0 1 1
4 50 0 0 0 0 1 0
My straight forward solution :
Check if the current col is in your Y list or add a 0 :
for col in ['a', 'b', 'c', 'd', 'e']:
df[col] = pd.Series([1 if col in df["Y"][x] else 0 for x in range(len(df.index))])
df = df.drop('Y', axis=1)
print(df)
Edit: Okay, the groupby is cleaner

Add a column with sequence values if conditions of value in another column with binary values is satisfied

I have a dataframe df with column A with random numbers and column B with categories. Now, I obtain another column C using the code below:
df.loc[df['A'] >= 50, 'C'] = 1
df.loc[df['A'] < 50, 'C'] = 0
I want to obtain a column 'D' which creates a sequence if 1 is encountered else returns the value 0. The required dataframe is given below.
Required df
A B C D
17 a 0 0
88 a 1 1
99 a 1 2
76 a 1 3
73 a 1 4
23 b 0 0
36 b 0 0
47 b 0 0
74 b 1 1
80 c 1 1
77 c 1 2
97 d 1 1
30 d 0 0
80 d 1 2
Use GroupBy.cumcount with Series.mask:
df['D'] = df.groupby(['B', 'C']).cumcount().add(1).mask(df['C'].eq(0), 0)
print (df)
A B C D
17 a 0 0
88 a 1 1
99 a 1 2
76 a 1 3
73 a 1 4
23 b 0 0
36 b 0 0
47 b 0 0
74 b 1 1
80 c 1 1
77 c 1 2
97 d 1 1
30 d 0 0
80 d 1 2
Or numpy.where:
df['D'] = np.where(df['C'].eq(0), 0, df.groupby(['B', 'C']).cumcount().add(1))

Drop rows of a Pandas dataframe if the value of range columns 0

Dataframe:
df = pd.DataFrame({'a':['NA','W','Q','M'], 'b':[0,0,4,2], 'c':[0,12,0,2], 'd':[22, 3, 34, 12], 'e':[0,0,3,6], 'f':[0,2,0,0], 'h':[0,1,1,0] })
df
a b c d e f h
0 NA 0 0 22 0 0 0
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0
I want to drop the entire row if the value of column b and all columns e contain 0
Basically I want to get something like this
a b c d e f h
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0
If want test from e to end columns and b columns added by DataFrame.assign use DataFrame.loc for selecing, test for not equal by DataFrame.ne and then if aby values match (it means no all 0) with DataFrame.any and last filter by boolean indexing:
df = df[df.loc[:, 'e':].assign(b = df['b']).ne(0).any(axis=1)]
print (df)
a b c d e f h
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0

Pandas DataFrame - count 0s in every row

I have dataframe that looks like this
x = pd.DataFrame.from_dict({'A':[1,2,0,4,0,6], 'B':[0, 0, 0, 44, 48, 81], 'C':[1,0,1,0,1,0]})
(assume it might have other columns).
I want to add a column, which specifies for each row, how many 0s there are in the specific columns A,B,C.
A B C num_zeros
0 1 0 1 1
1 2 0 0 2
2 0 0 1 2
3 4 44 0 1
4 0 48 1 1
5 6 81 0 1
Create a boolean dtype dataframe using ==, then use sum with axis=1:
x['num_zeros'] = (x == 0).sum(1)
Output:
A B C num_zeros
0 1 0 1 1
1 2 0 0 2
2 0 0 1 2
3 4 44 0 1
4 0 48 1 1
5 6 81 0 1
Now, if you want explicitly define which columns, ie... on count in B and C columns, then you can use this:
x['Num_zeros_in_BC'] = (x == 0)[['B','C']].sum(1)
Output:
A B C num_zeros Num_zeros_in_BC
0 1 0 1 1 1
1 2 0 0 2 2
2 0 0 1 2 1
3 4 44 0 1 1
4 0 48 1 1 0
5 6 81 0 1 1

Need to transpose a pandas dataframe

I have a Series that look like this:
col1 id
0 a 10
1 b 20
2 c 30
3 b 10
4 d 10
5 a 30
6 e 40
My desired output is this:
a b c d e
10 1 1 0 1 0
20 0 1 0 0 0
30 1 0 1 0 0
40 0 0 0 0 1
I got this code:
import pandas as pd
df['dummies'] = 1
df_ind.pivot(index='id', columns='col1', values='dummies')
I get an error:
137
138 if mask.sum() < len(self.index):
--> 139 raise ValueError('Index contains duplicate entries, '
140 'cannot reshape')
141
ValueError: Index contains duplicate entries, cannot reshape
There are duplicate id's because multiple values in col1 can be attributed to a single id.
How can I achieve the desired output?
Thanks!
You could use pd.crosstab
In [329]: pd.crosstab(df.id, df.col1)
Out[329]:
col1 a b c d e
id
10 1 1 0 1 0
20 0 1 0 0 0
30 1 0 1 0 0
40 0 0 0 0 1
Or, use pd.pivot_table
In [336]: df.pivot_table(index='id', columns='col1', aggfunc=len, fill_value=0)
Out[336]:
col1 a b c d e
id
10 1 1 0 1 0
20 0 1 0 0 0
30 1 0 1 0 0
40 0 0 0 0 1
Or, use groupby and unstack
In [339]: df.groupby(['id', 'col1']).size().unstack(fill_value=0)
Out[339]:
col1 a b c d e
id
10 1 1 0 1 0
20 0 1 0 0 0
30 1 0 1 0 0
40 0 0 0 0 1

Categories

Resources