Loop column names in get_dummies for pandas? - python

For pandas I have written the code below in order to convert all categorical features. However after I run it on my data set and check data types, nothing changes.
Thank you in advance.
Code:
def dummy_conv(data):
names=data.select_dtypes(exclude=['number']).columns
for c in names:
data=pd.get_dummies(data,columns=[c],drop_first=True)
dummy_conv(data_train)
data_train.dtypes # object features are not converted

Looping is not necessary, filter by list of columns, also not forget for return:
data_train = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (data_train)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
def dummy_conv(data):
names=data.select_dtypes(exclude=['number']).columns
return pd.get_dummies(data[names], drop_first=True)
df = dummy_conv(data_train)
print (df)
A_b A_c A_d A_e A_f F_b
0 0 0 0 0 0 0
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 1
4 0 0 0 1 0 1
5 0 0 0 0 1 1
If want convert only non numeric columns:
def dummy_conv(data):
return pd.get_dummies(data,drop_first=True)
#same output like
#names=data.select_dtypes(exclude=['number']).columns
#return pd.get_dummies(data,columns=names,drop_first=True)
df = dummy_conv(data_train)
print (df)
B C D E A_b A_c A_d A_e A_f F_b
0 4 7 1 5 0 0 0 0 0 0
1 5 8 3 3 1 0 0 0 0 0
2 4 9 5 6 0 1 0 0 0 0
3 5 4 7 9 0 0 1 0 0 1
4 5 2 1 2 0 0 0 1 0 1
5 4 3 0 4 0 0 0 0 1 1

Related

How to one-hot-encode matrix of sentences at the character level?

There is a dataframe:
0 1 2 3
0 a c e NaN
1 b d NaN NaN
2 b c NaN NaN
3 a b c d
4 a b NaN NaN
5 b c NaN NaN
6 a b NaN NaN
7 a b c e
8 a b c NaN
9 a c e NaN
I would like to transfrom encode it with one-hot like this
a c e b d
0 1 1 1 0 0
1 0 0 0 1 1
2 0 1 0 1 0
3 1 1 0 1 1
4 1 0 0 1 0
5 0 1 0 1 0
6 1 0 0 1 0
7 1 1 1 1 0
8 1 1 0 1 0
9 1 1 1 0 0
pd.get_dummies does not work here, because it acutually encode each columns independently. How can I get this? Btw, the order of the columns doesn't matter.
Try this:
df.stack().str.get_dummies().max(level=0)
Out[129]:
a b c d e
0 1 0 1 0 1
1 0 1 0 1 0
2 0 1 1 0 0
3 1 1 1 1 0
4 1 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 0
7 1 1 1 0 1
8 1 1 1 0 0
9 1 0 1 0 1
One way using str.join and str.get_dummies:
one_hot = df1.apply(lambda x: "|".join([i for i in x if pd.notna(i)]), 1).str.get_dummies()
print(one_hot)
Output:
a b c d e
0 1 0 1 0 1
1 0 1 0 1 0
2 0 1 1 0 0
3 1 1 1 1 0
4 1 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 0
7 1 1 1 0 1
8 1 1 1 0 0
9 1 0 1 0 1

Python pandas: add new columns based on the existed a column value, and set the value of new columns as 1 or 0

I have a dataframe named df as following:
ticker class_n
1 a
2 b
3 c
4 d
5 e
6 f
7 a
8 b
............................
I want to add new columns to this dataframe, the new columns names is the value of unique category of class_n(I mean no repeat class_n). Further, the value of new columns is 1 (if the value of class_n is same with column name), other is 0.
for example as the following dataframe. I want to get the new dataframe as following:
ticer class_n a b c d e f
1 a 1 0 0 0 0 0
2 b 0 1 0 0 0 0
3 c 0 0 1 0 0 0
4 d 0 0 0 1 0 0
5 e 0 0 0 0 1 0
6 f 0 0 0 0 0 1
7 a 1 0 0 0 0 0
8 b 0 1 0 0 0 0
My code is following:
lst_class = list(set(list(df['class_n'])))
for cla in lst_class:
df[c] = 0
df.loc[df['class_n'] is cla, cla] =1
but there is error:
KeyError: 'cannot use a single bool to index into setitem'
Thanks!
Use pd.get_dummies
df.join(pd.get_dummies(df.class_n))
ticker class_n a b c d e f
0 1 a 1 0 0 0 0 0
1 2 b 0 1 0 0 0 0
2 3 c 0 0 1 0 0 0
3 4 d 0 0 0 1 0 0
4 5 e 0 0 0 0 1 0
5 6 f 0 0 0 0 0 1
6 7 a 1 0 0 0 0 0
7 8 b 0 1 0 0 0 0
Or the same thing but a little more manually
f, u = pd.factorize(df.class_n.values)
d = pd.DataFrame(np.eye(u.size, dtype=int)[f], df.index, u)
df.join(d)
ticker class_n a b c d e f
0 1 a 1 0 0 0 0 0
1 2 b 0 1 0 0 0 0
2 3 c 0 0 1 0 0 0
3 4 d 0 0 0 1 0 0
4 5 e 0 0 0 0 1 0
5 6 f 0 0 0 0 0 1
6 7 a 1 0 0 0 0 0
7 8 b 0 1 0 0 0 0

Convert pandas dataframe to series

Is there a way to convert pandas dataframe to series with multiindex? The dataframe's columns could be multi-indexed too.
Below works, but only for multiindex with labels.
In [163]: d
Out[163]:
a 0 1
b 0 1 0 1
a 0 0 0 0
b 1 2 3 4
c 2 4 6 8
In [164]: d.stack(d.columns.names)
Out[164]:
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64
I think you can use nlevels for find length of levels in MultiIndex, then create range with stack:
print (d.columns.nlevels)
2
#for python 3 add `list`
print (list(range(d.columns.nlevels)))
[0, 1]
print (d.stack(list(range(d.columns.nlevels))))
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64

Sequence number groupby ID with reset

I'am looking for a way to générate a sequence of numbers that reset on every break
Example
ID VAR
A 0
A 0
A 1
A 1
A 0
A 0
A 1
A 1
B 1
B 1
B 1
B 0
B 0
B 0
B 0
Each time var is at 1 and ID the same as before, you start the counter.
but if ID is not the same or VAR is 0 you start again from 0
Desired output
ID VAR DESIRED
A 0 0
A 0 0
A 1 1
A 1 2
A 0 0
A 0 0
A 1 1
A 1 2
B 1 1
B 1 2
B 1 3
B 0 0
B 0 0
B 0 0
B 0 0
You can create an intermediate index, and then groupby this index and ID, cumsumming up on VAR:
df['ix'] = df['VAR'].diff().fillna(0).abs().cumsum()
df['DESIRED'] = df.groupby(['ID','ix'])['VAR'].cumsum()
In [21]: df
Out[21]:
ID VAR ix DESIRED
0 A 0 0 0
1 A 0 0 0
2 A 1 1 1
3 A 1 1 2
4 A 0 2 0
5 A 0 2 0
6 A 1 3 1
7 A 1 3 2
8 B 1 3 1
9 B 1 3 2
10 B 1 3 3
11 B 0 4 0
12 B 0 4 0
13 B 0 4 0
14 B 0 4 0

Convert a pandas data frame to a pandas data frame with another style

I have data frame containing the IDs of animals and types they belong to as given below
ID Class
1 1
2 1
3 0
4 4
5 3
6 2
7 1
8 0
I want convert it to a new style with the classes on the header row as follows.
ID 0 1 2 3 4
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Can you help me to do it with python
See get_dummies():
>>> print df
ID Class
0 1 1
1 2 1
2 3 0
3 4 4
4 5 3
5 6 2
6 7 1
7 8 0
>>> df2 = pd.get_dummies(df, columns=['Class'])
>>> print df2
ID Class_0 Class_1 Class_2 Class_3 Class_4
0 1 0 1 0 0 0
1 2 0 1 0 0 0
2 3 1 0 0 0 0
3 4 0 0 0 0 1
4 5 0 0 0 1 0
5 6 0 0 1 0 0
6 7 0 1 0 0 0
7 8 1 0 0 0 0
And if you want to get rid of "Class_" in the column headers, set both prefix and prefix_sep to the empty string:
df2 = pd.get_dummies(df, columns=['Class'], prefix='', prefix_sep='')

Categories

Resources