Convert a categorical valued column to its statistical values in Python - python

I have a dataframe whose sample is given below.
import pandas as pd
data = {'ID':['A','B','C','D','E','F'],
'Gender':['Man', 'Woman', 'Transgender', 'Non-binary,Transgender', 'Woman,Non-binary',
'Man,Non-binary,Transgender']}
df = pd.DataFrame(data)
df
Now, I want to create a column for each value in the 'Gender' column and if the value is present in the row, the new column should have '1' else empty. The final form required is shown below.
Cannot use pd.get_dummies() as there are multiple values(ex: 'Non-binary, Transgender') in many rows.
I thought of manually hardcoding for all values, but wanted to know if there is a way to automate the process.
Any help is greatly appreciated. Thanks.

Well you can split on , to easily come back to a situation where you can use get_dummies:
>>> df_split = df[['ID']].join(df['Gender'].str.split(',')).explode('Gender')
>>> df_split
ID Gender
0 A Man
1 B Woman
2 C Transgender
3 D Non-binary
3 D Transgender
4 E Woman
4 E Non-binary
5 F Man
5 F Non-binary
5 F Transgender
>>> dummies = pd.get_dummies(df_split['Gender']).groupby(df_split['ID']).max().reset_index()
>>> dummies
ID Man Non-binary Transgender Woman
0 A 1 0 0 0
1 B 0 0 0 1
2 C 0 0 1 0
3 D 0 1 1 0
4 E 0 1 0 1
5 F 1 1 1 0
>>> df.merge(dummies, on='ID')
ID Gender Man Non-binary Transgender Woman
0 A Man 1 0 0 0
1 B Woman 0 0 0 1
2 C Transgender 0 0 1 0
3 D Non-binary,Transgender 0 1 1 0
4 E Woman,Non-binary 0 1 0 1
5 F Man,Non-binary,Transgender 1 1 1 0

Use Series.str.get_dummies, which allows you to specify a separator in the case of multiple values in a string, then join the result back.
pd.concat([df, df['Gender'].str.get_dummies(',').add_prefix('Gender_')], axis=1)
ID Gender Gender_Man Gender_Non-binary Gender_Transgender Gender_Woman
0 A Man 1 0 0 0
1 B Woman 0 0 0 1
2 C Transgender 0 0 1 0
3 D Non-binary,Transgender 0 1 1 0
4 E Woman,Non-binary 0 1 0 1
5 F Man,Non-binary,Transgender 1 1 1 0

Related

iterate through columns pandas dataframe and create another column based on a condition

I have a dataframe df
ID ID2 escto1 escto2 escto3
1 A 1 0 0
2 B 0 1 0
3 C 0 0 3
4 D 0 2 0
so either using indexing or using wildcard
like column name 'escto*'
if df.iloc[:, 2:]>0 then df.helper=1
or
df.loc[(df.iloc[:, 3:]>0,'Transfer')]=1
So that output becomes
ID ID2 escto1 escto2 escto3 helper
1 A 1 0 0 1
2 B 0 1 0 1
3 C 0 0 3 1
4 D 0 2 0 1
Output
One option is to use the boolean output:
df.assign(helper = df.filter(like='escto').gt(0).any(1).astype(int))
ID ID2 escto1 escto2 escto3 helper
0 1 A 1 0 0 1
1 2 B 0 1 0 1
2 3 C 0 0 3 1
3 4 D 0 2 0 1

Create a feature table in Python from a df

I have the following df:
id step1 step2 step3 step4 .... stepn-1, stepn, event
1 a b c null null null 1
2 b d f null null null 0
3 a d g h l m 1
Where the id is a session, the steps represent a certain path, and event is whether something specific happened
I want to create a feature store where we take all the possible steps (a, b, c, ... all the way to some arbitrary number) and make them the columns. Then I want the x-column to remain the id and it just fill a 1 or zero if that session hit that step in the column. The result is below:
id a b c d e f g ... n event
1 1 1 1 0 0 0 0 0 1
2 0 1 0 0 0 1 0 0 0
3 1 0 0 1 0 0 1 1 1
I have a unique list of all the possible steps which I assume will be used to construct the new table. But after that I am struggling thinking how to create this.
What you are looking for is often used in machine learning, and is called one-hot encoding.
There is a pandas function specifically designed for this purpose, called pd.get_dummies().
step_cols = [c for c in df.columns if c.startswith('step')]
other_cols = [c for c in df.columns if not c.startswith('step')]
new_df = pd.get_dummies(df[step_cols].stack()).groupby(level=0).max()
new_df[other_cols] = df[other_cols]
Output:
>>> new_df
a b c d f g h l m id event
0 1 1 1 0 0 0 0 0 0 1 1
1 0 1 0 1 1 0 0 0 0 2 0
2 1 0 0 1 0 1 1 1 1 3 1
Probably not the most elegant way:
step_cols = [col for col in df.columns if col.startswith("step")]
values = pd.Series(sorted(set(df[step_cols].melt().value.dropna())))
df1 = pd.DataFrame(
(values.isin(row).to_list() for row in zip(*(df[col] for col in step_cols))),
columns=values
).astype(int)
df = pd.concat([df.id, df1, df.event], axis=1)
Result for
df =
id step1 step2 step3 step4 event
0 1 a b c NaN 1
1 2 b d f NaN 0
2 3 a d g h 1
is
id a b c d f g h event
0 1 1 1 1 0 0 0 0 1
1 2 0 1 0 1 1 0 0 0
2 3 1 0 0 1 0 1 1 1

Use groupby and merge to create new column in pandas

So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!
If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')
Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

Pandas DataFrame update one column using another column

I have a two-column DataFrame df, its columns are phone and label, which label can only be 0 or 1.
Here is an example:
phone label
a 0
b 1
a 1
a 0
c 0
b 0
What I want to do is to calculate the number of '1' of each type of 'phone' and using the number replaces the 'phone' column
What i come with up is groupby, but i am not familiar with it
The answer should be:
Count the number of each 'phone'
phone count
a 1
b 1
c 0
replace the 'phone' with 'count' in the original table
phone
1
1
1
1
0
1
taking into account that the label column can only have 0 or 1, you can use .trasnform('sum') method:
In [4]: df.label = df.groupby('phone')['label'].transform('sum')
In [5]: df
Out[5]:
phone label
0 a 1
1 b 1
2 a 1
3 a 1
4 c 0
5 b 1
Explanation:
In [2]: df
Out[2]:
phone label
0 a 0
1 b 1
2 a 1
3 a 0
4 c 0
5 b 0
In [3]: df.groupby('phone')['label'].transform('sum')
Out[3]:
0 1
1 1
2 1
3 1
4 0
5 1
dtype: int64
You can filter and group data in pandas. For your case it would look
assume data is
phone label
0 a 0
1 b 1
2 a 1
3 a 1
4 c 1
5 d 1
6 a 0
7 c 0
8 b 0
df.groupby(['phone','label'])['label'].count()
phone label
a 0 2
1 2
b 0 1
1 1
c 0 1
1 1
d 1 1
If you require group count of phones given label==1 then do this -
#first filter to get only label==1 rows
phone_rows_label_one_df = df[df.label==1]
#then do groupby
phone_rows_label_one_df.groupby(['phone'])['label'].count()
phone
a 2
b 1
c 1
d 1
To get count as a new column in the dataframe do this
phone_rows_label_one_df.groupby(['phone'])['label'].count().reset_index(name='count')
phone count
0 a 2
1 b 1
2 c 1
3 d 1

how to use get_dummies in Pandas when different categories are concatenated into a string without a separator?

consider the following example
df=pd.DataFrame({'col':['ABC','BDE','DE',np.nan,]})
df
Out[216]:
col
0 ABC
1 BDE
2 DE
3 NaN
I want to create a dummy variable for each letter in col.
In this example, we thus have 5 dummies: A,B,C,D,E. Indeed, in the first row 'ABC' corresponds to category A and category B and category C.
Using get_dummies fails
df.col.str.get_dummies(sep='')
Out[217]:
ABC BDE DE
0 1 0 0
1 0 1 1
2 0 0 1
3 0 0 0
Indeed, expected output for the first row should be
A B C D E
0 1 1 1 0 0
Do you have other ideas?
Thanks!
You can use Series.str.join to introduce a separator between each character, then use get_dummies.
df.col.str.join('|').str.get_dummies()
The resulting output:
A B C D E
0 1 1 1 0 0
1 0 1 0 1 1
2 0 0 0 1 1
3 0 0 0 0 0

Categories

Resources