I have the following df:
id step1 step2 step3 step4 .... stepn-1, stepn, event
1 a b c null null null 1
2 b d f null null null 0
3 a d g h l m 1
Where the id is a session, the steps represent a certain path, and event is whether something specific happened
I want to create a feature store where we take all the possible steps (a, b, c, ... all the way to some arbitrary number) and make them the columns. Then I want the x-column to remain the id and it just fill a 1 or zero if that session hit that step in the column. The result is below:
id a b c d e f g ... n event
1 1 1 1 0 0 0 0 0 1
2 0 1 0 0 0 1 0 0 0
3 1 0 0 1 0 0 1 1 1
I have a unique list of all the possible steps which I assume will be used to construct the new table. But after that I am struggling thinking how to create this.
What you are looking for is often used in machine learning, and is called one-hot encoding.
There is a pandas function specifically designed for this purpose, called pd.get_dummies().
step_cols = [c for c in df.columns if c.startswith('step')]
other_cols = [c for c in df.columns if not c.startswith('step')]
new_df = pd.get_dummies(df[step_cols].stack()).groupby(level=0).max()
new_df[other_cols] = df[other_cols]
Output:
>>> new_df
a b c d f g h l m id event
0 1 1 1 0 0 0 0 0 0 1 1
1 0 1 0 1 1 0 0 0 0 2 0
2 1 0 0 1 0 1 1 1 1 3 1
Probably not the most elegant way:
step_cols = [col for col in df.columns if col.startswith("step")]
values = pd.Series(sorted(set(df[step_cols].melt().value.dropna())))
df1 = pd.DataFrame(
(values.isin(row).to_list() for row in zip(*(df[col] for col in step_cols))),
columns=values
).astype(int)
df = pd.concat([df.id, df1, df.event], axis=1)
Result for
df =
id step1 step2 step3 step4 event
0 1 a b c NaN 1
1 2 b d f NaN 0
2 3 a d g h 1
is
id a b c d f g h event
0 1 1 1 1 0 0 0 0 1
1 2 0 1 0 1 1 0 0 0
2 3 1 0 0 1 0 1 1 1
Related
I have a dataframe whose sample is given below.
import pandas as pd
data = {'ID':['A','B','C','D','E','F'],
'Gender':['Man', 'Woman', 'Transgender', 'Non-binary,Transgender', 'Woman,Non-binary',
'Man,Non-binary,Transgender']}
df = pd.DataFrame(data)
df
Now, I want to create a column for each value in the 'Gender' column and if the value is present in the row, the new column should have '1' else empty. The final form required is shown below.
Cannot use pd.get_dummies() as there are multiple values(ex: 'Non-binary, Transgender') in many rows.
I thought of manually hardcoding for all values, but wanted to know if there is a way to automate the process.
Any help is greatly appreciated. Thanks.
Well you can split on , to easily come back to a situation where you can use get_dummies:
>>> df_split = df[['ID']].join(df['Gender'].str.split(',')).explode('Gender')
>>> df_split
ID Gender
0 A Man
1 B Woman
2 C Transgender
3 D Non-binary
3 D Transgender
4 E Woman
4 E Non-binary
5 F Man
5 F Non-binary
5 F Transgender
>>> dummies = pd.get_dummies(df_split['Gender']).groupby(df_split['ID']).max().reset_index()
>>> dummies
ID Man Non-binary Transgender Woman
0 A 1 0 0 0
1 B 0 0 0 1
2 C 0 0 1 0
3 D 0 1 1 0
4 E 0 1 0 1
5 F 1 1 1 0
>>> df.merge(dummies, on='ID')
ID Gender Man Non-binary Transgender Woman
0 A Man 1 0 0 0
1 B Woman 0 0 0 1
2 C Transgender 0 0 1 0
3 D Non-binary,Transgender 0 1 1 0
4 E Woman,Non-binary 0 1 0 1
5 F Man,Non-binary,Transgender 1 1 1 0
Use Series.str.get_dummies, which allows you to specify a separator in the case of multiple values in a string, then join the result back.
pd.concat([df, df['Gender'].str.get_dummies(',').add_prefix('Gender_')], axis=1)
ID Gender Gender_Man Gender_Non-binary Gender_Transgender Gender_Woman
0 A Man 1 0 0 0
1 B Woman 0 0 0 1
2 C Transgender 0 0 1 0
3 D Non-binary,Transgender 0 1 1 0
4 E Woman,Non-binary 0 1 0 1
5 F Man,Non-binary,Transgender 1 1 1 0
I have the following dataframes:
The MAIN dataframe A:
A B
0 1 0
1 1 0
The second dataframe B:
A B
0 0 1
1 1 0
The third dataframe C:
A B C
0 1 0 0
1 0 1 1
2 0 0 1
In python pandas, I want to add A,B and Cthem in such a way that the structure of the resulting dataframe D consists of same columns and rows structure as the MAIN dataframe A while the the values of rows/columns are added.
A + B + C
A B
0 2 1
1 2 1
And by Union addition, I mean that if values > 1, make it 1. So the final
A + B + C is:
A B
0 1 1
1 1 1
As you can see, the structure of first A dataframe is maintained while the values from common rows and columns are added. The common rows and columns are variable so I need a code to do this automatically by detecting common rows and columns. Any ideas how to do this?
UPDATE
Please note that the data frames can multidimensional:
For example:
A
A B
0 a 2 1
1 a 2 1
C
A B C
0 a 1 0 0
0 b 1 0 0
0 b 1 0 0
1 a 0 1 1
2 c 0 0 1
In this case I am expecting: A + C to be:
A B
0 a 3 1
1 a 2 2
Thereby keeping the structure of MAIN dataframe A. Then 'binarized' to
A B
0 a 1 1
1 a 1 1
((dfA+dfB+dfC).reindex(index=dfA.index,columns=dfA.columns)>=1).astype(int)
Out[252]:
A B
0 1 1
1 1 1
Updated :
(A+C).reindex(A.index,columns=A.columns)
Out[297]:
A B
0 a 3.0 1.0
1 a 2.0 2.0
IIUC:
In [56]: (d1 + d2 + d3).dropna(how='all').dropna(axis=1, how='all').ne(0).astype(np.int8)
Out[56]:
A B
0 1 1
1 1 1
UPDATE:
In [129]: idx = A.index.intersection(C.index)
In [131]: (A.loc[idx] | B.loc[idx, A.columns]).gt(0).astype('int8')
Out[131]:
A B
0 a 1 1
1 a 1 1
Depending a bit on how much of your given structure will generalize,
In [50]: df_a | df_b | df_c.loc[df_a.index, df_a.columns]
Out[50]:
A B
0 1 1
1 1 1
HI everybody i need some help with python.
I'm working with an excel with several rows, some of this rows has zero value in all the columns, so i need to delete that rows.
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
a 0 1 5 0
d 0 0 0 1
e 1 0 0 1
I think in something like show the rows that do not contain zeros, but do not work because is deleting all the rows with zero and without zero
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC != 0) & (df.TM != 0) & (df.Lease != 0) & (df.Maint != 0) & (df.Support != 0) & (df.Other != 0)]
Then i think like just show the columns with zero
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
b 0 0 0 0
c 0 0 0 0
So i make a little change and i have something like this
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC == 0) & (df.TM == 0) & (df.Lease == 0) & (df.Maint == 0) & (df.Support == 0) & (df.Other == 0)]
In this way I just get the column with zeros. I need a way to remove this 2 rows from the original input, and receive the output without that rows. Thanks, and sorry for the bad English, I'm working on that too
Given your input you can group by whether all the columns are zero or not, then access them, eg:
groups = df.groupby((df.drop('id', axis= 1) == 0).all(axis=1))
all_zero = groups.get_group(True)
non_all_zero = groups.get_group(False)
For this dataframe:
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 f 0 0 0 0 0
6 g 0 2 1 0 2
7 h 0 0 0 0 0
8 i 1 2 2 0 2
9 j 2 2 1 2 1
Temporarily set the index:
df = df.set_index('id')
Drop rows containing all zeros and reset the index:
df = df[~(df==0).all(axis=1)].reset_index()
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 g 0 2 1 0 2
6 i 1 2 2 0 2
7 j 2 2 1 2 1
consider the following example
df=pd.DataFrame({'col':['ABC','BDE','DE',np.nan,]})
df
Out[216]:
col
0 ABC
1 BDE
2 DE
3 NaN
I want to create a dummy variable for each letter in col.
In this example, we thus have 5 dummies: A,B,C,D,E. Indeed, in the first row 'ABC' corresponds to category A and category B and category C.
Using get_dummies fails
df.col.str.get_dummies(sep='')
Out[217]:
ABC BDE DE
0 1 0 0
1 0 1 1
2 0 0 1
3 0 0 0
Indeed, expected output for the first row should be
A B C D E
0 1 1 1 0 0
Do you have other ideas?
Thanks!
You can use Series.str.join to introduce a separator between each character, then use get_dummies.
df.col.str.join('|').str.get_dummies()
The resulting output:
A B C D E
0 1 1 1 0 0
1 0 1 0 1 1
2 0 0 0 1 1
3 0 0 0 0 0
I have a DataFrame of authors and their papers:
author paper
0 A z
1 B z
2 C z
3 D y
4 E y
5 C y
6 F x
7 G x
8 G w
9 B w
I want to get a matrix of how many papers each pair of authors has together.
A B C D E F G
A
B 1
C 1 1
D 1 0 1
E 0 0 1 1
F 0 0 0 0 0
G 0 1 0 0 0 1
Is there a way to transform the DataFrame using pandas to get this results? Or is there a more efficient way (like with numpy) to do this so that it is scalable?
get_dummies, which I first reached for, isn't as convenient here as hoped; needed to add an extra groupby. Instead, it's actually simpler to add a dummy column or use a custom aggfunc. For example, if we start from a df like this (note that I've added an extra paper a so that there's at least one pair who's written more than one paper together)
>>> df
author paper
0 A z
1 B z
2 C z
[...]
10 A a
11 B a
We can add a dummy tick column, pivot, and then use the "it's simply a dot product" observation from this question:
>>> df["dummy"] = 1
>>> dm = df.pivot("author", "paper").fillna(0)
>>> dout = dm.dot(dm.T)
>>> dout
author A B C D E F G
author
A 2 2 1 0 0 0 0
B 2 3 1 0 0 0 1
C 1 1 2 1 1 0 0
D 0 0 1 1 1 0 0
E 0 0 1 1 1 0 0
F 0 0 0 0 0 1 1
G 0 1 0 0 0 1 2
where the diagonal counts how many papers an author has written. If you really want to obliterate the diagonal and above, we can do that too:
>>> dout.values[np.triu_indices_from(dout)] = 0
>>> dout
author A B C D E F G
author
A 0 0 0 0 0 0 0
B 2 0 0 0 0 0 0
C 1 1 0 0 0 0 0
D 0 0 1 0 0 0 0
E 0 0 1 1 0 0 0
F 0 0 0 0 0 0 0
G 0 1 0 0 0 1 0