dataframe to presence/absence dataframe with 2 columns as comma separated strings

dataframe to presence/absence dataframe with 2 columns as comma separated strings - python

I have a dataframe with 3 columns, the first one (annotations) is what I want to measure presence absence on and the latter two columns (categories, CUI_desc) contain comma separated strings of factors that I would like to become columns for a presence/absence dataframe
Currently, the data looks like this:
annotations categories CUI_desc
heroine ['heroic', 'opioid', 'substance_abuse'] ['C0011892___heroin']
heroin ['heroic', 'opioid', 'substance_abuse'] ['C0011892___heroin']
he smoked two packs a day ['opioid', 'substance_abuse'] ['C0439234___year', 'C0748223___QUIT', 'C0028040___nicotine']
And I would like it to look like this:
annotations heroic opioid substance_abuse COO1892___heroin CO439234___year CO748223___QUIT COO22840___nicotine
heroine 1 1 1 1 0 0 0
heroin 1 1 1 1 0 0 0
he smoked two packs a day 0 1 1 0 1 1 1
I used this line of code from a similar question:
from collections import Counter
test = pd.DataFrame({k:Counter(v) for k, v in master.items()}).T.fillna(0).astype(int)
But got an undesired output:
heroine heroin he smoked two packs a day
annotations 1 1 1
categories 0 0 0
CUI_desc 0 0 0
It seems to be counting how many times a certain annotations shows up in my dataframe. This is likely because the above block of code is for a dictionary and not a dataframe.

Edit: OP clarified that each cell is a string so we need to convert it into a list first before calling explode.
Assuming the index is unique:
from ast import literal_eval
categories = pd.get_dummies(master['categories'].apply(literal_eval).explode()).groupby(level=0).sum()
cui_desc = pd.get_dummies(master['CUI_desc'].apply(literal_eval).explode()).groupby(level=0).sum()
pd.concat([master['annotations'], categories, cui_desc], axis=1)
Output:
annotations heroic opioid substance_abuse C0011892___heroin C0028040___nicotine C0439234___year C0748223___QUIT
heroine 1 1 1 1 0 0 0
heroin 1 1 1 1 0 0 0
he smoked two packs a day 0 1 1 0 1 1 1

Here is another approach using Series.value_counts
import ast
def row_value_counts(row):
return row.apply(ast.literal_eval).explode().value_counts()
test = (
df.set_index("annotations")
.apply(row_value_counts, axis=1)
.fillna(0)
.astype(int)
.reset_index()
)

Related

Column header equals column value pandas [duplicate]

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1

Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1

The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Get Dummies for Column with Multiple Variables [duplicate]

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1

Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1

The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Create new categorical variable based on multiple binary columns

I have a data frame with many binary variables and I would like to create a new variable with categorical values based on many of these binary variables
My dataframe looks like this
gov_winner corp_winner in part
1 0 0
0 1 0
0 0 1
I variable I would like to create is called winning_party and would look like this
gov_winner corp_winner in part winning_party
1 0 0 gov
0 1 0 corp
0 0 1 in part
I started trying the following code but haven't had success yet:
harrington_citations = harrington_citations.assign(winning_party=lambda x: x['gov_winner']
== 1 then x = 'gov' else x == 0)
Using anky_91's answer I get the following error:
TypeError: can't multiply sequence by non-int of type 'str'

You can use a dot product:
df.assign(Winner_Party=df.dot(df.columns))
#df.assign(Winner_Party=df # df.columns)
gov_winner corp_winner in_part Winner_Party
0 1 0 0 gov_winner
1 0 1 0 corp_winner
2 0 0 1 in_part

How about idxmax, notice this will only select the first max , you have multiple cell equal to 1 per row, you may want to try Jez's solution
df['Winner_Party']=df.eq(1).idxmax(1)

If there is always only one 1 per rows use DataFrame.dot, also you can filter only 1 and 0 columns before:
df1 = df.loc[:, df.isin([0,1,'0','1']).all()].astype(int)
df['Winner_Party'] = df1.dot(df1.columns)
But if there is multiple 1 per rows and need all matched values add separator and then remove it :
df['Winner_Party'] = df1.dot(df1.columns + ',').str.rstrip(',')
print (df)
gov_winner corp_winner in part Winner_Party
0 1 0 0 gov_winner
1 0 1 0 corp_winner
2 0 0 1 in part

Getting maximum values in a column

My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.

Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2

First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2

The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.

You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()

Encoding a dataframe of lists [duplicate]

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1

Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1

The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

dataframe to presence/absence dataframe with 2 columns as comma separated strings - python

Here is another approach using Series.value_counts import ast def row_value_counts(row): return row.apply(ast.literal_eval).explode().value_counts() test = ( df.set_index("annotations") .apply(row_value_counts, axis=1) .fillna(0) .astype(int) .reset_index() )

Related

Column header equals column value pandas [duplicate]

Get Dummies for Column with Multiple Variables [duplicate]

Create new categorical variable based on multiple binary columns

Getting maximum values in a column

Encoding a dataframe of lists [duplicate]

Categories

Resources