Pandas/Numpy shift rows into column based on existence - python

I have a dataframe like so:
col_a | col b
0 1
0 2
0 3
1 1
1 2
I want to convert it to:
col_a | 1 | 2 | 3
0 1 1 1
1 1 1 0
Unfortunately, most questions/answers revolving around this topic simply pivot it
Background: For Scikit, I want to use the existence of values in column b as an attribute/feature (like a sort of manual CountVectorizer, but for row values in this case instead of text)

Use get_dummies with creating first column to index, last use max per index for return only 1/0 values in output:
df = pd.get_dummies(df.set_index('col_a')['col b'], prefix='', prefix_sep='').max(level=0)
print (df)
1 2 3
col_a
0 1 1 1
1 1 1 0

You can use Groupby.cumcount and use it as columns for a pivoted dataframe, which can be obtained using pd.croostab and by default computes a frequency table of the factors :
cols = df.groupby('col_a').cumcount()
pd.crosstab(index = df.col_a, columns = cols)
col_0 0 1 2
col_a
0 1 1 1
1 1 1 0

Related

Column header equals column value pandas [duplicate]

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Get Dummies for Column with Multiple Variables [duplicate]

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

python pandas group by and aggregate columns

I am using panda version 0.23.0. I want to use data frame group by function to generate new aggregated columns using [lambda] functions..
My data frame looks like
ID Flag Amount User
1 1 100 123345
1 1 55 123346
2 0 20 123346
2 0 30 123347
3 0 50 123348
I want to generate a table which looks like
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM Flag0_User_Count Flag1_User_Count
1 2 2 0 155 0 2
2 2 0 50 0 2 0
3 1 0 50 0 1 0
here:
Flag0_Count is count of Flag = 0
Flag1_Count is count of Flag = 1
Flag0_Amount_SUM is SUNM of amount when Flag = 0
Flag1_Amount_SUM is SUNM of amount when Flag = 1
Flag0_User_Count is Count of Distinct User when Flag = 0
Flag1_User_Count is Count of Distinct User when Flag = 1
I have tried something like
df.groupby(["ID"])["Flag"].apply(lambda x: sum(x==0)).reset_index()
but it creates a new a new data frame. This means I will have to this for all columns and them merge them together into a new data frame.
Is there an easier way to accomplish this?
Use DataFrameGroupBy.agg by dictionary by column names with aggregate function, then reshape by unstack, flatten MultiIndex of columns, rename columns and last reset_index:
df = (df.groupby(["ID", "Flag"])
.agg({'Flag':'size', 'Amount':'sum', 'User':'nunique'})
.unstack(fill_value=0))
#python 3.6+
df.columns = [f'{i}{j}' for i, j in df.columns]
#python below
#df.columns = [f'{}{}'.format(i, j) for i, j in df.columns]
d = {'Flag0':'Flag0_Count',
'Flag1':'Flag1_Count',
'Amount0':'Flag0_Amount_SUM',
'Amount1':'Flag1_Amount_SUM',
'User0':'Flag0_User_Count',
'User1':'Flag1_User_Count',
}
df = df.rename(columns=d).reset_index()
print (df)
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM \
0 1 0 2 0 155
1 2 2 0 50 0
2 3 1 0 50 0
Flag0_User_Count Flag1_User_Count
0 0 2
1 2 0
2 1 0

Encoding a dataframe of lists [duplicate]

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

Categories

Resources