I'm trying to create a new column group consisting of 3 sub-columns after using pivot on a dataframe, but the result is only one column.
Let's say I have the following dataframe that I pivot:
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': [1, 2, 3, 4, 5, 6]})
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
Now I want an extra column group that is the sum of the two value columns baz and zoo.
My output:
df.loc[:, "baz+zoo"] = df.loc[:,'baz'] + df.loc[:,'baz']
The desired output:
I know that performing the sum and then concatenating will do the trick, but I was hoping for a neater solution.
I think if many rows or mainly many columns is better/faster create new DataFrame and add first level of MultiIndex by MultiIndex.from_product and add to original by DataFrame.join:
df1 = df.loc[:,'baz'] + df.loc[:,'zoo']
df1.columns = pd.MultiIndex.from_product([['baz+zoo'], df1.columns])
print (df1)
baz+zoo
A B C
foo
one 2 4 6
two 8 10 12
df = df.join(df1)
print (df)
baz zoo baz+zoo
bar A B C A B C A B C
foo
one 1 2 3 1 2 3 2 4 6
two 4 5 6 4 5 6 8 10 12
Another solution is loop by second levels and select MultiIndex by tuples, but if large DataFrame performance should be worse, the best test with real data:
for x in df.columns.levels[1]:
df[('baz+zoo', x)] = df[('baz', x)] + df[('zoo', x)]
print (df)
baz zoo baz+zoo
bar A B C A B C A B C
foo
one 1 2 3 1 2 3 2 4 6
two 4 5 6 4 5 6 8 10 12
I was able to do it this way too. I'm not sure I understand the theory, but...
df['baz+zoo'] = df['baz']+df['zoo']
df.pivot(index='foo', columns='bar', values=['baz','zoo','baz+zoo'])
Related
I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 3 years ago.
I am trying to get the index (or running count if you will) of each individual record in a groupby object into a column. I doesn't have to be a groupby, but the order has to remain the same, so for example, I want to sort and reindex by column C:
df = pd.DataFrame([[1, 2, 'Foo'],
[1, 3, 'Foo'],
[4, 6,'Bar'],
[7,8,'Bar']],
columns=['A', 'B', 'C'])
Out[72]:
A B C
0 1 2 Foo
1 1 3 Foo
2 4 6 Bar
3 7 8 Bar
My desired output would be:
Out[75]:
A B C sorted
0 1 2 Foo 1
1 1 3 Foo 2
2 4 6 Bar 1
3 7 8 Bar 2
It seems like this should be really easy, but nothing I've tried really comes close without looping through the entire data frame, which I would prefer to avoid. Thanks
Try with cumcount:
>>> df = pd.DataFrame([[1, 2, 'Foo'],
... [1, 3, 'Foo'],
... [4, 6,'Bar'],
... [7,8,'Bar']],
... columns=['A', 'B', 'C'])
>>> df["sorted"]=df.groupby("C").cumcount()+1
>>> df
A B C sorted
0 1 2 Foo 1
1 1 3 Foo 2
2 4 6 Bar 1
3 7 8 Bar 2
Pretty sure this is very simple.
I am reading a csv file and have the dataframe:
Attribute A B C
a 1 4 7
b 2 5 8
c 3 6 9
I want to do a transpose to get
Attribute a b c
A 1 2 3
B 4 5 6
C 7 8 9
However, when I do df.T, it results in
0 1 2
Attribute a b c
A 1 2 3
B 4 5 6
C 7 8 9`
How do I get rid of the indexes on top?
You can set the index to your first column (or in general, the column you want to use as as index) in your dataframe first, then transpose the dataframe. For example if the column you want to use as index is 'Attribute', you can do:
df.set_index('Attribute',inplace=True)
df.transpose()
Or
df.set_index('Attribute').T
It works for me:
>>> data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
>>> df = pd.DataFrame(data, index=['a', 'b', 'c'])
>>> df.T
a b c
A 1 2 3
B 4 5 6
C 7 8 9
If your index column 'Attribute' is really set to index before the transpose, then the top row after the transpose is not the first row, but a title row. if you don't like it, I would first drop the index, then rename them as columns after the transpose.
Similar to the documentation example, I want to pivot the following dataframe:
foo extra bar baz
0 one x A 1
1 one x B 2
2 one x C 3
3 two y A 4
4 two y B 5
5 two y C 6
The result should be
extra A B C
one x 1 2 3
two y 4 5 6
Can this be done in a shorter way than
splitting the extra column off before pivoting
deduplicating it separately
merging it back to the pivoted data?
(I expected the pivot command to be able to do this, but my tries failed.)
Here's the code for the dataframe to play with it:
df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
'extra': ['x','x','x','y','y','y'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6]})
You can use pivot_table, pivot only accepts one column as index, column and value while pivot_table can accept multiple columns:
df.pivot_table('baz', ['foo', 'extra'], 'bar').reset_index()
#bar foo extra A B C
# 0 one x 1 2 3
# 1 two y 4 5 6
Use set_index and unstack
In [2087]: df.set_index(['foo', 'extra', 'bar'])['baz'].unstack().reset_index()
Out[2087]:
bar foo extra A B C
0 one x 1 2 3
1 two y 4 5 6
I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')