Pandas: Replace a set of rows by their sum - python

I'm sure there is a neat way of doing this, but haven't had any luck finding it yet.
Suppose I have a data frame:
f = pd.DataFrame({'A':[1, 2, 3, 4], 'B': [10, 20, 30, 40], 'C':[100, 200, 300, 400]}).T
that is, with rows indexed A, B and C.
Now suppose I want to take rows A and B, and replace them both by a single row that is their sum; and, moreover, that I want to assign a given index (say 'sum') to that replacement row (note the order of indices doesn't matter).
At the moment I'm having to do:
f.append(pd.DataFrame(f.ix[['A','B']].sum()).T).drop(['A','B'])
followed by something equally clunky to set the index of the replacement row. However, I'm curious to know if there's an elegant, one-line way of doing both of these steps?

Do this:
In [79]: f.append(f.loc[['A', 'B']].sum(), ignore_index=True).drop([0, 1]).set_index(Index(['C', 'sumAB'])
)
Out[79]:
0 1 2 3
C 100 200 300 400
sumAB 11 22 33 44
Alternatively you can use Index.get_indexer for an even uglier one-liner:
In [96]: f.append(f.loc[['A', 'B']].sum(), ignore_index=True).drop(f.index.get_indexer(['A', 'B'])).set_index(Index(['C', 'sumAB']))
Out[96]:
0 1 2 3
C 100 200 300 400
sumAB 11 22 33 44

Another option is to use concat:
In [11]: AB = list('AB')
First select the rows you wish to sum:
In [12]: f.loc[AB]
Out[12]:
0 1 2 3
A 1 2 3 4
B 10 20 30 40
In [13]: f.loc[AB].sum()
Out[13]:
0 11
1 22
2 33
3 44
dtype: int64
and as a row in a DataFrame (Note: this step may not be necessary in future versions...):
In [14]: pd.DataFrame({'sumAB': f.loc[AB].sum()}).T
Out[14]:
0 1 2 3
sumAB 11 22 33 44
and we want to concat with all the remaining rows:
In [15]: f.loc[f.index - AB]
Out[15]:
0 1 2 3
C 100 200 300 400
In [16]: pd.concat([pd.DataFrame({'sumAB': f.loc[AB].sum()}).T,
f.loc[f.index - AB]],
axis=0)
Out[16]:
0 1 2 3
sumAB 11 22 33 44
C 100 200 300 400

Related

How to create a new column from merging two or more column?

I created a sample data frame like this:
A B A+B
0 1 2 3
1 9 60 69
2 20 400 420
And i want to display the process like this: Yeah the process like my last question but without rolling window stuff this time
A B A+B Equation
0 1 2 3 1+2
1 9 60 69 9+60 #the expectation
2 20 400 420 20+400
Assuming column A and Column B is created from separated columns like this:
d={'A':[1,9,20],'B':[2,60,400]}
Andhere's some code that i tried:
df['A+B']=df['A']+df['B']
df['Process']=str(df['A'])+str(df['B'])
Here's the output:
A B
0 1 2
1 9 60
2 20 400
A B A+B Process
0 1 2 3 0 1\n1 9\n2 20\nName: A, dtype: int...
1 9 60 69 0 1\n1 9\n2 20\nName: A, dtype: int...
2 20 400 420 0 1\n1 9\n2 20\nName: A, dtype: int... #Is there any step that i missed?
>>>
As Henry suggested, the best way to achieve what you want is:
df['Process'] = df['A'].astype(str) + '+' + df['B'].astype(str)
df
A B A+B Process
0 1 2 3 1+2
1 9 60 69 9+60
2 20 400 420 20+400
You can use Apply function
df['Process']= df.apply(lambda row : f"{row['A']}+{row['B']}", axis=1)
It works for me.

Keep some groups from GroupBy using list of indices

Hello StackOverflowers!
I have a pandas DataFrame
df = pd.DataFrame({
'A':[1,1,2,1,3,3,1,6,3,5,1],
'B':[10,10,300,10,30,40,20,10,30,45,20],
'C':[20,20,20,20,15,20,15,15,15,15,15],
'D':[10,20,30,40,80,10,20,50,30,10,70],
'E':[10,10,10,22,22,3,4,5,9,0,1]
})
Then I groupby it on some columns
groups = df.groupby(['A', 'B', 'C'])
I would like to select/filter the original data based on the groupby indices.
For example I would like to get 3 random combinations out of the groupby
Any ideas?
Instead of iterating along all groups len(indices) times and indexing on the respective indices value each time, get a list of the groups' keys from the dictionary returned by GroupBy.groups, and do single calls to GroupBy.get_group for each index:
keys = list(groups.groups.keys())
# [(1, 10, 20), (1, 20, 15), (2, 300, 20)...
pd.concat([groups.get_group(keys[i]) for i in indices])
A B C D E
6 1 20 15 20 4
10 1 20 15 70 1
5 3 40 20 10 3
4 3 30 15 80 22
8 3 30 15 30 9
What I could do is
groups = df.groupby(['A', 'B', 'C'])
indices = [1, 4, 3]
pd.concat([[df_group for names, df_group in groups][i] for i in indices])
Which results to :
Out[24]:
A B C D E
6 1 20 15 20 4
10 1 20 15 70 1
5 3 40 20 10 3
4 3 30 15 80 22
8 3 30 15 30 9
I wonder if there is a more elegant way, maybe implemented already in the pd.groupby()?

Compute number of occurance of each value and Sum another column in Pandas

I have a pandas dataframe with some columns in it. The column I am interested in is something like this,
df['col'] = ['A', 'A', 'B', 'C', 'B', 'A']
I want to make another column say, col_count such that it shows count value in col from that index to the end of the column.
The first A in the column should have a value 3 because there is 3 occurrence of A in the column from that index. The second A will have value 2 and so on.
Finally, I want to get the following result,
col col_count
0 A 3
1 A 2
2 B 2
3 C 1
4 B 1
5 A 1
How can I do this effectively in pandas.? I was able to do this by looping through the dataframe and taking a unique count of that value for a sliced dataframe.
Is there an efficient method to do this.? Something without loops preferable.
Another part of the question is, I have another column like this along with col,
df['X'] = [10, 40, 10, 50, 30, 20]
I want to sum up this column in the same fashion I wanted to count the column col.
For instance, At index 0, I will have 10 + 40 + 20 as the sum. At index 1, the sum will be 40 + 20. In short, instead of counting, I want to sum up another column.
The result will be like this,
col col_count X X_sum
0 A 3 10 70
1 A 2 40 60
2 B 2 10 40
3 C 1 50 50
4 B 1 30 30
5 A 1 20 20
Use pandas.Series.groupby with cumcount and cumsum.
g = df[::-1].groupby('col')
df['col_count'] = g.cumcount().add(1)
df['X_sum'] = g['X'].cumsum()
print(df)
Output:
col X col_count X_sum
0 A 10 3 70
1 A 40 2 60
2 B 10 2 40
3 C 50 1 50
4 B 30 1 30
5 A 20 1 20

Groupby in pandas between rows based on a condition

Let's say that I have the following dataframe:
name number
0 A 100
1 B 200
2 B 30
3 A 20
4 B 30
5 A 40
6 A 50
7 A 100
8 B 10
9 B 20
10 B 30
11 A 40
What I would like to do is to merge all the successive rows where name == 'B', between two rows with name == 'A' and get the corresponding sum. So, I would like my final output to look like that:
name number
0 A 100
1 B 230
2 A 20
3 B 30
4 A 40
5 A 50
6 A 100
7 B 60
8 A 40
We can use a little groupby trick here. Create a mask with of A's and then shift each subsequent group of B's into their own group. This answer assumes that your name Series contains just A's and B's.
c = df['name'].eq('A')
m1 = c.cumsum()
m = m1.where(c, m1 + m1.max())
df.groupby(m, sort=False, as_index=False).agg({'name': 'first', 'number': 'sum'})
name number
0 A 100
1 B 230
2 A 20
3 B 30
4 A 40
5 A 50
6 A 100
7 B 60
8 A 40
A clumsier attempt - but since I've done it might as well post.
This is just a basic for loop with a while:
for i in df.index:
if i in df.index and df.loc[i, 'name'] == 'B':
while df.loc[i+1, 'name'] == 'B':
df.loc[i, 'number'] += df.loc[i+1, 'number']
df = df.drop(i+1).reset_index(drop=True)
It's very straightforward (and hence inefficient I imagine): if B, if next row is also B, add next row to this row's number and delete next row.

How to sort a dataframe based on idxmax?

I have a dataframe like this:
A B C
0 1 2 1
1 3 -8 10
2 10 3 -20
3 50 7 1
I would like to rearrange its columns based on the index of the maximal absolute value in each column. In column A, the maximal absolute value is in row 3, in B it is row 1 and in C it is row 2 which means that my new dataframe should be in the order B C A.
Currently I do this as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 3, 10, 50], 'B': [2, -8, 3, 7], 'C': [1, 10, -20, 1]})
indMax = abs(df).idxmax(axis=0)
df = df[np.argsort(indMax)]
So I first determine the indices of the maximal value per column which are stored in indMax, then I sort it and rearrange the dataframe accordingly which gives me the desired output:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50
My question is whether there is the possibility to pass the function idxmax directly to a sort function and change the dataframe inplace.
IIUC the following does what you want:
In [69]
df.ix[:,df.abs().idxmax().sort_values().index]
Out[69]:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50
Here we determine the idxmax in the abs values, sort the values and pass the index to index the df.
As to sorting in place you can just assign back to the df.
For a pre 0.17.0 version the following works:
In [75]:
df.ix[:,df.abs().idxmax().sort(inplace=False).index]
Out[75]:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50
This is ugly, but it seems to work using reindex_axis:
import numpy as np
>>> df.reindex_axis(df.columns[list(np.argsort(abs(df).idxmax(axis=0)))], axis=1)
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50

Categories

Resources