How to sum values of pandas dataframe by rows? - python

I have a data frame like this:
A
0
1
0
2
and I would like to sum the values "so far" of the dataframe in a cumulative format, so if A increases by 1 then I would like the sum to increase by 1 as well, as so:
A Sum
0 0
1 1
0 1
2 2
I have to keep a record of when this change occurs for the analysis, so I can't just sum the entire column at once.
I thought about doing:
df = df.assign(A_before=df.A.shift(1))
df['change'] = (df.A - df.A_before)
df['sum'] = df['A'] + df['A_before']
but it's not adding the sum values from the previous rows as well, only the values in the same rows.
Any solutions? Thank you.

You can do diff with cumsum
df.A.diff().ge(1).cumsum()
0 0
1 1
2 1
3 2
Name: A, dtype: int64
df['sum']=df.A.diff().ge(1).cumsum()

Related

Incrementing row values without looping [duplicate]

I have a pandas dataframe defined as:
A B SUM_C
1 1 10
1 2 20
I would like to do a cumulative sum of SUM_C and add it as a new column to the same dataframe. In other words, my end goal is to have a dataframe that looks like below:
A B SUM_C CUMSUM_C
1 1 10 10
1 2 20 30
Using cumsum in pandas on group() shows the possibility of generating a new dataframe where column name SUM_C is replaced with cumulative sum. However, my ask is to add the cumulative sum as a new column to the existing dataframe.
Thank you
Just apply cumsum on the pandas.Series df['SUM_C'] and assign it to a new column:
df['CUMSUM_C'] = df['SUM_C'].cumsum()
Result:
df
Out[34]:
A B SUM_C CUMSUM_C
0 1 1 10 10
1 1 2 20 30

Pandas groupby cumulative sum ignore current row

I know there's some questions about this topic (like Pandas: Cumulative sum of one column based on value of another) however, none of them fuull fill my requirements.
Let's say I have a dataframe like this one
.
I want to compute the cumulative sum of Cost grouping by month, avoiding taking into account the current value, in order to get the Desired column.By using groupby and cumsum I obtain colum CumSum
.
The DDL to generate the dataframe is
df = pd.DataFrame({'Month': [1,1,1,2,2,1,3],
'Cost': [5,8,10,1,3,4,1]})
IIUC you can use groupby.cumsum and then just subtract cost;
df['cumsum_'] = df.groupby('Month').Cost.cumsum().sub(df.Cost)
print(df)
Month Cost cumsum_
0 1 5 0
1 1 8 5
2 1 10 13
3 2 1 0
4 2 3 1
5 1 4 23
6 3 1 0
You can do the following:
df['agg']=df.groupby('Month')['Cost'].shift().fillna(0)
df['Cumsum']=df['Cost']+df['agg']

Getting maximum values in a column

My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.
Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2
First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2
The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.
You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()

pandas: adding rows in a column

i have a dataframe like this,
Count
1
0
1
1
1
I want to add N and N+1 in count column and store it in N, is it possible to do in pandas way?
result should like this, technically it is cumulative sum:
Counts
1
1
2
3
4
You can use the cumulative sum function, cumsum().
df = pd.DataFrame([1, 0, 1, 1,1], columns=['Count'])
df['Counts'] = df['Count'].cumsum()
print(df)
giving you the desired output.
Count Counts
0 1 1
1 0 1
2 1 2
3 1 3
4 1 4

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources