Cumulative Sum that resets based on specific condition - python

Let's say I have the following data:
df=pd.DataFrame({'Days':[1,2,3,4,1,2,3,4],
'Flag':["First","First","First","First","Second","Second","Second","Second"],
'Payments':[1,2,3,4,9,3,1,6]})
I want to create a cumulative sum for payments, but it has to reset when flag turns from first to second. Any help?
The output that I'm looking for is the following:

Not sure if this is you want since you didn't provide an output but try this
df=pd.DataFrame({'Days':[1,2,3,4,1,2,3,4],
'Flag':["First","Second","First","Second","First","Second","Second","First"],
'Payments':[1,2,3,4,9,3,1,6]})
# make groups using consecutive Flags
groups = df.Flag.shift().ne(df.Flag).cumsum()
# groupby the groups and cumulatively sum payments
df['cumsum'] = df.groupby(groups).Payments.cumsum()
df

You can use df['Flag'].ne(df['Flag'].shift()).cumsum() to generate a grouper that will group by changes in the Flag column. Then, group by that, and cumsum:
df['cumsum'] = df['Payments'].groupby(df['Flag'].ne(df['Flag'].shift()).cumsum()).cumsum()
Output:
>>> df
Days Flag Payments cumsum
0 1 First 1 1
1 2 First 2 3
2 3 First 3 6
3 4 First 4 10
4 1 Second 9 9
5 2 Second 3 12
6 3 Second 1 13
7 4 Second 6 19

What is wrong with
df['Cumulative Payments'] = df.groupby('Flag')['Payments'].cumsum()
Days Flag Payments Cumulative Payments
0 1 First 1 1
1 2 First 2 3
2 3 First 3 6
3 4 First 4 10
4 1 Second 9 9
5 2 Second 3 12
6 3 Second 1 13
7 4 Second 6 19

Related

How to get number of rows since last peak Pandas

I would like to get a rolling count of how many rows have been between the current row and the last peak. Example code:
Value | Rows since Peak
-----------------------
1 0
3 0
1 1
2 2
1 3
4 0
6 0
5 1
You can compare the values to the cummax and use it for a groupby.cumcount:
df['Rows since Peak'] = (df.groupby(df['Value'].eq(df['Value'].cummax())
.cumsum())
.cumcount()
)
How it works:
Every time a value is equal to the cumulated max (df['Value'].eq(df['Value'].cummax())) we start a new group (using cumsum to define the group). Then cumcount enumerates since the start of the group.
output:
Value Rows since Peak
0 1 0
1 3 0
2 1 1
3 2 2
4 1 3
5 4 0
6 6 0
7 5 1

How to create new column in Pandas dataframe where each row is product of previous rows

I have the following DataFrame dt:
a
0 1
1 2
2 3
3 4
4 5
How do I create a a new column where each row is a function of previous rows?
For instance, say the formula is:
B_row(t) = A_row(t-1)+A_row(t-2)+3
Such that:
a b
0 1 /
1 2 /
2 3 6
3 4 8
4 5 10
Also, I hear a lot about the fact that we mustn't loop through rows in Pandas', however it seems to me that I should go at it by looping through each row and creating a sort of recursive loop - as I would do in regular Python.
You could use cumprod:
dt['b'] = dt['a'].cumprod()
Output:
a b
0 1 1
1 2 2
2 3 6
3 4 24
4 5 120

Reverse cumulative sum until condition is met in a dataframe

I have the following dataframe:
frame=pd.DataFrame(columns=["a","b"], data=[(2,5),(2,6),(1,8),(1,1),(3,5),(3,2),(3,3)])
which looks like this:
a b
0 2 5
1 2 6
2 1 8
3 1 1
4 3 5
5 3 2
6 3 3
I want to do a reverse cumulative sum of column "b" until condition is met - column "a" is the same number - in this particular example - 3. Desired output is 10.
based on your logic:
blocks = frame['a'].ne(frame['a'].shift()).cumsum()
frame.loc[blocks==blocks.iloc[-1], 'b'].sum()
# 10

how remove rows in a dataframe that the order of values are not important

I have a dataframe like this:
source target weight
1 2 5
2 1 5
1 2 5
1 2 7
3 1 6
1 1 6
1 3 6
My goal is to remove the duplicate rows, but the order of source and target columns are not important. In fact, the order of two columns are not important and they should be removed. In this case, the expected result would be
source target weight
1 2 5
1 2 7
3 1 6
1 1 6
Is there any way to this without loops?
Use frozenset and duplicated
df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]
source target weight
0 1 2 5
3 3 1 6
4 1 1 6
If you want to account for unordered source/target and weight
df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
However, to be explicit with more readable code.
# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')
# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()
df[~mask]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
Should be fairly easy.
data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])
You can drop the duplicates using drop_duplicates
df = df.drop_duplicates(keep=False)
print(df)
would result in:
source target weight
1 2 1 5
3 3 1 6
4 1 1 6
5 1 3 6
because you want to handle the unordered source/target issue.
def pair(row):
sorted_pair = sorted([row['source'],row['target']])
row['source'] = sorted_pair[0]
row['target'] = sorted_pair[1]
return row
df = df.apply(pair,axis=1)
and then you can use df.drop_duplicates()
source target weight
0 1 2 5
3 1 2 7
4 1 3 6
5 1 1 6

computing sum of pandas dataframes

I have two dataframes that I want to add bin-wise. That is, given
dfc1 = pd.DataFrame(list(zip(range(10),np.zeros(10))), columns=['bin', 'count'])
dfc2 = pd.DataFrame(list(zip(range(0,10,2), np.ones(5))), columns=['bin', 'count'])
which gives me this
dfc1:
bin count
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
dfc2:
bin count
0 0 1
1 2 1
2 4 1
3 6 1
4 8 1
I want to generate this:
bin count
0 0 1
1 1 0
2 2 1
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0
where I've added the count columns where the bin columns matched.
In fact, it turns out that I only ever add 1 (that is, count in dfc2 is always 1). So an alternate version of the question is "given an array of bin values (dfc2.bin), how can I add one to each of their corresponding count values in dfc1?"
My only solution thus far feels grossly inefficient (and slightly unreadable in the end), doing an outer joint between the two bin columns, thus creating a third dataframe on which I do a computation and then project out the unneeded column.
Suggestions?
First set bin to be index in both dataframes, then you can use add, fillvalue is needed to point that zero shall be used if bin is missing in dataframe:
dfc1 = dfc1.set_index('bin')
dfc2 = dfc2.set_index('bin')
result = pd.DataFrame.add(dfc1, dfc2, fill_value=0)
Pandas automatically sums up rows with equal index.
By the way, if you need to perform such operation frequently, I strongly recommend using numpy.bincount, which allows even repeating the bin index inside one dataframe
Since the dfc1 index is the same as your "bin" value, you could simply do the following:
dfc1.iloc[dfc2.bin].cnt += 1
Notice that I renamed your "count" column to "cnt" since count is a pandas builtin, which can cause confusion and errors!
As an alternative of #Alleo's answer, you can use method combineAdd to simply add 2 dataframes together and set_index at the same time, provided that their indexes will be matched by bin:
dfc1.set_index('bin').combineAdd(dfc2.set_index('bin')).reset_index()
bin count
0 0 1
1 1 0
2 2 1
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0

Categories

Resources