I have the following dataframe:
frame=pd.DataFrame(columns=["a","b"], data=[(2,5),(2,6),(1,8),(1,1),(3,5),(3,2),(3,3)])
which looks like this:
a b
0 2 5
1 2 6
2 1 8
3 1 1
4 3 5
5 3 2
6 3 3
I want to do a reverse cumulative sum of column "b" until condition is met - column "a" is the same number - in this particular example - 3. Desired output is 10.
based on your logic:
blocks = frame['a'].ne(frame['a'].shift()).cumsum()
frame.loc[blocks==blocks.iloc[-1], 'b'].sum()
# 10
Related
Let's say I have the following data:
df=pd.DataFrame({'Days':[1,2,3,4,1,2,3,4],
'Flag':["First","First","First","First","Second","Second","Second","Second"],
'Payments':[1,2,3,4,9,3,1,6]})
I want to create a cumulative sum for payments, but it has to reset when flag turns from first to second. Any help?
The output that I'm looking for is the following:
Not sure if this is you want since you didn't provide an output but try this
df=pd.DataFrame({'Days':[1,2,3,4,1,2,3,4],
'Flag':["First","Second","First","Second","First","Second","Second","First"],
'Payments':[1,2,3,4,9,3,1,6]})
# make groups using consecutive Flags
groups = df.Flag.shift().ne(df.Flag).cumsum()
# groupby the groups and cumulatively sum payments
df['cumsum'] = df.groupby(groups).Payments.cumsum()
df
You can use df['Flag'].ne(df['Flag'].shift()).cumsum() to generate a grouper that will group by changes in the Flag column. Then, group by that, and cumsum:
df['cumsum'] = df['Payments'].groupby(df['Flag'].ne(df['Flag'].shift()).cumsum()).cumsum()
Output:
>>> df
Days Flag Payments cumsum
0 1 First 1 1
1 2 First 2 3
2 3 First 3 6
3 4 First 4 10
4 1 Second 9 9
5 2 Second 3 12
6 3 Second 1 13
7 4 Second 6 19
What is wrong with
df['Cumulative Payments'] = df.groupby('Flag')['Payments'].cumsum()
Days Flag Payments Cumulative Payments
0 1 First 1 1
1 2 First 2 3
2 3 First 3 6
3 4 First 4 10
4 1 Second 9 9
5 2 Second 3 12
6 3 Second 1 13
7 4 Second 6 19
I have the following DataFrame dt:
a
0 1
1 2
2 3
3 4
4 5
How do I create a a new column where each row is a function of previous rows?
For instance, say the formula is:
B_row(t) = A_row(t-1)+A_row(t-2)+3
Such that:
a b
0 1 /
1 2 /
2 3 6
3 4 8
4 5 10
Also, I hear a lot about the fact that we mustn't loop through rows in Pandas', however it seems to me that I should go at it by looping through each row and creating a sort of recursive loop - as I would do in regular Python.
You could use cumprod:
dt['b'] = dt['a'].cumprod()
Output:
a b
0 1 1
1 2 2
2 3 6
3 4 24
4 5 120
Group Code
1 2
1 2
1 4
1 1
2 4
2 1
2 2
2 3
2 1
2 1
2 3
Within each group there are pairs. In Group 1 for example; the pairs are (2,2),(2,4),(4,1)
I want to filter these pairs based on code number 2 OR 4 being present at the END of the pair. In group 1 for example, only (2,2) and (2,4) will be kept while (4,1) will be filtered out.
The code am I using for determining code number being present at the beginning is
df[df.groupby("Group")['Code'].shift().isin([2,4])|df['Code'].isin([2,4])]
Excepted Output:
Group Code
1 2
1 2
1 4
2 1
2 2
Using your own suggested code, you can modify it to achieve your goal:
idx = df.groupby("Group")['Code'].shift(-1).isin([2,4])
df[idx | idx.shift()]
First you groupby 'Group' and then shift one up and check for values 2 or 4. Finally, you want both the end of pairs satisfying the condition (i.e. idx) and the begin of the pair (i.e. idx.shift())
output:
Group Code
0 1 2
1 1 2
2 1 4
5 2 1
6 2 2
Assuming the data is sorted by Group, you can also do it without using groupby() to save some processing and speed up the process, as follows:
m = df['Code'].isin([2,4]) & df['Group'].eq(df['Group'].shift())
df[m | m.shift(-1)]
Result:
Group Code
0 1 2
1 1 2
2 1 4
5 2 1
6 2 2
I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).
From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)
Hi I will show what im trying to do through examples:
I start with a dataframe like this:
> pd.DataFrame({'A':['a','a','a','c'],'B':[1,1,2,3], 'count':[5,6,1,7]})
A B count
0 a 1 5
1 a 1 6
2 a 2 1
3 c 3 7
I need to find a way to get all the unique combinations between column A and B, and merge them. The count column should be added together between the merged columns, the result should be like the following:
A B count
0 a 1 11
1 a 2 1
2 c 3 7
Thans for any help.
Use groupby with aggregating sum:
print (df.groupby(['A','B'], as_index=False)['count'].sum())
A B count
0 a 1 11
1 a 2 1
2 c 3 7
print (df.groupby(['A','B'])['count'].sum().reset_index())
A B count
0 a 1 11
1 a 2 1
2 c 3 7