Sum of previous rows values - python

how I can sum previous rows values and current row value to a new column?
My current output:
index,value
0,1
1,2
2,3
3,4
4,5
My goal output is:
index,value,sum
0,1,1
1,2,3
2,3,6
3,4,10
4,5,15
I know that this is easy to do with Excel, but I'm looking solution to do with pandas.
My code:
import random, pandas
recordlist=[1,2,3,4,5]
df=pandas.DataFrame(recordlist, columns=["Values"])

use cumsum
df.assign(sum=df.value.cumsum())
value sum
index
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
Or
df['sum'] = df.value.cumsum()
df
value sum
index
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
If df is a series
pd.DataFrame(dict(value=df, sum=df.cumsum())

As already used in the previous posts, df.assign is a great function.
If you want to have a little bit more flexibility here, you can use a lambda function, like so
df.assign[ sum=lambda l: l['index'] + l['value'] ]
Just to do the summing, this could even be shortened with
df.assign[ sum=df['index'] + df['value'] ]
Note that sum (before the = sign) is not a function or variable, but the name for the new column. So this could be also df.assign[ mylongersumlabel=.. ]

Related

Python Pandas: Passing previous, current and next row to a function

I have a dataframe and a function that I want to apply elementwise.
Of course I can iterate over the dataframe but this is very slow and I want to find a quicker way.
My dataframe:
A B C D pattern
4 4 5 6 0
6 4 1 2 0
5 2 2 1 0
5 6 7 9 0
My function takes three rows as an input and returns a value that I want to store in the pattern column of current_row.
def findPattern(previous_row, current_row, next_row):
.....
return "pattern"
How can I apply this function to my dataframe without iterating over it with a for loop?
Thanks for any help :)

How to find the values of a column such that no values in another column takes value greater than 3

I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).
From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)

Efficient STAR selection in pandas

There is a type of selection called STAR which is an acronym for "Score then Automatic Runoff". This is used in a number of algorithmic methods but the typical example is voting. In pandas, this is use to select a single column under this metric. The standard "score" selection is to select the column of the dataframe with the highest sum. This can simply accomplished by
df.sum().idxmax()
What is the most efficient pythonic way to do a STAR selection? The method works but first taking the two columns with the highest sum then taking the winner as the column which has the higher value more often between those two. I can't seem to write this in a clean way.
Here my take on it
Sample df
Out[1378]:
A B C D
0 5 5 1 5
1 0 1 5 5
2 3 3 1 3
3 4 5 0 4
4 5 5 1 1
Step 1: Use sum, nlargest, and slice columns for Score step
df_tops = df[df.sum().nlargest(2, keep='all').index]
Out[594]:
B D
0 5 5
1 1 5
2 3 3
3 5 4
4 5 1
Step 2: compare df_tops agains max of df_tops to create boolean result. finally, sum and call idxmax on it
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().idxmax()
Out[608]: 'B'
Or you may use idxmax and mode for step 2. This returns a series of top column name
finalist = df_tops.idxmax(1).mode()
Out[621]:
0 B
dtype: object
After you have the top column, just slice it out
df[finalist]
Out[623]:
B
0 5
1 1
2 3
3 5
4 5
Note: in case runner-up columns are summing to the same number, step 2 picks only one column. If you want it to pick both same ranking/votes runner-up columns, you need use nlargest and index instead of idxmax and the output will be array
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().nlargest(1, keep='all').index.values
Out[615]: array(['B'], dtype=object)

Getting maximum values in a column

My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.
Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2
First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2
The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.
You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()

Select rows based on frequency of values in a column; one-liner or faster way?

I want to do a splitting task, but that requires a minimum number of samples per class, so I want to filter a Dataframe by a column that identifies class labels. If the frequency occurrence of the class is below some threshold, then we want to filter that out.
>>> df = pd.DataFrame([[1,2,3], [4,5,6], [0,0,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
2 0 0 6
>>> filter_on_col(df, col=2, threshold=6) # Removes first row
0 1 2
0 4 5 6
1 0 0 6
I can do something like df[2].value_counts() to get frequency of each value in column 2, and then I can figure out which values exceed my threshold simply by:
>>>`df[2].value_counts() > 2`
3 False
6 True
and then the logic for figuring out the rest is pretty easy.
But I feel like there's an elegant, Pandas one-liner here that I can do, or maybe a more efficient method.
My question is pretty similar to: Select rows from a DataFrame based on values in a column in pandas, but the tricky part is that I'm relying on value frequency rather than the values themselves.
So this is a one-liner:
# Assuming the parameters of your specific example posed above.
col=2; thresh=2
df[df[col].isin(df[col].value_counts().get(thresh).loc[lambda x : x].index)]
Out[303]:
0 1 2
1 4 5 6
2 0 0 6
Or another one-liner:
df[df.groupby(col)[col].transform('count')>thresh,]

Categories

Resources