Subtracting min value from previous value in pandas DataFrame - python

I want to subtract the minimum value of a column in a DataFrame from the value just above it. In R I would do this:
df <- data.frame(a=1:5, b=c(5,6,7,4,9))
df
a b
1 1 5
2 2 6
3 3 7
4 4 4
5 5 9
df$b[which.min(df$b)-1] - df$b[which.min(df$b)]
[1] 3
How can I do the same thing in pandas? More generally, how can I extract the row number in a pandas DataFrame where a certain condition is met?

You can use argmin to find out the index of the minimum value(the first one if there are ties), then you can do the subtraction based on the location:
index = df.b.argmin()
df.b[index-1] - df.b[index]
# 3
In case the index is not consecutive numbers:
i_index = df.b.values.argmin()
df.b.iat[i_index-1] - df.b.iat[i_index]
# 3
Or less efficiently:
-df.b.diff()[df.b.argmin()]
# 3.0

Related

Pandas groupby cumulative sum ignore current row

I know there's some questions about this topic (like Pandas: Cumulative sum of one column based on value of another) however, none of them fuull fill my requirements.
Let's say I have a dataframe like this one
.
I want to compute the cumulative sum of Cost grouping by month, avoiding taking into account the current value, in order to get the Desired column.By using groupby and cumsum I obtain colum CumSum
.
The DDL to generate the dataframe is
df = pd.DataFrame({'Month': [1,1,1,2,2,1,3],
'Cost': [5,8,10,1,3,4,1]})
IIUC you can use groupby.cumsum and then just subtract cost;
df['cumsum_'] = df.groupby('Month').Cost.cumsum().sub(df.Cost)
print(df)
Month Cost cumsum_
0 1 5 0
1 1 8 5
2 1 10 13
3 2 1 0
4 2 3 1
5 1 4 23
6 3 1 0
You can do the following:
df['agg']=df.groupby('Month')['Cost'].shift().fillna(0)
df['Cumsum']=df['Cost']+df['agg']

Efficient STAR selection in pandas

There is a type of selection called STAR which is an acronym for "Score then Automatic Runoff". This is used in a number of algorithmic methods but the typical example is voting. In pandas, this is use to select a single column under this metric. The standard "score" selection is to select the column of the dataframe with the highest sum. This can simply accomplished by
df.sum().idxmax()
What is the most efficient pythonic way to do a STAR selection? The method works but first taking the two columns with the highest sum then taking the winner as the column which has the higher value more often between those two. I can't seem to write this in a clean way.
Here my take on it
Sample df
Out[1378]:
A B C D
0 5 5 1 5
1 0 1 5 5
2 3 3 1 3
3 4 5 0 4
4 5 5 1 1
Step 1: Use sum, nlargest, and slice columns for Score step
df_tops = df[df.sum().nlargest(2, keep='all').index]
Out[594]:
B D
0 5 5
1 1 5
2 3 3
3 5 4
4 5 1
Step 2: compare df_tops agains max of df_tops to create boolean result. finally, sum and call idxmax on it
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().idxmax()
Out[608]: 'B'
Or you may use idxmax and mode for step 2. This returns a series of top column name
finalist = df_tops.idxmax(1).mode()
Out[621]:
0 B
dtype: object
After you have the top column, just slice it out
df[finalist]
Out[623]:
B
0 5
1 1
2 3
3 5
4 5
Note: in case runner-up columns are summing to the same number, step 2 picks only one column. If you want it to pick both same ranking/votes runner-up columns, you need use nlargest and index instead of idxmax and the output will be array
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().nlargest(1, keep='all').index.values
Out[615]: array(['B'], dtype=object)

Python Pandas: Rolling backward function

I have a dataframe which has two columns (i.e. audit_value and rolling_sum). Rolling_sum_3 column contains the rolling sum of last 3 audit values. Dataframe is shown below:
df1
audit_value rolling_sum_3 Fixed_audit
0 4 NA 3
1 5 NA 3
2 3 12 3
3 1 9 1
4 2 6 2
5 1 4 1
6 4 7 3
Now I want to apply condition on rolling_sum_3 column and find if the value is greater than 5, if yes, then look at the last 3 values of audit_value and find the values which are greater than 3. If the any value among the last 3 values of audit_value is greater than 3 then replace those value with 3 and place in a new column (called fixed_audit), otherwise retain the old value of audit_value in new column. I couldn't find any builtin function in pandas that perform rolling back functionality. Could anyone suggest easy and efficient way of performing rolling back functionality on certain column?
df1['fixed_audit'] = df1['audit_value']
for i in range(3, len(df1)):
if(df1.iloc[i].rolling_sum_3 > 5):
df1.loc[i-1,'fixed_audit'] = 3 if df1.loc[i-1,'audit_value'] > 3 else df1.loc[i-1,'audit_value']
df1.loc[i-2,'fixed_audit'] = 3 if df1.loc[i-2,'audit_value'] > 3 else df1.loc[i-2,'audit_value']
df1.loc[i-3,'fixed_audit'] = 3 if df1.loc[i-3,'audit_value'] > 3 else df1.loc[i-3,'audit_value']

Sum of previous rows values

how I can sum previous rows values and current row value to a new column?
My current output:
index,value
0,1
1,2
2,3
3,4
4,5
My goal output is:
index,value,sum
0,1,1
1,2,3
2,3,6
3,4,10
4,5,15
I know that this is easy to do with Excel, but I'm looking solution to do with pandas.
My code:
import random, pandas
recordlist=[1,2,3,4,5]
df=pandas.DataFrame(recordlist, columns=["Values"])
use cumsum
df.assign(sum=df.value.cumsum())
value sum
index
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
Or
df['sum'] = df.value.cumsum()
df
value sum
index
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
If df is a series
pd.DataFrame(dict(value=df, sum=df.cumsum())
As already used in the previous posts, df.assign is a great function.
If you want to have a little bit more flexibility here, you can use a lambda function, like so
df.assign[ sum=lambda l: l['index'] + l['value'] ]
Just to do the summing, this could even be shortened with
df.assign[ sum=df['index'] + df['value'] ]
Note that sum (before the = sign) is not a function or variable, but the name for the new column. So this could be also df.assign[ mylongersumlabel=.. ]

Pandas checking if values in multiple column exists in other columns

I am trying to check if values for each row noted in Dataframe "Actual" match values under the same row in Dataframe "Estimate". Column position is not important. The value just needs to exist on the same row level between the different dataframes. The Dataframes can be concat/ merged, if need be.
I present below my code::
Actual=pd.DataFrame([[4,7,2,8,1],[1,5,7,9,8]], columns=['Actual1','Actual2','Actual3','Actual4','Actual5'])
estimate=pd.DataFrame([[1,2,7,9,3],[0,8,2,5,9]], columns=['estimate1','estimate2','estimate3','estimate4','estimate5'])
Actual
Actual1 Actual2 Actual3 Actual4 Actual5
0 4 7 2 8 1
1 1 5 7 9 8
estimate
estimate1 estimate2 estimate3 estimate4 estimate5
0 1 2 7 9 3
1 0 8 2 5 9
My attempt using Pandas::
for loop1 in range(1,6,1):
for loop2 in range(1,6,1):
Actual['want'+str(loop1)]=np.where(Actual['Actual'+ str(loop1)] == estimate['estimate' + str(loop2)],1,0)
and finally, my output that I would like::
want=pd.DataFrame([[0,1,1,0,1],[0,1,0,1,1]], columns=['want1','want2','want3','want4','want5'])
want
want1 want2 want3 want4 want5
0 0 1 1 0 1
1 0 1 0 1 1
So, as I was mentioning earlier, since from Dataframe "Actual" value 4 does not exist on the whole first row of dataframe "estimate", column "want1" has been assigned value 0. Once again, considering the first row of Dataframe "Actual" column 5 where value=1, since this value exists in the same first row of dataframe "estimate" (column location does not matter) column 'want5' has been assigned value 1.
Thanks.
Assuming that the indices in your Actual and estimate DataFrames are the same, one approach would be to just apply a check along the columns with isin.
Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Here we use the name attribute as the glue between the two DataFrames.
Demo
>>> Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Actual1 Actual2 Actual3 Actual4 Actual5
0 0 1 1 0 1
1 0 1 0 1 1

Categories

Resources