I have a dataframe which has two columns (i.e. audit_value and rolling_sum). Rolling_sum_3 column contains the rolling sum of last 3 audit values. Dataframe is shown below:
df1
audit_value rolling_sum_3 Fixed_audit
0 4 NA 3
1 5 NA 3
2 3 12 3
3 1 9 1
4 2 6 2
5 1 4 1
6 4 7 3
Now I want to apply condition on rolling_sum_3 column and find if the value is greater than 5, if yes, then look at the last 3 values of audit_value and find the values which are greater than 3. If the any value among the last 3 values of audit_value is greater than 3 then replace those value with 3 and place in a new column (called fixed_audit), otherwise retain the old value of audit_value in new column. I couldn't find any builtin function in pandas that perform rolling back functionality. Could anyone suggest easy and efficient way of performing rolling back functionality on certain column?
df1['fixed_audit'] = df1['audit_value']
for i in range(3, len(df1)):
if(df1.iloc[i].rolling_sum_3 > 5):
df1.loc[i-1,'fixed_audit'] = 3 if df1.loc[i-1,'audit_value'] > 3 else df1.loc[i-1,'audit_value']
df1.loc[i-2,'fixed_audit'] = 3 if df1.loc[i-2,'audit_value'] > 3 else df1.loc[i-2,'audit_value']
df1.loc[i-3,'fixed_audit'] = 3 if df1.loc[i-3,'audit_value'] > 3 else df1.loc[i-3,'audit_value']
Related
I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).
From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)
There is a type of selection called STAR which is an acronym for "Score then Automatic Runoff". This is used in a number of algorithmic methods but the typical example is voting. In pandas, this is use to select a single column under this metric. The standard "score" selection is to select the column of the dataframe with the highest sum. This can simply accomplished by
df.sum().idxmax()
What is the most efficient pythonic way to do a STAR selection? The method works but first taking the two columns with the highest sum then taking the winner as the column which has the higher value more often between those two. I can't seem to write this in a clean way.
Here my take on it
Sample df
Out[1378]:
A B C D
0 5 5 1 5
1 0 1 5 5
2 3 3 1 3
3 4 5 0 4
4 5 5 1 1
Step 1: Use sum, nlargest, and slice columns for Score step
df_tops = df[df.sum().nlargest(2, keep='all').index]
Out[594]:
B D
0 5 5
1 1 5
2 3 3
3 5 4
4 5 1
Step 2: compare df_tops agains max of df_tops to create boolean result. finally, sum and call idxmax on it
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().idxmax()
Out[608]: 'B'
Or you may use idxmax and mode for step 2. This returns a series of top column name
finalist = df_tops.idxmax(1).mode()
Out[621]:
0 B
dtype: object
After you have the top column, just slice it out
df[finalist]
Out[623]:
B
0 5
1 1
2 3
3 5
4 5
Note: in case runner-up columns are summing to the same number, step 2 picks only one column. If you want it to pick both same ranking/votes runner-up columns, you need use nlargest and index instead of idxmax and the output will be array
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().nlargest(1, keep='all').index.values
Out[615]: array(['B'], dtype=object)
I want to subtract the minimum value of a column in a DataFrame from the value just above it. In R I would do this:
df <- data.frame(a=1:5, b=c(5,6,7,4,9))
df
a b
1 1 5
2 2 6
3 3 7
4 4 4
5 5 9
df$b[which.min(df$b)-1] - df$b[which.min(df$b)]
[1] 3
How can I do the same thing in pandas? More generally, how can I extract the row number in a pandas DataFrame where a certain condition is met?
You can use argmin to find out the index of the minimum value(the first one if there are ties), then you can do the subtraction based on the location:
index = df.b.argmin()
df.b[index-1] - df.b[index]
# 3
In case the index is not consecutive numbers:
i_index = df.b.values.argmin()
df.b.iat[i_index-1] - df.b.iat[i_index]
# 3
Or less efficiently:
-df.b.diff()[df.b.argmin()]
# 3.0
I am trying to check if values for each row noted in Dataframe "Actual" match values under the same row in Dataframe "Estimate". Column position is not important. The value just needs to exist on the same row level between the different dataframes. The Dataframes can be concat/ merged, if need be.
I present below my code::
Actual=pd.DataFrame([[4,7,2,8,1],[1,5,7,9,8]], columns=['Actual1','Actual2','Actual3','Actual4','Actual5'])
estimate=pd.DataFrame([[1,2,7,9,3],[0,8,2,5,9]], columns=['estimate1','estimate2','estimate3','estimate4','estimate5'])
Actual
Actual1 Actual2 Actual3 Actual4 Actual5
0 4 7 2 8 1
1 1 5 7 9 8
estimate
estimate1 estimate2 estimate3 estimate4 estimate5
0 1 2 7 9 3
1 0 8 2 5 9
My attempt using Pandas::
for loop1 in range(1,6,1):
for loop2 in range(1,6,1):
Actual['want'+str(loop1)]=np.where(Actual['Actual'+ str(loop1)] == estimate['estimate' + str(loop2)],1,0)
and finally, my output that I would like::
want=pd.DataFrame([[0,1,1,0,1],[0,1,0,1,1]], columns=['want1','want2','want3','want4','want5'])
want
want1 want2 want3 want4 want5
0 0 1 1 0 1
1 0 1 0 1 1
So, as I was mentioning earlier, since from Dataframe "Actual" value 4 does not exist on the whole first row of dataframe "estimate", column "want1" has been assigned value 0. Once again, considering the first row of Dataframe "Actual" column 5 where value=1, since this value exists in the same first row of dataframe "estimate" (column location does not matter) column 'want5' has been assigned value 1.
Thanks.
Assuming that the indices in your Actual and estimate DataFrames are the same, one approach would be to just apply a check along the columns with isin.
Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Here we use the name attribute as the glue between the two DataFrames.
Demo
>>> Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Actual1 Actual2 Actual3 Actual4 Actual5
0 0 1 1 0 1
1 0 1 0 1 1
So I have an extremely simple dataframe:
values
1
1
1
2
2
I want to add a new column and for each row assign the sum of it's unique occurences, so the table would look like:
values unique_sum
1 3
1 3
1 3
2 2
2 2
I have seen some examples in R, but for python and pandas I have not come across anything and am stuck. I can list the value counts using .value_counts() and I have tried groupbyroutines but cannot fathom it.
Just use map to map your column onto its value_counts:
>>> x
A
0 1
1 1
2 1
3 2
4 2
>>> x['unique'] = x.A.map(x.A.value_counts())
>>> x
A unique
0 1 3
1 1 3
2 1 3
3 2 2
4 2 2
(I named the column A instead of values. values is not a great choice for a column name, because DataFrames have a special attribute called values, which prevents you from getting the column with x.values --- you'd have to use x['values'] instead.)