Efficient STAR selection in pandas

Efficient STAR selection in pandas - python

There is a type of selection called STAR which is an acronym for "Score then Automatic Runoff". This is used in a number of algorithmic methods but the typical example is voting. In pandas, this is use to select a single column under this metric. The standard "score" selection is to select the column of the dataframe with the highest sum. This can simply accomplished by
df.sum().idxmax()
What is the most efficient pythonic way to do a STAR selection? The method works but first taking the two columns with the highest sum then taking the winner as the column which has the higher value more often between those two. I can't seem to write this in a clean way.

Here my take on it
Sample df
Out[1378]:
A B C D
0 5 5 1 5
1 0 1 5 5
2 3 3 1 3
3 4 5 0 4
4 5 5 1 1
Step 1: Use sum, nlargest, and slice columns for Score step
df_tops = df[df.sum().nlargest(2, keep='all').index]
Out[594]:
B D
0 5 5
1 1 5
2 3 3
3 5 4
4 5 1
Step 2: compare df_tops agains max of df_tops to create boolean result. finally, sum and call idxmax on it
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().idxmax()
Out[608]: 'B'
Or you may use idxmax and mode for step 2. This returns a series of top column name
finalist = df_tops.idxmax(1).mode()
Out[621]:
0 B
dtype: object
After you have the top column, just slice it out
df[finalist]
Out[623]:
B
0 5
1 1
2 3
3 5
4 5
Note: in case runner-up columns are summing to the same number, step 2 picks only one column. If you want it to pick both same ranking/votes runner-up columns, you need use nlargest and index instead of idxmax and the output will be array
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().nlargest(1, keep='all').index.values
Out[615]: array(['B'], dtype=object)

Related

How to find the values of a column such that no values in another column takes value greater than 3

I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).

From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.

import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)

Python Pandas: Rolling backward function

I have a dataframe which has two columns (i.e. audit_value and rolling_sum). Rolling_sum_3 column contains the rolling sum of last 3 audit values. Dataframe is shown below:
df1
audit_value rolling_sum_3 Fixed_audit
0 4 NA 3
1 5 NA 3
2 3 12 3
3 1 9 1
4 2 6 2
5 1 4 1
6 4 7 3
Now I want to apply condition on rolling_sum_3 column and find if the value is greater than 5, if yes, then look at the last 3 values of audit_value and find the values which are greater than 3. If the any value among the last 3 values of audit_value is greater than 3 then replace those value with 3 and place in a new column (called fixed_audit), otherwise retain the old value of audit_value in new column. I couldn't find any builtin function in pandas that perform rolling back functionality. Could anyone suggest easy and efficient way of performing rolling back functionality on certain column?

df1['fixed_audit'] = df1['audit_value']
for i in range(3, len(df1)):
if(df1.iloc[i].rolling_sum_3 > 5):
df1.loc[i-1,'fixed_audit'] = 3 if df1.loc[i-1,'audit_value'] > 3 else df1.loc[i-1,'audit_value']
df1.loc[i-2,'fixed_audit'] = 3 if df1.loc[i-2,'audit_value'] > 3 else df1.loc[i-2,'audit_value']
df1.loc[i-3,'fixed_audit'] = 3 if df1.loc[i-3,'audit_value'] > 3 else df1.loc[i-3,'audit_value']

Select rows based on frequency of values in a column; one-liner or faster way?

I want to do a splitting task, but that requires a minimum number of samples per class, so I want to filter a Dataframe by a column that identifies class labels. If the frequency occurrence of the class is below some threshold, then we want to filter that out.
>>> df = pd.DataFrame([[1,2,3], [4,5,6], [0,0,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
2 0 0 6
>>> filter_on_col(df, col=2, threshold=6) # Removes first row
0 1 2
0 4 5 6
1 0 0 6
I can do something like df[2].value_counts() to get frequency of each value in column 2, and then I can figure out which values exceed my threshold simply by:
>>>`df[2].value_counts() > 2`
3 False
6 True
and then the logic for figuring out the rest is pretty easy.
But I feel like there's an elegant, Pandas one-liner here that I can do, or maybe a more efficient method.
My question is pretty similar to: Select rows from a DataFrame based on values in a column in pandas, but the tricky part is that I'm relying on value frequency rather than the values themselves.

So this is a one-liner:
# Assuming the parameters of your specific example posed above.
col=2; thresh=2
df[df[col].isin(df[col].value_counts().get(thresh).loc[lambda x : x].index)]
Out[303]:
0 1 2
1 4 5 6
2 0 0 6
Or another one-liner:
df[df.groupby(col)[col].transform('count')>thresh,]

Sum of previous rows values

how I can sum previous rows values and current row value to a new column?
My current output:
index,value
0,1
1,2
2,3
3,4
4,5
My goal output is:
index,value,sum
0,1,1
1,2,3
2,3,6
3,4,10
4,5,15
I know that this is easy to do with Excel, but I'm looking solution to do with pandas.
My code:
import random, pandas
recordlist=[1,2,3,4,5]
df=pandas.DataFrame(recordlist, columns=["Values"])

use cumsum
df.assign(sum=df.value.cumsum())
value sum
index
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
Or
df['sum'] = df.value.cumsum()
df
value sum
index
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
If df is a series
pd.DataFrame(dict(value=df, sum=df.cumsum())

As already used in the previous posts, df.assign is a great function.
If you want to have a little bit more flexibility here, you can use a lambda function, like so
df.assign[ sum=lambda l: l['index'] + l['value'] ]
Just to do the summing, this could even be shortened with
df.assign[ sum=df['index'] + df['value'] ]
Note that sum (before the = sign) is not a function or variable, but the name for the new column. So this could be also df.assign[ mylongersumlabel=.. ]

Select last observation per group

Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1

Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.

First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1

Use the .tail method:
df=df.groupby('group_id').tail(1)

You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64

One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient STAR selection in pandas - python

Related

How to find the values of a column such that no values in another column takes value greater than 3

Python Pandas: Rolling backward function

Select rows based on frequency of values in a column; one-liner or faster way?

Sum of previous rows values

Select last observation per group

Categories

Resources