Pandas - Rolling and GroupBy apply based on condition - python

Input data:
no Group Value
1 A 5
2 B 10
3 A 7
4 B 20
5 A 8
6 B 30
7 A NaN
8 B NaN
9 A 90
10 B 105
How can I apply custom python function (let's call it "custom_fnc") only to the rows which "Value" field is NaN, but that function would accept Series of the last two rows within the group?
For example, I would like to calculate "Value" only for the 7th and 8th row (also because of the performance I don't want to calculate it for the whole dataset), so the function would work with this data:
For the group A, it would need only 3rd and 5th row "Value"
For the group B, it would need only 4th and 6th row "Value"
I was wondering how I can use groupby and rolling, but only for filtered rows that have NaN as the Value?

Related

How to average DataFrame row with another row only if the first row is a substring of other next row

I have a dataframe called 'data':
USER VALUE
XOXO 21
ABC-1 2
ABC-1B 4
ABC-2 4
ABC-2B 6
PEPE 12
I want to combine 'ABC-1' with 'ABC-1B' into a single row using the first USER name and then averaging the two values to arrive here:
USER VALUE
XOXO 21
ABC-1 3
ABC-2 5
PEPE 12
The dataframe may not be in order and there are other values in there as well that are unrelated that don't need averaging. I only want to average the two rows where 'XXX-X' is in 'XXX-XB'
data = pd.DataFrame({'USER':['XOXO','ABC-1','ABC-1B','ABC-2','ABC-2B', 'PEPE'], 'VALUE':[21,2,4,4,6,12]})
Let's try,
df.USER = df.USER.str.replace('(-\d)B', r"\1")
df = df.groupby("USER", as_index=False, sort=False).VALUE.mean()
print(df)
USER VALUE
0 XOXO 21
1 ABC-1 3
2 ABC-2 5
3 PEPE 12

Efficient STAR selection in pandas

There is a type of selection called STAR which is an acronym for "Score then Automatic Runoff". This is used in a number of algorithmic methods but the typical example is voting. In pandas, this is use to select a single column under this metric. The standard "score" selection is to select the column of the dataframe with the highest sum. This can simply accomplished by
df.sum().idxmax()
What is the most efficient pythonic way to do a STAR selection? The method works but first taking the two columns with the highest sum then taking the winner as the column which has the higher value more often between those two. I can't seem to write this in a clean way.
Here my take on it
Sample df
Out[1378]:
A B C D
0 5 5 1 5
1 0 1 5 5
2 3 3 1 3
3 4 5 0 4
4 5 5 1 1
Step 1: Use sum, nlargest, and slice columns for Score step
df_tops = df[df.sum().nlargest(2, keep='all').index]
Out[594]:
B D
0 5 5
1 1 5
2 3 3
3 5 4
4 5 1
Step 2: compare df_tops agains max of df_tops to create boolean result. finally, sum and call idxmax on it
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().idxmax()
Out[608]: 'B'
Or you may use idxmax and mode for step 2. This returns a series of top column name
finalist = df_tops.idxmax(1).mode()
Out[621]:
0 B
dtype: object
After you have the top column, just slice it out
df[finalist]
Out[623]:
B
0 5
1 1
2 3
3 5
4 5
Note: in case runner-up columns are summing to the same number, step 2 picks only one column. If you want it to pick both same ranking/votes runner-up columns, you need use nlargest and index instead of idxmax and the output will be array
finalist = df_tops.eq(df_tops.max(1), axis=0).sum().nlargest(1, keep='all').index.values
Out[615]: array(['B'], dtype=object)

Python Pandas: Rolling backward function

I have a dataframe which has two columns (i.e. audit_value and rolling_sum). Rolling_sum_3 column contains the rolling sum of last 3 audit values. Dataframe is shown below:
df1
audit_value rolling_sum_3 Fixed_audit
0 4 NA 3
1 5 NA 3
2 3 12 3
3 1 9 1
4 2 6 2
5 1 4 1
6 4 7 3
Now I want to apply condition on rolling_sum_3 column and find if the value is greater than 5, if yes, then look at the last 3 values of audit_value and find the values which are greater than 3. If the any value among the last 3 values of audit_value is greater than 3 then replace those value with 3 and place in a new column (called fixed_audit), otherwise retain the old value of audit_value in new column. I couldn't find any builtin function in pandas that perform rolling back functionality. Could anyone suggest easy and efficient way of performing rolling back functionality on certain column?
df1['fixed_audit'] = df1['audit_value']
for i in range(3, len(df1)):
if(df1.iloc[i].rolling_sum_3 > 5):
df1.loc[i-1,'fixed_audit'] = 3 if df1.loc[i-1,'audit_value'] > 3 else df1.loc[i-1,'audit_value']
df1.loc[i-2,'fixed_audit'] = 3 if df1.loc[i-2,'audit_value'] > 3 else df1.loc[i-2,'audit_value']
df1.loc[i-3,'fixed_audit'] = 3 if df1.loc[i-3,'audit_value'] > 3 else df1.loc[i-3,'audit_value']

Reduce the dataframe rows and lookup

Need help with the following please.
Suppose we have a dataframe:
dictionary ={'Category':['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'val1':[11,13,14,17,18,21,22,25,2,8,9,13,15,16,19],
'val2':[1,0,5,1,4,3,5,9,4,1,5,2,4,0,3]}
df=pd.DataFrame(dictionary)
'val1' is always increasing within the same value in 'category', i.e first and last rows of a category are min and max values of that category. There are too many rows per category, and I want to make a new dataframe that includes min and max values of each category and contains equally spaced e.g. 5 rows (incluing min and max) from each category.
I think numpy's linspace should be used to create an array of values for each category (e.g. linspace(min, max, 5)) then something similar to excel's 'lookup' function should be used to get the closest values of 'val1' from df.
Or maybe there are some other better ways...
Many thanks for the help.
Is this what you need ? with groupby and reindex
l=[]
for _, x in df.groupby('Category'):
x.index=x['val1']
y=x.reindex(np.linspace(x['val1'].min(), x['val1'].max(), 5),method='nearest')
l.append(y)
pd.concat(l)
Out[330]:
Category val1 val2
val1
11.00 a 11 1
14.50 a 14 5
18.00 a 18 4
21.50 a 22 5
25.00 a 25 9
2.00 b 2 4
6.25 b 8 1
10.50 b 9 5
14.75 b 15 4
19.00 b 19 3

Subtracting min value from previous value in pandas DataFrame

I want to subtract the minimum value of a column in a DataFrame from the value just above it. In R I would do this:
df <- data.frame(a=1:5, b=c(5,6,7,4,9))
df
a b
1 1 5
2 2 6
3 3 7
4 4 4
5 5 9
df$b[which.min(df$b)-1] - df$b[which.min(df$b)]
[1] 3
How can I do the same thing in pandas? More generally, how can I extract the row number in a pandas DataFrame where a certain condition is met?
You can use argmin to find out the index of the minimum value(the first one if there are ties), then you can do the subtraction based on the location:
index = df.b.argmin()
df.b[index-1] - df.b[index]
# 3
In case the index is not consecutive numbers:
i_index = df.b.values.argmin()
df.b.iat[i_index-1] - df.b.iat[i_index]
# 3
Or less efficiently:
-df.b.diff()[df.b.argmin()]
# 3.0

Categories

Resources