Pandas - conditional row average - python

I have a dataframe:
x = pd.DataFrame({'1':[1,2,3,2,5,6,7,8,9], '2':[2,5,6,8,10,np.nan,6,np.nan,np.nan],
'3':[10,10,10,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
I am trying to generate an average of a row but only on values greater than 5. For instance - if a row had values of 3, 6, 10. The average would be 8 ((6+10)/2). The 3 would be ignored as it is below 5.
The equivalent in excel would be =AVERAGEIF(B2:DX2,">=5")

You can mask the values greater than 5 then take mean:
x.where(x>5).mean(1)
Or:
x.mask(x<=5).mean(1)

You can create a small custom function which, inside each row, filters out values smaller or equal than a certain value and apply it to each row of your dataframe
def average_if(s, value=5):
s = s.loc[s > value]
return s.mean()
x.apply(average_if, axis=1)
0 10.0
1 10.0
2 8.0
3 8.0
4 10.0
5 6.0
6 6.5
7 8.0
8 9.0
dtype: float64

Related

Dynamically Fill NaN Values in Dataframe

I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0

Way to produce a table in pandas given a formula and zeros

Lets say I have the following df:
Letter Number
a 0
b 0
c 0
d 1
e 2
f 3
I want to apply the following formula to the df
for i in range(1,len(df):
x = df.loc[i,'Number'] /df.loc[i-1,'Number'] + df.loc[i,'Number']
df.loc[i,'Number'] = x
Note: The column 'Number' only has zeros in the first few rows. After, there are no more zeros.
How would I apply the formula to the df without slicing the zeros off?
You can get the previous row's number by using shift(). Then you can compute the value using the formula you defined. For that we can use df.apply()
Here's how we can do it.
import pandas as pd
df = pd.DataFrame({'Letter':list('abcdef'),'Number':[0,0,0,1,2,3]})
print (df)
# capture the previous row's value
df['Prev'] = df.Number.shift()
# check if prev row value is NaN or 0. It will be NaN for first row
# if Nan or 0, don't divide instead use 0. Then add current row value
df['New'] = df.apply(lambda x: x.Number + ((x.Number/x.Prev) if (pd.isnull(x.Prev) or x.Prev !=0) else 0), axis = 1)
print (df)
The output of this will be (Prev is the previous row; New is the computed result):
Letter Number Prev New
0 a 0 NaN NaN
1 b 0 0.0 0.0
2 c 0 0.0 0.0
3 d 1 0.0 1.0
4 e 2 1.0 4.0
5 f 3 2.0 4.5
If you want the first row to have a value of 0, we can modify the .shift() option a bit and fillna(0). That will make the first row value to be 0. You can drop the Prev column after the computation.
This snippet divides, using pd.Series.div, the Series Number by the shifted values of Number and then adds Number using pd.Series.add
>>> df.Number.div(df.Number.shift()).add(df.Number)
0 NaN
1 NaN
2 NaN
3 inf
4 4.0
5 4.5
Name: Number, dtype: float64

pandas: how to get if column is greater than x select the max of two columns otherwise select mean?

I have a df that looks like this and want to add an adj mean that selects the max if one of the two columns (avg or rolling_mean) is 0 otherwise it gets the avg of the two columns.
ID Avg rolling_mean adj_mean (goal to have this column)
0 5 0 5
1 6 6.3 6.15
2 5 8 6.5
3 4 0 4
I was able to get the max value of the columns using this code
df["adj_mean"]=df[["Avg", "rolling_mean"]].max(axis=1)
but not sure how to add the avg if both values are greater than zero.
Many thanks!
One approach can be to treat 0 as NaN and then simply calculate the mean
df['adj_mean'] = df.replace({0: np.nan})[["Avg", "rolling_mean"]].mean(axis=1)
Out[1]:
rolling_mean Avg adj_mean
0 0.0 5 5.00
1 6.3 6 6.15
2 8.0 5 6.50
3 0.0 4 4.00
By default, df.mean() skips null values. Per the docs:
skipna : bool, default True
Exclude NA/null values when computing the result.

Pandas interpolate keep original index

I am doing some pandas interpolation in a series in which the index is not continuous. So it can be something like this:
Value Customer_id
0. 5 A
1. np.nan A
10. 9 A
11. 10 B
12. np.nan B
13. 30 B
I'm interpolating taking into account the customer_id (in this case it makes no difference, but my dataframe has NaNs in the starting or ending point of a customer)
So I'm doing
series = series.groupby('Customer_id').apply(lambda group: group.interpolate(method= interpolation_method))
Where interpolation_method is 'cubic' or 'index' (I'm testing both. for different purposes).
How can I do the interpolation and keep the original index somehow in a of column or in the index if possible so that I canter join with other data frames?
You can define your own interpolation function using np.polyfit. Let's say you have this dataframe where customer A begins with na:
Value Customer_id
0 NaN A
1 5.0 A
10 9.0 A
11 10.0 B
12 NaN B
13 30.0 B
Fill the missing values with a custom interpolation:
def interpolate(group):
x = group.dropna()
params = np.polyfit(x.index, x['Value'], deg=1)
predicted = np.polyval(params, group.index)
s = pd.Series(predicted, index=group.index)
return group['Value'].combine_first(s)
df.groupby('Customer_id').apply(interpolate).to_frame().reset_index(level=0)
Result:
Customer_id Value
0 A 4.555556
1 A 5.000000
10 A 9.000000
11 B 10.000000
12 B 20.000000
13 B 30.000000
This assumes that there is a minium of 2 valid Value per customer.

Delete rows group by one column and all rows have nan value in an other column in panda / python

I want to group by one column (id) and check if all the values of an other column day are 'nan', and in this case delete all the rows corresponding.
I was going like this :
output = entry.iloc[entry.groupby['id'].day.isnull()]
but it's not working...
entry = pd.DataFrame([ [1,],[1,],[1,],[1,],[2,3],[2,],[2,4]],columns=['id','day'])
output = pd.DataFrame([[2,3],[2,],[2,4]],columns=['id','day'])
groupby ,transform,and count
entry.loc[entry.groupby('id')['day'].transform('count').nonzero()]
Out[154]:
id day
4 2 3.0
5 2 NaN
6 2 4.0
You can directly use groupby with filter
entry.groupby('id').day.filter(lambda g: g.notnull().sum()!=0)
4 3.0
5 NaN
6 4.0
Using groupby and followed by filter to check those group of id that has sum of days greater than 0.
entry.groupby('id').filter(lambda group: group.day.sum()>0)
Result:
id day
4 2 3.0
5 2 NaN
6 2 4.0

Categories

Resources