I want to calculate the max value in the past 3 rolling rows, ignoring NaN if I see them. I assumed that skipna would do that, but it doesn't. How can I ignore NaN, and also what is skipna supposed to do?
In this code
import pandas as pd
df = pd.DataFrame({'sales': [25, 20, 14]})
df['max'] = df['sales'].rolling(3).max(skipna=True)
print(df)
The last column is
sales max
0 25 NaN
1 20 NaN
2 14 25.0
But I want it to be
sales max
0 25 25.0
1 20 25.0
2 14 25.0
skipna= has the default value of True, so adding it explicitly in your code does not have any effect. If you were to set it to False, you would possibly get NaN as the max if you had NaNs in the original sales column. There is a nice explanation of why that would happen here.
In your example, you are getting those NaNs in the first two rows because the .rolling(3) call tells pandas that if there is less than 3 values in the rolling window, they are to be set to NaN. You can set the second parameter (min_periods) in the .rolling() call to require at least one value:
df['max'] = df['sales'].rolling(3,1).max()
df
# sales max
# 0 25 25.0
# 1 20 25.0
# 2 14 25.0
You can also use Series.bfill with your command:
df['max'] = df['sales'].rolling(3).max().bfill()
Output:
sales max
0 25 25.0
1 20 25.0
2 14 25.0
Related
I am stuck on a problem which looks simple but for which I cannot find a proper solution.
Consider a given Pandas dataframe df, composed by multiple columns A1,A2, etc., and let Ai be one of its column filled for example as follows:
Ai
25
30
30
NaN
12
15
15
NaN
I would like to delete all the rows in df for which Ai values are between NaN and a "further change" in its value, so that my output (for column Ai) would be:
Ai
25
NaN
12
NaN
Any idea on how to do so would be very much appreciated. Thank you very much in advance.
update
Similar to the previous solution but with a filter per group to keep the early duplicates
m = df['Ai'].isna()
df.loc[((m|m.shift(fill_value=True))
.groupby(df['Ai'].ne(df['Ai'].shift()).cumsum())
.filter(lambda d: d.sum()>0).index
)]
output:
Ai
0 25.0
1 25.0
2 25.0
5 NaN
6 30.0
7 30.0
9 NaN
original answer
This is equivalent to selecting the NaNs and line below. You could use a mask:
m = df['Ai'].isna()
df[m|m.shift(fill_value=True)]
Output:
Ai
0 25.0
3 NaN
4 12.0
7 NaN
I share a part of my big dataframe here to ask my question. In the Age column there are two missing values that are the first two rows. The way I intend to fill them is based on the following steps:
Calculte the mean of age for each group. (Assume the mean value of Age in Group A is X)
Iterate through Age column to detect the null values (which belong to the first two rows)
Return the Group value of each Age null value (which is 'A')
Fill those null values of Age with the mean age value of their corresponding group (The first two rows belong to A then fill their Age null values with X)
I know how to do step 1, I can use data.groupby('Group')['Age'].mean() but don't know how to proceed to the end of step 4.
Thanks.
Use:
df['Age'] = (df['Age'].fillna(df.groupby('Group')['Age'].transform('mean'))
.astype(int))
I'm guessing you're looking for something like this:
df['Age'] = df.groupby(['Name'])['Age'].transform(lambda x: np.where(np.isnan(x), x.mean(),x))
Assuming your data looks like this (I didn't copy the whole dataframe)
Name Age
0 a NaN
1 a NaN
2 b 15.0
3 d 50.0
4 d 45.0
5 a 8.0
6 a 7.0
7 a 8.0
you would run:
df['Age'] = df.groupby(['Name'])['Age'].transform(lambda x: np.where(np.isnan(x), x.mean(),x))
and get:
Name Age
0 a 7.666667 ---> The mean of group 'a'
1 a 7.666667
2 b 15.000000
3 d 50.000000
4 d 45.000000
5 a 8.000000
6 a 7.000000
7 a 8.000000
I am doing some pandas interpolation in a series in which the index is not continuous. So it can be something like this:
Value Customer_id
0. 5 A
1. np.nan A
10. 9 A
11. 10 B
12. np.nan B
13. 30 B
I'm interpolating taking into account the customer_id (in this case it makes no difference, but my dataframe has NaNs in the starting or ending point of a customer)
So I'm doing
series = series.groupby('Customer_id').apply(lambda group: group.interpolate(method= interpolation_method))
Where interpolation_method is 'cubic' or 'index' (I'm testing both. for different purposes).
How can I do the interpolation and keep the original index somehow in a of column or in the index if possible so that I canter join with other data frames?
You can define your own interpolation function using np.polyfit. Let's say you have this dataframe where customer A begins with na:
Value Customer_id
0 NaN A
1 5.0 A
10 9.0 A
11 10.0 B
12 NaN B
13 30.0 B
Fill the missing values with a custom interpolation:
def interpolate(group):
x = group.dropna()
params = np.polyfit(x.index, x['Value'], deg=1)
predicted = np.polyval(params, group.index)
s = pd.Series(predicted, index=group.index)
return group['Value'].combine_first(s)
df.groupby('Customer_id').apply(interpolate).to_frame().reset_index(level=0)
Result:
Customer_id Value
0 A 4.555556
1 A 5.000000
10 A 9.000000
11 B 10.000000
12 B 20.000000
13 B 30.000000
This assumes that there is a minium of 2 valid Value per customer.
So I basically have an Airbnb data set with a few columns. Several of them correspond to ratings of different parameters (cleanliness, location,etc). For those columns I have a bunch of NaNs that I want to fill.
As some of those NaNs correspond to listings from the same owner, I wanted to fill some of the NaNs with the corresponding hosts' rating average for each of those columns.
For example, let's say that for host X, the average value for review_scores_location is 7. What I want to do is, in the review_scores_location column, fill all the NaN values, that correspond to the host X, with 7.
I've tried the following code:
cols=['reviews_per_month','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value']
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].mean())
Although it does run and it doesn't return any error, it does not fill the NaN values, since when I check if there are still any NaNs, the amount hasn't changed.
What am I doing?
Thanks for taking the time to read this!
The problem here is that when using the series airbnb.groupby('host_id')[i].mean() in the fillna, the function tries to align index and as the index of airbnb.groupby('host_id')[i].mean() are actually the values of the column host_id and not the original index values of airbnb, the fillna does not work as you expect. Several options are possible to do the job, one way is to use transform after the groupby that will align the mean value per group to the original index values and then the fillna would work as expected, such as:
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].transform('mean'))
And even, you can use this method without a loop such as:
airbnb = airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean'))
with an example:
airbnb = pd.DataFrame({'host_id':[1,1,1,2,2,2],
'reviews_per_month':[4,5,np.nan,9,3,5],
'review_scores_rating':[3,np.nan,np.nan,np.nan,7,8]})
print (airbnb)
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 NaN 5.0
2 1 NaN NaN
3 2 NaN 9.0
4 2 7.0 3.0
5 2 8.0 5.0
and you get:
cols=['reviews_per_month','review_scores_rating'] # would work with all your columns
print (airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean')))
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 3.0 5.0
2 1 3.0 4.5
3 2 7.5 9.0
4 2 7.0 3.0
5 2 8.0 5.0
df1 = pd.DataFrame(np.arange(15).reshape(5,3))
df1.iloc[:4,1] = np.nan
df1.iloc[:2,2] = np.nan
df1.dropna(thresh=1 ,axis=1)
It seems that no nan value has been deleted.
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN 8.0
3 9 NaN 11.0
4 12 13.0 14.0
if i run
df1.dropna(thresh=2,axis=1)
why it gives the following?
0 2
0 0 NaN
1 3 NaN
2 6 8.0
3 9 11.0
4 12 14.0
i just dont understand what thresh is doing here. If a column has more than one nan value, should the column be deleted?
thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped.
Try setting thresh to 4 to get a better sense of what's happening.
thresh parameter value decides the minimum number of non-NAN values needed in a "ROW" not to drop.
This will search along the column and check if the column has atleast 1 non-NaN values:
df1.dropna(thresh=1 ,axis=1)
So the Column name 1 has only one non-NaN value i.e 13 but thresh=2 need atleast 2 non-NaN, so this column failed and it will drop that column:
df1.dropna(thresh=2,axis=1)