I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019
Related
I have a DataFrame with (several) grouping variables and (several) value variables. My goal is to set the last n non nan values to nan. So let's take a simple example:
df = pd.DataFrame({'id':[1,1,1,2,2,],
'value':[1,2,np.nan, 9,8]})
df
Out[1]:
id value
0 1 1.0
1 1 2.0
2 1 NaN
3 2 9.0
4 2 8.0
The desired result for n=1 would look like the following:
Out[53]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
Use with groupby().cumcount():
N=1
groups = df.loc[df['value'].notna()].groupby('id')
enum = groups.cumcount()
sizes = groups['value'].transform('size')
df['value'] = df['value'].where(enum < sizes - N)
Output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
You can check cumsum after groupby get how many notna value per-row
df['value'].where(df['value'].notna().iloc[::-1].groupby(df['id']).cumsum()>1,inplace=True)
df
Out[86]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
One option: create a reversed cumcount on the non-NA values:
N = 1
m = (df
.loc[df['value'].notna()]
.groupby('id')
.cumcount(ascending=False)
.lt(N)
)
df.loc[m[m].index, 'value'] = np.nan
Similar approach with boolean masking:
m = df['value'].notna()
df['value'] = df['value'].mask(m[::-1].groupby(df['id']).cumsum().le(N))
output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.
I have a data frame with some values by year and type. I want to replace all NaN values in each year with the mean of values in that year with a specific type. I would like to do this in the most elegant way possible. I'm dealing with a lot of data so less computation would be good as well.
Example:
df =pd.DataFrame({'year':[1,1,1,2,2,2],
'type':[1,1,2,1,1,2],
'val':[np.nan,5,10,100,200,np.nan]})
I want ALL nan's regardless of their type to be replaced with their respective year mean of all type 1.
In this example, the first row NaN should be replaced with 5 and the last row should be replaced with 150.
This only fills in values that are missing for type 1 , not type 2
df[val]=df[val].fillna(df.query('type==1').groupby('year')[val].transform('mean'))
You want map:
# calculate mean val of type 1 by year
s = df[df['type'].eq(1)].groupby('year')['val'].mean()
# replace `year` by the above mean, and fill in the Nan
df['val'] = df['val'].fillna(df['year'].map(s))
Output:
year type val
0 1 1 5.0
1 1 1 5.0
2 1 2 10.0
3 2 1 100.0
4 2 1 200.0
5 2 2 150.0
Using fillna and matching indexes
df['val'] = (df.set_index('year').val
.fillna(df.query('type == 1').groupby(['year']).val.mean())
.values)
year type val
0 1 1 5.0
1 1 1 5.0
2 1 2 10.0
3 2 1 100.0
4 2 1 200.0
5 2 2 150.0
mask and transform
df.fillna({'val': df.val.mask(df.type.ne(1)).groupby(df.year).transform('mean')})
year type val
0 1 1 5.0
1 1 1 5.0
2 1 2 10.0
3 2 1 100.0
4 2 1 200.0
5 2 2 150.0
I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()
I want to fill in data between each row in a dataframe with an average of current and next row (where columns are numeric)
starting data:
time value value_1 value-2
0 0 0 4 3
1 2 1 6 6
intermediate df:
time value value_1 value-2
0 0 0 4 3
1 1 0 4 3 #duplicate of row 0
2 2 1 6 6
3 3 1 6 6 #duplicate of row 2
I would like to create df_1:
time value value_1 value-2
0 0 0 4 3
1 1 0.5 5 4.5 #average of row 0 and 2
2 2 1 6 6
3 3 2 8 8 #average of row 2 and 4
To to this I appended a copy of the starting dataframe to create the intermediate dataframe shown above:
df = df_0.append(df_0)
df.sort_values(['time'], ascending=[True], inplace=True)
df = df.reset_index()
df['value_shift'] = df['value'].shift(-1)
df['value_shift_1'] = df['value_1'].shift(-1)
df['value_shift_2'] = df['value_2'].shift(-1)
then I was thinking of applying a function to each column:
def average_vals(numeric_val):
#average every odd row
if int(row.name) % 2 != 0:
#take average of value and value_shift for each value
#but this way I need to create 3 separate functions
Is there a way to do this without writing a separate function for each column and applying to each column one by one (in real data I have tens of columns)?
How about this method using DataFrame.reindex and DataFrame.interpolate
df.reindex(np.arange(len(df.index) * 2) / 2).interpolate().reset_index(drop=True)
Explanation
Reindex, in half steps reindex(np.arange(len(df.index) * 2) / 2)
This gives a DataFrame like this:
time value value_1 value-2
0.0 0.0 0.0 4.0 3.0
0.5 NaN NaN NaN NaN
1.0 2.0 1.0 6.0 6.0
1.5 NaN NaN NaN NaN
Then use DataFrame.interpolate to fill in the NaN values .... the default will be linear interpolation, so mean in this case.
Finaly, use .reset_index(drop=True) to fix your index.
Should give
time value value_1 value-2
0 0.0 0.0 4.0 3.0
1 1.0 0.5 5.0 4.5
2 2.0 1.0 6.0 6.0
3 2.0 1.0 6.0 6.0