(pandas) Fill NaN based on groupby and column condition - python

Using 'bfill' or 'ffill' on a groupby element is trivial, but what if you need to fill the na with a specific value in a second column, based on a condition in a third column?
For example:
>>> df=pd.DataFrame({'date':['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'], 'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
>>> df
a b date
0 1 4.0 01/10/2017
1 1 NaN 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 NaN 01/11/2017
5 2 7.0 02/10/2016
I need to group by column 'a', and fill the NaN with the column 'b' value where the date for that row is closest to the date in the NaN row.
So the output should look like:
a b date
0 1 4.0 01/10/2017
1 1 6.0 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 5.0 01/11/2017
5 2 7.0 02/10/2016
Assume there is a closest_date() function that takes the NaN date and the list of other dates in that group, and returns the closest date.
I'm trying to find a clean solution that doesn't have to iterate through rows, ideally able to use apply() with lambdas. Any ideas?

This should work:
df['closest_date_by_a'] = df.groupby('a')['date'].apply(closest_date)
df['b'] = df.groupby(['a', 'closest_date_by_a'])['b'].ffill().bfill()
Given a function (closest_date()), you need to apply that function by group so it calculates the closest dates for rows within each group. Then you can group by both the main grouping column (a) and the closest date column (closest_date_by_a) and perform your filling.

Ensure that your date column are in fact dates.
df = pd.DataFrame(
{'date': ['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'],
'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
df.date = pd.to_datetime(df.date)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 NaN 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 NaN 2017-01-11
5 2 7.0 2016-02-10
Use reindex with method='nearest' after having dropna()
def fill_with_nearest(df):
s = df.set_index('date').b
s = s.dropna().reindex(s.index, method='nearest')
s.index = df.index
return s
df.loc[df.b.isnull(), 'b'] = df.groupby('a').apply(fill_with_nearest).reset_index(0, drop=True)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 4.0 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 5.0 2017-01-11
5 2 7.0 2016-02-10

Related

Fill NaNs in Df using groupby and rolling mean

I have a dataframe that looks like this
d = {'date': ['1999-01-01', '1999-01-02', '1999-01-03', '1999-01-04', '1999-01-05', '1999-01-06'], 'ID': [1,1,1,1,1,1], 'Value':[1,2,3,np.NaN,5,6]}
df = pd.DataFrame(data = d)
date ID Value
0 1999-01-01 1 1
1 1999-01-02 1 2
2 1999-01-03 1 3
3 1999-01-04 1 NaN
4 1999-01-05 1 5
5 1999-01-06 1 6
I would like to fill in NaNs using a rolling mean (e.g 2) and extend that to a df with multiple IDs and dates. I tried s.th like this but it takes a very long time and fails with the error "cannot join with no overlapping index names"
df.groupby(['date','ID']).fillna(df.rolling(2, min_periods=1).mean().shift())
or
df.groupby(['date','ID']).fillna(df.groupby(['date','ID']).rolling(2, min_periods=1).mean().shift())
IIUC, here is one way to do it
if you add expected output that will help validate this solution
df2=df.fillna(0).groupby('ID')['Value'].rolling(2).mean().reset_index()
df.update(df2, overwrite=False)
df
date ID Value
0 1999-01-01 1 1.0
1 1999-01-02 1 2.0
2 1999-01-03 1 3.0
3 1999-01-04 1 1.5
4 1999-01-05 1 5.0
5 1999-01-06 1 6.0

Python - How to clean time series data

I have a df which looks like this:
df = pd.DataFrame({'Date':['2019-09-23','2019-09-24','2019-09-25','2019-09-26','2019-09-27','2019-09-28','2019-09-29'],
'Sep':[1,10,5,'NaN','NaN','NaN','NaN'],
'Dec':[2,8,4,7,9,1,5]})
I'm trying to create a new column called 'First_Contract':
'First_Contract' needs to take the third-last value of 'Sep' column, before 'Sep'column reaches NaN.
The subsequent values need to be filled with 'Dec' column values.
Desired output:
df2= pd.DataFrame({'Date':['2019-09-23','2019-09-24','2019-09-25','2019-09-26','2019-09-27','2019-09-28','2019-09-29'],
'Sep':[1,10,5,'NaN','NaN','NaN','NaN'],
'Dec':[2,8,4,7,9,1,5],
'First_Contract':[1,8,4,7,9,1,5]})
How do I go about to achieve this?
Let us do it step by step
df.Sep.replace({'NaN': np.nan}, inplace=True)
df['FC'] = df['Dec']
ids = df.Sep.last_valid_index()-2
df.loc[ids,'FC'] = df.Sep[ids]
df
Out[126]:
Date Sep Dec First_Contract FC
0 2019-09-23 1.0 2 1 1.0
1 2019-09-24 10.0 8 8 8.0
2 2019-09-25 5.0 4 4 4.0
3 2019-09-26 NaN 7 7 7.0
4 2019-09-27 NaN 9 9 9.0
5 2019-09-28 NaN 1 1 1.0
6 2019-09-29 NaN 5 5 5.0
You can use pd.concat and last_valid_index to create your column:
df['First_contract'] = pd.concat((
df.Sep.iloc[:df.Sep.last_valid_index() - 1], df.Dec.iloc[df.Sep.last_valid_index() - 1:]
)).astype(int)
Complete code (I replaced strings 'NaN' with np.nan in Sep column; it is not needed if they are already NaN):
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':['2019-09-23','2019-09-24','2019-09-25','2019-09-26','2019-09-27','2019-09-28','2019-09-29'],
'Sep':[1,10,5, 'NaN','NaN','NaN','NaN'],
'Dec':[2,8,4,7,9,1,5]})
df.Sep.replace({'NaN': np.nan}, inplace=True)
df['First_contract'] = pd.concat((
df.Sep.iloc[:df.Sep.last_valid_index() - 1], df.Dec.iloc[df.Sep.last_valid_index() - 1:]
)).astype(int)
Output:
Date Sep Dec First_contract
0 2019-09-23 1.0 2 1
1 2019-09-24 10.0 8 8
2 2019-09-25 5.0 4 4
3 2019-09-26 NaN 7 7
4 2019-09-27 NaN 9 9
5 2019-09-28 NaN 1 1
6 2019-09-29 NaN 5 5
You can use numpy to fill fill in Sep where the index is 3 behind the first null index, and fill the rest with Dec
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':['2019-09-23','2019-09-24','2019-09-25','2019-09-26','2019-09-27','2019-09-28','2019-09-29'],
'Sep':[1,10,5,np.nan,np.nan,np.nan,np.nan],
'Dec':[2,8,4,7,9,1,5]})
df['First_Contract'] = np.where(df.index==df.Sep.isnull().idxmax()-3, df.Sep, df.Dec)

Pandas - Replace NaNs in a column with the mean of specific group

I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019

Move values in rows in a new column in pandas

I have a DataFrame with an Ids column an several columns with data, like the column "value" in this example.
For this DataFrame I want to move all the values that correspond to the same id to a new column in the row as shown below:
I guess there is an opposite function to "melt" that allow this, but I'm not getting how to pivot this DF.
The dicts for the input and out DFs are:
d = {"id":[1,1,1,2,2,3,3,4,5],"value":[12,13,1,22,21,23,53,64,9]}
d2 = {"id":[1,2,3,4,5],"value1":[12,22,23,64,9],"value2":[1,21,53,"","",],"value3":[1,"","","",""]}
Create MultiIndex by cumcount, reshape by unstack and add change columns names by add_prefix:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index())
print (df)
id value0 value1 value2
0 1 12.0 13.0 1.0
1 2 22.0 21.0 NaN
2 3 23.0 53.0 NaN
3 4 64.0 NaN NaN
4 5 9.0 NaN NaN
Missing values is possible replace by fillna, but get mixed numeric with strings data, so some function should failed:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index()
.fillna(''))
print (df)
id value0 value1 value2
0 1 12.0 13 1
1 2 22.0 21
2 3 23.0 53
3 4 64.0
4 5 9.0
You can GroupBy to a list, then expand the series of lists:
df = pd.DataFrame(d) # create input dataframe
res = df.groupby('id')['value'].apply(list).reset_index() # groupby to list
res = res.join(pd.DataFrame(res.pop('value').values.tolist())) # expand lists to columns
print(res)
id 0 1 2
0 1 12 13.0 1.0
1 2 22 21.0 NaN
2 3 23 53.0 NaN
3 4 64 NaN NaN
4 5 9 NaN NaN
In general, such operations will be expensive as the number of columns is arbitrary. Pandas / NumPy solutions work best when you can pre-allocate memory, which isn't possible here.

How To Create A Dataframe Series

I have a dictionary of Pandas Series objects that I want to turn into a Dataframe. The key for each series should be the column heading. The individual series overlap but, each label is unique.
I thought I should be able to just do
df = pd.DataFrame(data)
But I keep getting the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
I get the same error if I try to turn each series into a frame, and use pd.concat(data, axis=1).
Which doesn't make sense if you take the column label into account. What am I doing wrong, and how do I fix it?
I believe you need reset_index with parameter drop=True of each Series in dict comprehension, because duplicates in index:
s = pd.Series([1,4,5,2,0], index=[1,2,2,3,5])
s1 = pd.Series([5,7,8,1],index=[1,2,3,4])
data = {'a':s, 'b': s1}
print (s.reset_index(drop=True))
0 1
1 4
2 5
3 2
4 0
dtype: int64
df = pd.concat({k:v.reset_index(drop=True) for k,v in data.items()}, axis=1)
print (df)
a b
0 1 5.0
1 4 7.0
2 5 8.0
3 2 1.0
4 0 NaN
If you need drop rows where duplicated index use boolean indexing with duplicated:
print (s[~s.index.duplicated()])
1 1
2 4
3 2
5 0
dtype: int64
df = pd.concat({k:v[~v.index.duplicated()] for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.0 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN
Another solution:
print (s.groupby(level=0).mean())
1 1.0
2 4.5
3 2.0
5 0.0
dtype: float64
df = pd.concat({k:v.groupby(level=0).mean() for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.5 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN

Categories

Resources