Python - How to clean time series data

Python - How to clean time series data - python

I have a df which looks like this:
df = pd.DataFrame({'Date':['2019-09-23','2019-09-24','2019-09-25','2019-09-26','2019-09-27','2019-09-28','2019-09-29'],
'Sep':[1,10,5,'NaN','NaN','NaN','NaN'],
'Dec':[2,8,4,7,9,1,5]})
I'm trying to create a new column called 'First_Contract':
'First_Contract' needs to take the third-last value of 'Sep' column, before 'Sep'column reaches NaN.
The subsequent values need to be filled with 'Dec' column values.
Desired output:
df2= pd.DataFrame({'Date':['2019-09-23','2019-09-24','2019-09-25','2019-09-26','2019-09-27','2019-09-28','2019-09-29'],
'Sep':[1,10,5,'NaN','NaN','NaN','NaN'],
'Dec':[2,8,4,7,9,1,5],
'First_Contract':[1,8,4,7,9,1,5]})
How do I go about to achieve this?

Let us do it step by step
df.Sep.replace({'NaN': np.nan}, inplace=True)
df['FC'] = df['Dec']
ids = df.Sep.last_valid_index()-2
df.loc[ids,'FC'] = df.Sep[ids]
df
Out[126]:
Date Sep Dec First_Contract FC
0 2019-09-23 1.0 2 1 1.0
1 2019-09-24 10.0 8 8 8.0
2 2019-09-25 5.0 4 4 4.0
3 2019-09-26 NaN 7 7 7.0
4 2019-09-27 NaN 9 9 9.0
5 2019-09-28 NaN 1 1 1.0
6 2019-09-29 NaN 5 5 5.0

You can use pd.concat and last_valid_index to create your column:
df['First_contract'] = pd.concat((
df.Sep.iloc[:df.Sep.last_valid_index() - 1], df.Dec.iloc[df.Sep.last_valid_index() - 1:]
)).astype(int)
Complete code (I replaced strings 'NaN' with np.nan in Sep column; it is not needed if they are already NaN):
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':['2019-09-23','2019-09-24','2019-09-25','2019-09-26','2019-09-27','2019-09-28','2019-09-29'],
'Sep':[1,10,5, 'NaN','NaN','NaN','NaN'],
'Dec':[2,8,4,7,9,1,5]})
df.Sep.replace({'NaN': np.nan}, inplace=True)
df['First_contract'] = pd.concat((
df.Sep.iloc[:df.Sep.last_valid_index() - 1], df.Dec.iloc[df.Sep.last_valid_index() - 1:]
)).astype(int)
Output:
Date Sep Dec First_contract
0 2019-09-23 1.0 2 1
1 2019-09-24 10.0 8 8
2 2019-09-25 5.0 4 4
3 2019-09-26 NaN 7 7
4 2019-09-27 NaN 9 9
5 2019-09-28 NaN 1 1
6 2019-09-29 NaN 5 5

You can use numpy to fill fill in Sep where the index is 3 behind the first null index, and fill the rest with Dec
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':['2019-09-23','2019-09-24','2019-09-25','2019-09-26','2019-09-27','2019-09-28','2019-09-29'],
'Sep':[1,10,5,np.nan,np.nan,np.nan,np.nan],
'Dec':[2,8,4,7,9,1,5]})
df['First_Contract'] = np.where(df.index==df.Sep.isnull().idxmax()-3, df.Sep, df.Dec)

Related

fill missing values based on the last value [duplicate]

I am dealing with pandas DataFrames like this:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 NaN
5 2 NaN
6 1 300
7 1 NaN
I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 20
5 2 200
6 1 300
7 1 300
Is there some slick way to do this without manually looping over rows?

You could perform a groupby/forward-fill operation on each group:
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]})
df['x'] = df.groupby(['id'])['x'].ffill()
print(df)
yields
id x
0 1 10.0
1 1 20.0
2 2 100.0
3 2 200.0
4 1 20.0
5 2 200.0
6 1 300.0
7 1 300.0

df
id val
0 1 23.0
1 1 NaN
2 1 NaN
3 2 NaN
4 2 34.0
5 2 NaN
6 3 2.0
7 3 NaN
8 3 NaN
df.sort_values(['id','val']).groupby('id').ffill()
id val
0 1 23.0
1 1 23.0
2 1 23.0
4 2 34.0
3 2 34.0
5 2 34.0
6 3 2.0
7 3 2.0
8 3 2.0
use sort_values, groupby and ffill so that if you have Nan value for the first value or set of first values they also get filled.

Solution for multi-key problem:
In this example, the data has the key [date, region, type]. Date is the index on the original dataframe.
import os
import pandas as pd
#sort to make indexing faster
df.sort_values(by=['date','region','type'], inplace=True)
#collect all possible regions and types
regions = list(set(df['region']))
types = list(set(df['type']))
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
# steps:
# 1) grab rows with a particular region and type
# 2) use forwardfill to fill nulls
# 3) use backwardfill to fill remaining nulls
# 4) append to file
for r in regions:
for t in types:
group_df = df[(df.region == r) & (df.type == t)].copy()
group_df.fillna(method='ffill', inplace=True)
group_df.fillna(method='bfill', inplace=True)
group_df.to_csv('ffill_df.csv', mode='a', header=False, index=True)
Checking the result:
#load in the ffill_df
ffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)
ffill_df.columns = df_reindexed_cols
ffill_df.index= ffill_df.date
ffill_df.drop('date', axis=1, inplace=True)
ffill_df.head()
#compare new and old dataframe
print(df.shape)
print(ffill_df.shape)
print()
print(pd.isnull(ffill_df).sum())

Set filtered rows of a column equal to the filtered rows of another column pandas

I have a dataframe in pandas that looks like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name' : pd.Series(['Alex', 'John', 'Christopher', 'Dwayne']),
'value' : pd.Series([1., 2., 3., 4.]),
'new_value' : pd.Series([np.NaN, 4, 5, 10])})
df
Out[1]:
name value new_value
0 Alex 1.0 NaN
1 John 2.0 4.0
2 Christopher 3.0 5.0
3 Dwayne 4.0 10.0
Now I want to update the value column, if the new value is not NaN. I did my SO searching and I found this answer: Efficient way to update column value for subset of rows on Pandas DataFrame?, which led me to the following (correct) construction:
df.loc[~df.new_value.isnull(), 'value'] = df.new_value
My question is, how does this work? Why is the right hand side of the assignment filtered as well by loc?

~df.new_value.isnull() will provide the opposite of the data after the ~ with True False values (True becomes False and False becomes True ...).
Those values have indexes. So you are locating df by the indices that you just filtered (first argument) and you are making sure to choose only the values of value.
df = pd.DataFrame({"ali":[1,2,3,4,5,6,7],"mali":[4,2,6,4,5,6,10]})
df["kouki"] = df.loc[~(df.ali>3),"mali"]
# output
ali mali kouki
0 1 4 4.0
1 2 2 2.0
2 3 6 6.0
3 4 4 NaN
4 5 5 NaN
5 6 6 NaN
6 7 10 NaN
df = pd.DataFrame({"ali":[1,2,3,4,5,6,7],"mali":[4,2,6,4,5,6,10]})
df["kouki"] = df.loc[(df.ali>3),"mali"]
# otuput
ali mali kouki
0 1 4 NaN
1 2 2 NaN
2 3 6 NaN
3 4 4 4.0
4 5 5 5.0
5 6 6 6.0
6 7 10 10.0

Pandas - Replace NaNs in a column with the mean of specific group

I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!

You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13

You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019

Parse columns to reshape dataframe

I have a csv that I import as a dataframe with pandas. The columns are like:
Step1:A Step1:B Step1:C Step1:D Step2:A Step2:B Step2:D Step3:B Step3:D Step3:E
0 1 2 3 4 5 6 7 8 9
Where the step and parameter are separated by ':'. I want to reshape the dataframe to look like this:
Step1 Step2 Step3
A 0 4 nan
B 1 5 7
C 2 nan nan
D 3 6 8
E nan nan 9
Now, If I want to maintain column sequential order such that I have this case:
Step2:A Step2:B Step2:C Step2:D Step1:A Step1:B Step1:D AStep3:B AStep3:D AStep3:E
0 1 2 3 4 5 6 7 8 9
Where the step and parameter are separated by ':'. I want to reshape the dataframe to look like this:
Step2 Step1 AStep3
A 0 4 nan
B 1 5 7
C 2 nan nan
D 3 6 8
E nan nan 9

Try read_csv with delim_whitespace:
df = pd.read_csv('file.csv', delim_whitespace=True)
df.columns = df.columns.str.split(':', expand=True)
df.stack().reset_index(level=0, drop=True)
output:
Step1 Step2 Step3
A 0.0 4.0 NaN
B 1.0 5.0 7.0
C 2.0 NaN NaN
D 3.0 6.0 8.0
E NaN NaN 9.0

(pandas) Fill NaN based on groupby and column condition

Using 'bfill' or 'ffill' on a groupby element is trivial, but what if you need to fill the na with a specific value in a second column, based on a condition in a third column?
For example:
>>> df=pd.DataFrame({'date':['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'], 'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
>>> df
a b date
0 1 4.0 01/10/2017
1 1 NaN 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 NaN 01/11/2017
5 2 7.0 02/10/2016
I need to group by column 'a', and fill the NaN with the column 'b' value where the date for that row is closest to the date in the NaN row.
So the output should look like:
a b date
0 1 4.0 01/10/2017
1 1 6.0 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 5.0 01/11/2017
5 2 7.0 02/10/2016
Assume there is a closest_date() function that takes the NaN date and the list of other dates in that group, and returns the closest date.
I'm trying to find a clean solution that doesn't have to iterate through rows, ideally able to use apply() with lambdas. Any ideas?

This should work:
df['closest_date_by_a'] = df.groupby('a')['date'].apply(closest_date)
df['b'] = df.groupby(['a', 'closest_date_by_a'])['b'].ffill().bfill()
Given a function (closest_date()), you need to apply that function by group so it calculates the closest dates for rows within each group. Then you can group by both the main grouping column (a) and the closest date column (closest_date_by_a) and perform your filling.

Ensure that your date column are in fact dates.
df = pd.DataFrame(
{'date': ['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'],
'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
df.date = pd.to_datetime(df.date)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 NaN 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 NaN 2017-01-11
5 2 7.0 2016-02-10
Use reindex with method='nearest' after having dropna()
def fill_with_nearest(df):
s = df.set_index('date').b
s = s.dropna().reindex(s.index, method='nearest')
s.index = df.index
return s
df.loc[df.b.isnull(), 'b'] = df.groupby('a').apply(fill_with_nearest).reset_index(0, drop=True)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 4.0 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 5.0 2017-01-11
5 2 7.0 2016-02-10

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - How to clean time series data - python

Related

fill missing values based on the last value [duplicate]

Set filtered rows of a column equal to the filtered rows of another column pandas

Pandas - Replace NaNs in a column with the mean of specific group

Parse columns to reshape dataframe

(pandas) Fill NaN based on groupby and column condition

Categories

Resources