pandas fill missing row with dates and value from previous data - python

Let's say I have the following data.
dates=['2020-12-01','2020-12-04','2020-12-05', '2020-12-01','2020-12-04','2020-12-05']
symbols=['ABC','ABC','ABC','DEF','DEF','DEF']
v=[1,3,5,7,9,10]
df= pd.DataFrame({'date':dates, 'g':symbols, 'v':v})
date g v
0 2020-12-01 ABC 1
1 2020-12-04 ABC 3
2 2020-12-05 ABC 5
3 2020-12-01 DEF 7
4 2020-12-04 DEF 9
5 2020-12-05 DEF 10
I'd like to fill the missing dates with previous value (group by field 'g')
For example, I want the following entrees added in the above example:
2020-12-02 ABC 1
2020-12-03 ABC 1
2020-12-02 DEF 7
2020-12-03 DEF 7
how can I do this?

The answer is borrowed mostly from the following answer, with the exception of filling with a negative value and using that to replace with nulls for the forward fill.
Original Answer Here
dates=['2020-12-01','2020-12-04','2020-12-05', '2020-12-01','2020-12-04','2020-12-05']
symbols=['ABC','ABC','ABC','DEF','DEF','DEF']
v=[1,3,5,7,9,10]
df= pd.DataFrame({'date':dates, 'g':symbols, 'v':v})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(
['date', 'g']
).unstack(
fill_value=-999
).asfreq(
'D', fill_value=-999
).stack().sort_index(level=1).reset_index()
df.replace(-999, np.nan).ffill()

Related

Add Missing Dates to Time Series ID's in Pandas

I have the following data frame:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[44551,'2020-10-01',1],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
How can I add the the months (in which QTD is zero), to each ID ? (Ideally I would like for the column BALANCE and CC to keep the previous value, for each ID, on the added rows but this not not stricly necessary as I am more interested on the QTD and VAL columns).
I thought about maybe resampling the data by month for each ID on a data frame and then merge that data frame to the one above. Is this a good implementation? Is there a better way to achieve this result?
Should end up similar to this:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[66991,'2020-08-01',0],
[66991,'2020-09-01',0],
[66991,'2020-10-01',0],
[44551,'2020-10-01',1],
[44551,'2020-11-05',0],
[66991,'2020-11-01',0],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
You can generate a range of dates by ID using pd.date_range, then create a pd.MultiIndex so you can do a reindex:
s = pd.MultiIndex.from_tuples([(i, x) for i, j in df.groupby("ID")
for x in pd.date_range(min(j["DATE"]), max(j["DATE"]), freq="MS")],
names=["ID", "DATE"])
df = df.set_index(["ID", "DATE"])
print (df.reindex(df.index|s, fill_value=0)
.reset_index()
.groupby(["ID", pd.Grouper(key="DATE", freq="M")], as_index=False)
.apply(lambda i: i[i["QTD"].ne(0)|(len(i)==1)])
.droplevel(0))
ID DATE QTD
0 44551 2020-10-01 1
1 44551 2020-11-01 0
3 44551 2020-12-05 5
4 66991 2020-06-01 2
5 66991 2020-06-02 1
7 66991 2020-07-03 1
8 66991 2020-08-01 0
9 66991 2020-09-01 0
10 66991 2020-10-01 0
11 66991 2020-11-01 0
12 66991 2020-12-01 1
13 66991 2020-12-05 7
15 66991 2021-01-08 3

Elegant way to shift multiple date columns - Pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14

How to create multiple additional columns from dataframe and add to the same dataframe

Trying to parse 'date' column into 'month', 'day', 'hour' and 'minute' and then add them as separate columns to the same dataframe:
import pandas as pd
d = {'date':[pd.Timestamp('2019-03-01 00:05:01'),
pd.Timestamp('2019-04-02 07:11:00'),
pd.Timestamp('2019-05-03 10:25:00')],
'foo': ['abc','def','jhk']
}
df1 = pd.DataFrame(d)
date foo
0 2019-03-01 00:05:01 abc
1 2019-04-02 07:11:00 def
2 2019-05-03 10:25:00 jhk
After extracting 'times':
times = df1['date'].apply(lambda date: (date.month, date.day, date.hour, date.minute))
I try to add them to the dataframe as separate columns:
df1['month'], df1['day'], df1['hour'], df1['minute'] = times
Which results in error:
ValueError Traceback (most recent call last)
<ipython-input-21-171174d71b13> in <module>
----> 1 df1['month'], df1['day'], df1['hour'], df1['minute'] = times
ValueError: not enough values to unpack (expected 4, got 3)
How to add 'times' as separate columns?
Looks like you want
df1['month'], df1['day'], df1['hour'], df1['minute'] = (df1.date.dt.month, df1.date.dt.day,
df1.date.dt.hour, df1.date.dt.minute)
print(df1)
date foo month day hour minute
0 2019-03-01 00:05:01 abc 3 1 0 5
1 2019-04-02 07:11:00 def 4 2 7 11
2 2019-05-03 10:25:00 jhk 5 3 10 25
​
Alternatively, use pd.assign:
df1.assign(month=df1["date"].dt.month, day=df1["date"].dt.day, hour=df1["date"].dt.hour, minutes=df1["date"].dt.minute)
Output:
date foo month day hour minutes
0 2019-03-01 00:05:01 abc 3 1 0 5
1 2019-04-02 07:11:00 def 4 2 7 11
2 2019-05-03 10:25:00 jhk 5 3 10 25

How can I add parts of a column to a new pandas data frame?

So I have a pandas data frame of lenght 90 which isn't important
Lets say I have :
df
A date
1 2012-01-01
4 2012-02-01
5 2012-03-01
7 2012-04-01
8 2012-05-01
9 2012-06-01
2 2012-07-01
1 2012-08-01
3 2012-09-01
2 2012-10-01
5 2012-11-01
9 2012-12-01
0 2013-01-01
6 2013-02-01
and I have created a new data frame
df_copy=df.copy()
index = range(0,3)
df1 = pd.DataFrame(index=index, columns=range((len(df_copy.columns))))
df1.columns = df_copy.columns
df1['date'] = pd.date_range('2019-11-01','2020-01-01' , freq='MS')-pd.offsets.MonthBegin(1)
which should create a data frame like this
A date
na 2019-10-01
na 2019-11-01
na 2019-12-01
So I use the following code to get the values of A in my new data frame
df1['A'] = df1['A'].iloc[9:12]
And I want the outcome to be this
A date
2 2019-10-01
5 2019-11-01
9 2019-12-01
so I want that the last 3 values are assigned the value that has iloc position 9-12 in the new data frame, the indexes are different and so are the dates in both data frames. Is there a way to do this because
df1['A'] = df1['A'].iloc[9:12]
doesn't seem to work
According to my knowledge you can solve this by genearting several new data frames
df_copy=df.copy()
index = range(0,1)
df1 = pd.DataFrame(index=index, columns=range((len(df_copy.columns))))
df1.columns = df_copy.columns
df1['date'] = pd.date_range('2019-11-01','2019-11-01' , freq='MS')-pd.offsets.MonthBegin(1)
df1['A'] = df1['A'].iloc[9]
Then appending to your original data frame and repeating it is a bit overwhemling but it seems like the only solution i could came up with

How to sum over all values between 2 repeating values in pandas?

I have pandas dataframe with Columns 'Date' and 'Skew(float no.)'. I want to average the values of the skew between every Tuesday and the store it in a list or dataframe. I tried using lambda as given in this question Pandas, groupby and summing over specific months I but it only helps to some over a particular week but i cannot go across week i.e from one tuesday to another. Can you give how to do the same?
Here's an example with random data
df = pd.DataFrame({'Date' : pd.date_range('20130101', periods=100),
'Skew': 10+pd.np.random.randn(100)})
min_date = df.Date.min()
start = min_date.dayofweek
if start < 1:
min_date = min_date - pd.np.timedelta64(6+start, 'D')
elif start > 1:
min_date = min_date - pd.np.timedelta64(start-1, 'D')
df.groupby((df.Date - min_date).astype('timedelta64[D]')//7).mean()
Input:
>>> df
Date Skew
0 2013-01-01 10.082080
1 2013-01-02 10.907402
2 2013-01-03 8.485768
3 2013-01-04 9.221740
4 2013-01-05 10.137910
5 2013-01-06 9.084963
6 2013-01-07 9.457736
7 2013-01-08 10.092777
Output:
Skew
Date
0 9.625371
1 9.993275
2 10.041077
3 9.837709
4 9.901311
5 9.985390
6 10.123757
7 9.782892
8 9.889291
9 9.853204
10 10.190098
11 10.594125
12 10.012265
13 9.278008
14 10.530251
Logic: Find relative week from the first week's Tuesday and groupby and each groups (i.e week no's) mean.

Categories

Resources