Let's say I have the following data.
dates=['2020-12-01','2020-12-04','2020-12-05', '2020-12-01','2020-12-04','2020-12-05']
symbols=['ABC','ABC','ABC','DEF','DEF','DEF']
v=[1,3,5,7,9,10]
df= pd.DataFrame({'date':dates, 'g':symbols, 'v':v})
date g v
0 2020-12-01 ABC 1
1 2020-12-04 ABC 3
2 2020-12-05 ABC 5
3 2020-12-01 DEF 7
4 2020-12-04 DEF 9
5 2020-12-05 DEF 10
I'd like to fill the missing dates with previous value (group by field 'g')
For example, I want the following entrees added in the above example:
2020-12-02 ABC 1
2020-12-03 ABC 1
2020-12-02 DEF 7
2020-12-03 DEF 7
how can I do this?
The answer is borrowed mostly from the following answer, with the exception of filling with a negative value and using that to replace with nulls for the forward fill.
Original Answer Here
dates=['2020-12-01','2020-12-04','2020-12-05', '2020-12-01','2020-12-04','2020-12-05']
symbols=['ABC','ABC','ABC','DEF','DEF','DEF']
v=[1,3,5,7,9,10]
df= pd.DataFrame({'date':dates, 'g':symbols, 'v':v})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(
['date', 'g']
).unstack(
fill_value=-999
).asfreq(
'D', fill_value=-999
).stack().sort_index(level=1).reset_index()
df.replace(-999, np.nan).ffill()
Related
I have the following data frame:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[44551,'2020-10-01',1],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
How can I add the the months (in which QTD is zero), to each ID ? (Ideally I would like for the column BALANCE and CC to keep the previous value, for each ID, on the added rows but this not not stricly necessary as I am more interested on the QTD and VAL columns).
I thought about maybe resampling the data by month for each ID on a data frame and then merge that data frame to the one above. Is this a good implementation? Is there a better way to achieve this result?
Should end up similar to this:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[66991,'2020-08-01',0],
[66991,'2020-09-01',0],
[66991,'2020-10-01',0],
[44551,'2020-10-01',1],
[44551,'2020-11-05',0],
[66991,'2020-11-01',0],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
You can generate a range of dates by ID using pd.date_range, then create a pd.MultiIndex so you can do a reindex:
s = pd.MultiIndex.from_tuples([(i, x) for i, j in df.groupby("ID")
for x in pd.date_range(min(j["DATE"]), max(j["DATE"]), freq="MS")],
names=["ID", "DATE"])
df = df.set_index(["ID", "DATE"])
print (df.reindex(df.index|s, fill_value=0)
.reset_index()
.groupby(["ID", pd.Grouper(key="DATE", freq="M")], as_index=False)
.apply(lambda i: i[i["QTD"].ne(0)|(len(i)==1)])
.droplevel(0))
ID DATE QTD
0 44551 2020-10-01 1
1 44551 2020-11-01 0
3 44551 2020-12-05 5
4 66991 2020-06-01 2
5 66991 2020-06-02 1
7 66991 2020-07-03 1
8 66991 2020-08-01 0
9 66991 2020-09-01 0
10 66991 2020-10-01 0
11 66991 2020-11-01 0
12 66991 2020-12-01 1
13 66991 2020-12-05 7
15 66991 2021-01-08 3
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14
Trying to parse 'date' column into 'month', 'day', 'hour' and 'minute' and then add them as separate columns to the same dataframe:
import pandas as pd
d = {'date':[pd.Timestamp('2019-03-01 00:05:01'),
pd.Timestamp('2019-04-02 07:11:00'),
pd.Timestamp('2019-05-03 10:25:00')],
'foo': ['abc','def','jhk']
}
df1 = pd.DataFrame(d)
date foo
0 2019-03-01 00:05:01 abc
1 2019-04-02 07:11:00 def
2 2019-05-03 10:25:00 jhk
After extracting 'times':
times = df1['date'].apply(lambda date: (date.month, date.day, date.hour, date.minute))
I try to add them to the dataframe as separate columns:
df1['month'], df1['day'], df1['hour'], df1['minute'] = times
Which results in error:
ValueError Traceback (most recent call last)
<ipython-input-21-171174d71b13> in <module>
----> 1 df1['month'], df1['day'], df1['hour'], df1['minute'] = times
ValueError: not enough values to unpack (expected 4, got 3)
How to add 'times' as separate columns?
Looks like you want
df1['month'], df1['day'], df1['hour'], df1['minute'] = (df1.date.dt.month, df1.date.dt.day,
df1.date.dt.hour, df1.date.dt.minute)
print(df1)
date foo month day hour minute
0 2019-03-01 00:05:01 abc 3 1 0 5
1 2019-04-02 07:11:00 def 4 2 7 11
2 2019-05-03 10:25:00 jhk 5 3 10 25
Alternatively, use pd.assign:
df1.assign(month=df1["date"].dt.month, day=df1["date"].dt.day, hour=df1["date"].dt.hour, minutes=df1["date"].dt.minute)
Output:
date foo month day hour minutes
0 2019-03-01 00:05:01 abc 3 1 0 5
1 2019-04-02 07:11:00 def 4 2 7 11
2 2019-05-03 10:25:00 jhk 5 3 10 25
So I have a pandas data frame of lenght 90 which isn't important
Lets say I have :
df
A date
1 2012-01-01
4 2012-02-01
5 2012-03-01
7 2012-04-01
8 2012-05-01
9 2012-06-01
2 2012-07-01
1 2012-08-01
3 2012-09-01
2 2012-10-01
5 2012-11-01
9 2012-12-01
0 2013-01-01
6 2013-02-01
and I have created a new data frame
df_copy=df.copy()
index = range(0,3)
df1 = pd.DataFrame(index=index, columns=range((len(df_copy.columns))))
df1.columns = df_copy.columns
df1['date'] = pd.date_range('2019-11-01','2020-01-01' , freq='MS')-pd.offsets.MonthBegin(1)
which should create a data frame like this
A date
na 2019-10-01
na 2019-11-01
na 2019-12-01
So I use the following code to get the values of A in my new data frame
df1['A'] = df1['A'].iloc[9:12]
And I want the outcome to be this
A date
2 2019-10-01
5 2019-11-01
9 2019-12-01
so I want that the last 3 values are assigned the value that has iloc position 9-12 in the new data frame, the indexes are different and so are the dates in both data frames. Is there a way to do this because
df1['A'] = df1['A'].iloc[9:12]
doesn't seem to work
According to my knowledge you can solve this by genearting several new data frames
df_copy=df.copy()
index = range(0,1)
df1 = pd.DataFrame(index=index, columns=range((len(df_copy.columns))))
df1.columns = df_copy.columns
df1['date'] = pd.date_range('2019-11-01','2019-11-01' , freq='MS')-pd.offsets.MonthBegin(1)
df1['A'] = df1['A'].iloc[9]
Then appending to your original data frame and repeating it is a bit overwhemling but it seems like the only solution i could came up with
I have pandas dataframe with Columns 'Date' and 'Skew(float no.)'. I want to average the values of the skew between every Tuesday and the store it in a list or dataframe. I tried using lambda as given in this question Pandas, groupby and summing over specific months I but it only helps to some over a particular week but i cannot go across week i.e from one tuesday to another. Can you give how to do the same?
Here's an example with random data
df = pd.DataFrame({'Date' : pd.date_range('20130101', periods=100),
'Skew': 10+pd.np.random.randn(100)})
min_date = df.Date.min()
start = min_date.dayofweek
if start < 1:
min_date = min_date - pd.np.timedelta64(6+start, 'D')
elif start > 1:
min_date = min_date - pd.np.timedelta64(start-1, 'D')
df.groupby((df.Date - min_date).astype('timedelta64[D]')//7).mean()
Input:
>>> df
Date Skew
0 2013-01-01 10.082080
1 2013-01-02 10.907402
2 2013-01-03 8.485768
3 2013-01-04 9.221740
4 2013-01-05 10.137910
5 2013-01-06 9.084963
6 2013-01-07 9.457736
7 2013-01-08 10.092777
Output:
Skew
Date
0 9.625371
1 9.993275
2 10.041077
3 9.837709
4 9.901311
5 9.985390
6 10.123757
7 9.782892
8 9.889291
9 9.853204
10 10.190098
11 10.594125
12 10.012265
13 9.278008
14 10.530251
Logic: Find relative week from the first week's Tuesday and groupby and each groups (i.e week no's) mean.