I have the following data frame:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[44551,'2020-10-01',1],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
How can I add the the months (in which QTD is zero), to each ID ? (Ideally I would like for the column BALANCE and CC to keep the previous value, for each ID, on the added rows but this not not stricly necessary as I am more interested on the QTD and VAL columns).
I thought about maybe resampling the data by month for each ID on a data frame and then merge that data frame to the one above. Is this a good implementation? Is there a better way to achieve this result?
Should end up similar to this:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[66991,'2020-08-01',0],
[66991,'2020-09-01',0],
[66991,'2020-10-01',0],
[44551,'2020-10-01',1],
[44551,'2020-11-05',0],
[66991,'2020-11-01',0],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
You can generate a range of dates by ID using pd.date_range, then create a pd.MultiIndex so you can do a reindex:
s = pd.MultiIndex.from_tuples([(i, x) for i, j in df.groupby("ID")
for x in pd.date_range(min(j["DATE"]), max(j["DATE"]), freq="MS")],
names=["ID", "DATE"])
df = df.set_index(["ID", "DATE"])
print (df.reindex(df.index|s, fill_value=0)
.reset_index()
.groupby(["ID", pd.Grouper(key="DATE", freq="M")], as_index=False)
.apply(lambda i: i[i["QTD"].ne(0)|(len(i)==1)])
.droplevel(0))
ID DATE QTD
0 44551 2020-10-01 1
1 44551 2020-11-01 0
3 44551 2020-12-05 5
4 66991 2020-06-01 2
5 66991 2020-06-02 1
7 66991 2020-07-03 1
8 66991 2020-08-01 0
9 66991 2020-09-01 0
10 66991 2020-10-01 0
11 66991 2020-11-01 0
12 66991 2020-12-01 1
13 66991 2020-12-05 7
15 66991 2021-01-08 3
Related
I need to build the new column about compare the previous date and the previous date must follow a special rule. I need to find the repeat purchase in past 3 months. I have no idea how can do this. There has some example and my expected output.
transaction.csv:
code,transaction_datetime
1,2021-12-01
1,2022-01-24
1,2022-05-29
2,2021-11-20
2,2022-04-12
2,2022-06-02
3,2021-04-23
3,2022-04-22
expected output:
code,transaction_datetime,repeat_purchase_P3M
1,2021-12-01,no
1,2022-01-24,2021-12-01
1,2022-05-29,no
2,2021-11-20,no
2,2022-04-12,no
2,2022-06-02,2022-04-12
3,2021-04-23,no
3,2022-04-22,no
df = pd.read_csv('file.csv')
df.transaction_datetime = pd.to_datetime(df.transaction_datetime)
grouped = df.groupby('code')['transaction_datetime']
df['repeated_purchase_P3M'] = grouped.shift().dt.date.where(grouped.diff().dt.days < 90, 'no')
df
code transaction_datetime repeated_purchase_P3M
0 1 2021-12-01 no
1 1 2022-01-24 2021-12-01
2 1 2022-05-29 no
3 2 2021-11-20 no
4 2 2022-04-12 no
5 2 2022-06-02 2022-04-12
6 3 2021-04-23 no
7 3 2022-04-22 no
Let's say I have the following data.
dates=['2020-12-01','2020-12-04','2020-12-05', '2020-12-01','2020-12-04','2020-12-05']
symbols=['ABC','ABC','ABC','DEF','DEF','DEF']
v=[1,3,5,7,9,10]
df= pd.DataFrame({'date':dates, 'g':symbols, 'v':v})
date g v
0 2020-12-01 ABC 1
1 2020-12-04 ABC 3
2 2020-12-05 ABC 5
3 2020-12-01 DEF 7
4 2020-12-04 DEF 9
5 2020-12-05 DEF 10
I'd like to fill the missing dates with previous value (group by field 'g')
For example, I want the following entrees added in the above example:
2020-12-02 ABC 1
2020-12-03 ABC 1
2020-12-02 DEF 7
2020-12-03 DEF 7
how can I do this?
The answer is borrowed mostly from the following answer, with the exception of filling with a negative value and using that to replace with nulls for the forward fill.
Original Answer Here
dates=['2020-12-01','2020-12-04','2020-12-05', '2020-12-01','2020-12-04','2020-12-05']
symbols=['ABC','ABC','ABC','DEF','DEF','DEF']
v=[1,3,5,7,9,10]
df= pd.DataFrame({'date':dates, 'g':symbols, 'v':v})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(
['date', 'g']
).unstack(
fill_value=-999
).asfreq(
'D', fill_value=-999
).stack().sort_index(level=1).reset_index()
df.replace(-999, np.nan).ffill()
I have data that looks like this
date ticker x y
0 2018-01-31 ABC 1 5
1 2019-01-31 ABC 2 6
2 2018-01-31 XYZ 3 7
3 2019-01-31 XYZ 4 8
So it is a panel of yearly observations. I want to upsample to a monthly frequency and forward fill the new observations. So ABC would look like
date ticker x y
0 2018-01-31 ABC 1 5
1 2018-02-28 ABC 1 5
...
22 2019-11-30 ABC 2 6
23 2019-12-31 ABC 2 6
Notice that I want to fill through the last year, not just up until the last date.
Right now I am doing something like
newidx = df.groupby('ticker')['date'].apply(lambda x:
pd.Series(pd.date_range(x.min(),x.max()+YearEnd(1),freq='M'))).reset_index()
newidx.drop('level_1',axis=1,inplace=True)
df = pd.merge(newidx,df,on=['date','ticker'],how='left')
This is obviously a terrible way to do this. It's really slow, but it works. What is the proper way to handle this?
Your approach might be slow because you need groupby, then merge. Let's try another option with reindex so you only need groupby:
(df.set_index('date')
.groupby('ticker')
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),x.index.max()+YearEnd(1),freq='M'),
method='ffill'))
.reset_index('ticker', drop=True)
.reset_index()
)
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14
I have a dataframe like this, how to sort this.
df = pd.DataFrame({'Date':['Oct20','Nov19','Jan19','Sep20','Dec20']})
Date
0 Oct20
1 Nov19
2 Jan19
3 Sep20
4 Dec20
I familiar in sorting list of dates(string)
a.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
Any thoughts? Should i split it ?
First convert column to datetimes and get positions of sorted values by Series.argsort what is used for change ordering with DataFrame.iloc:
df = df.iloc[pd.to_datetime(df['Date'], format='%b%y').argsort()]
print (df)
Date
2 Jan19
1 Nov19
3 Sep20
0 Oct20
4 Dec20
Details:
print (pd.to_datetime(df['Date'], format='%b%y'))
0 2020-10-01
1 2019-11-01
2 2019-01-01
3 2020-09-01
4 2020-12-01
Name: Date, dtype: datetime64[ns]