Market Basket Analysis - python

I have the following pandas dataset of transactions, regarding a retail shop:
print(df)
product Date Assistant_name
product_1 2017-01-02 11:45:00 John
product_2 2017-01-02 11:45:00 John
product_3 2017-01-02 11:55:00 Mark
...
I would like to create the following dataset, for Market Basket Analysis:
product Date Assistant_name Invoice_number
product_1 2017-01-02 11:45:00 John 1
product_2 2017-01-02 11:45:00 John 1
product_3 2017-01-02 11:55:00 Mark 2
...
Briefly, if a transaction has the same Assistant_name and Date, I assume it does generate a new Invoice.

Simpliest is factorize with joined columns together:
df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
print (df)
product Date Assistant_name Invoice
0 product_1 2017-01-02 11:45:00 John 1
1 product_2 2017-01-02 11:45:00 John 1
2 product_3 2017-01-02 11:55:00 Mark 2
If performance is important use pd.lib.fast_zip:
df['Invoice']=pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0]+1
Timings:
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [178]: %%timeit
...: df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
...: df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
...:
9.16 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: %%timeit
...: df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
...:
11.2 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [180]: %%timeit
...: df['Invoice'] = pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0] + 1
...:
6.27 ms ± 93.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using pandas categories is one way:
df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
# product Date Assistant_name Invoice
# product_1 2017-01-02 11:45:00 John 1
# product_2 2017-01-02 11:45:00 John 1
# product_3 2017-01-02 11:55:00 Mark 2
The benefit of this method is you can easily retrieve a dictionary of mappings:
dict(enumerate(df['Invoice'].astype('category').cat.categories, 1))
# {1: ('11:45:00', 'John'), 2: ('11:55:00', 'Mark')}

Related

Localize time zone based on column in pandas

I am trying to set timezone to a datetime column, based on another column containing the time zone.
Example data:
DATETIME VALUE TIME_ZONE
0 2021-05-01 00:00:00 1.00 Europe/Athens
1 2021-05-01 00:00:00 2.13 Europe/London
2 2021-05-01 00:00:00 5.13 Europe/London
3 2021-05-01 01:00:00 4.25 Europe/Dublin
4 2021-05-01 01:00:00 4.25 Europe/Paris
I am trying to assign a time zone to the DATETIME column, but using the tz_localize method, I cannot avoid using an apply call, which will be very slow on my large dataset. Is there some way to do this without using apply?
What I have now (which is slow):
df['DATETIME_WITH_TZ'] = df.apply(lambda row: row['DATETIME'].tz_localize(row['TIME_ZONE']), axis=1)
I'm not sure but a listcomp seems to be x17 faster than apply in your case :
df["DATETIME_WITH_TZ"] = [dt.tz_localize(tz)
for dt,tz in zip(df["DATETIME"], df["TIME_ZONE"])]
Another variant, with tz_convert :
df["DATETIME_WITH_TZ"] = [dt.tz_localize("UTC").tz_convert(tz)
for dt,tz in zip(df["DATETIME"], df["TIME_ZONE"])]
Timing :
#%%timeit #listcomp1
47.4 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
#%%timeit #listcomp2
25.7 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
#%%timeit #apply
457 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Output :
print(df)
DATETIME VALUE TIME_ZONE DATETIME_WITH_TZ
0 2021-05-01 00:00:00 1.00 Europe/Athens 2021-05-01 03:00:00+03:00
1 2021-05-01 00:00:00 2.13 Europe/London 2021-05-01 01:00:00+01:00
2 2021-05-01 00:00:00 5.13 Europe/London 2021-05-01 01:00:00+01:00
3 2021-05-01 01:00:00 4.25 Europe/Dublin 2021-05-01 02:00:00+01:00
4 2021-05-01 01:00:00 4.25 Europe/Paris 2021-05-01 03:00:00+02:00

Subtracting a rolling window mean based on value from one column based on another without loops in Pandas

I'm not sure what the word is for what I'm doing, but I can't just use the pandas rolling (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) function because the window is not a fixed size in terms of database indices. What I'm trying to do this:
I have a dataframe with columns UT (time in hours, but not a datetime object) and WINDS, I want to add a third column that subtracts the mean of all WINDS values that are within 12 hours of the time in the UT column. Currently, I do it like this:
rolsub = []
for i in df['UT']:
df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
df2 = df[df['UT'] == i]
rolsub += [float(df2['WINDS'] - df1['WINDS'].mean())]
df['WIND_SUB'] = rolsub
This works fine, but it takes way too long since my dataframe has tens of thousands of entries. There must be a better way to do this, right? Please help!
If I understood correctly, you could create a fake DatetimeIndex to use for rolling.
Example data:
import pandas as pd
df = pd.DataFrame({'UT':[0.5, 1, 2, 8, 9, 12, 13, 14, 15, 24, 60, 61, 63, 100],
'WINDS':[1, 1, 10, 1, 1, 1, 5, 5, 5, 5, 5, 1, 1, 10]})
print(df)
UT WINDS
0 0.5 1
1 1.0 1
2 2.0 10
3 8.0 1
4 9.0 1
5 12.0 1
6 13.0 5
7 14.0 5
8 15.0 5
9 24.0 5
10 60.0 5
11 61.0 1
12 63.0 1
13 100.0 10
Code:
# Fake DatetimeIndex.
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')
df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
print(df)
Which gives:
UT WINDS WINDS_SUB
dt
2022-05-11 00:30:00 0.5 1 -1.500000
2022-05-11 01:00:00 1.0 1 -1.500000
2022-05-11 02:00:00 2.0 10 7.142857
2022-05-11 08:00:00 8.0 1 -2.333333
2022-05-11 09:00:00 9.0 1 -2.333333
2022-05-11 12:00:00 12.0 1 -2.333333
2022-05-11 13:00:00 13.0 5 0.875000
2022-05-11 14:00:00 14.0 5 1.714286
2022-05-11 15:00:00 15.0 5 1.714286
2022-05-12 00:00:00 24.0 5 0.000000
2022-05-13 12:00:00 60.0 5 2.666667
2022-05-13 13:00:00 61.0 1 -1.333333
2022-05-13 15:00:00 63.0 1 -1.333333
2022-05-15 04:00:00 100.0 10 0.000000
The result on this small test set matches the output of your code. This assumes UT is representing hours from a certain start timepoint, which seems to be the case by looking at your solution.
Runtime:
I tested it on the following df with 30,000 rows:
import numpy as np
df = pd.DataFrame({'UT':range(30000),
'WINDS':np.full(30000, 1)})
def loop(df):
rolsub = []
for i in df['UT']:
df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
df2 = df[df['UT'] == i]
rolsub += [float(df2['WINDS'] - df1['WINDS'].mean())]
df['WIND_SUB'] = rolsub
def vector(df):
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')
df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
return df
# 10.1 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit loop(df)
# 1.69 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vector(df)
So it's about 5,000 times faster.

How to extract the last value in the last timestamp of a day?

My dataframe has multiple values in a day. I want to extract the value which is from the last timestamp in a day.
Date_Timestamp Values
2010-01-01 11:00:00 2.5
2010-01-01 15:00:00 7.1
2010-01-01 23:59:00 11.1
2010-02-01 08:00:00 12.5
2010-02-01 17:00:00 37.1
2010-02-01 23:53:00 71.1
output:
Date_Timestamp Values
2010-01-01 23:59:00 11.1
2010-02-01 23:53:00 71.1
df['Date_Timestamp']=pd.to_datetime(df['Date_Timestamp'])
df.groupby(df['Date_Timestamp'].dt.date)['Values'].apply(lambda x: x.tail(1))
Use pandas.core.groupby.GroupBy.last
This is a vectorized method, that is incredibly fast, compared to .apply.
# given dataframe df with Date_Timestamp as a datetime
dfg = df.groupby(df.Date_Timestamp.dt.date).last().reset_index(drop=True)
# display(dfg)
Date_Timestamp Values
2010-01-01 23:59:00 11.1
2010-02-01 23:53:00 71.1
timeit test
import pandas as pd
import numpy as np
from datetime import datetime
# test data with 2M rows
np.random.seed(365)
rows = 2000000
df = pd.DataFrame({'datetime': pd.bdate_range(datetime(2020, 1, 1), freq='h', periods=rows).tolist(),
'values': np.random.rand(rows, )*1000})
# display(df.head())
datetime values
2020-01-01 00:00:00 941.455743
2020-01-01 01:00:00 641.602705
2020-01-01 02:00:00 684.610467
2020-01-01 03:00:00 588.562066
2020-01-01 04:00:00 543.887219
%%timeit
df.groupby(df.datetime.dt.date).last().reset_index(drop=True)
[out]:
100k: 39.8 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
200k: 80.7 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
400k: 164 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2M: 791 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This answer, with apply, is terrible
# I let it run for 1.5 hours and it didn't finish
# I reran the test for this is 100k and 200k
%%timeit
df.groupby(df.datetime.dt.date)['values'].apply(lambda x: x.tail(1))
[out]:
100k: 2.42 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
200k: 8.77 s ± 328 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
400k: 38.2 s # I only did %%time instead of %%timeit - it takes to long
800k: 2min 54s

Speeding up iterator operation in python

[pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
Is there anyway to speed up this operation?
Basically for a given date range I am creating all rows of dates in between them.
Use DataFrame.itertuples:
L = [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
Or zip of both columns:
L = [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
If want join together:
s = pd.concat(L, ignore_index=True)
Performance for 100 rows:
np.random.seed(123)
def random_dates(start, end, n=100):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
df = pd.DataFrame({'START_DATE': start, 'END_DATE':random_dates(start, end)})
print (df)
In [155]: %timeit [pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
33.5 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [156]: %timeit [pd.date_range(row[1].START_DATE, row[1].END_DATE) for row in df[['START_DATE', 'END_DATE']].iterrows()]
30.3 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [157]: %timeit [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
25.3 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
24.3 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And for 1000 rows:
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
df = pd.DataFrame({'START_DATE': start, 'END_DATE':random_dates(start, end, n=1000)})
In [159]: %timeit [pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
333 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [160]: %timeit [pd.date_range(row[1].START_DATE, row[1].END_DATE) for row in df[['START_DATE', 'END_DATE']].iterrows()]
314 ms ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [161]: %timeit [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
243 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [162]: %timeit [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
246 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Instead of creating a pd.Series on each iteration, do:
[pd.date_range(row[1].START_DATE, row[1].END_DATE))
for row in df[['START_DATE', 'END_DATE']].iterrows()]
And create a dataframe from the result. Here's an example:
df = pd.DataFrame([
{'start_date': pd.datetime(2019,1,1), 'end_date': pd.datetime(2019,1,10)},
{'start_date': pd.datetime(2019,1,2), 'end_date': pd.datetime(2019,1,8)},
{'start_date': pd.datetime(2019,1,6), 'end_date': pd.datetime(2019,1,14)}
])
dr = [pd.date_range(df.loc[i,'start_date'], df.loc[i,'end_date']) for i,_ in df.iterrows()]
pd.DataFrame(dr)
0 1 2 3 4 5 \
0 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06
1 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-07
2 2019-01-06 2019-01-07 2019-01-08 2019-01-09 2019-01-10 2019-01-11
6 7 8 9
0 2019-01-07 2019-01-08 2019-01-09 2019-01-10
1 2019-01-08 NaT NaT NaT
2 2019-01-12 2019-01-13 2019-01-14 NaT

Extracting the first day of month of a datetime type column in pandas

I have the following dataframe:
user_id purchase_date
1 2015-01-23 14:05:21
2 2015-02-05 05:07:30
3 2015-02-18 17:08:51
4 2015-03-21 17:07:30
5 2015-03-11 18:32:56
6 2015-03-03 11:02:30
and purchase_date is a datetime64[ns] column. I need to add a new column df[month] that contains first day of the month of the purchase date:
df['month']
2015-01-01
2015-02-01
2015-02-01
2015-03-01
2015-03-01
2015-03-01
I'm looking for something like DATE_FORMAT(purchase_date, "%Y-%m-01") m in SQL. I have tried the following code:
df['month']=df['purchase_date'].apply(lambda x : x.replace(day=1))
It works somehow but returns: 2015-01-01 14:05:21.
Simpliest and fastest is convert to numpy array by to_numpy and then cast:
df['month'] = df['purchase_date'].to_numpy().astype('datetime64[M]')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Another solution with floor and pd.offsets.MonthBegin(1) and add pd.offsets.MonthEnd(0) for correct ouput if first day of month:
df['month'] = (df['purchase_date'].dt.floor('d') +
pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
df['month'] = ((df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
.dt.floor('d'))
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Last solution is create month period by to_period:
df['month'] = df['purchase_date'].dt.to_period('M')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01
1 2 2015-02-05 05:07:30 2015-02
2 3 2015-02-18 17:08:51 2015-02
3 4 2015-03-21 17:07:30 2015-03
4 5 2015-03-11 18:32:56 2015-03
5 6 2015-03-03 11:02:30 2015-03
... and then to datetimes by to_timestamp, but it is a bit slowier:
df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
There are many solutions, so:
Timings (in pandas 1.2.3):
rng = pd.date_range('1980-04-01 15:41:12', periods=100000, freq='20H')
df = pd.DataFrame({'purchase_date': rng})
print (df.head())
In [70]: %timeit df['purchase_date'].to_numpy().astype('datetime64[M]')
8.6 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [71]: %timeit df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(n=0) - pd.offsets.MonthBegin(n=1)
23 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [72]: %timeit (df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)).dt.floor('d')
23.6 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit df['purchase_date'].dt.to_period('M')
9.25 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [74]: %timeit df['purchase_date'].dt.to_period('M').dt.to_timestamp()
17.6 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [76]: %timeit df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
23.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [77]: %timeit df['purchase_date'].dt.normalize().map(MonthBegin().rollback)
1.66 s ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
We can use date offset in conjunction with Series.dt.normalize:
In [60]: df['month'] = df['purchase_date'].dt.normalize() - pd.offsets.MonthBegin(1)
In [61]: df
Out[61]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Or much nicer solution from #BradSolomon
In [95]: df['month'] = df['purchase_date'] - pd.offsets.MonthBegin(1, normalize=True)
In [96]: df
Out[96]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
How about this easy solution?
As purchase_date is already in datetime64[ns] format, you can use strftime to format the date to always have the first day of month.
df['date'] = df['purchase_date'].apply(lambda x: x.strftime('%Y-%m-01'))
print(df)
user_id purchase_date date
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Because we used strftime, now the date column is in object (string) type:
print(df.dtypes)
user_id int64
purchase_date datetime64[ns]
date object
dtype: object
Now if you want it to be in datetime64[ns], just use pd.to_datetime():
df['date'] = pd.to_datetime(df['date'])
print(df.dtypes)
user_id int64
purchase_date datetime64[ns]
date datetime64[ns]
dtype: object
Most proposed solutions don't work for the first day of the month.
Following solution works for any day of the month:
df['month'] = df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
[EDIT]
Another, more readable, solution is:
from pandas.tseries.offsets import MonthBegin
df['month'] = df['purchase_date'].dt.normalize().map(MonthBegin().rollback)
Be aware not to use:
df['month'] = df['purchase_date'].map(MonthBegin(normalize=True).rollback)
because that gives incorrect results for the first day due to a bug: https://github.com/pandas-dev/pandas/issues/32616
Try this ..
df['month']=pd.to_datetime(df.purchase_date.astype(str).str[0:7]+'-01')
Out[187]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
To extract the first day of every month, you could write a little helper function that will also work if the provided date is already the first of month. The function looks like this:
def first_of_month(date):
return date + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
You can apply this function on pd.Series:
df['month'] = df['purchase_date'].apply(first_of_month)
With that you will get the month column as a Timestamp. If you need a specific format, you might convert it with the strftime() method.
df['month_str'] = df['month'].dt.strftime('%Y-%m-%d')
For me df['purchase_date'] - pd.offsets.MonthBegin(1) didn't work (it fails for the first day of the month), so I'm subtracting the days of the month like this:
df['purchase_date'] - pd.to_timedelta(df['purchase_date'].dt.day - 1, unit='d')
#Eyal: This is what I did to get the first day of the month using pd.offsets.MonthBegin and handle the scenario where day is already first day of month.
import datetime
from_date= pd.to_datetime('2018-12-01')
from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not from_date.is_month_start else from_date
from_date
result: Timestamp('2018-12-01 00:00:00')
from_date= pd.to_datetime('2018-12-05')
from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not rom_date.is_month_start else from_date
from_date
result: Timestamp('2018-12-01 00:00:00')
Just adding my 2 cents, for the sake of completeness:
1 - transform purchase_date to date, instead of datetime. This will remove hour, minute, second, etc...
df['purchase_date'] = df['purchase_date'].dt.date
2 - apply the datetime replace, to use day 1 instead of the original:
df['purchase_date_begin'] = df['purchase_date'].apply(lambda x: x.replace(day=1))
This replace method is available on the datetime library:
from datetime import date
today = date.today()
month_start = today.replace(day=1)
and you can replace day, month, year, etc...
try this Pandas libraries, where 'purchase_date' is date parameter placed into the module.
date['month_start'] = pd.to_datetime(sched_slim.purchase_date)
.dt.to_period('M')
.dt.to_timestamp()

Categories

Resources