Create new rows in a dataframe by range of dates - python

I need to generate a list of dates in a dataframe by days and that each day is a row in the new dataframe, taking into account the start date and the end date of each record.
Input Dataframe:
A
B
Start
End
A1
B1
2021-05-15 00:00:00
2021-05-17 00:00:00
A1
B2
2021-05-30 00:00:00
2021-06-02 00:00:00
A2
B3
2021-05-10 00:00:00
2021-05-12 00:00:00
A2
B4
2021-06-02 00:00:00
2021-06-04 00:00:00
Expected Output:
A
B
Start
End
A1
B1
2021-05-15 00:00:00
2021-05-16 00:00:00
A1
B1
2021-05-16 00:00:00
2021-05-17 00:00:00
A1
B2
2021-05-30 00:00:00
2021-05-31 00:00:00
A1
B2
2021-05-31 00:00:00
2021-06-01 00:00:00
A1
B2
2021-06-01 00:00:00
2021-06-02 00:00:00
A2
B3
2021-05-10 00:00:00
2021-05-11 00:00:00
A2
B3
2021-05-11 00:00:00
2021-05-12 00:00:00
A2
B4
2021-06-02 00:00:00
2021-06-03 00:00:00
A2
B4
2021-06-03 00:00:00
2021-06-04 00:00:00

Use:
#convert columns to datetimes
df["Start"] = pd.to_datetime(df["Start"])
df["End"] = pd.to_datetime(df["End"])
#subtract values and convert to days
s = df["End"].sub(df["Start"]).dt.days
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas, add 1 day for End column
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Start'] = df["Start"].add(add)
df['End'] = df["Start"] + pd.Timedelta(1, 'd')
#default index
df = df.reset_index(drop=True)
print (df)
A B Start End
0 A1 B1 2021-05-15 2021-05-16
1 A1 B1 2021-05-16 2021-05-17
2 A1 B2 2021-05-30 2021-05-31
3 A1 B2 2021-05-31 2021-06-01
4 A1 B2 2021-06-01 2021-06-02
5 A2 B3 2021-05-10 2021-05-11
6 A2 B3 2021-05-11 2021-05-12
7 A2 B4 2021-06-02 2021-06-03
8 A2 B4 2021-06-03 2021-06-04
Performance:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [136]: %timeit jez(df)
16.9 ms ± 3.94 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [137]: %timeit andreas(df)
888 ms ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#800 rows
df = pd.concat([df] * 200, ignore_index=True)
In [139]: %timeit jez(df)
6.25 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [140]: %timeit andreas(df)
170 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
def andreas(df):
df['d_range'] = df.apply(lambda row: list(pd.date_range(start=row['Start'], end=row['End'])), axis=1)
return df.explode('d_range')
def jez(df):
df["Start"] = pd.to_datetime(df["Start"])
df["End"] = pd.to_datetime(df["End"])
#subtract values and convert to days
s = df["End"].sub(df["Start"]).dt.days
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas, add 1 day for End column
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Start'] = df["Start"].add(add)
df['End'] = df["Start"] + pd.Timedelta(1, 'd')
#default index
return df.reset_index(drop=True)

You can create a list of dates and explode it:
df['d_range'] = df.apply(lambda row: list(pd.date_range(start=row['Start'], end=row['End'])), axis=1)
df = df.explode('d_range')

Related

Localize time zone based on column in pandas

I am trying to set timezone to a datetime column, based on another column containing the time zone.
Example data:
DATETIME VALUE TIME_ZONE
0 2021-05-01 00:00:00 1.00 Europe/Athens
1 2021-05-01 00:00:00 2.13 Europe/London
2 2021-05-01 00:00:00 5.13 Europe/London
3 2021-05-01 01:00:00 4.25 Europe/Dublin
4 2021-05-01 01:00:00 4.25 Europe/Paris
I am trying to assign a time zone to the DATETIME column, but using the tz_localize method, I cannot avoid using an apply call, which will be very slow on my large dataset. Is there some way to do this without using apply?
What I have now (which is slow):
df['DATETIME_WITH_TZ'] = df.apply(lambda row: row['DATETIME'].tz_localize(row['TIME_ZONE']), axis=1)
I'm not sure but a listcomp seems to be x17 faster than apply in your case :
df["DATETIME_WITH_TZ"] = [dt.tz_localize(tz)
for dt,tz in zip(df["DATETIME"], df["TIME_ZONE"])]
Another variant, with tz_convert :
df["DATETIME_WITH_TZ"] = [dt.tz_localize("UTC").tz_convert(tz)
for dt,tz in zip(df["DATETIME"], df["TIME_ZONE"])]
Timing :
#%%timeit #listcomp1
47.4 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
#%%timeit #listcomp2
25.7 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
#%%timeit #apply
457 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Output :
print(df)
DATETIME VALUE TIME_ZONE DATETIME_WITH_TZ
0 2021-05-01 00:00:00 1.00 Europe/Athens 2021-05-01 03:00:00+03:00
1 2021-05-01 00:00:00 2.13 Europe/London 2021-05-01 01:00:00+01:00
2 2021-05-01 00:00:00 5.13 Europe/London 2021-05-01 01:00:00+01:00
3 2021-05-01 01:00:00 4.25 Europe/Dublin 2021-05-01 02:00:00+01:00
4 2021-05-01 01:00:00 4.25 Europe/Paris 2021-05-01 03:00:00+02:00

Subtracting a rolling window mean based on value from one column based on another without loops in Pandas

I'm not sure what the word is for what I'm doing, but I can't just use the pandas rolling (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) function because the window is not a fixed size in terms of database indices. What I'm trying to do this:
I have a dataframe with columns UT (time in hours, but not a datetime object) and WINDS, I want to add a third column that subtracts the mean of all WINDS values that are within 12 hours of the time in the UT column. Currently, I do it like this:
rolsub = []
for i in df['UT']:
df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
df2 = df[df['UT'] == i]
rolsub += [float(df2['WINDS'] - df1['WINDS'].mean())]
df['WIND_SUB'] = rolsub
This works fine, but it takes way too long since my dataframe has tens of thousands of entries. There must be a better way to do this, right? Please help!
If I understood correctly, you could create a fake DatetimeIndex to use for rolling.
Example data:
import pandas as pd
df = pd.DataFrame({'UT':[0.5, 1, 2, 8, 9, 12, 13, 14, 15, 24, 60, 61, 63, 100],
'WINDS':[1, 1, 10, 1, 1, 1, 5, 5, 5, 5, 5, 1, 1, 10]})
print(df)
UT WINDS
0 0.5 1
1 1.0 1
2 2.0 10
3 8.0 1
4 9.0 1
5 12.0 1
6 13.0 5
7 14.0 5
8 15.0 5
9 24.0 5
10 60.0 5
11 61.0 1
12 63.0 1
13 100.0 10
Code:
# Fake DatetimeIndex.
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')
df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
print(df)
Which gives:
UT WINDS WINDS_SUB
dt
2022-05-11 00:30:00 0.5 1 -1.500000
2022-05-11 01:00:00 1.0 1 -1.500000
2022-05-11 02:00:00 2.0 10 7.142857
2022-05-11 08:00:00 8.0 1 -2.333333
2022-05-11 09:00:00 9.0 1 -2.333333
2022-05-11 12:00:00 12.0 1 -2.333333
2022-05-11 13:00:00 13.0 5 0.875000
2022-05-11 14:00:00 14.0 5 1.714286
2022-05-11 15:00:00 15.0 5 1.714286
2022-05-12 00:00:00 24.0 5 0.000000
2022-05-13 12:00:00 60.0 5 2.666667
2022-05-13 13:00:00 61.0 1 -1.333333
2022-05-13 15:00:00 63.0 1 -1.333333
2022-05-15 04:00:00 100.0 10 0.000000
The result on this small test set matches the output of your code. This assumes UT is representing hours from a certain start timepoint, which seems to be the case by looking at your solution.
Runtime:
I tested it on the following df with 30,000 rows:
import numpy as np
df = pd.DataFrame({'UT':range(30000),
'WINDS':np.full(30000, 1)})
def loop(df):
rolsub = []
for i in df['UT']:
df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
df2 = df[df['UT'] == i]
rolsub += [float(df2['WINDS'] - df1['WINDS'].mean())]
df['WIND_SUB'] = rolsub
def vector(df):
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')
df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
return df
# 10.1 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit loop(df)
# 1.69 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vector(df)
So it's about 5,000 times faster.

Average 2 consecutive rows in Panda dataframes

I've a dataset which looks as follows
userid time val1 val2 val3 val4
1 2010-6-1 0:15 12 16 17 11
1 2010-6-1 0:30 11.5 14 15.2 10
1 2010-6-1 0:45 12 14 15 10
1 2010-6-1 1:00 8 11 13 0
.................................
.................................
2 2010-6-1 0:15 14 16 17 11
2 2010-6-1 0:30 11 14 15.2 10
2 2010-6-1 0:45 11 14 15 10
2 2010-6-1 1:00 9 11 13 0
.................................
.................................
3 ...................................
.................................
.................................
I want to get the average of every two rows. Expected results would be
userid time val1 val2 val3 val4
1 2010-6-1 0:30 11.75 15 16.1 10.5
1 2010-6-1 1:00 10 12.5 14 5
..............................
..............................
2 2010-6-1 0:30 12.5 15 16.1 10.5
2 2010-6-1 1:00 10 12.5 14 5
.................................
.................................
3 ...................................
.................................
.................................
At the moment my approach is
data = pd.read_csv("sample_dataset.csv")
i = 0
while i < len(data) - 1:
x = data.iloc[i:i+2].mean()
x['time'] = data.iloc[i+1]['time']
data.iloc[i] = x
i+=2
for i in range(len(data)):
if i % 2 != 1:
del data.iloc[i]
But this is very inefficient. Therefore can someone point out me a better approach to get the intended result?. In the dataset, I've more than 1000000 rows
I am using resample
df.set_index('time').resample('30Min',closed = 'right',label ='right').mean()
Out[293]:
val1 val2 val3 val4
time
2010-06-01 00:30:00 11.75 15.0 16.1 10.5
2010-06-01 01:00:00 10.00 12.5 14.0 5.0
Method 2
df.groupby(np.arange(len(df))//2).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean())
Out[308]:
time val1 val2 val3 val4
0 2010-06-01 00:30:00 11.75 15.0 16.1 10.5
1 2010-06-01 01:00:00 10.00 12.5 14.0 5.0
Update solution
df.groupby([df.userid,np.arange(len(df))//2]).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean()).reset_index(drop=True)
This solution stays in pandas, and is far more performant than the groupby-agg solution:
>>> df = pd.DataFrame({"a":range(10),
"b":range(0, 20, 2),
"c":pd.date_range('2018-01-01', periods=10, freq='H')})
>>> df
a b c
0 0 0 2018-01-01 00:00:00
1 1 2 2018-01-01 01:00:00
2 2 4 2018-01-01 02:00:00
3 3 6 2018-01-01 03:00:00
4 4 8 2018-01-01 04:00:00
5 5 10 2018-01-01 05:00:00
6 6 12 2018-01-01 06:00:00
7 7 14 2018-01-01 07:00:00
8 8 16 2018-01-01 08:00:00
9 9 18 2018-01-01 09:00:00
>>> pd.concat([(df.iloc[::2, :2] + df.iloc[1::2, :2].values) / 2,
df.iloc[::2, 2]], axis=1)
a b c
0 0.5 1.0 2018-01-01 00:00:00
2 2.5 5.0 2018-01-01 02:00:00
4 4.5 9.0 2018-01-01 04:00:00
6 6.5 13.0 2018-01-01 06:00:00
8 8.5 17.0 2018-01-01 08:00:00
Performance:
In [41]: n = 100000
In [42]: df = pd.DataFrame({"a":range(n), "b":range(0, n*2, 2), "c":pd.date_range('2018-01-01', periods= n, freq='S')})
In [44]: df.shape
Out[44]: (100000, 3)
In [45]: %timeit pd.concat([(df.iloc[::2, :2] + df.iloc[1::2, :2].values) / 2, df.iloc[::2, 2]], axis=1)
2.21 ms ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [46]: %timeit df.groupby(np.arange(len(df))//2).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean())
7.9 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I tried both mentioned answers. Both did work. But Noah's answer was the fastest one as I experienced. Therefore I marked that answer as the solution.
Here is my version of Noah's answer with some explanation and edits to map with my dataset
In order to use Noah;s answer time column should be first or last (I maybe wrong). Therefore, I moved the time column to end
col = data.columns.tolist()
tmp = col[10]
col[10] = col[1]
col[1] = tmp
data2 = data[col]
Then I did the concatenation. Here ::2 means every second column and :10 means columns from 0 to 9. And then I add the time column which is at the 10th index
x = pd.concat([(data2.iloc[::2, :10] + data2.iloc[1::2, :10].values) / 2, data2.iloc[::2, 10]], axis=1)

Market Basket Analysis

I have the following pandas dataset of transactions, regarding a retail shop:
print(df)
product Date Assistant_name
product_1 2017-01-02 11:45:00 John
product_2 2017-01-02 11:45:00 John
product_3 2017-01-02 11:55:00 Mark
...
I would like to create the following dataset, for Market Basket Analysis:
product Date Assistant_name Invoice_number
product_1 2017-01-02 11:45:00 John 1
product_2 2017-01-02 11:45:00 John 1
product_3 2017-01-02 11:55:00 Mark 2
...
Briefly, if a transaction has the same Assistant_name and Date, I assume it does generate a new Invoice.
Simpliest is factorize with joined columns together:
df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
print (df)
product Date Assistant_name Invoice
0 product_1 2017-01-02 11:45:00 John 1
1 product_2 2017-01-02 11:45:00 John 1
2 product_3 2017-01-02 11:55:00 Mark 2
If performance is important use pd.lib.fast_zip:
df['Invoice']=pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0]+1
Timings:
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [178]: %%timeit
...: df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
...: df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
...:
9.16 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: %%timeit
...: df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
...:
11.2 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [180]: %%timeit
...: df['Invoice'] = pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0] + 1
...:
6.27 ms ± 93.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using pandas categories is one way:
df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
# product Date Assistant_name Invoice
# product_1 2017-01-02 11:45:00 John 1
# product_2 2017-01-02 11:45:00 John 1
# product_3 2017-01-02 11:55:00 Mark 2
The benefit of this method is you can easily retrieve a dictionary of mappings:
dict(enumerate(df['Invoice'].astype('category').cat.categories, 1))
# {1: ('11:45:00', 'John'), 2: ('11:55:00', 'Mark')}

Extracting the first day of month of a datetime type column in pandas

I have the following dataframe:
user_id purchase_date
1 2015-01-23 14:05:21
2 2015-02-05 05:07:30
3 2015-02-18 17:08:51
4 2015-03-21 17:07:30
5 2015-03-11 18:32:56
6 2015-03-03 11:02:30
and purchase_date is a datetime64[ns] column. I need to add a new column df[month] that contains first day of the month of the purchase date:
df['month']
2015-01-01
2015-02-01
2015-02-01
2015-03-01
2015-03-01
2015-03-01
I'm looking for something like DATE_FORMAT(purchase_date, "%Y-%m-01") m in SQL. I have tried the following code:
df['month']=df['purchase_date'].apply(lambda x : x.replace(day=1))
It works somehow but returns: 2015-01-01 14:05:21.
Simpliest and fastest is convert to numpy array by to_numpy and then cast:
df['month'] = df['purchase_date'].to_numpy().astype('datetime64[M]')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Another solution with floor and pd.offsets.MonthBegin(1) and add pd.offsets.MonthEnd(0) for correct ouput if first day of month:
df['month'] = (df['purchase_date'].dt.floor('d') +
pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
df['month'] = ((df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
.dt.floor('d'))
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Last solution is create month period by to_period:
df['month'] = df['purchase_date'].dt.to_period('M')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01
1 2 2015-02-05 05:07:30 2015-02
2 3 2015-02-18 17:08:51 2015-02
3 4 2015-03-21 17:07:30 2015-03
4 5 2015-03-11 18:32:56 2015-03
5 6 2015-03-03 11:02:30 2015-03
... and then to datetimes by to_timestamp, but it is a bit slowier:
df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
There are many solutions, so:
Timings (in pandas 1.2.3):
rng = pd.date_range('1980-04-01 15:41:12', periods=100000, freq='20H')
df = pd.DataFrame({'purchase_date': rng})
print (df.head())
In [70]: %timeit df['purchase_date'].to_numpy().astype('datetime64[M]')
8.6 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [71]: %timeit df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(n=0) - pd.offsets.MonthBegin(n=1)
23 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [72]: %timeit (df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)).dt.floor('d')
23.6 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit df['purchase_date'].dt.to_period('M')
9.25 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [74]: %timeit df['purchase_date'].dt.to_period('M').dt.to_timestamp()
17.6 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [76]: %timeit df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
23.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [77]: %timeit df['purchase_date'].dt.normalize().map(MonthBegin().rollback)
1.66 s ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
We can use date offset in conjunction with Series.dt.normalize:
In [60]: df['month'] = df['purchase_date'].dt.normalize() - pd.offsets.MonthBegin(1)
In [61]: df
Out[61]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Or much nicer solution from #BradSolomon
In [95]: df['month'] = df['purchase_date'] - pd.offsets.MonthBegin(1, normalize=True)
In [96]: df
Out[96]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
How about this easy solution?
As purchase_date is already in datetime64[ns] format, you can use strftime to format the date to always have the first day of month.
df['date'] = df['purchase_date'].apply(lambda x: x.strftime('%Y-%m-01'))
print(df)
user_id purchase_date date
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Because we used strftime, now the date column is in object (string) type:
print(df.dtypes)
user_id int64
purchase_date datetime64[ns]
date object
dtype: object
Now if you want it to be in datetime64[ns], just use pd.to_datetime():
df['date'] = pd.to_datetime(df['date'])
print(df.dtypes)
user_id int64
purchase_date datetime64[ns]
date datetime64[ns]
dtype: object
Most proposed solutions don't work for the first day of the month.
Following solution works for any day of the month:
df['month'] = df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
[EDIT]
Another, more readable, solution is:
from pandas.tseries.offsets import MonthBegin
df['month'] = df['purchase_date'].dt.normalize().map(MonthBegin().rollback)
Be aware not to use:
df['month'] = df['purchase_date'].map(MonthBegin(normalize=True).rollback)
because that gives incorrect results for the first day due to a bug: https://github.com/pandas-dev/pandas/issues/32616
Try this ..
df['month']=pd.to_datetime(df.purchase_date.astype(str).str[0:7]+'-01')
Out[187]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
To extract the first day of every month, you could write a little helper function that will also work if the provided date is already the first of month. The function looks like this:
def first_of_month(date):
return date + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
You can apply this function on pd.Series:
df['month'] = df['purchase_date'].apply(first_of_month)
With that you will get the month column as a Timestamp. If you need a specific format, you might convert it with the strftime() method.
df['month_str'] = df['month'].dt.strftime('%Y-%m-%d')
For me df['purchase_date'] - pd.offsets.MonthBegin(1) didn't work (it fails for the first day of the month), so I'm subtracting the days of the month like this:
df['purchase_date'] - pd.to_timedelta(df['purchase_date'].dt.day - 1, unit='d')
#Eyal: This is what I did to get the first day of the month using pd.offsets.MonthBegin and handle the scenario where day is already first day of month.
import datetime
from_date= pd.to_datetime('2018-12-01')
from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not from_date.is_month_start else from_date
from_date
result: Timestamp('2018-12-01 00:00:00')
from_date= pd.to_datetime('2018-12-05')
from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not rom_date.is_month_start else from_date
from_date
result: Timestamp('2018-12-01 00:00:00')
Just adding my 2 cents, for the sake of completeness:
1 - transform purchase_date to date, instead of datetime. This will remove hour, minute, second, etc...
df['purchase_date'] = df['purchase_date'].dt.date
2 - apply the datetime replace, to use day 1 instead of the original:
df['purchase_date_begin'] = df['purchase_date'].apply(lambda x: x.replace(day=1))
This replace method is available on the datetime library:
from datetime import date
today = date.today()
month_start = today.replace(day=1)
and you can replace day, month, year, etc...
try this Pandas libraries, where 'purchase_date' is date parameter placed into the module.
date['month_start'] = pd.to_datetime(sched_slim.purchase_date)
.dt.to_period('M')
.dt.to_timestamp()

Categories

Resources