pandas resample - issue with epoch and frequency - python

I'd like to get a time series with a fixed set of dates in the index. I thought that resample with freq and epoch='origin' will do the trick. It seems that I'm using this method in a wrong way. Here's an example that shows that epoch='origin' does not seem to work.
import pandas as pd
dates = pd.date_range('2022-01-01', '2022-02-01', freq="1D")
freq = '2W-MON'
vals = range(len(dates))
print(
pd.Series(vals,index = dates)
.resample(freq,
origin="epoch",
convention='end')
.sum()
.to_markdown()
)
0
2022-01-03 00:00:00
3
2022-01-17 00:00:00
133
2022-01-31 00:00:00
329
2022-02-14 00:00:00
31
If I change the first date in the series to anything after the "2022-01-03", I get a different result.
dates = pd.date_range('2022-01-04', '2022-02-01', freq="1D")
freq = '2W-MON'
vals = range(len(dates))
print(
pd.Series(vals,index = dates)
.resample(freq,
origin="epoch",
convention='end')
.sum()
.to_markdown()
)
0
2022-01-10 00:00:00
21
2022-01-24 00:00:00
189
2022-02-07 00:00:00
196
I'd expect that if the freq='2W-MON' and epoch='origin', both the examples will end up with the same dates (so, both should have either 2022-01-10 or 2022-01-03).
Is there an elegant way of forcing pandas to actually use epoch="origin"?

Related

group datetime column by 5 minutes increment only for time of day (ignoring date) and count

I have a dataframe with one column timestamp (of type datetime) and some other columns but their content don't matter. I'm trying to group by 5 minutes interval and count but ignoring the date and only caring about the time of day.
One can generate an example dataframe using this code:
def get_random_dates_df(
n=10000,
start=pd.to_datetime('2015-01-01'),
period_duration_days=5,
seed=None
):
if not seed: # from piR's answer
np.random.seed(0)
end = start + pd.Timedelta(period_duration_days, 'd'),
n_seconds = int(period_duration_days * 3600 * 24)
random_dates = pd.to_timedelta(n_seconds * np.random.rand(n), unit='s') + start
return pd.DataFrame(data={"timestamp": random_dates}).reset_index()
df = get_random_dates_df()
it would look like this:
index
timestamp
0
0
2015-01-03 17:51:27.433696604
1
1
2015-01-04 13:49:21.806272885
2
2
2015-01-04 00:19:53.778462950
3
3
2015-01-03 17:23:09.535054659
4
4
2015-01-03 02:50:18.873314407
I think I have a working solution but it seems overly complicated:
gpd_df = df.groupby(pd.Grouper(key="timestamp", freq="5min")).agg(
count=("index", "count")
).reset_index()
gpd_df["time_of_day"] = gpd_df["timestamp"].dt.time
res_df= gpd_df.groupby("time_of_day").sum()
Output:
count
time_of_day
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
... ...
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
[288 rows x 1 columns]
Is there a better way to solve this?
You could groupby the floored 5Min datetime's time portion:
df2 = df.groupby(df['timestamp'].dt.floor('5Min').dt.time)['index'].count()
I'd suggest something like this, to avoid trying to merge the results of two groupbys together:
gpd_df = df.copy()
gpd_df["time_of_day"] = gpd_df["timestamp"].apply(lambda x: x.replace(year=2000, month=1, day=1))
gpd_df = gpd_df.set_index("time_of_day")
res_df = gpd_df.resample("5min").size()
It works by setting the year/month/day to fixed values and applying the built-in resampling function.
What about flooring the datetimes to 5min, extracting the time only and using value_counts:
out = (df['timestamp']
.dt.floor('5min')
.dt.time.value_counts(sort=False)
.sort_index()
)
Output:
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
..
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
Name: timestamp, Length: 288, dtype: int64

Elegant way to shift multiple date columns - Pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14

How to use Pandas to get date_range from some timestamp?

I need to split a year in enumerated 20-minute chunks and then find the sequece number of corresponding time range chunk for randomly distributed timestamps in a year for further processing.
I tried to use pandas for this, but I can't find a way to index timestamp in date_range:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from datetime import timedelta
if __name__ == '__main__':
date_start = pd.to_datetime('2018-01-01')
date_end = date_start + timedelta(days=365)
index = pd.date_range(start=date_start, end=date_end, freq='20min')
data = range(len(index))
df = pd.DataFrame(data, index=index, columns=['A'])
print(df)
event_ts = pd.to_datetime('2018-10-14 02:17:43')
# How to find the corresponding df['A'] for event_ts?
# print(df.loc[event_ts])
Output:
A
2018-01-01 00:00:00 0
2018-01-01 00:20:00 1
2018-01-01 00:40:00 2
2018-01-01 01:00:00 3
2018-01-01 01:20:00 4
... ...
2018-12-31 22:40:00 26276
2018-12-31 23:00:00 26277
2018-12-31 23:20:00 26278
2018-12-31 23:40:00 26279
2019-01-01 00:00:00 26280
[26281 rows x 1 columns]
What is the best practice to do it in python? I imagine how to find the range "by hand" converting date_range to integers and comparing it, but may be there are some elegant pandas/python-style ways to do it?
First of all, I've worked with a small interval, one week:
date_end = date_start + timedelta(days=7)
Then I've followed your steps, and got a portion of your dataframe.
My event_ts is this:
event_ts = pd.to_datetime('2018-01-04 02:17:43')
And I've chosen to reset the index, and have a dataframe easy to manipulate:
df = df.reset_index()
With this code I found the last value where event_ts belongs:
for i in df['index']:
if i <= event_ts:
run.append(i)
print(max(run))
#2018-01-04 02:00:00
or:
top = max(run)
Finally:
df.loc[df['index'] == top].index[0]
222
event_ts belongs to index df[222]

How do I store the first use date in a dictionary using a for loop

I have a dataset of userids and the all the times they use a particular pass. I need to find out how many days since each of them first used the pass. I was thinking of running through the dataset and store the first use in a dictionary and minus it off today's date. I cant seem to get it to work.
Userid Start use Day
1712 2019-01-04 Friday
1712 2019-01-05 Saturday
9050 2019-01-04 Friday
9050 2019-01-04 Friday
9050 2019-01-06 Sunday
9409 2019-01-05 Saturday
9683 2019-05-20 Monday
8800 2019-05-17 Friday
8800 2019-05-17 Friday
This is the part of the dataset. Date format is Y-m-d
usedict={}
keys = df.user_id
values = df.start_date
for i in keys:
if (usedict[i] == keys):
continue
else:
usedict[i] = values[i]
prints(usedict)
user_id use_count days_used Ave Daily Trips register_date days_since_reg
12 42 23 1.826087 NaT NaT
17 28 13 2.153846 NaT NaT
114 54 24 2.250000 2019-02-04 107 days
169 31 17 1.823529 NaT NaT
1414 49 20 2.450000 NaT NaT
1712 76 34 2.235294 NaT NaT
2388 24 12 2.000000 NaT NaT
6150 10 5 2.000000 2019-02-05 106 days
You can achieve what you want with the following. I have used only 2 user ids from the example given by you, but the same will apply to all.
import pandas as pd
import datetime
df = pd.DataFrame([{'Userid':'1712','use_date':'2019-01-04'},
{'Userid':'1712','use_date':'2019-01-05'},
{'Userid':'9050','use_date':'2019-01-04'},
{'Userid':'9050','use_date':'2019-01-04'},
{'Userid':'9050','use_date':'2019-01-06'}])
df.use_date = pd.to_datetime(df.use_date).dt.date
group_df = df.sort_values(by='use_date').groupby('Userid', as_index=False).agg({'use_date':'first'}).rename(columns={'use_date':'first_use_date'})
group_df['diff_from_today'] = datetime.datetime.today().date() - group_df.first_use_date
The output is:
print(group_df)
Userid first_use_date diff_from_today
0 1712 2019-01-04 139 days
1 9050 2019-01-04 139 days
Check sort_values and groupby for more details.
I am only looking at two columns but you could find the min for each id with groupby and then use apply to get difference (I have done difference in days)
import pandas as pd
import datetime
user_id = [1712, 1712, 9050, 9050, 9050, 9409, 9683, 8800, 8800]
start = ['2019-01-04', '2019-01-05', '2019-01-04', '2019-01-04', '2019-01-06', '2019-01-05', '2019-05-20', '2019-05-17', '2019-05-17']
df = pd.DataFrame(list(zip(user_id, start)), columns = ['UserId', 'Start'])
df['Start']= pd.to_datetime(df['Start'])
df = df.groupby('UserId')['Start'].agg([pd.np.min])
now = datetime.datetime.now()
df['days'] = df['amin'].apply(lambda x: (now - x).days)
a_dict = pd.Series(df.days.values,index = df.index).to_dict()
print(a_dict)
References:
to_dict() method taken from #jeff
Output:

Add months to a date in Pandas

I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]

Categories

Resources