Create Bi-weekly Variable Based On A Custom Start-date - python

I've done some light searching using "get biweekly variable in python" but havent been able to find many useful posts so I thought to post my question here.
I have a dataframe with tens of thousands of records. The dataframe contains records for the entire fiscal year. Each records has a datetime variable CHECKIN_DATE_TIME. I would like to create a biweekly variable beginning with the date June 30 2019.
ID CHECKIN_DATE_TIME
1 2019-06-30 13:36:00
2 2019-06-30 14:26:00
3 2019-06-30 20:10:00
4 2019-06-30 21:27:00
....
51 2019-07-10 13:36:00
52 2019-07-10 10:26:00
53 2019-07-10 10:10:00
54 2019-07-10 23:27:00
....
I would like a new dataframe to look like this where 6/30/2019 - 7/13/2019 would be week 1, 7/14/2019 to 7/27/2019 would be week 2, and so on until the end date of 6/28/2020. Thus there will be 26 weeks within the Week variable and each week represent a 2 week time frame.
EDIT and to have the last day in the week range assigned to the week number.
ID CHECKIN_DATE_TIME Week Date
1 2019-06-30 13:36:00 1 7/13/2019
2 2019-06-30 14:26:00 1 7/13/2019
3 2019-06-30 20:10:00 1 7/13/2019
4 2019-06-30 21:27:00 1 7/13/2019
....
51 2019-07-20 13:36:00 2 7/27/2019
52 2019-07-20 10:26:00 2 7/27/2019
53 2019-07-20 10:10:00 2 7/27/2019
54 2019-07-20 23:27:00 2
....

You can do so by determining the number of days between the check-in date and 2019-06-30 and then doing a floor division by 14.
df['CHECKIN_DATE_TIME'] = pd.to_datetime(df.CHECKIN_DATE_TIME)
df['week'] = (df.CHECKIN_DATE_TIME - pd.datetime(2019, 6, 30)).dt.days // 14 + 1
df['last_week_day'] = (pd.to_timedelta(-((df.CHECKIN_DATE_TIME - pd.datetime(2019,6,30)).dt.days % 14) + 13 ,'d') + df.CHECKIN_DATE_TIME).dt.date
# note I've created my own test set.
ID CHECKIN_DATE_TIME week last_week_day
0 1 2019-06-30 13:36:00 1 2019-07-13
1 2 2019-07-10 10:36:00 1 2019-07-13
2 3 2019-07-12 02:36:00 1 2019-07-13
3 4 2019-07-18 18:36:00 2 2019-07-27
4 5 2019-07-30 11:36:00 3 2019-08-10
5 6 2019-08-01 20:36:00 3 2019-08-10
Edit: added last_week_day as per request in comments. This is done by calculating the required number of days to the CHECKIN_DATE_TIME columns using modulo operator %.

Using Pandas date_range function it is very easy and efficient way to generate date list for weekly, bi-weekly or monthly
import pandas as pd
from datetime import date,datetime,timedelta
date_rng=pd.date_range(start=date.today()-timedelta(weeks=53),end=date.today(), freq="2W-SAT")
for i in date_rng:
print(i)

Related

December/January Seasonal Mean

I am attempting to calculate the seasonal means for the winter months of DJF and DJ. I first tried to use Xarray's .groupby function:
ds.groupby('time.month').mean('time')
Then I realized that instead of grouping by the previous years' December and the subsequent Jan/Feb., it was grouping all three months from the same year. I was then able to figure out how to solve for the DJF season by resampling and creating a function to select out the proper 3 month period:
>def is_djf(month):
return (month == 12)
>ds.resample('QS-MAR').mean('time')
>ds.sel(time=is_djf(ds['time.month']))
I am still unfortunately unsure how to solve for the Dec./Jan. season since the resampling method I used was for offsetting quarterly. Thank you for any and all help!
Use resample with QS-DEC.
Suppose this dataframe:
time val
0 2020-12-31 1
1 2021-01-31 1
2 2021-02-28 1
3 2021-03-31 2
4 2021-04-30 2
5 2021-05-31 2
6 2021-06-30 3
7 2021-07-31 3
8 2021-08-31 3
9 2021-09-30 4
10 2021-10-31 4
11 2021-11-30 4
12 2021-12-31 5
13 2022-01-31 5
14 2022-02-28 5
>>> df.set_index('time').resample('QS-DEC').mean()
val
time
2020-12-01 1.0
2021-03-01 2.0
2021-06-01 3.0
2021-09-01 4.0
2021-12-01 5.0

how do I classify or regroup dataset based on time variation in python

I need to assign number to values between different time hourly. How can I then add a new column to this where I can specify each cell to be grouped hourly. for instance, all the transactions within 00:00:00 to 00:59:59 to be filled with 1, transactions within 01:00:00 to 01:59:59 to be filled with 2, and so on till 23:00:00 to 23:59:59 to be filled with 24
Time_duration = df['period']
print (Time_duration)
0 23:59:56
1 23:59:56
2 23:59:55
3 23:59:53
4 23:59:52
...
74187 00:00:18
74188 00:00:09
74189 00:00:08
74190 00:00:03
74191 00:00:02 ```
# this is the result I desire.... How can I then add a new column to this where I can specify each cell to be grouped hourly. for instance, all the transactions within 00:00:00 to 00:59:59 to be filled with 1, transactions within 01:00:00 to 01:59:59 to be filled with 2, and so on till 23:00:00 to 23:59:59 to be filled with 24.
0 23:59:56 24
1 23:59:56 24
2 23:59:55 24
3 23:59:53 24
4 23:59:52 24
...
74187 00:00:18 1
74188 00:00:09 1
74189 00:00:08 1
74190 00:00:03 1
74191 00:00:02 1
df.sort_values(by=["period"])
timeStamp_list = (pd.to_datetime(list(df['period'])))
df['Hour'] =timeStamp_list.hour
try this code, this works for me.
You can use regular expressions and str.extract
import pandas as pd
pattern= r'^(\d{1,2}):' #capture the digits of the hour
df['hour']=df['period'].str.extract(pattern).astype('int') + 1 # cast it as int so that you can add 1

Calculate mean based on time elapsed in Pandas

I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64

Changing datetime column to integer number without loop

I have a pandas dataset like this:
user_id datetime
1 13 days 21:50:00
2 0 days 02:05:00
5 10 days 00:10:00
7 2 days 01:20:00
1 3 days 11:50:00
2 1 days 02:30:00
I want to have a column that contains the mintues, So in this case the result can be :
user_id datetime minutes
1 13 days 21:50:00 20030
2 0 days 02:05:00 125
5 10 days 00:10:00 14402
7 2 days 01:20:00 2960
1 3 days 11:50:00 5030
2 1 days 02:30:00 1590
Is there any way to do that without loop?
Yes, there is a special dt accessor for date/time series:
df['minutes'] = df['datetime'].dt.total_seconds() / 60
If you only want whole minutes, cast the result using .astype(int).
Here is a way with pd.Timedelta:
df['minutes'] = pd.to_timedelta(df.datetime) / pd.Timedelta(1, 'm')
>>> df
user_id datetime minutes
0 1 13 days 21:50:00 20030.0
1 2 0 days 02:05:00 125.0
2 5 10 days 00:10:00 14410.0
3 7 2 days 01:20:00 2960.0
4 1 3 days 11:50:00 5030.0
5 2 1 days 02:30:00 1590.0
if your datetime column is already of dtype timedelta, you can omit the explicit casting and just use:
df['minutes'] = df.datetime / pd.Timedelta(1, 'm')

Query same time value every day in Pandas timeseries

I would like to get the 07h00 value every day, from a multiday DataFrame that has 24 hours of minute data in it each day.
import numpy as np
import pandas as pd
aframe = pd.DataFrame([np.arange(10000), np.arange(10000) * 2]).T
aframe.index = pd.date_range("2015-09-01", periods = 10000, freq = "1min")
aframe.head()
Out[174]:
0 1
2015-09-01 00:00:00 0 0
2015-09-01 00:01:00 1 2
2015-09-01 00:02:00 2 4
2015-09-01 00:03:00 3 6
2015-09-01 00:04:00 4 8
aframe.tail()
Out[175]:
0 1
2015-09-07 22:35:00 9995 19990
2015-09-07 22:36:00 9996 19992
2015-09-07 22:37:00 9997 19994
2015-09-07 22:38:00 9998 19996
2015-09-07 22:39:00 9999 19998
In this 10 000 row DataFrame spanning 7 days, how would I get the 7am value each day as efficiently as possible? Assume I might have to do this for very large tick databases so I value speed and low memory usage highly.
I know I can index with strings such as:
aframe.ix["2015-09-02 07:00:00"]
Out[176]:
0 1860
1 3720
Name: 2015-09-02 07:00:00, dtype: int64
But what I need is basically a wildcard style query for example
aframe.ix["* 07:00:00"]
You can use indexer_at_time:
>>> locs = aframe.index.indexer_at_time('7:00:00')
>>> aframe.iloc[locs]
0 1
2015-09-01 07:00:00 420 840
2015-09-02 07:00:00 1860 3720
2015-09-03 07:00:00 3300 6600
2015-09-04 07:00:00 4740 9480
2015-09-05 07:00:00 6180 12360
2015-09-06 07:00:00 7620 15240
2015-09-07 07:00:00 9060 18120
There's also indexer_between_time if you need select all indices that lie between two particular time of day.
Both of these methods return the integer locations of the desired values; the corresponding rows of the Series or DataFrame can be fetched with iloc, as shown above.

Categories

Resources