This question already has answers here:
Calculate Time Difference Between Two Pandas Columns in Hours and Minutes
(4 answers)
Closed 2 years ago.
I have a dataframe that looks like this:
Date Name Provider Task StartDateTime LastDateTime
2020-01-01 00:00:00 Bob PEM ED A 7a-4p 2020-01-01 07:00:00 2020-01-01 16:00:00
2020-01-02 00:00:00 Tom PEM ED C 10p-2a 2020-01-02 22:00:00 2020-01-03 02:00:00
I would like to list the number of hours between each person's StartDateTime LastDateTime(datetime64[ns]) and then create an updated dataframe to reflect said lists. So for example, the updated dataframe would look like this:
Name Date Hour
Bob 2020-01-01 7
Bob 2020-01-01 8
Bob 2020-01-01 9
...
Tom 2020-01-02 22
Tom 2020-01-02 23
Tom 2020-01-03 0
Tom 2020-01-03 1
...
I honestly do not have a solid idea where to start, I've found some articles that may provide a foundation but I'm not sure how to adapt my query to the below code since I want the counts based on the row and hour values.
def daterange(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
start_dt = date(2015, 12, 20)
end_dt = date(2016, 1, 11)
for dt in daterange(start_dt, end_dt):
print(dt.strftime("%Y-%m-%d"))
https://www.w3resource.com/python-exercises/date-time-exercise/python-date-time-exercise-50.php
Let us create the range of datetime then , use explode
df['Date']=[pd.date_range(x,y , freq='H') for x , y in zip(df.StartDateTime,df.LastDateTime)]
s=df[['Date','Name']].explode('Date').reset_index(drop=True)
s['Hour']=s.Date.dt.hour
s['Date']=s.Date.dt.date
s.head()
Date Name Hour
0 2020-01-01 Bob 7
1 2020-01-01 Bob 8
2 2020-01-01 Bob 9
3 2020-01-01 Bob 10
4 2020-01-01 Bob 11
Related
I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.
While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.
I have a df in format:
start end
0 2020-01-01 2020-01-01
1 2020-01-01 2020-01-01
2 2020-01-02 2020-01-02
...
57 2020-04-01 2020-04-01
58 2020-04-02 2020-04-02
And I want to count the number of entries in each month and place it in a new df i.e. the number of 'start' entries for Jan, Feb, etc, to give me:
Month Entries
2020-01 3
...
2020-04 2
I am currently trying something like this, but its not what I'm needing:
df.index = pd.to_datetime(df['start'],format='%Y-%m-%d')
df.groupby(pd.Grouper(freq='M'))
df['start'].value_counts()
Use Groupby.count with Series.dt:
In [1282]: df
Out[1282]:
start end
0 2020-01-01 2020-01-01
1 2020-01-01 2020-01-01
2 2020-01-02 2020-01-02
57 2020-04-01 2020-04-01
58 2020-04-02 2020-04-02
# Do this only when your `start` and `end` columns are object. If already datetime, you can ignore below 2 statements
In [1284]: df.start = pd.to_datetime(df.start)
In [1285]: df.end = pd.to_datetime(df.end)
In [1296]: df1 = df.groupby([df.start.dt.year, df.start.dt.month]).count().rename_axis(['year', 'month'])['start'].reset_index(name='Entries')
In [1297]: df1
Out[1297]:
year month Entries
0 2020 1 3
1 2020 4 2
I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019
You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.
Examples of how the df looks like:
customer order_datetime
a 01-03-2017 12:00:00 PM
b 01-04-2017 12:00:00 PM
c 01-07-2017 12:00:00 PM
a 01-08-2017 12:00:00 PM
b 01-09-2017 12:00:00 PM
a 01-11-2017 12:00:00 PM
There's 2 thing that I wanted to achieve but I'm still in the learning process, really appreciate any help to guide me in the right direction.
Create a list of "time between orders" where I can find the min, mean, max
Find out if "time between order" gets faster/slower, i.e. time between order_3 and order_2 vs time between order_2 and order_1
This example should set you in the right direction for your assignment.
First I'm creating a DataFrame similar to the one you show in the question:
import pandas as pd
import numpy as np
import datetime as dt
orders = pd.DataFrame({
'client': np.random.randint(65, 70, size=15),
'date': np.random.randint(0, 30, size=15)})
orders.client = orders.client.apply(chr)
orders.date = orders.date.apply(
pd.to_datetime, unit='d', origin=dt.date(2017, 1, 1), box=False)
# Sorting here is not necessary, just for visualization
orders.sort_values(['client', 'date'], inplace=True)
orders.reset_index(inplace=True, drop=True)
orders.head()
>>>>
client date
0 A 2017-01-27
1 A 2017-01-29
2 A 2017-01-30
3 B 2017-01-03
4 B 2017-01-13
The key to the solution is in the line orders.groupby('client').date.apply(pd.Series.sort_values).diff().
First we use groupby to group the orders using client as a key, then we select the date column only and sort the dates in each group with pd.Series.sort_values, finally we use diff to compute the difference of each record with the following one (here's why the dates in each group must be sorted).
The rest of the code is just to visualize the result, i.e. renaming the Series you obtain and concatenating it with the initial DataFrame.
diff_df = pd.concat([
orders,
orders.groupby('client').date.diff().rename('diff')], axis=1)
diff_df.head(10)
>>>>
client date diff
0 A 2017-01-27 NaT
1 A 2017-01-29 2 days
2 A 2017-01-30 1 days
3 B 2017-01-03 NaT
4 B 2017-01-13 10 days
5 B 2017-01-18 5 days
6 B 2017-01-24 6 days
7 C 2017-01-01 NaT
8 C 2017-01-02 1 days
9 C 2017-01-03 1 days
Once you have the time differences you can compute all kinds of in-group metrics you need.
First you can try pd.Series.describe:
diff_df.groupby('client').diff.describe()
>>>>
count mean std min \
client
A 1 5 days 00:00:00 NaT 5 days 00:00:00
B 1 12 days 00:00:00 NaT 12 days 00:00:00
C 3 4 days 00:00:00 1 days 17:34:09.189773 2 days 00:00:00
D 1 4 days 00:00:00 NaT 4 days 00:00:00
E 4 5 days 00:00:00 3 days 03:53:40.789838 2 days 00:00:00
25% 50% 75% max
client
A 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
B 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00
C 3 days 12:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
D 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00
E 2 days 18:00:00 4 days 12:00:00 6 days 18:00:00 9 days 00:00:00
If that is not enough you can define your own aggregations.
You will need a list of functions if you work on a single Series:
metrics = [pd.Series.count, pd.Series.min, pd.Series.max, pd.Series.mean]
diff_df.groupby('client').diff.aggregate(metrics)
>>>>
count nunique min max mean
client
A 1 1 5 days 5 days 5 days
B 1 1 12 days 12 days 12 days
C 3 2 2 days 5 days 4 days
D 1 1 4 days 4 days 4 days
E 4 4 2 days 9 days 5 days
Or a dictionary of of {column -> function, column -> function_list} if you work on the whole DataFrame:
metrics = {
'date': [pd.Series.count, pd.Series.nunique],
'diff': [pd.Series.min, pd.Series.max, pd.Series.mean],
}
diff_df.groupby('client').aggregate(metrics)
>>>>
diff date
min max mean count nunique
client
A 5 days 5 days 5 days 2 2
B 12 days 12 days 12 days 2 2
C 2 days 5 days 4 days 4 4
D 4 days 4 days 4 days 2 2
E 2 days 9 days 5 days 5 5
I have been using datetimes from the datetime library in Python, and making them timezone aware with pytz. I have then been using them as dates in Pandas DataFrames and trying to use Pandas's apply function and the ".day", ".hour", ".minute" etc. methods of the datetimes to create columns with just the day, hour, or minute. Surprisingly, it gives the UTC values. Is there a way to return the local day, hour, or minute? Simply adding the offset is not good enough, because the offset to UTC changes with daylight savings time.
Many thanks!
Here is an example of what I am talking about:
import pandas as pd
import datetime as dt
import pytz
# Simply return the hour of a date
def get_hour(dt1):
return dt1.hour
# Create a date column to segment by month
# Create the date list
PST = pytz.timezone('US/Pacific')
start = PST.localize(dt.datetime(2016, 1, 1))
actuals_dates = [start + dt.timedelta(hours=x) for x in range(8760)]
# Outside of this context, you can get the hour
print ''
print 'Hour at the start date:'
print get_hour(start)
print ''
#add it to a pandas DataFrame as a column
shapes = pd.DataFrame()
shapes['actuals dates'] = actuals_dates
# create a column for the hour
shapes['actuals hour'] = shapes['actuals dates'].apply(get_hour)
# Print the first 24 hours
print shapes.head(24)
Will return:
Hour at the start date:
0
actuals dates actuals hour
0 2016-01-01 00:00:00-08:00 8
1 2016-01-01 01:00:00-08:00 9
2 2016-01-01 02:00:00-08:00 10
3 2016-01-01 03:00:00-08:00 11
4 2016-01-01 04:00:00-08:00 12
5 2016-01-01 05:00:00-08:00 13
6 2016-01-01 06:00:00-08:00 14
7 2016-01-01 07:00:00-08:00 15
8 2016-01-01 08:00:00-08:00 16
9 2016-01-01 09:00:00-08:00 17
10 2016-01-01 10:00:00-08:00 18
11 2016-01-01 11:00:00-08:00 19
12 2016-01-01 12:00:00-08:00 20
13 2016-01-01 13:00:00-08:00 21
14 2016-01-01 14:00:00-08:00 22
15 2016-01-01 15:00:00-08:00 23
16 2016-01-01 16:00:00-08:00 0
17 2016-01-01 17:00:00-08:00 1
18 2016-01-01 18:00:00-08:00 2
19 2016-01-01 19:00:00-08:00 3
20 2016-01-01 20:00:00-08:00 4
21 2016-01-01 21:00:00-08:00 5
22 2016-01-01 22:00:00-08:00 6
23 2016-01-01 23:00:00-08:00 7
Using a list comprehension seems to do the trick:
shapes['hour'] = [ts.hour for ts in shapes['actuals dates']]
shapes.head()
actuals dates actuals hour hour
0 2016-01-01 00:00:00-08:00 8 0
1 2016-01-01 01:00:00-08:00 9 1
2 2016-01-01 02:00:00-08:00 10 2
3 2016-01-01 03:00:00-08:00 11 3
4 2016-01-01 04:00:00-08:00 12 4
Per the reminder from #Jeff, you can also use the dt accessor functions, e.g.:
>>> shapes['actuals dates'].dt.hour.head()
0 0
1 1
2 2
3 3
4 4
Name: actuals dates, dtype: int64