I have a dataframe that provides two integer columns with the Year and Week of the year:
import pandas as pd
import numpy as np
L1 = [43,44,51,2,5,12]
L2 = [2016,2016,2016,2017,2017,2017]
df = pd.DataFrame({"Week":L1,"Year":L2})
df
Out[72]:
Week Year
0 43 2016
1 44 2016
2 51 2016
3 2 2017
4 5 2017
5 12 2017
I need to create a datetime-object from these two numbers.
I tried this, but it throws an error:
df["DT"] = df.apply(lambda x: np.datetime64(x.Year,'Y') + np.timedelta64(x.Week,'W'),axis=1)
Then I tried this, it works but gives the wrong result, that is it ignores the week completely:
df["S"] = df.Week.astype(str)+'-'+df.Year.astype(str)
df["DT"] = df["S"].apply(lambda x: pd.to_datetime(x,format='%W-%Y'))
df
Out[74]:
Week Year S DT
0 43 2016 43-2016 2016-01-01
1 44 2016 44-2016 2016-01-01
2 51 2016 51-2016 2016-01-01
3 2 2017 2-2017 2017-01-01
4 5 2017 5-2017 2017-01-01
5 12 2017 12-2017 2017-01-01
I'm really getting lost between Python's datetime, Numpy's datetime64, and pandas Timestamp, can you tell me how it's done correctly?
I'm using Python 3, if that is relevant in any way.
EDIT:
Starting with Python 3.8 the problem is easily solved with a newly introduced method on datetime.date objects: https://docs.python.org/3/library/datetime.html#datetime.date.fromisocalendar
Try this:
In [19]: pd.to_datetime(df.Year.astype(str), format='%Y') + \
pd.to_timedelta(df.Week.mul(7).astype(str) + ' days')
Out[19]:
0 2016-10-28
1 2016-11-04
2 2016-12-23
3 2017-01-15
4 2017-02-05
5 2017-03-26
dtype: datetime64[ns]
Initially I have timestamps in s
It's much easier to parse it from UNIX epoch timestamp:
df['Date'] = pd.to_datetime(df['UNIX_Time'], unit='s')
Timing for 10M rows DF:
Setup:
In [26]: df = pd.DataFrame(pd.date_range('1970-01-01', freq='1T', periods=10**7), columns=['date'])
In [27]: df.shape
Out[27]: (10000000, 1)
In [28]: df['unix_ts'] = df['date'].astype(np.int64)//10**9
In [30]: df
Out[30]:
date unix_ts
0 1970-01-01 00:00:00 0
1 1970-01-01 00:01:00 60
2 1970-01-01 00:02:00 120
3 1970-01-01 00:03:00 180
4 1970-01-01 00:04:00 240
5 1970-01-01 00:05:00 300
6 1970-01-01 00:06:00 360
7 1970-01-01 00:07:00 420
8 1970-01-01 00:08:00 480
9 1970-01-01 00:09:00 540
... ... ...
9999990 1989-01-05 10:30:00 599999400
9999991 1989-01-05 10:31:00 599999460
9999992 1989-01-05 10:32:00 599999520
9999993 1989-01-05 10:33:00 599999580
9999994 1989-01-05 10:34:00 599999640
9999995 1989-01-05 10:35:00 599999700
9999996 1989-01-05 10:36:00 599999760
9999997 1989-01-05 10:37:00 599999820
9999998 1989-01-05 10:38:00 599999880
9999999 1989-01-05 10:39:00 599999940
[10000000 rows x 2 columns]
Check:
In [31]: pd.to_datetime(df.unix_ts, unit='s')
Out[31]:
0 1970-01-01 00:00:00
1 1970-01-01 00:01:00
2 1970-01-01 00:02:00
3 1970-01-01 00:03:00
4 1970-01-01 00:04:00
5 1970-01-01 00:05:00
6 1970-01-01 00:06:00
7 1970-01-01 00:07:00
8 1970-01-01 00:08:00
9 1970-01-01 00:09:00
...
9999990 1989-01-05 10:30:00
9999991 1989-01-05 10:31:00
9999992 1989-01-05 10:32:00
9999993 1989-01-05 10:33:00
9999994 1989-01-05 10:34:00
9999995 1989-01-05 10:35:00
9999996 1989-01-05 10:36:00
9999997 1989-01-05 10:37:00
9999998 1989-01-05 10:38:00
9999999 1989-01-05 10:39:00
Name: unix_ts, Length: 10000000, dtype: datetime64[ns]
Timing:
In [32]: %timeit pd.to_datetime(df.unix_ts, unit='s')
10 loops, best of 3: 156 ms per loop
Conclusion: I think 156 milliseconds for converting 10.000.000 rows is not that slow
Like #Gianmario Spacagna mentioned for datetimes higher like 2018 use %V with %G:
L1 = [43,44,51,2,5,12,52,53,1,2,5,52]
L2 = [2016,2016,2016,2017,2017,2017,2018,2018,2019,2019,2019,2019]
df = pd.DataFrame({"Week":L1,"Year":L2})
df['new'] = pd.to_datetime(df.Week.astype(str)+
df.Year.astype(str).add('-1') ,format='%V%G-%u')
print (df)
Week Year new
0 43 2016 2016-10-24
1 44 2016 2016-10-31
2 51 2016 2016-12-19
3 2 2017 2017-01-09
4 5 2017 2017-01-30
5 12 2017 2017-03-20
6 52 2018 2018-12-24
7 53 2018 2018-12-31
8 1 2019 2018-12-31
9 2 2019 2019-01-07
10 5 2019 2019-01-28
11 52 2019 2019-12-23
There is something fishy going on with weeks starting from 2019. The ISO-8601 standard assigns the 31st December 2018 to the week 1 of year 2019. The other approaches based on:
pd.to_datetime(df.Week.astype(str)+
df.Year.astype(str).add('-2') ,format='%W%Y-%w')
will give shifted results starting from 2019.
In order to be compliant with the ISO-8601 standard you would have to do the following:
import pandas as pd
import datetime
L1 = [52,53,1,2,5,52]
L2 = [2018,2018,2019,2019,2019,2019]
df = pd.DataFrame({"Week":L1,"Year":L2})
df['ISO'] = df['Year'].astype(str) + '-W' + df['Week'].astype(str) + '-1'
df['DT'] = df['ISO'].map(lambda x: datetime.datetime.strptime(x, "%G-W%V-%u"))
print(df)
It prints:
Week Year ISO DT
0 52 2018 2018-W52-1 2018-12-24
1 53 2018 2018-W53-1 2018-12-31
2 1 2019 2019-W1-1 2018-12-31
3 2 2019 2019-W2-1 2019-01-07
4 5 2019 2019-W5-1 2019-01-28
5 52 2019 2019-W52-1 2019-12-23
The week 53 of 2018 is ignored and mapped to the week 1 of 2019.
Please verify yourself on https://www.epochconverter.com/weeks/2019.
If you want to follow ISO Week Date
Weeks start with Monday. Each week's year is the Gregorian year in
which the Thursday falls. The first week of the year, hence, always
contains 4 January. ISO week year numbering therefore slightly
deviates from the Gregorian for some days close to 1 January.
The following sample code, generates a sequence of 60 Dates, starting from 18Dec2016 Sun and adds the appropriate columns.
It adds:
A "Date"
Week Day of the "Date"
Finds the Week Starting Monday of that "Date"
Finds the Year of the Week Starting Monday of that "Date"
Adds a Week Number (ISO)
Gets the Starting Monday Date, from Year and Week Number
Sample Code Below:
# Generate Some Dates
dft1 = pd.DataFrame(pd.date_range('2016-12-18', freq='D', periods=60))
dft1.columns = ['e_FullDate']
dft1['e_FullDateWeekDay'] = dft1.e_FullDate.dt.day_name().str.slice(0,3)
#Add a Week Start Date (Monday)
dft1['e_week_start'] = dft1['e_FullDate'] - pd.to_timedelta(dft1['e_FullDate'].dt.weekday,
unit='D')
dft1['e_week_startWeekDay'] = dft1.e_week_start.dt.day_name().str.slice(0,3)
#Add a Week Start Year
dft1['e_week_start_yr'] = dft1.e_week_start.dt.year
#Add a Week Number of Week Start Monday
dft1['e_week_no'] = dft1['e_week_start'].dt.week
#Add a Week Start generate from Week Number and Year
dft1['e_week_start_from_week_no'] = pd.to_datetime(dft1.e_week_no.astype(str)+
dft1.e_week_start_yr.astype(str).add('-1') ,format='%W%Y-%w')
dft1['e_week_start_from_week_noWeekDay'] = dft1.e_week_start_from_week_no.dt.day_name().str.slice(0,3)
with pd.option_context('display.max_rows', 999, 'display.max_columns', 0, 'display.max_colwidth', 9999):
display(dft1)
Related
I was trying to find difference of a series of dates and a date. for example, the series is
from may1 to june1 which is
date = pd.DataFrame()
In [0]: date['test'] = pd.date_range("2021-05-01", "2021-06-01", freq = "D")
Out[0]: date
test
0 2021-05-01 00:00:00
1 2021-05-02 00:00:00
2 2021-05-03 00:00:00
3 2021-05-04 00:00:00
4 2021-05-05 00:00:00
5 2021-05-06 00:00:00
6 2021-05-07 00:00:00
7 2021-05-08 00:00:00
8 2021-05-09 00:00:00
9 2021-05-10 00:00:00
In[1]
date['test'] = date['test'].dt.date
Out[1]:
test
0 2021-05-01
1 2021-05-02
2 2021-05-03
3 2021-05-04
4 2021-05-05
5 2021-05-06
6 2021-05-07
7 2021-05-08
8 2021-05-09
9 2021-05-10
In[2]:date['base'] = dt.strptime("2021-05-01",'%Y-%m-%d')
Out[2]:
0 2021-05-01 00:00:00
1 2021-05-01 00:00:00
2 2021-05-01 00:00:00
3 2021-05-01 00:00:00
4 2021-05-01 00:00:00
5 2021-05-01 00:00:00
6 2021-05-01 00:00:00
7 2021-05-01 00:00:00
8 2021-05-01 00:00:00
9 2021-05-01 00:00:00
In[3]:date['base'] = date['base'].dt.date
Out[3]:
base
0 2021-05-01
1 2021-05-01
2 2021-05-01
3 2021-05-01
4 2021-05-01
5 2021-05-01
6 2021-05-01
7 2021-05-01
8 2021-05-01
9 2021-05-01
In[4]:date['test']-date['base']
Out[4]:
diff
0 0 days 00:00:00.000000000
1 1 days 00:00:00.000000000
2 2 days 00:00:00.000000000
3 3 days 00:00:00.000000000
4 4 days 00:00:00.000000000
5 5 days 00:00:00.000000000
6 6 days 00:00:00.000000000
7 7 days 00:00:00.000000000
8 8 days 00:00:00.000000000
9 9 days 00:00:00.000000000
10 10 days 00:00:00.000000000
the only thing i could get is this. I don't want anything other than the number 1-10 cuz i need them for further numerical calculation but i can't get rid of those. Also how could i construct a time series which just outputs the date not the hms after it? i don't want to manually .dt.date for all of those and it sometimes mess things up
You don't need to create a column base for this, simply do:
>>> (date['test'] - pd.to_datetime("2021-05-01", format='%Y-%m-%d')).dt.days
0 0
1 1
2 2
3 3
4 4
...
27 27
28 28
29 29
30 30
31 31
Name: test, dtype: int64
You can convert the timestamps first to epoch seconds (they are actually stored internally as some number, and likely a factor of epoch seconds)
Using pandas datetime to unix timestamp seconds
import pandas as pd
# start df with date column
df = pd.DataFrame({"date": pd.date_range("2021-05-01", "2021-06-01", freq = "D")})
# create a column for datetimes
df["ts"] = (df["date"] - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
>>> df
date ts
0 2021-05-01 1619827200
1 2021-05-02 1619913600
2 2021-05-03 1620000000
3 2021-05-04 1620086400
...
31 2021-06-01 1622505600
This will allow you to do integer math before converting back
>>> df["days"] = (df["ts"] - min(df["ts"])) // (60*60*24) # 1 day in seconds
>>> df
date ts days
0 2021-05-01 1619827200 0
1 2021-05-02 1619913600 1
2 2021-05-03 1620000000 2
3 2021-05-04 1620086400 3
...
31 2021-06-01 1622505600 31
Alternatively, with a naive day-based series, you can use the index as the day offset (as that's how the DataFrame was generated)!
>>> import pandas as pd
>>> df = pd.DataFrame({"date": pd.date_range("2021-05-01", "2021-06-01", freq = "D")})
>>> df["days"] = df.index
>>> df
date days
0 2021-05-01 0
1 2021-05-02 1
2 2021-05-03 2
3 2021-05-04 3
...
31 2021-06-01 31
I have data that looks like this.
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag
2 1/1/2018 0:18:50 1/1/2018 12:24:39 AM N
2 1/1/2018 0:30:26 1/1/2018 12:46:42 AM N
2 1/1/2018 0:07:25 1/1/2018 12:19:45 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:38:35 1/1/2018 1:08:50 AM N
2 1/1/2018 0:18:41 1/1/2018 12:28:22 AM N
2 1/1/2018 0:38:02 1/1/2018 12:55:02 AM N
2 1/1/2018 0:05:02 1/1/2018 12:18:35 AM N
2 1/1/2018 0:35:23 1/1/2018 12:42:07 AM N
So, I converted df.lpep_pickup_datetime to datetime, but originally it comes in as a string. I'm not sure which one is easier to work with. I want to append 5 fields onto my current dataframe: year, month, day, weekday, and hour.
I tried this:
df['Year']=[d.split('-')[0] for d in df.lpep_pickup_datetime]
df['Month']=[d.split('-')[1] for d in df.lpep_pickup_datetime]
df['Day']=[d.split('-')[2] for d in df.lpep_pickup_datetime]
That gives me this error: AttributeError: 'Timestamp' object has no attribute 'split'
I tried this:
df2 = pd.DataFrame(df.lpep_pickup_datetime.dt.strftime('%m-%d-%Y-%H').str.split('/').tolist(),
columns=['Month', 'Day', 'Year', 'Hour'],dtype=int)
df = pd.concat((df,df2),axis=1)
That gives me this error: AssertionError: 4 columns passed, passed data had 1 columns
Basically, I want to parse df.lpep_pickup_datetime into year, month, day, weekday, and hour, appending each to the same dataframe. How can I do that?
Thanks!!
Here you go, first I'm creating a random dataset and then renaming the column date to the name you want, so you can just copy the code. Pandas has a big section of time-series series manipulation, you don't actually need to import datetime. Here you can find a lot more information about it:
import pandas as pd
date_rng = pd.date_range(start='1/1/2018', end='4/01/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['lpep_pickup_datetime'] = df['date']
df['year'] = df['lpep_pickup_datetime'].dt.year
df['year'] = df['lpep_pickup_datetime'].dt.month
df['weekday'] = df['lpep_pickup_datetime'].dt.weekday
df['day'] = df['lpep_pickup_datetime'].dt.day
df['hour'] = df['lpep_pickup_datetime'].dt.hour
print(df)
Output:
date lpep_pickup_datetime year weekday day hour
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 0 1 0
1 2018-01-01 01:00:00 2018-01-01 01:00:00 1 0 1 1
2 2018-01-01 02:00:00 2018-01-01 02:00:00 1 0 1 2
3 2018-01-01 03:00:00 2018-01-01 03:00:00 1 0 1 3
4 2018-01-01 04:00:00 2018-01-01 04:00:00 1 0 1 4
... ... ... ... ... ... ...
2156 2018-03-31 20:00:00 2018-03-31 20:00:00 3 5 31 20
2157 2018-03-31 21:00:00 2018-03-31 21:00:00 3 5 31 21
2158 2018-03-31 22:00:00 2018-03-31 22:00:00 3 5 31 22
2159 2018-03-31 23:00:00 2018-03-31 23:00:00 3 5 31 23
2160 2018-04-01 00:00:00 2018-04-01 00:00:00 4 6 1 0
EDIT: Since this is not working (As stated in the comments in this answer), I believe your data is formated incorrectly. Try this before applying anything:
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'], format='%d/%m/%y %H:%M:%S')
If this format is recognized properly, then you should have no trouble using dt.year,dt.month,dt.hour,dt.day,dt.weekday.
Give this a go. Since your dates are in the datetime dtype already, just use the datetime properties to extract each part.
import pandas as pd
from datetime import datetime as dt
# Creating a fake dataset of dates.
dates = [dt.now().strftime('%d/%m/%Y %H:%M:%S') for i in range(10)]
df = pd.DataFrame({'lpep_pickup_datetime': dates})
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
# Parse each date into its parts and store as a new column.
df['month'] = df['lpep_pickup_datetime'].dt.month
df['day'] = df['lpep_pickup_datetime'].dt.day
df['year'] = df['lpep_pickup_datetime'].dt.year
# ... and so on ...
Output:
lpep_pickup_datetime month day year
0 2019-09-24 16:46:10 9 24 2019
1 2019-09-24 16:46:10 9 24 2019
2 2019-09-24 16:46:10 9 24 2019
3 2019-09-24 16:46:10 9 24 2019
4 2019-09-24 16:46:10 9 24 2019
5 2019-09-24 16:46:10 9 24 2019
6 2019-09-24 16:46:10 9 24 2019
7 2019-09-24 16:46:10 9 24 2019
8 2019-09-24 16:46:10 9 24 2019
9 2019-09-24 16:46:10 9 24 2019
I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64
I'm having difficulty figuring out a way to count the occurrences of holidays between datetime ranges in a dataframe. The holidays are in a list while the datetime ranges are in the dataframe as shown below: (note that this is a subset of a very large data set)
df = pd.DataFrame({'Date': ['2018-12-19 18:47','2019-01-01 06:11','2019-01-12 10:05','2019-02-17 14:22','2019-03-08 16:17','2019-03-25 17:35','2019-02-14 17:35'],
'End Date': ['2018-12-28 18:47','2019-01-05 06:11','2019-01-16 10:05','2019-02-19 14:22','2019-03-12 16:17','2019-03-26 17:35','2019-05-27 17:35']})
df['Date'] = pd.to_datetime(df['Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
Holidays = [date(2018,12,24),date(2018,12,25),date(2019,1,1),date(2019,1,21),date(2019,2,18),date(2019,3,8),date(2019,5,27)]
I've been able to find a way that determine whether or not a Holiday is within the datetime ranges, but not get an actual count.
Is there a way to alter the code below to gather the count rather than boolean values?
This is what I've tried so far:
df['Holidays'] = [any([(z>=x)&(z<=y) for z in Holidays]) for x , y in zip(df['Date'].dt.date,df['End Date'].dt.date)]
The result I'm looking for is as follows:
result = pd.DataFrame({'Date': ['2018-12-19 18:47','2019-01-01 06:11','2019-01-12 10:05','2019-02-17 14:22','2019-03-08 16:17','2019-03-25 17:35','2019-02-14 17:35'],
'End Date': ['2018-12-28 18:47','2019-01-05 06:11','2019-01-16 10:05','2019-02-19 14:22','2019-03-12 16:17','2019-03-26 17:35','2019-05-27 17:35'],
'Holidays': [2,1,0,1,1,0,3]})
We can make a function that checks this condition and then apply it row-wise.
def fn(series):
return sum([series.iloc[0] <= h <= series.iloc[1] for h in Holidays])
df.assign(Holidays=df.apply(fn, axis=1))
Date End Date Holidays
0 2018-12-19 18:47:00 2018-12-28 18:47:00 2
1 2019-01-01 06:11:00 2019-01-05 06:11:00 0
2 2019-01-12 10:05:00 2019-01-16 10:05:00 0
3 2019-02-17 14:22:00 2019-02-19 14:22:00 1
4 2019-03-08 16:17:00 2019-03-12 16:17:00 0
5 2019-03-25 17:35:00 2019-03-26 17:35:00 0
6 2019-02-14 17:35:00 2019-05-27 17:35:00 3
Your desired output is incorrect because the Holidays list has no hours for any of the date timestamps. To get the output that you posted we will have to round down to the day.
def fn(series):
return sum([series.iloc[0].floor('d') <= h <= series.iloc[1].floor('d') for h in Holidays])
df.assign(Holidays=df.apply(fn, axis=1))
Date End Date Holidays
0 2018-12-19 18:47 2018-12-28 18:47 2
1 2019-01-01 06:11 2019-01-05 06:11 1
2 2019-01-12 10:05 2019-01-16 10:05 0
3 2019-02-17 14:22 2019-02-19 14:22 1
4 2019-03-08 16:17 2019-03-12 16:17 1
5 2019-03-25 17:35 2019-03-26 17:35 0
6 2019-02-14 17:35 2019-05-27 17:35 3
I have a pandas dataset like this:
user_id datetime
1 13 days 21:50:00
2 0 days 02:05:00
5 10 days 00:10:00
7 2 days 01:20:00
1 3 days 11:50:00
2 1 days 02:30:00
I want to have a column that contains the mintues, So in this case the result can be :
user_id datetime minutes
1 13 days 21:50:00 20030
2 0 days 02:05:00 125
5 10 days 00:10:00 14402
7 2 days 01:20:00 2960
1 3 days 11:50:00 5030
2 1 days 02:30:00 1590
Is there any way to do that without loop?
Yes, there is a special dt accessor for date/time series:
df['minutes'] = df['datetime'].dt.total_seconds() / 60
If you only want whole minutes, cast the result using .astype(int).
Here is a way with pd.Timedelta:
df['minutes'] = pd.to_timedelta(df.datetime) / pd.Timedelta(1, 'm')
>>> df
user_id datetime minutes
0 1 13 days 21:50:00 20030.0
1 2 0 days 02:05:00 125.0
2 5 10 days 00:10:00 14410.0
3 7 2 days 01:20:00 2960.0
4 1 3 days 11:50:00 5030.0
5 2 1 days 02:30:00 1590.0
if your datetime column is already of dtype timedelta, you can omit the explicit casting and just use:
df['minutes'] = df.datetime / pd.Timedelta(1, 'm')