We have Pandas dataframe with start_date and end_date columns(Input format is given below). I need to check whether the given input time range is present between the start_date and end_date.
For example, if the time range is 09:30-10:30, the output should be first row (student1) and if time range is 16:00-17:30, the output should be second row (student2). Please guide me how can I achieve this.
Input Dataframe:
name start_date end_date
0 student1 2020-08-30 09:00:00 2020-08-30 10:00:00
1 student2 2020-08-30 15:00:00 2020-08-30 18:00:00
2 student3 2020-08-30 11:00:00 2020-08-30 12:30:00
Supposing your date columns are datetime format:
from datetime import datetime
start_time = datetime(2020, 1, 1, 9, 30)
end_time = datetime(2020, 1, 1, 10, 30)
df['start_date'].dt.time.le(end_time.time())&df['end_date'].dt.time.ge(start_time.time())
Related
I have a table with a DateTime Column as shown below. The time Interval is in hours
ID TimeInterval Temperature
1 00:00:00 27
2 01:00:00 26
3 02:00:00 24
4 03:00:00 24
5 04:00:00 25
I tried to use the time interval for plotting. However, I got an error
float() argument must be a string or a number, not 'datetime.time'
So, I want to extract the first two numbers of Column TimeInterval and have it into a new column.
Any ideas how to extract that?
You can use strftime with %H for hours
# Create test data
df = pd.DataFrame({'ID': [1, 2, 3, 4, 5], 'TimeInterval': ['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00'], 'Temperature': [27, 26, 24, 24, 25]})
# Change the format to %H:%M:%S.
df['TimeInterval'] = pd.to_datetime(df['TimeInterval'], format='%H:%M:%S')
# Create a new column
df['new'] = df['TimeInterval'].dt.strftime('%H')
output:
ID TimeInterval Temperature new
1 1900-01-01 00:00:00 27 00
2 1900-01-01 01:00:00 26 01
3 1900-01-01 02:00:00 24 02
4 1900-01-01 03:00:00 24 03
5 1900-01-01 04:00:00 25 04
If the "TimeInterval" column is a string ...
you can select the first 2 characters from it and parse it later on into an integer:
df["new"] = df["TimeInterval"].str[:2].astype(int)
But that would be an ugly solution.
Here is a better solution
# test data
df = pd.DataFrame([{'TimeInterval':'00:00:00'},
{'TimeInterval':'01:00:00'}])
# cast it to a datetime object
df['TimeInterval'] = pd.to_datetime(df['TimeInterval'], format='%H:%M:%S')
# select the hours
df['hours'] = df['TimeInterval'].dt.hour
I have a column that came from Excel, that is supposed to contain durations (in hours) - example: 02:00:00 -
It works well if all this durations are less than 24:00 but if one is more than that, it appears in pandas as 1900-01-03 08:00:00 (so datetime)
as a result the datatype is dtype('O').
df = pd.DataFrame({'duration':[datetime.time(2, 0), datetime.time(2, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0), datetime.time(1, 0),
datetime.time(1, 0)]})
# Output
duration
0 02:00:00
1 02:00:00
2 1900-01-03 08:00:00
3 1900-01-03 08:00:00
4 1900-01-03 08:00:00
5 1900-01-03 08:00:00
6 1900-01-03 08:00:00
7 1900-01-03 08:00:00
8 01:00:00
9 01:00:00
But if I try to convert to either time or datetime I always get an error.
TypeError: <class 'datetime.time'> is not convertible to datetime
Today if I don't fix this, all the duration greater than 24:00 are gone.
Your problem lies in the engine that reads the Excel file. It converts cells that have a certain format (e.g. [h]:mm:ss or hh:mm:ss) to datetime.datetime or datetime.time objects. Those then get transferred into the pandas DataFrame, so it's not actually a pandas problem.
Before you start hacking the excel reader engine, it might be easier to tackle the issue in Excel. Here's a small sample file;
You can download it here.
duration is auto-formatted by Excel, duration_text is what you get if you set the column format to 'text' before you enter the values, duration_to_text is what you get if you change the format to text after Excel auto-formatted the values (first column).
Now you have everything you need after import with pandas:
df = pd.read_excel('path_to_file')
df
duration duration_text duration_to_text
0 12:30:00 12:30:00 0.520833
1 1900-01-01 00:30:00 24:30:00 1.020833
# now you can parse to timedelta:
pd.to_timedelta(df['duration_text'], errors='coerce')
0 0 days 12:30:00
1 1 days 00:30:00
Name: duration_text, dtype: timedelta64[ns]
# or
pd.to_timedelta(df['duration_to_text'], unit='d', errors='coerce')
0 0 days 12:29:59.999971200 # note the precision issue ;-)
1 1 days 00:29:59.999971200
Name: duration_to_text, dtype: timedelta64[ns]
Another viable option could be to save the Excel file as a csv and import that to a pandas DataFrame. The sample xlsx used above would then look like this for example.
If you have no other option than to re-process in pandas, an option could be to treat datetime.time objects and datetime.datetime objects specifically, e.g.
import datetime
# where you have datetime (incorrect from excel)
m = [isinstance(i, datetime.datetime) for i in df.duration]
# convert to timedelta where it's possible
df['timedelta'] = pd.to_timedelta(df['duration'].astype(str), errors='coerce')
# where you have datetime, some special treatment is needed...
df.loc[m, 'timedelta'] = df.loc[m, 'duration'].apply(lambda t: pd.Timestamp(str(t)) - pd.Timestamp('1899-12-31'))
df['timedelta']
0 0 days 12:30:00
1 1 days 00:30:00
Name: timedelta, dtype: timedelta64[ns]
IIUC, use pd.to_timedelta:
Setup a MRE:
df = pd.DataFrame({'duration': ['43:24:57', '22:12:52', '-', '78:41:33']})
print(df)
# Output
duration
0 43:24:57
1 22:12:52
2 -
3 78:41:33
df['duration'] = pd.to_timedelta(df['duration'], errors='coerce')
print(df)
# Output
duration
0 1 days 19:24:57
1 0 days 22:12:52
2 NaT
3 3 days 06:41:33
Update
#MrFuppes Excel file is exactly what I have in my column 'duration'
Try:
df['duration'] = np.where(df['duration'].apply(len) == 8,
'1899-12-31 ' + df['duration'], df['duration'])
df['duration'] = pd.to_datetime(df['duration'], errors='coerce') \
- pd.Timestamp('1899-12-31')
print(df)
# Output (with a slightly modified example of #MrFuppes)
duration
0 0 days 12:30:00
1 1 days 00:30:00
2 NaT
I have a data frame where there is time columns having minutes from 0-1339 meaning 1440 minutes of a day. I want to add a column datetime representing the day 2021-3-21 including hh amd mm like this 1980-03-01 11:00 I tried following code
from datetime import datetime, timedelta
date = datetime.date(2021, 3, 21)
days = date - datetime.date(1900, 1, 1)
df['datetime'] = pd.to_datetime(df['time'],format='%H:%M:%S:%f') + pd.to_timedelta(days, unit='d')
But the error seems like descriptor 'date' requires a 'datetime.datetime' object but received a 'int'
Is there any other way to solve this problem or fixing this code? Please help to figure this out.
>>df
time
0
1
2
3
..
1339
I want to convert this minutes to particular format 1980-03-01 11:00 where I will use the date 2021-3-21 and convert the minutes tohhmm part. The dataframe will look like.
>df
datetime time
2021-3-21 00:00 0
2021-3-21 00:01 1
2021-3-21 00:02 2
...
How can I format my data in this way?
Let's try with pd.to_timedelta instead to get the duration in minutes from time then add a TimeStamp:
df['datetime'] = (
pd.Timestamp('2021-3-21') + pd.to_timedelta(df['time'], unit='m')
)
df.head():
time datetime
0 0 2021-03-21 00:00:00
1 1 2021-03-21 00:01:00
2 2 2021-03-21 00:02:00
3 3 2021-03-21 00:03:00
4 4 2021-03-21 00:04:00
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': np.arange(0, 1440)})
df['datetime'] = (
pd.Timestamp('2021-3-21') + pd.to_timedelta(df['time'], unit='m')
)
print(df)
Is there a simple way to obtain the hour of the year from a datetime?
dt = datetime(2019, 1, 3, 00, 00, 00) # 03/01/2019 00:00
dt_hour = dt.hour_of_year() # should be something like that
Expected output: dt_hour = 48
It would be nice as well to obtain minutes_of_year and seconds_of_year
One way of implementing this yourself is this:
def hour_of_year(dt):
beginning_of_year = datetime.datetime(dt.year, 1, 1, tzinfo=dt.tzinfo)
return (dt - beginning_of_year).total_seconds() // 3600
This first creates a new datetime object representing the beginning of the year. We then compute the time since the beginning of the year in seconds, divide by 3600 and take the integer part to get the full hours that have passed since the beginning of the year.
Note that using the days attribute of the timedelta object will only return the number of full days since the beginning of the year.
You can use timedelta:
import datetime
dt = datetime.datetime(2019, 1, 3, 00, 00, 00)
dt2 = datetime.datetime(2019, 1, 1, 00, 00, 00)
print((dt-dt2).days*24)
output:
48
All three functions, reusing their code.
import datetime
def minutes_of_year(dt):
return seconds_of_year(dt) // 60
def hours_of_year(dt):
return minutes_of_year(dt) // 60
def seconds_of_year(dt):
dt0 = datetime.datetime(dt.year, 1, 1, tzinfo=dt.tzinfo)
delta = dt-dt0
return int(delta.total_seconds())
Edited to take possible time zone info into account.
Or: subclass datetime, for easier reuse in later projects:
import datetime
class MyDateTime(datetime.datetime):
def __new__(cls, *args, **kwargs):
return datetime.datetime.__new__(cls, *args, **kwargs)
def minutes_of_year(self):
return self.seconds_of_year() // 60
def hours_of_year(self):
return self.minutes_of_year() // 60
def seconds_of_year(self):
dt0 = datetime.datetime(self.year, 1, 1, tzinfo=self.tzinfo)
delta = self-dt0
return int(delta.total_seconds())
# create and use like a normal datetime object
dt = MyDateTime.now()
# properties and functions of datetime still available, of course.
print(dt.day)
# ... and new methods:
print(dt.hours_of_year())
You can write a custom function
def get_time_of_year(dt, type = 'hours_of_year'):
intitial_date = datetime(dt.year, 1,1, 00, 00, 00)
duration = dt - intitial_date
days, seconds = duration.days, duration.seconds
hours = days * 24 + seconds // 3600
minutes = (seconds % 3600) // 60
if type == 'hours_of_year':
return hours
if type == 'days_of_year':
return days
if type == 'seconds_of_year':
return seconds
if type == 'minuts_of_year':
return minutes
test function
get_time_of_year(dt, 'hours_of_year')
#>>48
I have the dataframe DF that has the column 'Timestamp' with type datetime64[ns].
The column timestamp looks like this:
DF['Timestamp']:
0 2022-01-01 00:00:00
1 2022-01-01 01:00:00
2 2022-01-01 02:00:00
3 2022-01-01 03:00:00
4 2022-01-01 04:00:00
...
8755 2022-12-31 19:00:00
8756 2022-12-31 20:00:00
8757 2022-12-31 21:00:00
8758 2022-12-31 22:00:00
8759 2022-12-31 23:00:00
Name: Timestamp, Length: 8760, dtype: datetime64[ns]
I extract 'Hour of Year' in this way:
DF['Year'] = DF['Timestamp'].astype('M8[Y]')
DF['DayOfYear'] = (DF['Timestamp'] - DF['Year']).astype('timedelta64[D]')
DF['Hour'] = DF['Timestamp'].dt.hour + 1
DF['HourOfYear'] = DF['DayOfYear'] * 24 + DF['Hour']
First it extracts the year from the Timestamp.
Next it creates a time delta from beginning of the year to that Timestamp based on days (in other words, day of year).
Then it extracts the hour from the timestamp.
Finally it calculates the hour of the year with that formula.
And it looks like this in the end:
DF:
Timestamp ... HourOfYear
0 2022-01-01 00:00:00 ... 1.0
1 2022-01-01 01:00:00 ... 2.0
2 2022-01-01 02:00:00 ... 3.0
3 2022-01-01 03:00:00 ... 4.0
4 2022-01-01 04:00:00 ... 5.0
...
8755 2022-12-31 19:00:00 ... 8756.0
8756 2022-12-31 20:00:00 ... 8757.0
8757 2022-12-31 21:00:00 ... 8758.0
8758 2022-12-31 22:00:00 ... 8759.0
8759 2022-12-31 23:00:00 ... 8760.0
[8760 rows x 6columns]
I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]