I have the following dataframe and was trying to create a new column of boolean values that would be generated based on my datetime index. A value of 1 if the hour is >= 08:00:00 and <= "21:00:00" and if the hour is outside of that range than 0.
Timestamp Bath_County_Gen Wing_Gen Boolean
2020-09-23 00:00:00 -390.0 2954.0 0
2020-09-23 00:15:00 -363.33 3007.75 0
2020-09-23 00:30:00 -250.0 3049.0 0
2020-09-23 00:45:00 -220.0 3143.5 0
2020-09-23 01:00:00 -206.67 3193.33 0
2020-09-23 01:15:00 -185.0 3195.25 0
I tried the following but had no luck and wasn't sure how else to dynamically the boolean column value.
df['boolean'] = np.where(df.between_time('08:00:00', '21:00:00'), 1,0)
Thanks for the help!
After ensuring your "Timestamp" column is in datetime format, you can extract the hour of the day from it and perform the following operation:
df['Timestamp'] = df.Timestamp.apply(pd.to_datetime) # ensure it's datetime
df['is_between_8_and_21'] = df['Timestamp'].dt.hour.between(8, 21, inclusive=True) # extract the hour and check if it's between 8 and 21h
now df will look like this:
Timestamp Bath_County_Gen Wing_Gen is_between_8_and_21
2020-10-23 00:00:00 -390.00 2954.00 False
2020-10-23 00:15:00 -363.33 3007.75 False
2020-10-23 00:30:00 -250.00 3049.00 False
Note that 21:05 will be translated to 21, so it will be included if you set the flag inclusive=True.
EDIT
As you mention, your "Timestamp" is actually a DateTime index. In this case, as you suggested you can already directly operate on the dataframe:
df.between_time('8:00', '21:00', include_start=True, include_end=True)
From the Pandas documentation on .between_time(), it appears that if you specify the start_time and end_time as strings, they must be in a format as "08:25", or "21:51". If you want more fine-grained control to the second, you can use the alternative specification via datetime.time, so for example:
import datetime
start_time = datetime.time(8, 0, 0)
end_time = datetime.time(21, 0, 0)
df.between_time(start_time, end_time, include_start=True,
include_end=False) # to ensure 21 o'clock exactly is excluded
Make sure that the Timestamp column has datetime format:
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
Afterwards you can access the hour with datetime.hour. Full code (I added two rows for testing):
df = pd.DataFrame({
"Timestamp": ["2020-09-23 00:00:00", "2020-09-23 00:15:00", "2020-09-23 00:30:00", "2020-09-23 00:45:00", "2020-09-23 01:00:00", "2020-09-23 01:15:00", "2020-09-23 15:15:00", "2020-09-23 23:15:00"]
})
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
def is_between_8_and_21(datetime):
return 1 if (datetime.hour >= 8) & (datetime.hour <= 21) else 0
df["Boolean"] = df["Timestamp"].apply(lambda x: is_between_8_and_21(x))
df
Related
I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12
I have a column that came from Excel, that is supposed to contain durations (in hours) - example: 02:00:00 -
It works well if all this durations are less than 24:00 but if one is more than that, it appears in pandas as 1900-01-03 08:00:00 (so datetime)
as a result the datatype is dtype('O').
df = pd.DataFrame({'duration':[datetime.time(2, 0), datetime.time(2, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0), datetime.time(1, 0),
datetime.time(1, 0)]})
# Output
duration
0 02:00:00
1 02:00:00
2 1900-01-03 08:00:00
3 1900-01-03 08:00:00
4 1900-01-03 08:00:00
5 1900-01-03 08:00:00
6 1900-01-03 08:00:00
7 1900-01-03 08:00:00
8 01:00:00
9 01:00:00
But if I try to convert to either time or datetime I always get an error.
TypeError: <class 'datetime.time'> is not convertible to datetime
Today if I don't fix this, all the duration greater than 24:00 are gone.
Your problem lies in the engine that reads the Excel file. It converts cells that have a certain format (e.g. [h]:mm:ss or hh:mm:ss) to datetime.datetime or datetime.time objects. Those then get transferred into the pandas DataFrame, so it's not actually a pandas problem.
Before you start hacking the excel reader engine, it might be easier to tackle the issue in Excel. Here's a small sample file;
You can download it here.
duration is auto-formatted by Excel, duration_text is what you get if you set the column format to 'text' before you enter the values, duration_to_text is what you get if you change the format to text after Excel auto-formatted the values (first column).
Now you have everything you need after import with pandas:
df = pd.read_excel('path_to_file')
df
duration duration_text duration_to_text
0 12:30:00 12:30:00 0.520833
1 1900-01-01 00:30:00 24:30:00 1.020833
# now you can parse to timedelta:
pd.to_timedelta(df['duration_text'], errors='coerce')
0 0 days 12:30:00
1 1 days 00:30:00
Name: duration_text, dtype: timedelta64[ns]
# or
pd.to_timedelta(df['duration_to_text'], unit='d', errors='coerce')
0 0 days 12:29:59.999971200 # note the precision issue ;-)
1 1 days 00:29:59.999971200
Name: duration_to_text, dtype: timedelta64[ns]
Another viable option could be to save the Excel file as a csv and import that to a pandas DataFrame. The sample xlsx used above would then look like this for example.
If you have no other option than to re-process in pandas, an option could be to treat datetime.time objects and datetime.datetime objects specifically, e.g.
import datetime
# where you have datetime (incorrect from excel)
m = [isinstance(i, datetime.datetime) for i in df.duration]
# convert to timedelta where it's possible
df['timedelta'] = pd.to_timedelta(df['duration'].astype(str), errors='coerce')
# where you have datetime, some special treatment is needed...
df.loc[m, 'timedelta'] = df.loc[m, 'duration'].apply(lambda t: pd.Timestamp(str(t)) - pd.Timestamp('1899-12-31'))
df['timedelta']
0 0 days 12:30:00
1 1 days 00:30:00
Name: timedelta, dtype: timedelta64[ns]
IIUC, use pd.to_timedelta:
Setup a MRE:
df = pd.DataFrame({'duration': ['43:24:57', '22:12:52', '-', '78:41:33']})
print(df)
# Output
duration
0 43:24:57
1 22:12:52
2 -
3 78:41:33
df['duration'] = pd.to_timedelta(df['duration'], errors='coerce')
print(df)
# Output
duration
0 1 days 19:24:57
1 0 days 22:12:52
2 NaT
3 3 days 06:41:33
Update
#MrFuppes Excel file is exactly what I have in my column 'duration'
Try:
df['duration'] = np.where(df['duration'].apply(len) == 8,
'1899-12-31 ' + df['duration'], df['duration'])
df['duration'] = pd.to_datetime(df['duration'], errors='coerce') \
- pd.Timestamp('1899-12-31')
print(df)
# Output (with a slightly modified example of #MrFuppes)
duration
0 0 days 12:30:00
1 1 days 00:30:00
2 NaT
I have a column with times that are not timestamps and would like to know the timedelta to 00:30:00 o'clock. However, I can only find methods for timestamps.
df['Time'] = ['22:30:00', '23:30:00', '00:15:00']
The intended result should look something like this:
df['Output'] = ['02:00:00', '01:00:00', '00:15:00']
This code convert a type of Time value from str to datetime (date is automatically set as 1900-01-01). Then, calculated timedelta by setting standardTime as 1900-01-02-00:30:00.
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame()
df['Time'] = ['22:30:00', '23:30:00', '00:15:00']
standardTime = datetime(1900, 1, 2, 0, 30, 0)
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S')
df['Output'] = df['Time'].apply(lambda x: standardTime-x).astype(str).str[7:] # without astype(str).str[7:], the Output value include a day such as "0 days 01:00:00"
print(df)
# Time Output
#0 1900-01-01 22:30:00 02:00:00
#1 1900-01-01 23:30:00 01:00:00
#2 1900-01-01 00:15:00 00:15:00
One could want to use datetime.time as data structures, but these cannot be subtracted, so you can't conveniently get a timedelta from them.
On the other hand, datetime.datetime objects can be subtracted, so if you're always interested in positive deltas, you could construct a datetime object from your time representation using 1970-01-01 as date, and compare that to 1970-01-02T00:30.
For instance, if your times are stored as strings (as per your snippet):
import datetime as dt
def timedelta_to_0_30(time_string: str) -> dt.timedelta:
time_string_as_datetime = dt.datetime.fromisoformat(f"1970-01-01T{time_string}")
return dt.datetime(1970, 1, 2, 0, 30) - time_string_as_datetime
my_time_string = "22:30:00"
timedelta_to_0_30(my_time_string) # 2:00:00
I need to split a year in enumerated 20-minute chunks and then find the sequece number of corresponding time range chunk for randomly distributed timestamps in a year for further processing.
I tried to use pandas for this, but I can't find a way to index timestamp in date_range:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from datetime import timedelta
if __name__ == '__main__':
date_start = pd.to_datetime('2018-01-01')
date_end = date_start + timedelta(days=365)
index = pd.date_range(start=date_start, end=date_end, freq='20min')
data = range(len(index))
df = pd.DataFrame(data, index=index, columns=['A'])
print(df)
event_ts = pd.to_datetime('2018-10-14 02:17:43')
# How to find the corresponding df['A'] for event_ts?
# print(df.loc[event_ts])
Output:
A
2018-01-01 00:00:00 0
2018-01-01 00:20:00 1
2018-01-01 00:40:00 2
2018-01-01 01:00:00 3
2018-01-01 01:20:00 4
... ...
2018-12-31 22:40:00 26276
2018-12-31 23:00:00 26277
2018-12-31 23:20:00 26278
2018-12-31 23:40:00 26279
2019-01-01 00:00:00 26280
[26281 rows x 1 columns]
What is the best practice to do it in python? I imagine how to find the range "by hand" converting date_range to integers and comparing it, but may be there are some elegant pandas/python-style ways to do it?
First of all, I've worked with a small interval, one week:
date_end = date_start + timedelta(days=7)
Then I've followed your steps, and got a portion of your dataframe.
My event_ts is this:
event_ts = pd.to_datetime('2018-01-04 02:17:43')
And I've chosen to reset the index, and have a dataframe easy to manipulate:
df = df.reset_index()
With this code I found the last value where event_ts belongs:
for i in df['index']:
if i <= event_ts:
run.append(i)
print(max(run))
#2018-01-04 02:00:00
or:
top = max(run)
Finally:
df.loc[df['index'] == top].index[0]
222
event_ts belongs to index df[222]
I am looking to add three columns to my current dataframe (utc_date, apac_date, and hour).
I successfully obtain two of the three columns, however hour should be corresponding to apac_date (17) but it is returning the hour for utc_date (9).
Any help would be greatly appreciated!
This is the starting dataframe:
import pandas as pd
from tzlocal import get_localzone
from pytz import timezone
raw_data = {
'id': ['123456'],
'start_date': [pd.datetime(2017, 9, 21, 5, 30, 0)]}
df = pd.DataFrame(raw_data, columns = ['id', 'start_date'])
df
Result:
id start_date
123456 2017-09-21 05:30:00
Next, I convert the timezones for utc and apac based on the users current region.
local_tz = get_localzone()
df['utc_date'] = df['start_date'].apply(lambda x: x.tz_localize(local_tz).astimezone(timezone('utc')))
df['apac_date'] = df['utc_date'].apply(lambda x: x.tz_localize('utc').astimezone(timezone('Asia/Hong_Kong')))
df
Result:
id start_date utc_date apac_date
123456 2017-09-21 05:30:00 2017-09-21 09:30:00+00:00 2017-09-21 17:30:00+08:00
Next, I retrieve the hour for the apac_date (it is giving me utc hour instead):
df['hour'] = df['apac_date'].apply(lambda x: int(x.strftime('%H')))
df
Result:
id start_date utc_date apac_date hour
123456 2017-09-21 05:30:00 2017-09-21 09:30:00+00:00 2017-09-21 17:30:00+08:00 9
can you try using:
df['apac_date'] = df['utc_date'].apply(lambda x: x.tz_convert('Asia/Hong_Kong'))
I got errors with your above code with using tz_localize() on a timezone that has already been localized.