I need to prepare data with time intervals for machine learning in the way that I get equal spacing between timestamps. For example, for 3 hours spacing, I would like to have the following timestamps: 00:00, 03:00, 6:00, 9:00, 12:00, 15:00... For example:
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
print(df)
Output:
Start End Val
0 2022-07-01 11:30:00 2022-07-01 18:30:00 a
1 2022-07-01 22:30:00 2022-07-02 03:30:00 b
I try to get timestamps:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
df['Datetime'] = df['Datetime'].dt.round('H')
print(df[['Datetime', 'Val']])
Output:
Datetime Val
0 2022-07-01 12:00:00 a
0 2022-07-01 14:00:00 a
0 2022-07-01 18:00:00 a
1 2022-07-01 22:00:00 b
1 2022-07-02 02:00:00 b
As you can see, those timestamps are not equally spaced. My expected result:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
We can use the function merge_asof:
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
date_min, date_max = df['Datetime'].dt.date.min(), df['Datetime'].dt.date.max() + pd.Timedelta('1D')
time_range = pd.date_range(date_min, date_max, freq='3H').to_series(name='Datetime')
df = pd.merge_asof(time_range, df, tolerance=pd.Timedelta('3H'))
df.truncate(df['Val'].first_valid_index(), df['Val'].last_valid_index())
Output:
Datetime Val
4 2022-07-01 12:00:00 a
5 2022-07-01 15:00:00 a
6 2022-07-01 18:00:00 a
7 2022-07-01 21:00:00 NaN
8 2022-07-02 00:00:00 b
9 2022-07-02 03:00:00 b
Annotated code
# Find min and max date of interval
s, e = df['Start'].min(), df['End'].max()
# Create a date range with freq=3H
# Create a output dataframe by assigning daterange to datetime column
df_out = pd.DataFrame({'datetime': pd.date_range(s.ceil('H'), e, freq='3H')})
# Create interval index from start and end date
idx = pd.IntervalIndex.from_arrays(df['Start'], df['End'], closed='both')
# Set the index of df to interval index and select Val column to create mapping series
# Then use this mapping series to substitute values in output dataframe
df_out['Val'] = df_out['datetime'].map(df.set_index(idx)['Val'])
Result
datetime Val
0 2022-07-01 12:00:00 a
1 2022-07-01 15:00:00 a
2 2022-07-01 18:00:00 a
3 2022-07-01 21:00:00 NaN
4 2022-07-02 00:00:00 b
5 2022-07-02 03:00:00 b
For this problem I really like to use pd.DataFrame.reindex. In particular, you can specify the method='nearest, and a tolerance='90m' to ensure you leave gaps where you need them.
You can create you regular spaced time series using pd.date_range with start and end arguments using the .floor('3H') and .ceil('3H') methods, respectively.
import pandas as pd
df = pd.DataFrame({'Start': ['2022-07-01 11:30', '2022-07-01 22:30'], 'End': ['2022-07-01 18:30', '2022-07-02 3:30'], 'Val': ['a', 'b']})
for col in ['Start', 'End']:
df[col] = df[col].apply(pd.to_datetime)
df['Datetime'] = df.apply(lambda x: pd.date_range(x['Start'], x['End'], freq='3H'), axis=1)
df = df.explode('Datetime').drop(['Start', 'End'], axis=1)
result = pd.DataFrame()
for name, group in df.groupby('Val'):
group = group.set_index('Datetime')
group.index = group.index.ceil('1H')
idx = pd.date_range(group.index.min().floor('3H'), group.index.max().ceil('3H'), freq='3H')
group = group.reindex(idx, tolerance = '90m', method='nearest')
result = pd.concat([result, group])
result = result.sort_index()
which returns:
Val
2022-07-01 12:00:00 a
2022-07-01 15:00:00 a
2022-07-01 18:00:00 a
2022-07-01 21:00:00
2022-07-02 00:00:00 b
2022-07-02 03:00:00 b
Another method would be to simply add the timehours to the start time in a loop.
from datetime import datetime,timedelta
#taking start time as current time just for example
start = datetime.now()
#taking end time as current time + 15 hours just for example
end = datetime.now() + timedelta(hours = 15)
times = []
while end>start:
start = start+timedelta(hours = 3)
print(start)
times.append(start)
df = pd.Dataframe(columns = ['Times'])
df['Times'] = times
Output
Times
0 2022-07-15 01:28:56.912013
1 2022-07-15 04:28:56.912013
2 2022-07-15 07:28:56.912013
3 2022-07-15 10:28:56.912013
4 2022-07-15 13:28:56.912013
Related
I need to check if an entry is within a person's shift:
The data looks like this:
timestamp = pd.DataFrame({
'Timestamp': ['01/02/2022 16:08:56','01/02/2022 16:23:31','01/02/2022 16:41:35','02/02/2022 16:57:41','02/02/2022 17:38:22','02/02/2022 17:50:56'],
'Person': ['A','B','A','B','B','A']
})
shift = pd.DataFrame({
'Date': ['01/02/2022','02/02/2022','01/02/2022','02/02/2022'],
'in':['13:00:00','13:00:00','14:00:00','14:00:00'],
'out': ['21:00:00','21:00:00','22:00:00','22:00:00'],
'Person': ['A','A','B','B']
})
For this kind of merge, an efficient method is to use merge_asof:
timestamp['Timestamp'] = pd.to_datetime(timestamp['Timestamp'])
(pd.merge_asof(timestamp.sort_values(by='Timestamp'),
shift.assign(Timestamp=pd.to_datetime(shift['Date']+' '+shift['in']),
ts_out=pd.to_datetime(shift['Date']+' '+shift['out']),
).sort_values(by='Timestamp')
[['Person', 'Timestamp', 'ts_out']],
on='Timestamp', by='Person'
)
.assign(in_shift=lambda d: d['ts_out'].ge(d['Timestamp']))
.drop(columns=['ts_out'])
)
output:
Timestamp Person in_shift
0 2022-01-02 16:08:56 A True
1 2022-01-02 16:23:31 B True
2 2022-01-02 16:41:35 A True
3 2022-02-02 16:57:41 B True
4 2022-02-02 17:38:22 B True
5 2022-02-02 17:50:56 A True
I assume that there is only one shift per person per day.
First I split the day and time from the timestamp dataframe. Then merge this with the shift dataframe on columns Person and Date. Then we only need to check whether the time from timestamp is between in and out.
timestamp[['Date', 'Time']] = timestamp.Timestamp.str.split(' ', 1, expand=True)
df_merge = timestamp.merge(shift, on=['Date', 'Person'])
df_merge['Timestamp_in_shift'] = (df_merge.Time <= df_merge.out) & (df_merge.Time >= df_merge['in'])
df_merge.drop(columns=['Date', 'Time'])
Output:
Timestamp Person in out Timestamp_in_shift
0 01/02/2022 16:08:56 A 13:00:00 21:00:00 True
1 01/02/2022 16:41:35 A 13:00:00 21:00:00 True
2 01/02/2022 16:23:31 B 14:00:00 22:00:00 True
3 02/02/2022 16:57:41 B 14:00:00 22:00:00 True
4 02/02/2022 17:38:22 B 14:00:00 22:00:00 True
5 02/02/2022 17:50:56 A 13:00:00 21:00:00 True
I have a dataframe with frequency as minutes, time is [['09:30:00'-'11:30:00'],['13:00:00'-'15:00:00']], demo is generate_date(),
I want to resample it according to the elapsed hours, but each period of time is closed before and after.
My approach is the resample() function:
Add a little time to the left.
Resample to 30T.
Add a small time to 10:00:00 and 11:00:00.
Separate morning and afternoon data.
The data in the morning is resampled to 30T again, and the data in the afternoon is resampled to H.
Combine morning and afternoon data.
Can my resample() function be simplified?
My code is as follows:
import pandas as pd
import numpy as np
import datetime
def generate_data():
datetime1 = pd.date_range('2021-12-01 09:30:00', '2021-12-01 15:00:00', freq='t')
df = pd.DataFrame(datetime1, columns=['datetime'])
df['time'] = df.datetime.dt.strftime('%H:%M:%S')
df = df[(df.time <= '11:30:00') | (df.time >= '13:00:00')].reset_index(drop=True)
np.random.seed(2021)
df['close'] = np.random.random(len(df)) * 100
return df
def resample(df):
df.loc[df['time'].isin(['09:30:00', '13:00:00']), 'datetime'] = df.loc[
df['time'].isin(['09:30:00', '13:00:00']), 'datetime'].apply(lambda x: x + datetime.timedelta(seconds=0.001))
df = df.set_index('datetime')
df = df.resample('30T', closed='right', label='right').last()
df = df.dropna(subset=['close'])
df = df.reset_index()
df.loc[:, 'time'] = df.datetime.dt.strftime('%H:%M:%S')
df.loc[df['time'].isin(['10:00:00', '11:00:00']), 'datetime'] = df.loc[
df['time'].isin(['10:00:00', '11:00:00']), 'datetime'].apply(lambda x: x + datetime.timedelta(seconds=0.001))
df = df.set_index('datetime')
df1 = df[df.time < '12:00:00']
df2 = df[df.time > '12:00:00']
df1 = df1.resample('30T', closed='right', label='right').last()
df2 = df2.resample('H', closed='right', label='right').last()
df = pd.concat([df1, df2])
df = df.dropna(subset=['close'])
df = df.reset_index()
return df
def main():
df = generate_data()
print('\nBefore resample:')
print(df)
df = resample(df)
print('\nAfter resample:')
print(df)
if __name__ == '__main__':
main()
Before resample:
datetime time close
0 2021-12-01 09:30:00 09:30:00 60.597828
1 2021-12-01 09:31:00 09:31:00 73.336936
2 2021-12-01 09:32:00 09:32:00 13.894716
3 2021-12-01 09:33:00 09:33:00 31.267308
4 2021-12-01 09:34:00 09:34:00 99.724328
.. ... ... ...
237 2021-12-01 14:56:00 14:56:00 98.774245
238 2021-12-01 14:57:00 14:57:00 67.903063
239 2021-12-01 14:58:00 14:58:00 40.640360
240 2021-12-01 14:59:00 14:59:00 50.995722
241 2021-12-01 15:00:00 15:00:00 88.935107
[242 rows x 3 columns]
After resample:
datetime time close
0 2021-12-01 10:30:00 10:30:00 6.383650
1 2021-12-01 11:30:00 11:30:00 26.667989
2 2021-12-01 14:00:00 14:00:00 19.255257
3 2021-12-01 15:00:00 15:00:00 88.935107
Lets say I have a dataframe like this
df1:
datetime1 datetime2
0 2021-05-09 19:52:14 2021-05-09 20:52:14
1 2021-05-09 19:52:14 2021-05-09 21:52:14
2 NaN NaN
3 2021-05-09 16:30:14 NaN
4 NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14
I want to compare the timestamps in datetime1 and datetime2 and create a new column with the difference between them.
In some scenarios I have a cases that I don't have values in datetime1 and datetime2, or I have values in datatime1 but I don't in datatime2, so is there is a possible way to get NaN in "difference" column if there is no timestamp in datetime1 and 2, and if there is a timestamp only in datetime1, get the difference compared to datetime.now() and put that in another column.
Desirable df output:
datetime1 datetime2 Difference in H:m:s Compared with datetime.now()
0 2021-05-09 19:52:14 2021-05-09 20:52:14 01:00:00 NaN
1 2021-05-09 19:52:14 2021-05-09 21:52:14 02:00:00 NaN
2 NaN NaN NaN NaN
3 2021-05-09 16:30:14 NaN NaN e.g(04:00:00)
4 NaN NaN NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14 02:00:00 NaN
I tried a solution from #AndrejKesely, but it is failing if there is no timestamp in datetime1 and datetime2:
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
# if datetime1/datetime2 aren't already datetime, apply `.to_datetime()`:
df["datetime1"] = pd.to_datetime(df["datetime1"])
df["datetime2"] = pd.to_datetime(df["datetime2"])
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["datetime2"] - x["datetime1"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)
print(df)
Select only rows match conditions by using boolean indexing (mask) to do what you need and let Pandas fill missing values with NaN:
def strfdelta(td: pd.Timestamp):
seconds = td.total_seconds()
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
seconds = int(seconds % 60)
return f"{hours:02}:{minutes:02}:{seconds:02}"
bm1 = df["datetime1"].notna() & df["datetime2"].notna()
bm2 = df["datetime1"].notna() & df["datetime2"].isna()
df["Difference in H:m:s"] = (df.loc[bm1, "datetime2"] - df.loc[bm1, "datetime1"]).apply(strfdelta)
df["Compared with datetime.now()"] = (datetime.now() - df.loc[bm2, "datetime1"]).apply(strfdelta)
>>> df
datetime1 datetime2 Diff... Comp...
0 2021-05-09 19:52:14 2021-05-09 20:52:14 01:00:00 NaN
1 2021-05-09 19:52:14 2021-05-09 21:52:14 02:00:00 NaN
2 NaT NaT NaN NaN
3 2021-05-09 16:30:14 NaT NaN 103:09:19
4 NaT NaT NaN NaN
5 2021-05-09 12:30:14 2021-05-09 14:30:14 02:00:00 NaN
You could start by replacing all NaN values in the datetime2 column with datetime.now value. Thus it would make it easier to compare datetime1 to now if datetime1 is NaN.
You can do it with :
df["datetime2"] = df["datetime2"].fillna(value=pandas.to_datetime('today').normalize(),axis=1)
Then you hace only 2 conditions remaining :
If datetime1 column is empty, the result is NaN.
Otherwise, the result is the difference between datetime1 and datetime2 column (as there is no NaN remaining in datetime2 column).
You can perform this with :
import numpy as np
df["Difference in H:m:s"] = np.where(
df["datetime1"].isnull(),
pd.NA,
df["datetime2"] - df["datetime1"]
)
You can finally format your Difference in H:m:s in the required format with the function you provided :
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["Difference in H:m:s"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)
The complete code is :
import numpy as np
# if datetime1/datetime2 aren't already datetime, apply `.to_datetime()`:
df["datetime1"] = pd.to_datetime(df["datetime1"])
df["datetime2"] = pd.to_datetime(df["datetime2"])
df["datetime2"] = df["datetime2"].fillna(value=pandas.to_datetime('today').normalize(),axis=1)
df["Difference in H:m:s"] = np.where(
df["datetime1"].isnull(),
pd.NA,
df["datetime2"] - df["datetime1"]
)
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
df["Difference in H:m:s"] = df.apply(
lambda x: strfdelta(
x["Difference in H:m:s"],
"{hours:02d}:{minutes:02d}:{seconds:02d}",
),
axis=1,
)
Still trying to learn Pandas. Let's assume a dataframe includes start and end of an event for an even-type and the event's Val. Here is an example:
>>> df = pd.DataFrame({ 'start': ["11:00","13:00", "14:00"], 'end': ["12:00","14:00", "15:00"], 'event_type':[1,2,3], 'Val':['a','b','c']})
>>> df['start'] = pd.to_datetime(df['start'])
>>> df['end'] = pd.to_datetime(df['end'])
>>> df
Start End event_type Val
0 2021-03-05 11:00:00 2021-03-05 12:00:00 1 a
1 2021-03-05 13:00:00 2021-03-05 14:00:00 2 b
2 2021-03-05 14:00:00 2021-03-05 15:00:00 3 c
What is the best way for example to find a corresponding value for an event that starts at 11:10 and ends 11:30 of event_type 1, in the Val column. For instance, for this event example, since start and end times fall withing the first row of the df, it should return a.
Try pd.IntervalIndex.from_arrays
df.index = pd.IntervalIndex.from_arrays(left = df.start, right = df.end)
Out put like below
df.loc['11:30']
Out[73]:
start 2021-03-05 11:00:00
end 2021-03-05 12:00:00
event_type 1
Val a
Name: (2021-03-05 11:00:00, 2021-03-05 12:00:00], dtype: object
df.loc['11:30','Val']
Out[75]: 'a'
I have 2 dataset
# df1 - minute based dataset
date Open
2018-01-01 00:00:00 1.0516
2018-01-01 00:01:00 1.0516
2018-01-01 00:02:00 1.0516
2018-01-01 00:03:00 1.0516
2018-01-01 00:04:00 1.0516
....
# df2 - daily based dataset
date_from date_to
2018-01-01 2018-01-01 02:21:00
2018-01-02 2018-01-02 01:43:00
2018-01-03 NA
2018-01-04 2018-01-04 03:11:00
2018-01-05 2018-01-05 00:19:00
For each value in df2, date_from and date_to, I want to grab the minimum/low value in Open in df1 and put it in a new column in df2 called min_value
df1 is a minute based sorted dataset.
For the NA in date_to in df2, we can skip those row entirely and move to the next row.
What did I do?
Firstly I tried to find values between two dates.
after that I used this code:
df2['min_value'] =
df1[df1['date'].dt.hour.between(df2['date_from'], df2['date_to'])].min()
I wanted to search between two dates but I am not sure if that is how to do it.
however it does not work. Could you please help identify what should I do?
Does this works for you?
df1 = pd.DataFrame({'date':['2018-01-01 00:00:00', '2018-01-01 00:01:00', '2018-01-01 00:02:00', '2018-01-01 00:03:00','2018-01-01 00:04:00'],
'Open':[1.0516, 1.0516, 1.0516, 1.0516, 1.0516]})
df2 = pd.DataFrame({'date_from':['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04','2018-01-05'],
'date_to':['2018-01-01 02:21:00', '2018-01-02 01:43:00', np.nan,
'2018-01-04 03:11:00', '2018-01-05 00:19:00']})
## converting to datetime
df1['date'] = pd.to_datetime(df1['date'])
df1.set_index('date', inplace=True)
df2['date_from'] = pd.to_datetime(df2['date_from'])
df2['date_to'] = pd.to_datetime(df2['date_to'])
def func(val):
minimum_val = np.nan
minimum_date = np.nan
if val['date_from'] is pd.NaT or val['date_to'] is pd.NaT:
pass
minimum_val = df1[val['date_from'] : val['date_to']]['Open'].min()
if minimum_val is not np.nan:
minimum_date = df1[val['date_from'] : val['date_to']].reset_index().head(1)['date'].values[0]
pass
else:
pass
return pd.DataFrame({'date_from':[val['date_from']], 'date_to':[val['date_to']], 'Open': [minimum_val], 'min_date': [minimum_date]})
df3=pd.concat(list(df2.apply(func, axis=1)))
Following codesnap is readable.
import pandas as pd
def get_minimum_value(row, df):
temp = df[(df['date'] >= row['date_from']) & (df['date'] <= row['date_to'])]
return temp['value'].min()
df1 = pd.read_csv("test.csv")
df2 = pd.read_csv("test2.csv")
df1['date'] = pd.to_datetime(df1['date'])
df2['date_from'] = pd.to_datetime(df2['date_from'])
df2['date_to'] = pd.to_datetime(df2['date_to'])
df2['value'] = df2.apply(func=get_minimum_value, df=df1, axis=1)
Here df2.apply() function sent each row as first argument to get_minimum_value function. Applying this to your given data, result is:
date_from date_to value
0 2018-01-01 2018-01-01 02:21:00 1.0512
1 2018-01-02 2018-01-02 01:43:00 NaN
2 2018-01-03 NaT NaN
3 2018-01-04 2018-01-04 03:11:00 NaN
4 2018-01-05 2018-01-05 00:19:00 NaN