merging two different date time column to form a sequence - python

I have two data sets.One with weekly date time and the other hourly date time.
my data sets looks like this:-
df1
Week_date w_values
21-04-2019 20:00:00 10
28-04-2019 20:00:00 20
05-05-2019 20:00:00 30
df2
hour_date h_values
19-04-2019 08:00:00 a
21-04-2019 07:00:00 b
21-04-2019 20:00:00 c
22-04-2019 06:00:00 d
23-04-2019 05:00:00 e
28-04-2019 19:00:00 f
28-04-2019 20:00:00 g
28-04-2019 21:00:00 h
29-04-2019 20:00:00 i
05-05-2019 20:00:00 j
06-05-2019 23:00:00 k
tried merging but failed to get the desired output
output data set should look like this
week_date w_values hour_date h_values
21-04-2019 20:00:00 10 21-04-2019 20:00:00 c
21-04-2019 20:00:00 10 22-04-2019 06:00:00 d
21-04-2019 20:00:00 10 23-04-2019 05:00:00 e
21-04-2019 20:00:00 10 28-04-2019 19:00:00 f
28-04-2019 20:00:00 20 28-04-2019 20:00:00 g
28-04-2019 20:00:00 20 28-04-2019 21:00:00 h
28-04-2019 20:00:00 20 29-04-2019 20:00:00 i
05-05-2019 20:00:00 30 05-05-2019 20:00:00 j
05-05-2019 20:00:00 30 06-05-2019 23:00:00 k
the weekly date will change only when week date is equal to hour date....else it will take previous week date....

Use the 'merge_asof' function. From pandas documentation "This merge is similar to a left-join except that we match on nearest key rather than equal keys."
df_week['Week_date']=pd.to_datetime(df_week['Week_date'])
df_hour['hour_date']=pd.to_datetime(df_hour['hour_date'])
df_week_sort=df_week.sort_values(by='Week_date')
df_hour_sort=df_hour.sort_values(by='hour_date')
df_week_sort.rename(columns={'Week_date':'Merge_date'},inplace=True)
df_hour_sort.rename(columns={'hour_date':'Merge_date'},inplace=True)
df_merged=pd.merge_asof(df_hour_sort,df_week_sort,on='Merge_date')
Make sure that the two frames are sorted by the date stamp

The following should do (provided Week_date and hour_date are datetimes):
(df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
.ffill()
.dropna())
The way it works
Make sure both dfs are sorted
>>> df1 = df1.sort_values('Week_date')
>>> df2 = df2.sort_values('hour_date')
Do the merge
>>> df3 = df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d NaT NaN
4 2019-04-23 05:00:00 e NaT NaN
5 2019-04-28 19:00:00 f NaT NaN
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h NaT NaN
8 2019-04-29 20:00:00 i NaT NaN
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k NaT NaN
Forward fill the gaps
>>> df3 = df3.ffill()
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0
Remove the remaining NaNs
>>> df3 = df3.dropna()
>>> df3
hour_date h_values Week_date w_values
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0

Related

Add hours to year-month-day data in pandas data frame

I have the following data frame with hourly resolution
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 20.96
1 2017-01-01 20.90
2 2017-01-01 18.13
3 2017-01-01 16.03
4 2017-01-01 16.43
... ...
8756 2017-12-31 25.56
8757 2017-12-31 11.02
8758 2017-12-31 7.32
8759 2017-12-31 1.86
type(day_ahead_DK1)
Out[28]: pandas.core.frame.DataFrame
But the current column DateStamp is missing hours. How can I add hours 00:00:00, to 2017-01-01 for Index 0 so it will be 2017-01-01 00:00:00, and then 01:00:00, to 2017-01-01 for Index 1 so it will be 2017-01-01 01:00:00, and so on, so that all my days will have hours from 0 to 23. Thank you!
The expected output:
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
... ...
8756 2017-12-31 20:00:00 25.56
8757 2017-12-31 21:00:00 11.02
8758 2017-12-31 22:00:00 7.32
8759 2017-12-31 23:00:00 1.86
Use GroupBy.cumcount for counter with to_timedelta for hours and add to DateStamp column:
df['DateStamp'] = pd.to_datetime(df['DateStamp'])
df['DateStamp'] += pd.to_timedelta(df.groupby('DateStamp').cumcount(), unit='H')
print (df)
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
8756 2017-12-31 00:00:00 25.56
8757 2017-12-31 01:00:00 11.02
8758 2017-12-31 02:00:00 7.32
8759 2017-12-31 03:00:00 1.86

Grouping multiple datetime dataframe rows into a single one

I have a dataframe with some date time interval. I am trying to squash them into single events.
I have the start and end time alongside the respective duration for every event.
What I have
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
What I'm trying to squash it into
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
4 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
I looked for group and merging options in pandas but I didn't manage to to what I want.
Groupby.agg with Series.dt.date
new_df =( df.groupby(df['end_time'].dt.date,as_index = False)
.agg({'start_time':'first',
'end_time':'last',
'duration':'sum',
'id':'first'})
)
print(new_df)
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
1 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E
You can use DataFrame.shift to compare end times with the shifted start times and set any identical pairs to null:
df['flag'] = df['start_time'].shift(-1)
df.loc[df['end_time'] == df['flag'], 'flag'] = pd.NaT
print(df)
start_time end_time duration id flag
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A NaT
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B NaT
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C NaT
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D 2020-01-04 05:00:00
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E NaT
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F NaT
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G NaT
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H 2020-01-04 20:30:00
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I NaT
Then use DataFrame.bfill to backfill those nulls with the start time that violate your interval condition. You'll need to manually set the null value yourself for the last value.
df['flag'] = df['flag'].bfill().fillna(df['end_time'].iloc[-2])
print(df)
start_time end_time duration id flag
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A 2020-01-04 05:00:00
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B 2020-01-04 05:00:00
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C 2020-01-04 05:00:00
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D 2020-01-04 05:00:00
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E 2020-01-04 20:30:00
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F 2020-01-04 20:30:00
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G 2020-01-04 20:30:00
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H 2020-01-04 20:30:00
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I 2020-01-04 07:00:00
Now do as ansev suggested:
df = df.groupby('flag').agg({'start_time':'first','end_time':'last','duration':'sum','id':'first'}).reset_index(drop=True)
print(df)
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
1 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
2 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E

Flagging list of datetimes within date ranges in pandas dataframe

I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object

conditional merging with datetime index

df1 looks like this-
week_date Values
21-04-2019 00:00:00 10
28-04-2019 00:00:00 20
df2 looks like this-
hourly_date hour_val
21-04-2019 00:00:00 a
21-04-2019 01:00:00 b
21-04-2019 02:00:00 c
21-04-2019 03:00:00 d
28-04-2019 00:00:00 e
resultant dataset should look like this
week_date Values hourly_date hour_val
21-04-2019 00:00:00 10 21-04-2019 00:00:00 a
21-04-2019 00:00:00 10 21-04-2019 01:00:00 b
21-04-2019 00:00:00 10 21-04-2019 02:00:00 c
21-04-2019 00:00:00 10 21-04-2019 03:00:00 d
28-04-2019 00:00:00 20 28-04-2019 00:00:00 e
I have hundreds of weekly rows data and thousands of hourly rows data.
trying merging but not getting the desired output.
merge=pd.merge(df1,df2, how='outer', left_index=True, right_index=True)
resultant dataset should look like this
week_date Values hourly_date hour_val
21-04-2019 00:00:00 10 21-04-2019 00:00:00 a
21-04-2019 00:00:00 10 21-04-2019 01:00:00 b
21-04-2019 00:00:00 10 21-04-2019 02:00:00 c
21-04-2019 00:00:00 10 21-04-2019 03:00:00 d
28-04-2019 00:00:00 20 28-04-2019 00:00:00 e
You can merge on year and week in this case, try something like:
import pandas as pd
df1 = pd.DataFrame(
{
"week_date": ["21-04-2019 00:00:00", "28-04-2019 00:00:00"],
"Values": [10,20]
}
)
df2 = pd.DataFrame(
{
"hourly_date": [
"21-04-2019 00:00:00",
"21-04-2019 01:00:00",
"21-04-2019 02:00:00",
"21-04-2019 03:00:00",
"28-04-2019 00:00:00"
],
"hour_val": ["a","b","c","d","e"]
}
)
df1.week_date = pd.to_datetime(df1.week_date)
df1 = df1.set_index("week_date", drop=False)
df2.hourly_date = pd.to_datetime(df2.hourly_date)
df2 = df2.set_index("hourly_date", drop=False)
pd.merge(df1, df2,
left_on=[df1.week_date.dt.week, df1.week_date.dt.year],
right_on=[df2.hourly_date.dt.week, df2.hourly_date.dt.year]
)[["week_date", "Values","hourly_date","hour_val"]].set_index("week_date")
this outputs
Values hourly_date hour_val
week_date
2019-04-21 10 2019-04-21 00:00:00 a
2019-04-21 10 2019-04-21 01:00:00 b
2019-04-21 10 2019-04-21 02:00:00 c
2019-04-21 10 2019-04-21 03:00:00 d
2019-04-28 20 2019-04-28 00:00:00 e
not getting the desired result
my original data sets look this this
data-1:
week_date value
2019-04-19 20:00:00 10
2019-04-26 20:00:00 20
data-2:
hourly_date hour_val
2019-04-26 01:00:00 a
2019-04-26 02:00:00 b
2019-04-26 03:00:00 c
2019-04-26 20:00:00 d
2019-04-26 21:00:00 e
and the desired output should be-
Values hourly_date hour_val
week_date
2019-04-19 20:00:00 10 2019-04-26 01:00:00 a
2019-04-19 20:00:00 10 2019-04-26 02:00:00 b
2019-04-19 20:00:00 10 2019-04-26 03:00:00 c
2019-04-26 20:00:00 20 2019-04-26 20:00:00 d
2019-04-26 20:00:00 20 2019-04-26 21:00:00 e
means weekly date-time changes only when it's equal to hourly date-time...else week_date carries the previous date-time value

Extend dataframe by adding start and end date and fill it with timestamps and NaN

I have got the following data:
data
timestamp
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
and would like to sort it descending by time, add a start and end date on top and bottom of the data, so that it looks like this:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-01 10:00:00 9
2012-06-01 13:00:00 9
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-02 00:00:00 NaN
and finally I would like to extend the dataset to cover all hours from start to end in one hour steps, filling the dataframe with missing timestamps containing 'None'/'NaN' as data.
So far I have the following code:
df2 = pd.DataFrame({'data':temperature, 'timestamp': pd.DatetimeIndex(timestamp)}, dtype=float)
df2.set_index('timestamp',inplace=True)
df3 = pd.DataFrame({ 'timestamp': pd.Series([ts1, ts2]), 'data': [None, None]})
df3.set_index('timestamp',inplace=True)
print(df3)
merged = df3.append(df2)
print(merged)
with the following print outs:
df3:
data
timestamp
2012-06-01 00:00:00 None
2012-06-02 00:00:00 None
merged:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-02 00:00:00 NaN
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
I have tried:
merged = merged.asfreq('H')
but this returned an unsatisfying result:
data
2012-06-01 00:00:00 NaN
2012-06-01 01:00:00 NaN
2012-06-01 02:00:00 NaN
2012-06-01 03:00:00 NaN
2012-06-01 04:00:00 NaN
2012-06-01 05:00:00 NaN
2012-06-01 06:00:00 NaN
2012-06-01 07:00:00 NaN
2012-06-01 08:00:00 NaN
2012-06-01 09:00:00 NaN
2012-06-01 10:00:00 9
Where is the rest of the dataframe? Why does it only contain data till the first valid value?
Help is much appreciated. Thanks a lot in advance
First create an empty dataframe with the timestamp index that you want and then do a left merge with your original dataset:
df2 = pd.DataFrame(index = pd.date_range('2012-06-01','2012-06-02', freq='H'))
df3 = pd.merge(df2, df, left_index = True, right_index = True, how = 'left')
df3
Out[103]:
timestamp value
2012-06-01 00:00:00 NaN NaN
2012-06-01 01:00:00 NaN NaN
2012-06-01 02:00:00 NaN NaN
2012-06-01 03:00:00 NaN NaN
2012-06-01 04:00:00 NaN NaN
2012-06-01 05:00:00 NaN NaN
2012-06-01 06:00:00 NaN NaN
2012-06-01 07:00:00 NaN NaN
2012-06-01 08:00:00 NaN NaN
2012-06-01 09:00:00 NaN NaN
2012-06-01 10:00:00 2012-06-01 10:00:00 9
2012-06-01 11:00:00 NaN NaN
2012-06-01 12:00:00 NaN NaN
2012-06-01 13:00:00 2012-06-01 13:00:00 9
2012-06-01 14:00:00 NaN NaN
2012-06-01 15:00:00 NaN NaN
2012-06-01 16:00:00 NaN NaN
2012-06-01 17:00:00 2012-06-01 17:00:00 9
2012-06-01 18:00:00 NaN NaN
2012-06-01 19:00:00 NaN NaN
2012-06-01 20:00:00 2012-06-01 20:00:00 8
2012-06-01 21:00:00 NaN NaN
2012-06-01 22:00:00 NaN NaN
2012-06-01 23:00:00 NaN NaN
2012-06-02 00:00:00 NaN NaN

Categories

Resources