Pandas dataframe with hourly data: Calculating sums for specific times - python

There is a dataframe with hourly data, e.g.:
DATE TIME Amount
2022-11-07 21:00:00 10
2022-11-07 22:00:00 11
2022-11-08 07:00:00 10
2022-11-08 08:00:00 13
2022-11-08 09:00:00 12
2022-11-08 10:00:00 11
2022-11-08 11:00:00 13
2022-11-08 12:00:00 12
2022-11-08 13:00:00 10
2022-11-08 14:00:00 9
...
I would like to add a new column sum_morning where I calculate the sum of "Amount" for the morning hours only (07:00 - 12:00):
DATE TIME Amount sum_morning
2022-11-07 21:00:00 10 NaN
2022-11-07 22:00:00 11 NaN
2022-11-08 07:00:00 10 NaN
2022-11-08 08:00:00 13 NaN
2022-11-08 09:00:00 12 NaN
2022-11-08 10:00:00 11 NaN
2022-11-08 11:00:00 13 NaN
2022-11-08 12:00:00 12 71
2022-11-08 13:00:00 10 NaN
2022-11-08 14:00:00 9 NaN
...
There can be gaps in the dataframe (e.g. from 22:00 - 07:00), so shift is probably not working here.
I thought about
creating a new dataframe where I filter all time slices from 07:00 - 12:00 for all dates
do a group by and calculate the sum for each day
and then merge this back to the original df.
But maybe there is a more effective solution?
I really enjoy working with Python / pandas, but hourly data still makes my head spin.

First set a DatetimeIndex in order to use DataFrame.between_time, then groupby DATE and aggregate by sum. Finally, get the last value of datetimes per day, in order to match the index of the original DataFrame:
df.index = pd.to_datetime(df['DATE'] + ' ' + df['TIME'])
s = (df.between_time('7:00','12:00')
.reset_index()
.groupby('DATE')
.agg({'Amount':'sum', 'index':'last'})
.set_index('index')['Amount'])
df['sum_morning'] = s
print (df)
DATE TIME Amount sum_morning
2022-11-07 21:00:00 2022-11-07 21:00:00 10 NaN
2022-11-07 22:00:00 2022-11-07 22:00:00 11 NaN
2022-11-08 07:00:00 2022-11-08 07:00:00 10 NaN
2022-11-08 08:00:00 2022-11-08 08:00:00 13 NaN
2022-11-08 09:00:00 2022-11-08 09:00:00 12 NaN
2022-11-08 10:00:00 2022-11-08 10:00:00 11 NaN
2022-11-08 11:00:00 2022-11-08 11:00:00 13 NaN
2022-11-08 12:00:00 2022-11-08 12:00:00 12 71.0
2022-11-08 13:00:00 2022-11-08 13:00:00 10 NaN
2022-11-08 14:00:00 2022-11-08 14:00:00 9 NaN
Lastly, if you need to remove DatetimeIndex you can use:
df = df.reset_index(drop=True)

You can use:
# get values between 7 and 12h
m = pd.to_timedelta(df['TIME']).between('7h', '12h')
# find last True per day
idx = m&m.groupby(df['DATE']).shift(-1).ne(True)
# assign the sum of the 7-12h values on the last True per day
df.loc[idx, 'sum_morning'] = df['Amount'].where(m).groupby(df['DATE']).transform('sum')
Output:
DATE TIME Amount sum_morning
0 2022-11-07 21:00:00 10 NaN
1 2022-11-07 22:00:00 11 NaN
2 2022-11-08 07:00:00 10 NaN
3 2022-11-08 08:00:00 13 NaN
4 2022-11-08 09:00:00 12 NaN
5 2022-11-08 10:00:00 11 NaN
6 2022-11-08 11:00:00 13 NaN
7 2022-11-08 12:00:00 12 71.0
8 2022-11-08 13:00:00 10 NaN
9 2022-11-08 14:00:00 9 NaN

Related

Find Yesterday's High Price by Merging Two DF's on Datetime and Date Columns

I'm trying to merge two df's, one df has a datetime column, and the other has just a date column. My application for this is to find yesterday's high price using an OHLC dataset. I've attached some starter code below, but I'll describe what I'm looking for.
Given this intraday dataset:
time current_intraday_high
0 2022-02-11 09:00:00 1
1 2022-02-11 10:00:00 2
2 2022-02-11 11:00:00 3
3 2022-02-11 12:00:00 4
4 2022-02-11 13:00:00 5
5 2022-02-14 09:00:00 6
6 2022-02-14 10:00:00 7
7 2022-02-14 11:00:00 8
8 2022-02-14 12:00:00 9
9 2022-02-14 13:00:00 10
10 2022-02-15 09:00:00 11
11 2022-02-15 10:00:00 12
12 2022-02-15 11:00:00 13
13 2022-02-15 12:00:00 14
14 2022-02-15 13:00:00 15
15 2022-02-16 09:00:00 16
16 2022-02-16 10:00:00 17
17 2022-02-16 11:00:00 18
18 2022-02-16 12:00:00 19
19 2022-02-16 13:00:00 20
...and this daily dataframe:
time daily_high
0 2022-02-11 5
1 2022-02-14 10
2 2022-02-15 15
3 2022-02-16 20
...how can I merge them together, and have each row of the intraday dataframe contain the previous (business) day's high price, like so:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
(Note the NaN's at the top because we don't have any data for Feb 10, 2022 from the intraday dataset, and see how each row contains the intraday data, plus the PREVIOUS day's max "high" price.)
Minimal reproducible example code below:
import pandas as pd
###################################################
# CREATE MOCK INTRADAY DATAFRAME
###################################################
intraday_date_time = [
"2022-02-11 09:00:00",
"2022-02-11 10:00:00",
"2022-02-11 11:00:00",
"2022-02-11 12:00:00",
"2022-02-11 13:00:00",
"2022-02-14 09:00:00",
"2022-02-14 10:00:00",
"2022-02-14 11:00:00",
"2022-02-14 12:00:00",
"2022-02-14 13:00:00",
"2022-02-15 09:00:00",
"2022-02-15 10:00:00",
"2022-02-15 11:00:00",
"2022-02-15 12:00:00",
"2022-02-15 13:00:00",
"2022-02-16 09:00:00",
"2022-02-16 10:00:00",
"2022-02-16 11:00:00",
"2022-02-16 12:00:00",
"2022-02-16 13:00:00",
]
intraday_date_time = pd.to_datetime(intraday_date_time)
intraday_df = pd.DataFrame(
{
"time": intraday_date_time,
"current_intraday_high": [x for x in range(1, 21)],
},
)
print(intraday_df)
# intraday_df.to_csv('intradayTEST.csv', index=True)
###################################################
# AGGREGATE/UPSAMPLE TO DAILY DATAFRAME
###################################################
# Aggregate to business days using intraday_df
agg_dict = {'current_intraday_high': 'max'}
daily_df = intraday_df.set_index('time').resample('B').agg(agg_dict).reset_index()
daily_df.rename(columns={"current_intraday_high": "daily_high"}, inplace=True)
print(daily_df)
# daily_df.to_csv('dailyTEST.csv', index=True)
###################################################
# MERGE THE TWO DATAFRAMES
###################################################
# Need to merge the daily dataset to the intraday dataset, such that,
# any row on the newly merged/joined/concat'd dataset will have:
# 1. The current intraday datetime in the 'time' column
# 2. The current 'intraday_high' value
# 3. The PREVIOUS DAY's 'daily_high' value
# This doesn't work as the daily_df just gets appended to the bottom
# of the intraday_df due to the datetimes/dates merging
merged_df = pd.merge(intraday_df, daily_df, how='outer', on='time')
print(merged_df)
pd.merge_asof allows you to easily do a merge like this.
yesterdays_high = (intraday_df.resample('B', on='time')['current_intraday_high'].max()
.shift()
.rename('yesterdays_high')
.reset_index())
merged_df = pd.merge_asof(intraday_df, yesterdays_high)
print(merged_df)
Output:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
Given your already existing code, you can map the shifted values:
intraday_df['yesterdays_high'] = (intraday_df['time']
.dt.date
.map(daily_df['daily_high']
.set_axis(daily_df['time'].shift(-1)))
)
If you don't have all days and really want to map the real previous business day:
intraday_df['yesterdays_high'] = (intraday_df['time']
.dt.date
.map(daily_df['daily_high']
.set_axis(daily_df['time'].add(pd.offsets.BusinessDay())))
)
Output:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
We can use .dt.date as an index to join two frames together on the same days. As of previous day hight_price, we can apply shift on daily_df:
intra_date = intraday_df['time'].dt.date
daily_date = daily_df['time'].dt.date
answer = intraday_df.set_index(intra_date).join(
daily_df.set_index(daily_date)['daily_high'].shift()
).reset_index(drop=True)

How to fill nan values from a specific date range in a python time series?

I'm working with a time series that have the recorded the prices from a fish in the markets from a Brazilian city from 2013 to 2021, the original dataset has three columns, one with the cheapest values founded, another with the most expensive ones and finally other with the average price found in the day they collected the data. I've made three subsets to the corresponding column, the dates and indexated the date then doing some explanatory analysis I founded that some specific months from 2013 and 2014 are with nan values.
dfmin.loc['2013-4-1':'2013-7-31']
min
date
2013-04-01 12:00:00 16.0
2013-04-02 12:00:00 16.0
2013-05-22 12:00:00 NaN
2013-05-23 12:00:00 NaN
2013-05-24 12:00:00 NaN
2013-05-27 12:00:00 NaN
2013-05-28 12:00:00 NaN
2013-05-29 12:00:00 NaN
2013-05-30 12:00:00 NaN
2013-05-31 12:00:00 NaN
2013-06-03 12:00:00 NaN
2013-06-04 12:00:00 NaN
2013-06-05 12:00:00 NaN
2013-06-06 12:00:00 NaN
2013-06-07 12:00:00 NaN
2013-06-10 12:00:00 NaN
2013-06-11 12:00:00 NaN
2013-06-12 12:00:00 NaN
2013-06-13 12:00:00 NaN
2013-06-14 12:00:00 NaN
2013-06-17 12:00:00 NaN
2013-06-18 12:00:00 NaN
2013-06-19 12:00:00 15.8
2013-06-20 12:00:00 15.8
2013-06-21 12:00:00 15.8
​```
I want to fill these NaN values from the month 05 with the average value from the medium price from the month 04 and the month 06, how can I make it?
IIUC, you can use simple indexing:
# if needed, convert to datetime
#df.index = pd.to_datetime(df.index)
df.loc[df.index.month==5, 'min'] = df.loc[df.index.month.isin([4,6]), 'min'].mean()
or if you have non NaN for the 5th month:
mask = df.index.month==5
df.loc[mask, 'min'] = (df.loc[mask, 'min']
.fillna(df.loc[df.index.month.isin([4,6]), 'min'].mean())
)
output:
min
date
2013-04-01 12:00:00 16.00
2013-04-02 12:00:00 16.00
2013-05-22 12:00:00 15.88
2013-05-23 12:00:00 15.88
2013-05-24 12:00:00 15.88
2013-05-27 12:00:00 15.88
2013-05-28 12:00:00 15.88
2013-05-29 12:00:00 15.88
2013-05-30 12:00:00 15.88
2013-05-31 12:00:00 15.88
2013-06-03 12:00:00 NaN
2013-06-04 12:00:00 NaN
2013-06-05 12:00:00 NaN
2013-06-06 12:00:00 NaN
2013-06-07 12:00:00 NaN
2013-06-10 12:00:00 NaN
2013-06-11 12:00:00 NaN
2013-06-12 12:00:00 NaN
2013-06-13 12:00:00 NaN
2013-06-14 12:00:00 NaN
2013-06-17 12:00:00 NaN
2013-06-18 12:00:00 NaN
2013-06-19 12:00:00 15.80
2013-06-20 12:00:00 15.80
2013-06-21 12:00:00 15.80

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Flagging list of datetimes within date ranges in pandas dataframe

I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object

Extend dataframe by adding start and end date and fill it with timestamps and NaN

I have got the following data:
data
timestamp
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
and would like to sort it descending by time, add a start and end date on top and bottom of the data, so that it looks like this:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-01 10:00:00 9
2012-06-01 13:00:00 9
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-02 00:00:00 NaN
and finally I would like to extend the dataset to cover all hours from start to end in one hour steps, filling the dataframe with missing timestamps containing 'None'/'NaN' as data.
So far I have the following code:
df2 = pd.DataFrame({'data':temperature, 'timestamp': pd.DatetimeIndex(timestamp)}, dtype=float)
df2.set_index('timestamp',inplace=True)
df3 = pd.DataFrame({ 'timestamp': pd.Series([ts1, ts2]), 'data': [None, None]})
df3.set_index('timestamp',inplace=True)
print(df3)
merged = df3.append(df2)
print(merged)
with the following print outs:
df3:
data
timestamp
2012-06-01 00:00:00 None
2012-06-02 00:00:00 None
merged:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-02 00:00:00 NaN
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
I have tried:
merged = merged.asfreq('H')
but this returned an unsatisfying result:
data
2012-06-01 00:00:00 NaN
2012-06-01 01:00:00 NaN
2012-06-01 02:00:00 NaN
2012-06-01 03:00:00 NaN
2012-06-01 04:00:00 NaN
2012-06-01 05:00:00 NaN
2012-06-01 06:00:00 NaN
2012-06-01 07:00:00 NaN
2012-06-01 08:00:00 NaN
2012-06-01 09:00:00 NaN
2012-06-01 10:00:00 9
Where is the rest of the dataframe? Why does it only contain data till the first valid value?
Help is much appreciated. Thanks a lot in advance
First create an empty dataframe with the timestamp index that you want and then do a left merge with your original dataset:
df2 = pd.DataFrame(index = pd.date_range('2012-06-01','2012-06-02', freq='H'))
df3 = pd.merge(df2, df, left_index = True, right_index = True, how = 'left')
df3
Out[103]:
timestamp value
2012-06-01 00:00:00 NaN NaN
2012-06-01 01:00:00 NaN NaN
2012-06-01 02:00:00 NaN NaN
2012-06-01 03:00:00 NaN NaN
2012-06-01 04:00:00 NaN NaN
2012-06-01 05:00:00 NaN NaN
2012-06-01 06:00:00 NaN NaN
2012-06-01 07:00:00 NaN NaN
2012-06-01 08:00:00 NaN NaN
2012-06-01 09:00:00 NaN NaN
2012-06-01 10:00:00 2012-06-01 10:00:00 9
2012-06-01 11:00:00 NaN NaN
2012-06-01 12:00:00 NaN NaN
2012-06-01 13:00:00 2012-06-01 13:00:00 9
2012-06-01 14:00:00 NaN NaN
2012-06-01 15:00:00 NaN NaN
2012-06-01 16:00:00 NaN NaN
2012-06-01 17:00:00 2012-06-01 17:00:00 9
2012-06-01 18:00:00 NaN NaN
2012-06-01 19:00:00 NaN NaN
2012-06-01 20:00:00 2012-06-01 20:00:00 8
2012-06-01 21:00:00 NaN NaN
2012-06-01 22:00:00 NaN NaN
2012-06-01 23:00:00 NaN NaN
2012-06-02 00:00:00 NaN NaN

Categories

Resources