Pandas: Find first occurrence - on daily basis in a timeseries - python

I'm struggling with this so any input appreciated. I want to iterate over the values in a dataframe column and return the first instance when a value is seen every day. Groupby looked to be a good option for this but when using df.groupby(grouper).first() with grouper set at daily the following output is seen.
In [95]:
df.groupby(grouper).first()
Out[95]:
test_1
2014-03-04 1.0
2014-03-05 1.0
This is only giving the day the value was seen in test _1 and not reseting the first() on a daily basis which is what I need (see desired output below).
I want to preserve the time this value was seen in the following format:
This is the input dataframe:
test_1
2014-03-04 09:00:00 NaN
2014-03-04 10:00:00 NaN
2014-03-04 11:00:00 NaN
2014-03-04 12:00:00 NaN
2014-03-04 13:00:00 NaN
2014-03-04 14:00:00 1.0
2014-03-04 15:00:00 NaN
2014-03-04 16:00:00 1.0
2014-03-05 09:00:00 1.0
This is the desired output:
test_1 test_output
2014-03-04 09:00:00 NaN NaN
2014-03-04 10:00:00 NaN NaN
2014-03-04 11:00:00 NaN NaN
2014-03-04 12:00:00 NaN NaN
2014-03-04 13:00:00 NaN NaN
2014-03-04 14:00:00 1.0 1.0
2014-03-04 15:00:00 NaN NaN
2014-03-04 16:00:00 1.0 NaN
2014-03-05 09:00:00 1.0 NaN
I just want to mark the time when an event first occurs in a new column named test_output.
Admins. Please note this question is different from the other marked as a duplicate as this requires a rolling one day first occurrence.

Try this, using this data:
rng = pd.DataFrame( {'test_1': [None, None,None, None, 1,1, 1 , None, None, None,1 , None, None, None,]}, index = pd.date_range('4/2/2014', periods=14, freq='BH'))
rng
test_1
2014-04-02 09:00:00 NaN
2014-04-02 10:00:00 NaN
2014-04-02 11:00:00 NaN
2014-04-02 12:00:00 NaN
2014-04-02 13:00:00 1.0
2014-04-02 14:00:00 1.0
2014-04-02 15:00:00 1.0
2014-04-02 16:00:00 NaN
2014-04-03 09:00:00 NaN
2014-04-03 10:00:00 NaN
2014-04-03 11:00:00 1.0
2014-04-03 12:00:00 NaN
2014-04-03 13:00:00 NaN
2014-04-03 14:00:00 NaN
The output is this:
rng['test_output'] = rng['test_1'].loc[rng.groupby(pd.TimeGrouper(freq='D'))['test_1'].idxmin()]
test_1 test_output
2014-04-02 09:00:00 NaN NaN
2014-04-02 10:00:00 NaN NaN
2014-04-02 11:00:00 NaN NaN
2014-04-02 12:00:00 NaN NaN
2014-04-02 13:00:00 1.0 1.0
2014-04-02 14:00:00 1.0 NaN
2014-04-02 15:00:00 1.0 NaN
2014-04-02 16:00:00 NaN NaN
2014-04-03 09:00:00 NaN NaN
2014-04-03 10:00:00 NaN NaN
2014-04-03 11:00:00 1.0 1.0
2014-04-03 12:00:00 NaN NaN
2014-04-03 13:00:00 NaN NaN
2014-04-03 14:00:00 NaN NaN

Related

How to fill nan values from a specific date range in a python time series?

I'm working with a time series that have the recorded the prices from a fish in the markets from a Brazilian city from 2013 to 2021, the original dataset has three columns, one with the cheapest values founded, another with the most expensive ones and finally other with the average price found in the day they collected the data. I've made three subsets to the corresponding column, the dates and indexated the date then doing some explanatory analysis I founded that some specific months from 2013 and 2014 are with nan values.
dfmin.loc['2013-4-1':'2013-7-31']
min
date
2013-04-01 12:00:00 16.0
2013-04-02 12:00:00 16.0
2013-05-22 12:00:00 NaN
2013-05-23 12:00:00 NaN
2013-05-24 12:00:00 NaN
2013-05-27 12:00:00 NaN
2013-05-28 12:00:00 NaN
2013-05-29 12:00:00 NaN
2013-05-30 12:00:00 NaN
2013-05-31 12:00:00 NaN
2013-06-03 12:00:00 NaN
2013-06-04 12:00:00 NaN
2013-06-05 12:00:00 NaN
2013-06-06 12:00:00 NaN
2013-06-07 12:00:00 NaN
2013-06-10 12:00:00 NaN
2013-06-11 12:00:00 NaN
2013-06-12 12:00:00 NaN
2013-06-13 12:00:00 NaN
2013-06-14 12:00:00 NaN
2013-06-17 12:00:00 NaN
2013-06-18 12:00:00 NaN
2013-06-19 12:00:00 15.8
2013-06-20 12:00:00 15.8
2013-06-21 12:00:00 15.8
​```
I want to fill these NaN values from the month 05 with the average value from the medium price from the month 04 and the month 06, how can I make it?
IIUC, you can use simple indexing:
# if needed, convert to datetime
#df.index = pd.to_datetime(df.index)
df.loc[df.index.month==5, 'min'] = df.loc[df.index.month.isin([4,6]), 'min'].mean()
or if you have non NaN for the 5th month:
mask = df.index.month==5
df.loc[mask, 'min'] = (df.loc[mask, 'min']
.fillna(df.loc[df.index.month.isin([4,6]), 'min'].mean())
)
output:
min
date
2013-04-01 12:00:00 16.00
2013-04-02 12:00:00 16.00
2013-05-22 12:00:00 15.88
2013-05-23 12:00:00 15.88
2013-05-24 12:00:00 15.88
2013-05-27 12:00:00 15.88
2013-05-28 12:00:00 15.88
2013-05-29 12:00:00 15.88
2013-05-30 12:00:00 15.88
2013-05-31 12:00:00 15.88
2013-06-03 12:00:00 NaN
2013-06-04 12:00:00 NaN
2013-06-05 12:00:00 NaN
2013-06-06 12:00:00 NaN
2013-06-07 12:00:00 NaN
2013-06-10 12:00:00 NaN
2013-06-11 12:00:00 NaN
2013-06-12 12:00:00 NaN
2013-06-13 12:00:00 NaN
2013-06-14 12:00:00 NaN
2013-06-17 12:00:00 NaN
2013-06-18 12:00:00 NaN
2013-06-19 12:00:00 15.80
2013-06-20 12:00:00 15.80
2013-06-21 12:00:00 15.80

Pandas fillna() method not filling all missing values

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

how can i replace time-series dataframe specific values in pandas?

I have the dataframes below (date/time is multi index) and I want to replace column values in (00:00:00~07:00:00) as a numpy array:
[[ 21.63920663 21.62012822 20.9900515 21.23217008 21.19482458
21.10839656 20.89631935 20.79977166 20.99176729 20.91567565
20.87258765 20.76210464 20.50357827 20.55897631 20.38005033
20.38227309 20.54460993 20.37707293 20.08279925 20.09955877
20.02559575 20.12390737 20.2917257 20.20056711 20.1589065
20.41302289 20.48000767 20.55604102 20.70255192]]
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
03:15:00 NaN
03:30:00 NaN
03:45:00 NaN
04:00:00 NaN
04:15:00 NaN
04:30:00 NaN
04:45:00 NaN
05:00:00 NaN
05:15:00 NaN
05:30:00 NaN
05:45:00 NaN
06:00:00 NaN
06:15:00 NaN
06:30:00 NaN
06:45:00 NaN
07:00:00 NaN
07:15:00 NaN
07:30:00 NaN
07:45:00 NaN
08:00:00 NaN
08:15:00 NaN
08:30:00 NaN
08:45:00 NaN
09:00:00 NaN
09:15:00 NaN
09:30:00 NaN
09:45:00 NaN
10:00:00 NaN
10:15:00 NaN
10:30:00 NaN
10:45:00 NaN
11:00:00 NaN
Name: temp, dtype: float64
<class 'datetime.time'>
How can I do this?
You can use slicers:
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
Or if second levels are times:
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = 1
Sample:
print (df1)
aaa
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 2.00
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
For assign array is necessary use numpy.tile for repeat by length of first level unique values:
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, 10),len(df1.index.levels[0]))
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
More general solution with generated array by length of slice:
idx = pd.IndexSlice
len0 = df1.loc[idx[df1.index.levels[0][0], '00:00:00':'02:00:00'],:].shape[0]
len1 = len(df1.index.levels[0])
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, len0 + 1), len1)
Tested with times:
import datetime
idx = pd.IndexSlice
arr =np.tile(np.arange(1, 10),len(df1.index.levels[0]))
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = arr
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
Last was problem found - my solution wokrs with one column DataFrame, but if working with Series need remove one ::
arr = np.array([[ 21.63920663, 21.62012822, 20.9900515, 21.23217008, 21.19482458, 21.10839656,
20.89631935, 20.79977166, 20.99176729, 20.91567565, 20.87258765, 20.76210464,
20.50357827, 20.55897631, 20.38005033, 20.38227309, 20.54460993, 20.37707293,
20.08279925, 20.09955877, 20.02559575, 20.12390737, 20.2917257, 20.20056711,
20.1589065, 20.41302289, 20.48000767, 20.55604102, 20.70255192]])
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0): datetime.time(7, 0, 0)]] = arr[0]
---^^^

How do I merge two columns in a dataframe based on a datetime index?

I got a big dataset in pandas with a datetime index.
The dataframe start at 2010-04-09 and ends present time. When I created this dataset I had for a few columns which had only data starting from 2011-06-01. The values above that where NaN values. Now I managed to get the data for a few of those columns between 2010-04-09 and 2011-06-01. Those data are in a different dataframe with the same datetime index.
Now I want to fill the old columns in the original dataset with the values of the new one but I seem not to be able to do it.
My original dataframe looks like this:
>>> data.head()
bc_conc stability wind_speed Qnet Visibility \
2010-04-09 10:00:00 609.542000 NaN NaN NaN NaN
2010-04-09 11:00:00 663.500000 NaN NaN NaN NaN
2010-04-09 12:00:00 524.661667 NaN NaN NaN NaN
2010-04-09 13:00:00 228.706667 NaN NaN NaN NaN
2010-04-09 14:00:00 279.721667 NaN NaN NaN NaN
wind_direction Rain seizoen clouds
2010-04-09 10:00:00 NaN NaN lente 1
2010-04-09 11:00:00 NaN NaN lente 6
2010-04-09 12:00:00 NaN NaN lente 8
2010-04-09 13:00:00 NaN NaN lente 4
2010-04-09 14:00:00 NaN NaN lente 7
The dataframe I want to add looks like this:
>>> df.loc['20100409']
Qnet Rain Windspeed Winddirection
2010-04-09 10:00:00 326.3 0.0 2.4 288
2010-04-09 11:00:00 331.8 0.0 3.6 308
2010-04-09 12:00:00 212.7 0.0 3.8 349
2010-04-09 13:00:00 246.6 0.0 4.1 354
2010-04-09 14:00:00 422.7 0.0 4.5 343
2010-04-09 15:00:00 210.9 0.0 4.6 356
2010-04-09 16:00:00 120.6 0.0 4.5 3
2010-04-09 17:00:00 83.3 0.0 4.5 4
2010-04-09 18:00:00 -23.8 0.0 3.3 7
2010-04-09 19:00:00 -54.0 0.0 3.0 15
2010-04-09 20:00:00 -44.3 0.0 2.7 3
2010-04-09 21:00:00 -41.9 0.0 2.6 3
2010-04-09 22:00:00 -42.1 0.0 2.2 1
2010-04-09 23:00:00 -47.4 0.0 2.2 2
So I want to add the values of df['Qnet'] to data['Qnet'], etc
I tried a lot of things with merge and join but nothing seems to really work. There are no overlapping data in the frames. The 'df' dataframe stops at 2011-05-31 and the 'data' dataframe has NaN values until that date in the columns I want to change. The original columns in data do have values from 2011-06-01 and I want to keep those!
I do know how to merge the two dataset but then I get a Qnet_x and a Qnet_y column.
So the questions is how do I combine/merge two columns in 2 or the same dataset.
I hope the question is clear
Thanks in advance for the help
UPDATE2:
this version should also work with duplicates in the index:
data = data.join(df['Qnet'], rsuffix='_new')
data['Qnet'] = data['Qnet'].combine_first(data['Qnet_new'])
data.drop(['Qnet_new'], axis=1, inplace=True)
UPDATE:
data.ix[pd.isnull(data.Qnet), 'Qnet'] = df['Qnet']
In [114]: data.loc[data.index[-1], 'Qnet'] = 9999
In [115]: data
Out[115]:
bc_conc stability wind_speed Qnet Visibility \
2010-04-09 10:00:00 609.542000 NaN NaN NaN NaN
2010-04-09 11:00:00 663.500000 NaN NaN NaN NaN
2010-04-09 12:00:00 524.661667 NaN NaN NaN NaN
2010-04-09 13:00:00 228.706667 NaN NaN NaN NaN
2010-04-09 14:00:00 279.721667 NaN NaN 9999.0 NaN
wind_direction Rain seizoen clouds
2010-04-09 10:00:00 NaN NaN lente 1
2010-04-09 11:00:00 NaN NaN lente 6
2010-04-09 12:00:00 NaN NaN lente 8
2010-04-09 13:00:00 NaN NaN lente 4
2010-04-09 14:00:00 NaN NaN lente 7
In [116]: data.ix[pd.isnull(data.Qnet), 'Qnet'] = df['Qnet']
In [117]: data
Out[117]:
bc_conc stability wind_speed Qnet Visibility \
2010-04-09 10:00:00 609.542000 NaN NaN 326.3 NaN
2010-04-09 11:00:00 663.500000 NaN NaN 331.8 NaN
2010-04-09 12:00:00 524.661667 NaN NaN 212.7 NaN
2010-04-09 13:00:00 228.706667 NaN NaN 246.6 NaN
2010-04-09 14:00:00 279.721667 NaN NaN 9999.0 NaN
wind_direction Rain seizoen clouds
2010-04-09 10:00:00 NaN NaN lente 1
2010-04-09 11:00:00 NaN NaN lente 6
2010-04-09 12:00:00 NaN NaN lente 8
2010-04-09 13:00:00 NaN NaN lente 4
2010-04-09 14:00:00 NaN NaN lente 7
OLD answer:
you can do it this way:
In [97]: data.drop(['Qnet'], axis=1).join(df['Qnet'])
Out[97]:
bc_conc stability wind_speed Visibility \
2010-04-09 10:00:00 609.542000 NaN NaN NaN
2010-04-09 11:00:00 663.500000 NaN NaN NaN
2010-04-09 12:00:00 524.661667 NaN NaN NaN
2010-04-09 13:00:00 228.706667 NaN NaN NaN
2010-04-09 14:00:00 279.721667 NaN NaN NaN
wind_direction Rain seizoen clouds Qnet
2010-04-09 10:00:00 NaN NaN lente 1 326.3
2010-04-09 11:00:00 NaN NaN lente 6 331.8
2010-04-09 12:00:00 NaN NaN lente 8 212.7
2010-04-09 13:00:00 NaN NaN lente 4 246.6
2010-04-09 14:00:00 NaN NaN lente 7 422.7

Extend dataframe by adding start and end date and fill it with timestamps and NaN

I have got the following data:
data
timestamp
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
and would like to sort it descending by time, add a start and end date on top and bottom of the data, so that it looks like this:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-01 10:00:00 9
2012-06-01 13:00:00 9
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-02 00:00:00 NaN
and finally I would like to extend the dataset to cover all hours from start to end in one hour steps, filling the dataframe with missing timestamps containing 'None'/'NaN' as data.
So far I have the following code:
df2 = pd.DataFrame({'data':temperature, 'timestamp': pd.DatetimeIndex(timestamp)}, dtype=float)
df2.set_index('timestamp',inplace=True)
df3 = pd.DataFrame({ 'timestamp': pd.Series([ts1, ts2]), 'data': [None, None]})
df3.set_index('timestamp',inplace=True)
print(df3)
merged = df3.append(df2)
print(merged)
with the following print outs:
df3:
data
timestamp
2012-06-01 00:00:00 None
2012-06-02 00:00:00 None
merged:
data
timestamp
2012-06-01 00:00:00 NaN
2012-06-02 00:00:00 NaN
2012-06-01 17:00:00 9
2012-06-01 20:00:00 8
2012-06-01 13:00:00 9
2012-06-01 10:00:00 9
I have tried:
merged = merged.asfreq('H')
but this returned an unsatisfying result:
data
2012-06-01 00:00:00 NaN
2012-06-01 01:00:00 NaN
2012-06-01 02:00:00 NaN
2012-06-01 03:00:00 NaN
2012-06-01 04:00:00 NaN
2012-06-01 05:00:00 NaN
2012-06-01 06:00:00 NaN
2012-06-01 07:00:00 NaN
2012-06-01 08:00:00 NaN
2012-06-01 09:00:00 NaN
2012-06-01 10:00:00 9
Where is the rest of the dataframe? Why does it only contain data till the first valid value?
Help is much appreciated. Thanks a lot in advance
First create an empty dataframe with the timestamp index that you want and then do a left merge with your original dataset:
df2 = pd.DataFrame(index = pd.date_range('2012-06-01','2012-06-02', freq='H'))
df3 = pd.merge(df2, df, left_index = True, right_index = True, how = 'left')
df3
Out[103]:
timestamp value
2012-06-01 00:00:00 NaN NaN
2012-06-01 01:00:00 NaN NaN
2012-06-01 02:00:00 NaN NaN
2012-06-01 03:00:00 NaN NaN
2012-06-01 04:00:00 NaN NaN
2012-06-01 05:00:00 NaN NaN
2012-06-01 06:00:00 NaN NaN
2012-06-01 07:00:00 NaN NaN
2012-06-01 08:00:00 NaN NaN
2012-06-01 09:00:00 NaN NaN
2012-06-01 10:00:00 2012-06-01 10:00:00 9
2012-06-01 11:00:00 NaN NaN
2012-06-01 12:00:00 NaN NaN
2012-06-01 13:00:00 2012-06-01 13:00:00 9
2012-06-01 14:00:00 NaN NaN
2012-06-01 15:00:00 NaN NaN
2012-06-01 16:00:00 NaN NaN
2012-06-01 17:00:00 2012-06-01 17:00:00 9
2012-06-01 18:00:00 NaN NaN
2012-06-01 19:00:00 NaN NaN
2012-06-01 20:00:00 2012-06-01 20:00:00 8
2012-06-01 21:00:00 NaN NaN
2012-06-01 22:00:00 NaN NaN
2012-06-01 23:00:00 NaN NaN
2012-06-02 00:00:00 NaN NaN

Categories

Resources