Can I use date index to create dummies in pandas?

Can I use date index to create dummies in pandas? - python

I have been searching if I could create dummies using the date indexed in pandas, but could not find anything yet.
I have a df that is indexed by date
dew temp
date
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
...
I know I could setdate as a column using,
df.reset_index(level=0, inplace=True)
and then use something like this to create dummies,
df['main_hours'] = np.where((df['date'] >= '2010-01-02 03:00:00') & (df['date'] <= '2010-01-02 05:00:00')1,0)
However, I would like to create dummy variables using indexed date on the fly without using date as a column. Is there a way in pandas like that?
Any suggestion would be appreciated.

IIUC:
df['main_hours'] = \
np.where((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'),
1,
0)
or:
In [8]: df['main_hours'] = \
((df.index >= '2010-01-02 03:00:00') &
(df.index <= '2010-01-02 05:00:00')).astype(int)
In [9]: df
Out[9]:
dew temp main_hours
date
2010-01-02 00:00:00 129.0 -16 0
2010-01-02 01:00:00 148.0 -15 0
2010-01-02 02:00:00 159.0 -11 0
2010-01-02 03:00:00 181.0 -7 1
2010-01-02 04:00:00 138.0 -7 1
Timing: for 50.000 rows DF:
In [19]: df = pd.concat([df.reset_index()] * 10**4, ignore_index=True).set_index('date')
In [20]: pd.options.display.max_rows = 10
In [21]: df
Out[21]:
dew temp
date
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
... ... ...
2010-01-02 00:00:00 129.0 -16
2010-01-02 01:00:00 148.0 -15
2010-01-02 02:00:00 159.0 -11
2010-01-02 03:00:00 181.0 -7
2010-01-02 04:00:00 138.0 -7
[50000 rows x 2 columns]
In [22]: %timeit ((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00')).astype(int)
1.58 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [23]: %timeit np.where((df.index >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'), 1, 0)
1.52 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: df.shape
Out[24]: (50000, 2)

Or using between;
pd.Series(df.index).between('2010-01-02 03:00:00', '2010-01-02 05:00:00', inclusive=True).astype(int)
Out[1567]:
0 0
1 0
2 0
3 1
4 1
Name: date, dtype: int32

df = df.assign(main_hours=0)
df.loc[df.between_time(start_time='3:00', end_time='5:00').index, 'main_hours'] = 1
>>> df
dew temp main_hours
2010-01-02 00:00:00 129 -16 0
2010-01-02 01:00:00 148 -15 0
2010-01-02 02:00:00 159 -11 0
2010-01-02 03:00:00 181 -7 1
2010-01-02 04:00:00 138 -7 1

Related

How to number timestamps that comes under particular duration of time in dataframe

If we can divide time of a day from 00:00:00 hrs to 23:59:00 into 15 min blocks we will have 96 blocks. we can number them from 0 to 95.
I want to add a "timeblock" column to the dataframe, where i can number each row with a timeblock number that time stamp sits in as shown below.
tagdatetime tagvalue timeblock
2020-01-01 00:00:00 47.874423 0
2020-01-01 00:01:00 14.913561 0
2020-01-01 00:02:00 56.368034 0
2020-01-01 00:03:00 16.555687 0
2020-01-01 00:04:00 42.138176 0
... ... ...
2020-01-01 00:13:00 47.874423 0
2020-01-01 00:14:00 14.913561 0
2020-01-01 00:15:00 56.368034 0
2020-01-01 00:16:00 16.555687 1
2020-01-01 00:17:00 42.138176 1
... ... ...
2020-01-01 23:55:00 18.550685 95
2020-01-01 23:56:00 51.219147 95
2020-01-01 23:57:00 15.098951 95
2020-01-01 23:58:00 37.863191 95
2020-01-01 23:59:00 51.380950 95

I think there's a better way to do it, but I think it's possible below.
import pandas as pd
import numpy as np
tindex = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='min')
tvalue = np.random.randint(1,50, (1440,))
df = pd.DataFrame({'tagdatetime':tindex, 'tagvalue':tvalue})
min15 = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='15min')
tblock = np.arange(96)
df2 = pd.DataFrame({'min15':min15, 'timeblock':tblock})
df3 = pd.merge(df, df2, left_on='tagdatetime', right_on='min15', how='outer')
df3.ffill(axis=0, inplace=True)
df3 = df3.drop('min15', axis=1)
df3.iloc[10:20,]
tagdatetime tagvalue timeblock
10 2020-01-01 00:10:00 20 0.0
11 2020-01-01 00:11:00 25 0.0
12 2020-01-01 00:12:00 42 0.0
13 2020-01-01 00:13:00 45 0.0
14 2020-01-01 00:14:00 11 0.0
15 2020-01-01 00:15:00 15 1.0
16 2020-01-01 00:16:00 38 1.0
17 2020-01-01 00:17:00 23 1.0
18 2020-01-01 00:18:00 5 1.0
19 2020-01-01 00:19:00 32 1.0

Average 2 consecutive rows in Panda dataframes

I've a dataset which looks as follows
userid time val1 val2 val3 val4
1 2010-6-1 0:15 12 16 17 11
1 2010-6-1 0:30 11.5 14 15.2 10
1 2010-6-1 0:45 12 14 15 10
1 2010-6-1 1:00 8 11 13 0
.................................
.................................
2 2010-6-1 0:15 14 16 17 11
2 2010-6-1 0:30 11 14 15.2 10
2 2010-6-1 0:45 11 14 15 10
2 2010-6-1 1:00 9 11 13 0
.................................
.................................
3 ...................................
.................................
.................................
I want to get the average of every two rows. Expected results would be
userid time val1 val2 val3 val4
1 2010-6-1 0:30 11.75 15 16.1 10.5
1 2010-6-1 1:00 10 12.5 14 5
..............................
..............................
2 2010-6-1 0:30 12.5 15 16.1 10.5
2 2010-6-1 1:00 10 12.5 14 5
.................................
.................................
3 ...................................
.................................
.................................
At the moment my approach is
data = pd.read_csv("sample_dataset.csv")
i = 0
while i < len(data) - 1:
x = data.iloc[i:i+2].mean()
x['time'] = data.iloc[i+1]['time']
data.iloc[i] = x
i+=2
for i in range(len(data)):
if i % 2 != 1:
del data.iloc[i]
But this is very inefficient. Therefore can someone point out me a better approach to get the intended result?. In the dataset, I've more than 1000000 rows

I am using resample
df.set_index('time').resample('30Min',closed = 'right',label ='right').mean()
Out[293]:
val1 val2 val3 val4
time
2010-06-01 00:30:00 11.75 15.0 16.1 10.5
2010-06-01 01:00:00 10.00 12.5 14.0 5.0
Method 2
df.groupby(np.arange(len(df))//2).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean())
Out[308]:
time val1 val2 val3 val4
0 2010-06-01 00:30:00 11.75 15.0 16.1 10.5
1 2010-06-01 01:00:00 10.00 12.5 14.0 5.0
Update solution
df.groupby([df.userid,np.arange(len(df))//2]).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean()).reset_index(drop=True)

This solution stays in pandas, and is far more performant than the groupby-agg solution:
>>> df = pd.DataFrame({"a":range(10),
"b":range(0, 20, 2),
"c":pd.date_range('2018-01-01', periods=10, freq='H')})
>>> df
a b c
0 0 0 2018-01-01 00:00:00
1 1 2 2018-01-01 01:00:00
2 2 4 2018-01-01 02:00:00
3 3 6 2018-01-01 03:00:00
4 4 8 2018-01-01 04:00:00
5 5 10 2018-01-01 05:00:00
6 6 12 2018-01-01 06:00:00
7 7 14 2018-01-01 07:00:00
8 8 16 2018-01-01 08:00:00
9 9 18 2018-01-01 09:00:00
>>> pd.concat([(df.iloc[::2, :2] + df.iloc[1::2, :2].values) / 2,
df.iloc[::2, 2]], axis=1)
a b c
0 0.5 1.0 2018-01-01 00:00:00
2 2.5 5.0 2018-01-01 02:00:00
4 4.5 9.0 2018-01-01 04:00:00
6 6.5 13.0 2018-01-01 06:00:00
8 8.5 17.0 2018-01-01 08:00:00
Performance:
In [41]: n = 100000
In [42]: df = pd.DataFrame({"a":range(n), "b":range(0, n*2, 2), "c":pd.date_range('2018-01-01', periods= n, freq='S')})
In [44]: df.shape
Out[44]: (100000, 3)
In [45]: %timeit pd.concat([(df.iloc[::2, :2] + df.iloc[1::2, :2].values) / 2, df.iloc[::2, 2]], axis=1)
2.21 ms ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [46]: %timeit df.groupby(np.arange(len(df))//2).agg(lambda x : x.iloc[-1] if x.dtype=='datetime64[ns]' else x.mean())
7.9 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I tried both mentioned answers. Both did work. But Noah's answer was the fastest one as I experienced. Therefore I marked that answer as the solution.
Here is my version of Noah's answer with some explanation and edits to map with my dataset
In order to use Noah;s answer time column should be first or last (I maybe wrong). Therefore, I moved the time column to end
col = data.columns.tolist()
tmp = col[10]
col[10] = col[1]
col[1] = tmp
data2 = data[col]
Then I did the concatenation. Here ::2 means every second column and :10 means columns from 0 to 9. And then I add the time column which is at the 10th index
x = pd.concat([(data2.iloc[::2, :10] + data2.iloc[1::2, :10].values) / 2, data2.iloc[::2, 10]], axis=1)

How to convert datetime series to actual duration in hours?

I have a dataframe like this:
index = ['2018-02-17 00:30:00', '2018-02-17 07:00:00',
'2018-02-17 13:00:00', '2018-02-17 19:00:00',
'2018-02-18 00:00:00', '2018-02-18 07:00:00',
'2018-02-18 10:30:00', '2018-02-18 13:00:00']
df = pd.DataFrame({'col': list(range(len(index)))})
df.index = pd.to_datetime(index)
col
2018-02-17 00:30:00 0
2018-02-17 07:00:00 1
2018-02-17 13:00:00 2
2018-02-17 19:00:00 3
2018-02-18 00:00:00 4
2018-02-18 07:00:00 5
2018-02-18 10:30:00 6
2018-02-18 13:00:00 7
and would like to add a column that reflects the actual duration in hours, so my desired outcome looks like this:
col time_range
2018-02-17 00:30:00 0 0.0
2018-02-17 07:00:00 1 6.5
2018-02-17 13:00:00 2 12.5
2018-02-17 19:00:00 3 18.5
2018-02-18 00:00:00 4 23.5
2018-02-18 07:00:00 5 30.5
2018-02-18 10:30:00 6 34.0
2018-02-18 13:00:00 7 36.5
I currently do this as follows:
df['time_range'] = [(ti - df.index[0]).delta / (10 ** 9 * 60 * 60) for ti in df.index]
Is there a smarter (i.e. vectorized/built-in) way of doing this?

Use:
df['new'] = (df.index - df.index[0]).total_seconds() / 3600
Or:
df['new'] = (df.index - df.index[0]) / np.timedelta64(1, 'h')
print (df)
col new
2018-02-17 00:30:00 0 0.0
2018-02-17 07:00:00 1 6.5
2018-02-17 13:00:00 2 12.5
2018-02-17 19:00:00 3 18.5
2018-02-18 00:00:00 4 23.5
2018-02-18 07:00:00 5 30.5
2018-02-18 10:30:00 6 34.0
2018-02-18 13:00:00 7 36.5

Optimize code to find the median of values of past 4 to 6 days for each row in a DataFrame

Given a dataframe of timestamp data, I would like to compute the median of certain variable of past 4-6 days.
Median of past 1-3 days can be computed by pd.pandas.DataFrame.rolling, but I couldn't find how to use rolling to compute the median of past 4-6 days.
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
timestamp var
0 2011-01-01 00:00:00 1.624345
1 2011-01-01 06:00:00 -0.611756
2 2011-01-01 12:00:00 -0.528172
3 2011-01-01 18:00:00 -1.072969
4 2011-01-02 00:00:00 0.865408
5 2011-01-02 06:00:00 -2.301539
6 2011-01-02 12:00:00 1.744812
7 2011-01-02 18:00:00 -0.761207
8 2011-01-03 00:00:00 0.319039
9 2011-01-03 06:00:00 -0.249370
10 2011-01-03 12:00:00 1.462108
Desired output:
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN # no data in past 4-6 days
1 2011-01-01 06:00:00 -0.611756 NaN # no data in past 4-6 days
2 2011-01-01 12:00:00 -0.528172 NaN # no data in past 4-6 days
3 2011-01-01 18:00:00 -1.072969 NaN # no data in past 4-6 days
4 2011-01-02 00:00:00 0.865408 NaN # no data in past 4-6 days
5 2011-01-02 06:00:00 -2.301539 NaN # no data in past 4-6 days
6 2011-01-02 12:00:00 1.744812 NaN # no data in past 4-6 days
7 2011-01-02 18:00:00 -0.761207 NaN # no data in past 4-6 days
8 2011-01-03 00:00:00 0.319039 NaN # no data in past 4-6 days
9 2011-01-03 06:00:00 -0.249370 NaN # no data in past 4-6 days
10 2011-01-03 12:00:00 1.462108 NaN # no data in past 4-6 days
11 2011-01-03 18:00:00 -2.060141 NaN # no data in past 4-6 days
12 2011-01-04 00:00:00 -0.322417 NaN # no data in past 4-6 days
13 2011-01-04 06:00:00 -0.384054 NaN # no data in past 4-6 days
14 2011-01-04 12:00:00 1.133769 NaN # no data in past 4-6 days
15 2011-01-04 18:00:00 -1.099891 NaN # no data in past 4-6 days
16 2011-01-05 00:00:00 -0.172428 NaN # only 4 data in past 4-6 days
17 2011-01-05 06:00:00 -0.877858 -0.528172
18 2011-01-05 12:00:00 0.042214 -0.569964
19 2011-01-05 18:00:00 0.582815 -0.528172
20 2011-01-06 00:00:00 -1.100619 -0.569964
21 2011-01-06 06:00:00 1.144724 -0.528172
22 2011-01-06 12:00:00 0.901591 -0.388771
23 2011-01-06 18:00:00 0.502494 -0.249370
My current code:
def findPastVar2(df, var='var' ,window=3, method='median'):
# window= # of past days
for i in xrange(len(df)):
pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
if pastVar2.shape[0]>=5: # At least 5 data points
if method == 'median':
df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
return(df)
Current speed:
In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop
I edited the post so that I can clearly show my expected output of at least 5 data points. I've set the random seed so that everyone should be able to get the same input and show the same output. As far as I know simple rolling and shift does not work for the case of multiple data in the same day.

here we go:
df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)
df['var']
Out[55]:
timestamp
2011-01-01 00:00:00 NaN
2011-01-01 06:00:00 NaN
2011-01-01 12:00:00 NaN
2011-01-01 18:00:00 NaN
2011-01-02 00:00:00 NaN
2011-01-02 06:00:00 NaN
2011-01-02 12:00:00 NaN
2011-01-02 18:00:00 NaN
2011-01-03 00:00:00 NaN
2011-01-03 06:00:00 NaN
2011-01-03 12:00:00 NaN
2011-01-03 18:00:00 NaN
2011-01-04 00:00:00 NaN
2011-01-04 06:00:00 NaN
2011-01-04 12:00:00 NaN
2011-01-04 18:00:00 NaN
2011-01-05 00:00:00 NaN
2011-01-05 06:00:00 -0.528172
2011-01-05 12:00:00 -0.569964
2011-01-05 18:00:00 -0.528172
2011-01-06 00:00:00 -0.569964
2011-01-06 06:00:00 -0.528172
2011-01-06 12:00:00 -0.569964
2011-01-06 18:00:00 -0.528172
2011-01-07 00:00:00 -0.388771
2011-01-07 06:00:00 -0.249370
2011-01-07 12:00:00 -0.388771

The way this is setup is for each row, and as an irregular timeseries, it will have different widths thus requiring an iterative approach like you have started. But, if we make the index the timeseries
# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))
in this case, I chose an interval every 12hrs, but could be whatever is available or irregular. Using a modified function with a window for the median, along with an offset (here, positive Delta is looking backwards), gives you the flexibility you wanted:
def GetMedian(df,var='var',window='2D',Delta='3D'):
for Ti in df.index:
Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
(df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
df.loc[Ti,'Medians']=Vals[var].median()
return df
This runs substantially faster:
%timeit GetMedian(df)
84.8 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The min_period should be 2 instead of 5 because you should not count window size in. (5 - 3 = 2)
import pandas as pd
import numpy as np
import datetime
np.random.seed(1) # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
def first():
df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
return df
　
%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop
My first try didn't use shift(), but then I saw Noobie's answer.
I made the following one with shift(), which is much faster than previous one.
def test():
df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
return df
　
%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop
The second one is around 4 times as fast as the first one.
These two function creates the same result, which looks like this:
df2 = test()
df2
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN
1 2011-01-02 00:00:00 -0.611756 NaN
2 2011-01-03 00:00:00 -0.528172 NaN
3 2011-01-04 00:00:00 -1.072969 NaN
4 2011-01-05 00:00:00 0.865408 0.506294
5 2011-01-06 00:00:00 -2.301539 -0.528172
6 2011-01-07 00:00:00 1.744812 -0.611756
... ... ... ...
93 2011-04-04 00:00:00 -0.638730 1.129484
94 2011-04-05 00:00:00 0.423494 1.129484
95 2011-04-06 00:00:00 0.077340 0.185156
96 2011-04-07 00:00:00 -0.343854 -0.375285
97 2011-04-08 00:00:00 0.043597 -0.375285
98 2011-04-09 00:00:00 -0.620001 0.077340
99 2011-04-10 00:00:00 0.698032 0.077340

Pandas: complex condition on datetime

I have a dataframe with a datetime type column and a float type column.
date value
0 2010-01-01 01:23:00 21.2
1 2010-01-02 01:33:00 63.4
2 2010-01-03 06:02:00 80.6
3 2010-01-04 06:05:00 50.1
4 2010-01-05 06:20:00 346.5
5 2010-01-06 07:44:00 111.8
6 2010-01-07 08:00:00 113.1
7 2010-01-08 08:22:00 10.6
8 2010-01-09 09:00:00 287.2
9 2010-01-10 09:14:00 1652.6
I want to create a new column to record the mean value of one hours before the current iteration row time.
[UPDATE] Example:
If the current iteration is 4 2010-01-05 06:20:00 346.5 , I need to calculate (50.1 + 80.6) / 2 (value in range 2010-01-05 05:20:00~2010-01-05 06:20:00 and calculate mean).
date value before_1hr_mean
4 2010-01-05 06:20:00 346.5 65.35
I use iterrows() to solve this problem like the following code. But this method is really slow and the function iterrows() is usually not recommended in pandas and this row will become as
[UPDATE]
df['before_1hr_mean'] = np.nan
for index, row in df.iterrows():
df.loc[index, 'before_1hr_mean'] = df[(df['date'] < row['date']) & \
(df['date'] >= row['date'] - pd.Timedelta(hours=1))]['value'].mean()
Is there a better way to deal with this situation?

I took the liberty of changing your data to make it all the same day. It's the only way I could make sense of your question.
df.join(
df.set_index('date').value.rolling('H').mean().rename('before_1hr_mean'),
on='date'
)
date value before_1hr_mean
0 2010-01-01 01:23:00 21.2 21.200000
1 2010-01-01 01:33:00 63.4 42.300000
2 2010-01-01 06:02:00 80.6 80.600000
3 2010-01-01 06:05:00 50.1 65.350000
4 2010-01-01 06:20:00 346.5 159.066667
5 2010-01-01 07:44:00 111.8 111.800000
6 2010-01-01 08:00:00 113.1 112.450000
7 2010-01-01 08:22:00 10.6 78.500000
8 2010-01-01 09:00:00 287.2 148.900000
9 2010-01-01 09:14:00 1652.6 650.133333
If you want to exclude the current row, you have to track the sum and count of the rolling hour and back out what the average is after adjusting for the current value.
s = df.set_index('date')
sagg = s.rolling('H').agg(['sum', 'count']).value.rename(columns=str.title)
agged = df.join(sagg, on='date')
agged
date value Sum Count
0 2010-01-01 01:23:00 21.2 21.2 1.0
1 2010-01-01 01:33:00 63.4 84.6 2.0
2 2010-01-01 06:02:00 80.6 80.6 1.0
3 2010-01-01 06:05:00 50.1 130.7 2.0
4 2010-01-01 06:20:00 346.5 477.2 3.0
5 2010-01-01 07:44:00 111.8 111.8 1.0
6 2010-01-01 08:00:00 113.1 224.9 2.0
7 2010-01-01 08:22:00 10.6 235.5 3.0
8 2010-01-01 09:00:00 287.2 297.8 2.0
9 2010-01-01 09:14:00 1652.6 1950.4 3.0
Then do some math and assign a new column
df.assign(before_1hr_mean=agged.eval('(Sum - value) / (Count - 1)'))
date value before_1hr_mean
0 2010-01-01 01:23:00 21.2 NaN
1 2010-01-01 01:33:00 63.4 21.20
2 2010-01-01 06:02:00 80.6 NaN
3 2010-01-01 06:05:00 50.1 80.60
4 2010-01-01 06:20:00 346.5 65.35
5 2010-01-01 07:44:00 111.8 NaN
6 2010-01-01 08:00:00 113.1 111.80
7 2010-01-01 08:22:00 10.6 112.45
8 2010-01-01 09:00:00 287.2 10.60
9 2010-01-01 09:14:00 1652.6 148.90
Notice that you get nulls when there isn't an hours worth of prior data to calculate over.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can I use date index to create dummies in pandas? - python

Or using between; pd.Series(df.index).between('2010-01-02 03:00:00', '2010-01-02 05:00:00', inclusive=True).astype(int) Out[1567]: 0 0 1 0 2 0 3 1 4 1 Name: date, dtype: int32

df = df.assign(main_hours=0) df.loc[df.between_time(start_time='3:00', end_time='5:00').index, 'main_hours'] = 1 >>> df dew temp main_hours 2010-01-02 00:00:00 129 -16 0 2010-01-02 01:00:00 148 -15 0 2010-01-02 02:00:00 159 -11 0 2010-01-02 03:00:00 181 -7 1 2010-01-02 04:00:00 138 -7 1

Related

How to number timestamps that comes under particular duration of time in dataframe

Average 2 consecutive rows in Panda dataframes

How to convert datetime series to actual duration in hours?

Optimize code to find the median of values of past 4 to 6 days for each row in a DataFrame

Pandas: complex condition on datetime

Categories

Resources