Pandas - Sum of first X hours of datetime index - python

I have a dataframe with a datetime index and 100 columns.
I want to have a new dataframe with the same datetime index and columns, but the values would contain the sum of the first 10 hours of each day.
So if I had an original dataframe like this:
A B C
---------------------------------
2018-01-01 00:00:00 2 5 -10
2018-01-01 01:00:00 6 5 7
2018-01-01 02:00:00 7 5 9
2018-01-01 03:00:00 9 5 6
2018-01-01 04:00:00 10 5 2
2018-01-01 05:00:00 7 5 -1
2018-01-01 06:00:00 1 5 -1
2018-01-01 07:00:00 -4 5 10
2018-01-01 08:00:00 9 5 10
2018-01-01 09:00:00 21 5 -10
2018-01-01 10:00:00 2 5 -1
2018-01-01 11:00:00 8 5 -1
2018-01-01 12:00:00 8 5 10
2018-01-01 13:00:00 8 5 9
2018-01-01 14:00:00 7 5 -10
2018-01-01 15:00:00 7 5 5
2018-01-01 16:00:00 7 5 -10
2018-01-01 17:00:00 4 5 7
2018-01-01 18:00:00 5 5 8
2018-01-01 19:00:00 2 5 8
2018-01-01 20:00:00 2 5 4
2018-01-01 21:00:00 8 5 3
2018-01-01 22:00:00 1 5 3
2018-01-01 23:00:00 1 5 1
2018-01-02 00:00:00 2 5 2
2018-01-02 01:00:00 3 5 8
2018-01-02 02:00:00 4 5 6
2018-01-02 03:00:00 5 5 6
2018-01-02 04:00:00 1 5 7
2018-01-02 05:00:00 7 5 7
2018-01-02 06:00:00 5 5 1
2018-01-02 07:00:00 2 5 2
2018-01-02 08:00:00 4 5 3
2018-01-02 09:00:00 6 5 4
2018-01-02 10:00:00 9 5 4
2018-01-02 11:00:00 11 5 5
2018-01-02 12:00:00 2 5 8
2018-01-02 13:00:00 2 5 0
2018-01-02 14:00:00 4 5 5
2018-01-02 15:00:00 5 5 4
2018-01-02 16:00:00 7 5 4
2018-01-02 17:00:00 -1 5 7
2018-01-02 18:00:00 1 5 7
2018-01-02 19:00:00 1 5 7
2018-01-02 20:00:00 5 5 7
2018-01-02 21:00:00 2 5 7
2018-01-02 22:00:00 2 5 7
2018-01-02 23:00:00 8 5 7
So for all rows with date 2018-01-01:
The value for column A would be 68 (2+6+7+9+10+7+1-4+9+21)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 22 (-10+7+9+6+2-1-1+10+10-10)
So for all rows with date 2018-01-02:
The value for column A would be 39 (2+3+4+5+1+7+5+2+4+6)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 46 (2+8+6+6+7+7+1+2+3+4)
The outcome would be:
A B C
---------------------------------
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46
I figured I'd group by date first and perform a sum and then merge the results based on the date. Is there a better/faster way to do this?
Thanks.
EDIT: I worked on this answer in the mean time:
df= df.between_time('0:00','9:00').groupby(pd.Grouper(freq='D')).sum()
df= df.resample('1H').ffill()

You need groupby df.index.date and use transfrom with lambda function to find sum of first 10 values as:
df.loc[:,['A','B','C']] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
Or if the sequence is the same for both grouped values and real columns
df.loc[:,:] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
print(df)
A B C
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46

Related

Join dataframe value series to another on datetime range from 1st dataframe

I am trying to retrieve values from a dataframe (df2) that fall within a datetime range that is specified in another dataframe (df1).
df1 = pd.DataFrame({'date': pd.date_range("20180101", periods=5)}, index=list('ABCDE'))
df1['t-2'] = df1['date'] + dt.timedelta(days=-2)
df1['t+2'] = df1['date'] + dt.timedelta(days=2)
df1:
date t-2 t+2
A 2018-01-01 2017-12-30 2018-01-03
B 2018-01-02 2017-12-31 2018-01-04
C 2018-01-03 2018-01-01 2018-01-05
D 2018-01-04 2018-01-02 2018-01-06
E 2018-01-05 2018-01-03 2018-01-07
df2 = pd.DataFrame(np.random.randint(0,100,size=(5, 50)),
index=list('ABCDE'), columns = pd.date_range("20171201", periods=50))
df2:
2017-12-01 2017-12-02 2017-12-03 ... 2018-01-17 2018-01-18 2018-01-19
A 58 61 45 ... 72 77 68
B 88 94 68 ... 93 68 24
C 97 47 21 ... 22 48 89
D 44 8 62 ... 57 29 29
E 0 21 26 ... 65 46 36
I would like to take the 5 cells from df2 that fall between the dates in df1 t-2 & t+2 & append these to df1 so that it looks like the following:
date t-2 t+2 t-2d t-1d t t+1d t+2d
A 2018-01-01 2017-12-30 2018-01-03 21 28 25 7 6
B 2018-01-02 2017-12-31 2018-01-04 28 25 7 6 45
C 2018-01-03 2018-01-01 2018-01-05 25 7 6 45 74
D 2018-01-04 2018-01-02 2018-01-06 7 6 45 74 23
E 2018-01-05 2018-01-03 2018-01-07 6 45 74 23 57
So far I have tried any number of combinations of the following
pd.merge(df1, df2.loc[:, df1['t-2']:df1['t+2']], how='inner', left_index=True, right_index=True)
but i receive a TypeError. Any help greatly appreciated.
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [A 2017-12-30
B 2017-12-31
C 2018-01-01
D 2018-01-02
E 2018-01-03
Name: t-2, dtype: datetime64[ns]] of type Series

Highlighting a region on my plot? pandas/matplotlib

I have four columns of data imported using pandas:
DayOfYear Time Field Distance
1 09:00:00 50 100
1 10:00:00 51 110
1 11:00:00 52 130
2 09:00:00 54 170
2 10:00:00 55 200
2 11:00:00 56 220
3 09:00:00 58 250
3 10:00:00 59 280
3 11:00:00 60 300
4 09:00:00 61 320
4 10:00:00 63 350
4 11:00:00 65 400
5 09:00:00 66 420
5 10:00:00 68 450
5 11:00:00 70 500
6 09:00:00 72 520
6 10:00:00 74 560
6 11:00:00 75 600
7 09:00:00 77 630
7 10:00:00 79 670
7 11:00:00 80 700
...
So far I have needed to plot Field against Distance for whichever range of days that i need which i have done by using
startday = 1
endday= 6
plt.plot(rawdata[rawdata['Day'].between(startday,endday)].set_index('Distance')['Field'])
Now on the same plot i would like to highlight a region for specific time range. So I'd like to highlight , along the distance axis, for day 3 between 8AM to 10AM.

getting rows that belong to hour range in pandas datetimeindex

Minimal reproducible code:
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index("date", inplace=True)
now I have dataframe like
data
date
2018-01-01 00:00:00 47
2018-01-01 01:00:00 97
2018-01-01 02:00:00 98
2018-01-01 03:00:00 36
Since I've made its index datetimeindex I can do things like df["2018-01-01"] to get only index within January 1st of 2018.
I cannot find any resource that explains way to certain hours.
I want to get hours from 6am ~ 12pm for all days, leading to expected output
data
date
2018-01-01 06:00:00 47
2018-01-01 07:00:00 97
2018-01-01 08:00:00 98
.
.
.
2018-01-02 06:00:00 36
2018-01-02 07:00:00 47
2018-01-02 08:00:00 97
.
.
.
2018-01-03 06:00:00 98
2018-01-03 07:00:00 36
2018-01-03 08:00:00 47
.
.
. and so on
You can simply use between_time:
print (df.between_time("06:00","12:00"))
#
data
date
2018-01-01 06:00:00 51
2018-01-01 07:00:00 61
2018-01-01 08:00:00 37
2018-01-01 09:00:00 77
2018-01-01 10:00:00 7
2018-01-01 11:00:00 59
2018-01-01 12:00:00 69
2018-01-02 06:00:00 85
2018-01-02 07:00:00 70
2018-01-02 08:00:00 72
2018-01-02 09:00:00 55
2018-01-02 10:00:00 27
2018-01-02 11:00:00 32
2018-01-02 12:00:00 8
...

Slicing window on pandas dataframe

I have a pandas dataframe with time-series data in 1-min intervals. Is there a pythonic way to slice my data for every 15 min like this?
a=pd.DataFrame(index=pd.date_range('2017-01-01 00:04','2017-01-01 01:04',freq='1T'))
a['data']=np.arange(61)
for i in range(0,len(a),15):
print a[i:i+15]
Is there any built in function for this in pandas?
IIUC, use groups and pd.Grouper with freq=15min
for _, g in a.groupby(pd.Grouper(freq='15min')):
print(g)
Can also do
groups = a.groupby(pd.Grouper(freq='15min'))
list(groups)
Outputs
data
2017-01-01 00:04:00 0
2017-01-01 00:05:00 1
2017-01-01 00:06:00 2
2017-01-01 00:07:00 3
2017-01-01 00:08:00 4
2017-01-01 00:09:00 5
2017-01-01 00:10:00 6
2017-01-01 00:11:00 7
2017-01-01 00:12:00 8
2017-01-01 00:13:00 9
2017-01-01 00:14:00 10
data
2017-01-01 00:15:00 11
2017-01-01 00:16:00 12
2017-01-01 00:17:00 13
2017-01-01 00:18:00 14
2017-01-01 00:19:00 15
2017-01-01 00:20:00 16
2017-01-01 00:21:00 17
2017-01-01 00:22:00 18
2017-01-01 00:23:00 19
2017-01-01 00:24:00 20
2017-01-01 00:25:00 21
2017-01-01 00:26:00 22
2017-01-01 00:27:00 23
2017-01-01 00:28:00 24
2017-01-01 00:29:00 25
data
2017-01-01 00:30:00 26
2017-01-01 00:31:00 27
2017-01-01 00:32:00 28
2017-01-01 00:33:00 29
2017-01-01 00:34:00 30
2017-01-01 00:35:00 31
2017-01-01 00:36:00 32
2017-01-01 00:37:00 33
2017-01-01 00:38:00 34
2017-01-01 00:39:00 35
2017-01-01 00:40:00 36
2017-01-01 00:41:00 37
2017-01-01 00:42:00 38
2017-01-01 00:43:00 39
2017-01-01 00:44:00 40

How do I group hourly data by day and count only values greater than a set amount in Pandas?

I am new to Pandas but have been working with python for a few years now.
I have a large data set of hourly data with multiple columns. I need to group the data by day then count how many times the value is above 85 for each day for each column.
example data:
date KMRY KSNS PCEC1 KFAT
2014-06-06 13:00:00 56.000000 63.0 17 11
2014-06-06 14:00:00 58.000000 61.0 17 11
2014-06-06 15:00:00 63.000000 63.0 16 10
2014-06-06 16:00:00 67.000000 65.0 12 11
2014-06-06 17:00:00 67.000000 67.0 10 13
2014-06-06 18:00:00 72.000000 75.0 9 14
2014-06-06 19:00:00 77.000000 79.0 9 15
2014-06-06 20:00:00 84.000000 81.0 9 23
2014-06-06 21:00:00 81.000000 86.0 12 31
2014-06-06 22:00:00 84.000000 84.0 13 28
2014-06-06 23:00:00 83.000000 86.0 15 34
2014-06-07 00:00:00 84.000000 86.0 16 36
2014-06-07 01:00:00 86.000000 89.0 17 43
2014-06-07 02:00:00 86.000000 89.0 20 44
2014-06-07 03:00:00 89.000000 89.0 22 49
2014-06-07 04:00:00 86.000000 86.0 22 51
2014-06-07 05:00:00 86.000000 89.0 21 53
From the sample above my results should look like the following:
date KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0
Any help you be greatly appreciated.
(D_RH>85).sum()
The above code gets me close but I need a daily break down also not just the column counts.
One way would be to make date a DatetimeIndex and then groupby the result of the comparison to 85. For example:
>>> df["date"] = pd.to_datetime(df["date"]) # only if it isn't already
>>> df = df.set_index("date")
>>> (df > 85).groupby(df.index.date).sum()
KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0

Categories

Resources