I have four columns of data imported using pandas:
DayOfYear Time Field Distance
1 09:00:00 50 100
1 10:00:00 51 110
1 11:00:00 52 130
2 09:00:00 54 170
2 10:00:00 55 200
2 11:00:00 56 220
3 09:00:00 58 250
3 10:00:00 59 280
3 11:00:00 60 300
4 09:00:00 61 320
4 10:00:00 63 350
4 11:00:00 65 400
5 09:00:00 66 420
5 10:00:00 68 450
5 11:00:00 70 500
6 09:00:00 72 520
6 10:00:00 74 560
6 11:00:00 75 600
7 09:00:00 77 630
7 10:00:00 79 670
7 11:00:00 80 700
...
So far I have needed to plot Field against Distance for whichever range of days that i need which i have done by using
startday = 1
endday= 6
plt.plot(rawdata[rawdata['Day'].between(startday,endday)].set_index('Distance')['Field'])
Now on the same plot i would like to highlight a region for specific time range. So I'd like to highlight , along the distance axis, for day 3 between 8AM to 10AM.
Related
I am working with structured log data structured as the following (here a pastebin snippet of mock data for easy tinkering):
import pandas as pd
df = pd.read_csv("https://pastebin.com/raw/qrqTMrGa")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 123 32 0
1 123 2020-01-02 2 43 0
2 123 2020-01-03 43 4 1
3 123 2020-01-04 43 4 0
4 123 2020-01-05 43 4 0
5 123 2020-01-06 43 4 0
6 123 2020-01-07 43 4 1
7 123 2020-01-08 43 4 0
8 232 2020-01-04 56 4 0
9 232 2020-01-05 97 1 0
10 232 2020-01-06 23 74 0
11 232 2020-01-07 91 85 1
12 232 2020-01-08 91 85 0
13 232 2020-01-09 91 85 0
14 232 2020-01-10 91 85 1
Variables are pretty straightforward:
id: the id of the observed machine
date: observation date
info_a_cnt: counts of a specific kind of info event
info_b_cnt: same as above for a different event type
has_err: whether or not the machine logged any errors
Now, I'd like to group the dataframe by id to create a variable storing the number of days left before an error event. The desired dataframe should look like:
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 232 2020-01-04 56 4 0 3
8 232 2020-01-05 97 1 0 2
9 232 2020-01-06 23 74 0 1
10 232 2020-01-07 91 85 1 0
11 232 2020-01-08 91 85 0 2
12 232 2020-01-09 91 85 0 1
13 232 2020-01-10 91 85 1 0
I am having an hard time figuring out the correct implementation with the right grouping functions.
Edit:
All the answers below work really well when dealing with dates with a daily granularity. I am wondering how to adapt #jezrael solution below to a dataframe containing timestamps (logs will be batched with 15 minutes interval):
df:
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 12:00:00 123 32 0
1 123 2020-01-01 12:15:00 2 43 0
2 123 2020-01-01 12:30:00 43 4 1
3 123 2020-01-01 12:45:00 43 4 0
4 123 2020-01-01 13:00:00 43 4 0
5 123 2020-01-01 13:15:00 43 4 0
6 123 2020-01-01 13:30:00 43 4 1
7 123 2020-01-01 13:45:00 43 4 0
8 232 2020-01-04 17:00:00 56 4 0
9 232 2020-01-05 17:15:00 97 1 0
10 232 2020-01-06 17:30:00 23 74 0
11 232 2020-01-07 17:45:00 91 85 1
12 232 2020-01-08 18:00:00 91 85 0
13 232 2020-01-09 18:15:00 91 85 0
14 232 2020-01-10 18:30:00 91 85 1
I am wondering how to adapt #jezrael answer in order to land on something like:
id date info_a_cnt info_b_cnt has_err mins_to_err
0 123 2020-01-01 12:00:00 123 32 0 30
1 123 2020-01-01 12:15:00 2 43 0 15
2 123 2020-01-01 12:30:00 43 4 1 0
3 123 2020-01-01 12:45:00 43 4 0 45
4 123 2020-01-01 13:00:00 43 4 0 30
5 123 2020-01-01 13:15:00 43 4 0 15
6 123 2020-01-01 13:30:00 43 4 1 0
7 123 2020-01-01 13:45:00 43 4 0 60
8 232 2020-01-04 17:00:00 56 4 0 45
9 232 2020-01-05 17:15:00 97 1 0 30
10 232 2020-01-06 17:30:00 23 74 0 15
11 232 2020-01-07 17:45:00 91 85 1 0
12 232 2020-01-08 18:00:00 91 85 0 30
13 232 2020-01-09 18:15:00 91 85 0 15
14 232 2020-01-10 18:30:00 91 85 1 0
Use GroupBy.cumcount with ascending=False by column id and helper Series with Series.cumsum but form back - so added indexing by Series.iloc:
g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
EDIT: For count cumulative sum of differencies of dates use custom lambda function with GroupBy.transform:
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.days.cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2.0
1 123 2020-01-02 2 43 0 1.0
2 123 2020-01-03 43 4 1 0.0
3 123 2020-01-04 43 4 0 3.0
4 123 2020-01-05 43 4 0 2.0
5 123 2020-01-06 43 4 0 1.0
6 123 2020-01-07 43 4 1 0.0
7 123 2020-01-08 43 4 0 0.0
8 232 2020-01-04 56 4 0 3.0
9 232 2020-01-05 97 1 0 2.0
10 232 2020-01-06 23 74 0 1.0
11 232 2020-01-07 91 85 1 0.0
12 232 2020-01-08 91 85 0 2.0
13 232 2020-01-09 91 85 0 1.0
14 232 2020-01-10 91 85 1 0.0
EDIT1: Use Series.dt.total_seconds with divide by 60:
#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 12:00:00 123 32 0 30.0
1 123 2020-01-01 12:15:00 2 43 0 15.0
2 123 2020-01-01 12:30:00 43 4 1 0.0
3 123 2020-01-01 12:45:00 43 4 0 45.0
4 123 2020-01-01 13:00:00 43 4 0 30.0
5 123 2020-01-01 13:15:00 43 4 0 15.0
6 123 2020-01-01 13:30:00 43 4 1 0.0
7 123 2020-01-01 13:45:00 43 4 0 0.0
8 232 2020-01-01 17:00:00 56 4 0 45.0
9 232 2020-01-01 17:15:00 97 1 0 30.0
10 232 2020-01-01 17:30:00 23 74 0 15.0
11 232 2020-01-01 17:45:00 91 85 1 0.0
12 232 2020-01-01 18:00:00 91 85 0 30.0
13 232 2020-01-01 18:15:00 91 85 0 15.0
14 232 2020-01-01 18:30:00 91 85 1 0.0
Use:
df2 = df[::-1]
df['days_to_err'] = df2.groupby(['id', df2['has_err'].eq(1).cumsum()]).cumcount()
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
I have a dataframe with a datetime index and 100 columns.
I want to have a new dataframe with the same datetime index and columns, but the values would contain the sum of the first 10 hours of each day.
So if I had an original dataframe like this:
A B C
---------------------------------
2018-01-01 00:00:00 2 5 -10
2018-01-01 01:00:00 6 5 7
2018-01-01 02:00:00 7 5 9
2018-01-01 03:00:00 9 5 6
2018-01-01 04:00:00 10 5 2
2018-01-01 05:00:00 7 5 -1
2018-01-01 06:00:00 1 5 -1
2018-01-01 07:00:00 -4 5 10
2018-01-01 08:00:00 9 5 10
2018-01-01 09:00:00 21 5 -10
2018-01-01 10:00:00 2 5 -1
2018-01-01 11:00:00 8 5 -1
2018-01-01 12:00:00 8 5 10
2018-01-01 13:00:00 8 5 9
2018-01-01 14:00:00 7 5 -10
2018-01-01 15:00:00 7 5 5
2018-01-01 16:00:00 7 5 -10
2018-01-01 17:00:00 4 5 7
2018-01-01 18:00:00 5 5 8
2018-01-01 19:00:00 2 5 8
2018-01-01 20:00:00 2 5 4
2018-01-01 21:00:00 8 5 3
2018-01-01 22:00:00 1 5 3
2018-01-01 23:00:00 1 5 1
2018-01-02 00:00:00 2 5 2
2018-01-02 01:00:00 3 5 8
2018-01-02 02:00:00 4 5 6
2018-01-02 03:00:00 5 5 6
2018-01-02 04:00:00 1 5 7
2018-01-02 05:00:00 7 5 7
2018-01-02 06:00:00 5 5 1
2018-01-02 07:00:00 2 5 2
2018-01-02 08:00:00 4 5 3
2018-01-02 09:00:00 6 5 4
2018-01-02 10:00:00 9 5 4
2018-01-02 11:00:00 11 5 5
2018-01-02 12:00:00 2 5 8
2018-01-02 13:00:00 2 5 0
2018-01-02 14:00:00 4 5 5
2018-01-02 15:00:00 5 5 4
2018-01-02 16:00:00 7 5 4
2018-01-02 17:00:00 -1 5 7
2018-01-02 18:00:00 1 5 7
2018-01-02 19:00:00 1 5 7
2018-01-02 20:00:00 5 5 7
2018-01-02 21:00:00 2 5 7
2018-01-02 22:00:00 2 5 7
2018-01-02 23:00:00 8 5 7
So for all rows with date 2018-01-01:
The value for column A would be 68 (2+6+7+9+10+7+1-4+9+21)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 22 (-10+7+9+6+2-1-1+10+10-10)
So for all rows with date 2018-01-02:
The value for column A would be 39 (2+3+4+5+1+7+5+2+4+6)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 46 (2+8+6+6+7+7+1+2+3+4)
The outcome would be:
A B C
---------------------------------
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46
I figured I'd group by date first and perform a sum and then merge the results based on the date. Is there a better/faster way to do this?
Thanks.
EDIT: I worked on this answer in the mean time:
df= df.between_time('0:00','9:00').groupby(pd.Grouper(freq='D')).sum()
df= df.resample('1H').ffill()
You need groupby df.index.date and use transfrom with lambda function to find sum of first 10 values as:
df.loc[:,['A','B','C']] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
Or if the sequence is the same for both grouped values and real columns
df.loc[:,:] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
print(df)
A B C
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46
I have a data frame which contains date and value. I have to compute sum of the values for each month.
i.e., df.groupby(pd.Grouper(freq='M'))['Value'].sum()
But the problem is in my data set starting date of the month is 21 and ending at 20. Is there any way to tell that group the month from 21th day to 20th day to pandas.
Assume my data frame contains starting and ending date is,
starting_date=datetime.datetime(2015,11,21)
ending_date=datetime.datetime(2017,11,20)
so far i tried,
starting_date=df['Date'].min()
ending_date=df['Date'].max()
month_wise_sum=[]
while(starting_date<=ending_date):
temp=starting_date+datetime.timedelta(days=31)
e_y=temp.year
e_m=temp.month
e_d=20
temp= datetime.datetime(e_y,e_m,e_d)
month_wise_sum.append(df[df['Date'].between(starting_date,temp)]['Value'].sum())
starting_date=temp+datetime.timedelta(days=1)
print month_wise_sum
My above code does the thing. but still waiting for pythonic way to achieve it.
My biggest problem is slicing data frame for month wise
for example,
2015-11-21 to 2015-12-20
Is there any pythonic way to achieve this?
Thanks in Advance.
For Example consider this as my dataframe. It contains date from date_range(datetime.datetime(2017,01,21),datetime.datetime(2017,10,20))
Input:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
.. ... ...
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
[273 rows x 2 columns]
I want to slice this dataframe like below
Iter-1:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
30 2017-02-20 0.616847
iter-2:
Date Value
31 2017-02-21 2.356993
32 2017-02-22 -0.265603
33 2017-02-23 -0.651336
34 2017-02-24 -0.952791
35 2017-02-25 0.124278
36 2017-02-26 0.545956
37 2017-02-27 0.671670
38 2017-02-28 -0.836518
39 2017-03-01 1.178424
40 2017-03-02 0.182758
41 2017-03-03 -0.733987
42 2017-03-04 0.112974
43 2017-03-05 -0.357269
44 2017-03-06 1.454310
45 2017-03-07 -1.201187
46 2017-03-08 0.212540
47 2017-03-09 0.082771
48 2017-03-10 -0.906591
49 2017-03-11 -0.931166
50 2017-03-12 -0.391388
51 2017-03-13 -0.893409
52 2017-03-14 -1.852290
53 2017-03-15 0.368390
54 2017-03-16 -1.672943
55 2017-03-17 -0.934288
56 2017-03-18 -0.154785
57 2017-03-19 0.552378
58 2017-03-20 0.096006
.
.
.
iter-n:
Date Value
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
So that i could calculate each month's sum of value series
[0.7536957367200978, -4.796100620186059, -1.8423374363366014, 2.3780759926221267, 5.753755441349653, -0.01072884830461407, -0.24877912707664018, 11.666305431020149, 3.0772592888909065]
I hope i explained thoroughly.
For the purpose of testing my solution, I generated some random data, frequency is daily but it should work for every frequencies.
index = pd.date_range('2015-11-21', '2017-11-20')
df = pd.DataFrame(index=index, data={0: np.random.rand(len(index))})
Here you see that I passed as index an array of datetimes. Indexing with dates allow in pandas for a lot of added functionalities. With your data you should do (if the Date column already only contains datetime values) :
df = df.set_index('Date')
Then I would realign artificially your data by substracting 20 days to the index :
from datetime import timedelta
df.index -= timedelta(days=20)
and then I would resample data to a monthly indexing, summing all data in the same month :
df.resample('M').sum()
The resulting dataframe is indexed by the last datetime of each month (for me something like :
0
2015-11-30 3.191098
2015-12-31 16.066213
2016-01-31 16.315388
2016-02-29 13.507774
2016-03-31 15.939567
2016-04-30 17.094247
2016-05-31 15.274829
2016-06-30 13.609203
but feel free to reindex it :)
Using pandas.cut() could be a quick solution for you:
import pandas as pd
import numpy as np
start_date = "2015-11-21"
# As #ALollz mentioned, the month with the original end_date='2017-11-20' was missing.
# since pd.date_range() only generates dates in the specified range (between start= and end=),
# '2017-11-31'(using freq='M') exceeds the original end='2017-11-20' and thus is cut off.
# the similar situation applies also to start_date (using freq="MS") when start_month might be cut off
# easy fix is just to extend the end_date to a date in the next month or use
# the end-date of its own month '2017-11-30', or replace end= to periods=25
end_date = "2017-12-20"
# create a testing dataframe
df = pd.DataFrame({ "date": pd.date_range(start_date, periods=710, freq='D'), "value": np.random.randn(710)})
# set up bins to include all dates to create expected date ranges
bins = [ d.replace(day=20) for d in pd.date_range(start_date, end_date, freq="M") ]
# group and summary using the ranges from the above bins
df.groupby(pd.cut(df.date, bins)).sum()
value
date
(2015-11-20, 2015-12-20] -5.222231
(2015-12-20, 2016-01-20] -4.957852
(2016-01-20, 2016-02-20] -0.019802
(2016-02-20, 2016-03-20] -0.304897
(2016-03-20, 2016-04-20] -7.605129
(2016-04-20, 2016-05-20] 7.317627
(2016-05-20, 2016-06-20] 10.916529
(2016-06-20, 2016-07-20] 1.834234
(2016-07-20, 2016-08-20] -3.324972
(2016-08-20, 2016-09-20] 7.243810
(2016-09-20, 2016-10-20] 2.745925
(2016-10-20, 2016-11-20] 8.929903
(2016-11-20, 2016-12-20] -2.450010
(2016-12-20, 2017-01-20] 3.137994
(2017-01-20, 2017-02-20] -0.796587
(2017-02-20, 2017-03-20] -4.368718
(2017-03-20, 2017-04-20] -9.896459
(2017-04-20, 2017-05-20] 2.350651
(2017-05-20, 2017-06-20] -2.667632
(2017-06-20, 2017-07-20] -2.319789
(2017-07-20, 2017-08-20] -9.577919
(2017-08-20, 2017-09-20] 2.962070
(2017-09-20, 2017-10-20] -2.901864
(2017-10-20, 2017-11-20] 2.873909
# export the result
summary = df.groupby(pd.cut(df.date, bins)).value.sum().tolist()
..
I have to large .csv files that I'm handling with python, pandas and numpy, here is a sample from the more granular data set (A), the time stamps are in 15 minute intervals:
Timestamp,Real Energy Into the Load
2016-06-01T11:00:00, 2
2016-06-01T10:45:00, 1
2016-06-01T10:30:00, 5
2016-06-01T10:15:00, 3
2016-06-01T10:00:00, 3
2016-06-01T09:45:00, 6
2016-06-01T09:30:00, 2
...
and here is a sample from the less granular data (B) set with timestamps approximately an hour apart, but there is a lot a variance between time stamps.
TimeEDT, TemperatureF, Dew PointF
2016-06-01T10:33:00,82.0,66.0
2016-06-01T09:34:00,79.0,64.9
2016-06-01T09:20:00,75.9,64.9
...
I want to combine them such that the combined data frame will have the same number of rows as dataframe B by grouping averages from data frame A's rows. The final rows corresponding to this would be:
TimeEDT, TemperatureF, Dew PointF, Real Energy Into The Load
2016-06-01T10:33:00,82.0,66.0, 1.5 # average of (1, 2)
2016-06-01T09:34:00,79.0,64.9, 4.25 # average of (6, 3, 3, 5)
2016-06-01T09:20:00,75.9,64.9, 2 # average of (2,)
...
I think this is called a horizontal union in SQL.
Things I've already tried:
I took dataset B (dfB) and used dfB['TimeEDT'].apply to "floor" each date to it's 15 minute hourly interval. From there, I can use the groupby function to sum the rows together to at least have a one-to-one correspondence between the rows, but the I still need to add the data frames horizontally. But I would like to have a more direct way of doing this. Ideally the argument to groupby could be some user-defined comparison
May be you can do something like below. I haven't checked whether it works if there are no values during an hour, but this is the idea.
In[1]: import pandas as pd
In[2]: import numpy as np
In[3]: df1 = pd.DataFrame({"TemperatureF": np.random.random_integers(60, 90, 20), "DewPointF": np.random.random_integers(60, 90, 20)}, index = pd.date_range("2016-06-01 09:00:00", periods=20, freq="15min"))
In[4]: df2 = pd.DataFrame({"TemperatureF": np.random.random_integers(60, 90, 5), "DewPointF": np.random.random_integers(60, 90, 5), "RealEnergy": np.random.uniform(1.0, 5.0, 5)}, index = pd.date_range("2016-06-01 09:30:00", periods=5, freq="H"))
In[5]: df1
Out[5]:
DewPointF TemperatureF
2016-06-01 09:00:00 66 71
2016-06-01 09:15:00 84 68
2016-06-01 09:30:00 68 74
2016-06-01 09:45:00 66 85
2016-06-01 10:00:00 70 72
2016-06-01 10:15:00 63 78
2016-06-01 10:30:00 82 83
2016-06-01 10:45:00 67 79
2016-06-01 11:00:00 63 76
2016-06-01 11:15:00 72 80
2016-06-01 11:30:00 82 61
2016-06-01 11:45:00 60 81
2016-06-01 12:00:00 77 76
2016-06-01 12:15:00 78 60
2016-06-01 12:30:00 75 60
2016-06-01 12:45:00 83 67
2016-06-01 13:00:00 84 81
2016-06-01 13:15:00 66 66
2016-06-01 13:30:00 80 84
2016-06-01 13:45:00 87 69
In[6]: df2
Out[6]:
DewPointF RealEnergy TemperatureF
2016-06-01 09:30:00 84 2.479343 88
2016-06-01 10:30:00 64 1.428840 67
2016-06-01 11:30:00 88 3.214666 83
2016-06-01 12:30:00 72 4.280898 71
2016-06-01 13:30:00 62 3.376502 78
In[7]: df2.merge(df1.groupby(df1.index.hour)["DewPointF", "TemperatureF"].mean(), on=df2.index.hour)
Out[7]:
key_0 DewPointF_x RealEnergy TemperatureF_x DewPointF_y TemperatureF_y
0 9 84 2.479343 88 71.00 74.50
1 10 64 1.428840 67 70.50 78.00
2 11 88 3.214666 83 69.25 74.50
3 12 72 4.280898 71 78.25 65.75
4 13 62 3.376502 78 79.25 75.00
I have the following dataframe df:
timestamp objectId result
0 2015-11-24 09:00:00 Stress 3
1 2015-11-24 09:00:00 Productivity 0
2 2015-11-24 09:00:00 Abilities 4
3 2015-11-24 09:00:00 Challenge 0
4 2015-11-24 10:00:00 Productivity 87
5 2015-11-24 10:00:00 Abilities 84
6 2015-11-24 10:00:00 Challenge 58
7 2015-11-24 10:00:00 Stress 25
8 2015-11-24 11:00:00 Productivity 93
9 2015-11-24 11:00:00 Abilities 93
10 2015-11-24 11:00:00 Challenge 93
11 2015-11-24 11:00:00 Stress 19
12 2015-11-24 12:00:00 Challenge 90
13 2015-11-24 12:00:00 Abilities 96
14 2015-11-24 12:00:00 Stress 94
15 2015-11-24 12:00:00 Productivity 88
16 2015-11-24 13:00:00 Productivity 12
17 2015-11-24 13:00:00 Challenge 17
18 2015-11-24 13:00:00 Abilities 89
19 2015-11-24 13:00:00 Stress 13
I would like to achieve a barchart like the following
Where instead of a,b,c,d there would be the labels in the column ObjectID the y-axis should correspond to the column result and x-axis should be the values grouped of the column timestamp.
I tried several things but nothing worked. This was the closest, but the plot() method doesn't take any customisation via parameters (e.g. kind='bar' doesn't work).
groups = df.groupby('objectId')
sgb = groups['result']
sgb.plot()
Any other idea?
import seaborn as sns
In [36]:
df.timestamp = df.timestamp.factorize()[0]
In [39]:
df.objectId = df.objectId.map({'Stress' : 'a' , 'Productivity' : 'b' , 'Abilities' : 'c' , 'Challenge' : 'd'})
In [41]:
df
Out[41]:
timestamp objectId result
0 0 a 3
1 0 b 0
2 0 c 4
3 0 d 0
4 1 b 87
5 1 c 84
6 1 d 58
7 1 a 25
8 2 b 93
9 2 c 93
10 2 d 93
11 2 a 19
12 3 d 90
13 3 c 96
14 3 a 94
15 3 b 88
16 4 b 12
17 4 d 17
18 4 c 89
19 4 a 13
In [40]:
sns.barplot(x = 'timestamp' , y = 'result' , hue = 'objectId' , data = df );
The answer of #NaderHisham is a very easy solution!
But just as a reference, if you for some reason cannot use seaborn, this is a pure pandas/matplotlib solution:
You need to reshape your data, so the different objectIds becomes the columns:
In [20]: df.set_index(['timestamp', 'objectId'])['result'].unstack()
Out[20]:
objectId Abilities Challenge Productivity Stress
timestamp
09:00:00 4 0 0 3
10:00:00 84 58 87 25
11:00:00 93 93 93 19
12:00:00 96 90 88 94
13:00:00 89 17 12 13
If you make a bar plot of this, you get the desired result:
In [24]: df.set_index(['timestamp', 'objectId'])['result'].unstack().plot(kind='bar')
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0xc44a5c0>