I have a time series and I want to group the rows by hour of day (regardless of date) and visualize these as boxplots. So I'd want 24 boxplots starting from hour 1, then hour 2, then hour 3 and so on.
The way I see this working is splitting the dataset up into 24 series (1 for each hour of the day), creating a boxplot for each series and then plotting this on the same axes.
The only way I can think of to do this is to manually select all the values between each hour, is there a faster way?
some sample data:
Date Actual Consumption
2018-01-01 00:00:00 47.05
2018-01-01 00:15:00 46
2018-01-01 00:30:00 44
2018-01-01 00:45:00 45
2018-01-01 01:00:00 43.5
2018-01-01 01:15:00 43.5
2018-01-01 01:30:00 43
2018-01-01 01:45:00 42.5
2018-01-01 02:00:00 43
2018-01-01 02:15:00 42.5
2018-01-01 02:30:00 41
2018-01-01 02:45:00 42.5
2018-01-01 03:00:00 42.04
2018-01-01 03:15:00 41.96
2018-01-01 03:30:00 44
2018-01-01 03:45:00 44
2018-01-01 04:00:00 43.54
2018-01-01 04:15:00 43.46
2018-01-01 04:30:00 43.5
2018-01-01 04:45:00 43
2018-01-01 05:00:00 42.04
This is what i've tried so far:
zero = df.between_time('00:00', '00:59')
one = df.between_time('01:00', '01:59')
two = df.between_time('02:00', '02:59')
and then I would plot a boxplot for each of these on the same axes. However it's very tedious to do this for all 24 hours in a day.
This is the kind of output I want:
https://www.researchgate.net/figure/Boxplot-of-the-NOx-data-by-hour-of-the-day_fig1_24054015
there are 2 steps to achieve this:
convert Actual to date time:
df.Actual = pd.to_datetime(df.Actual)
Group by the hour:
df.groupby([df.Date, df.Actual.dt.hour+1]).Consumption.sum().reset_index()
I assumed you wanted to sum the Consumption (unless you wish to have mean or whatever just change it). One note: hour+1 so it will start from 1 and not 0 (remove it if you wish 0 to be midnight).
desired result:
Date Actual Consumption
0 2018-01-01 1 182.05
1 2018-01-01 2 172.50
2 2018-01-01 3 169.00
3 2018-01-01 4 172.00
4 2018-01-01 5 173.50
5 2018-01-01 6 42.04
Related
I have a time series dataset that can be created with the following code.
idx = pd.date_range("2018-01-01", periods=100, freq="H")
ts = pd.Series(idx)
dft = pd.DataFrame(ts,columns=["date"])
dft["data"] = ""
dft["data"][0:5]= "a"
dft["data"][5:15]= "b"
dft["data"][15:20]= "c"
dft["data"][20:30]= "d"
dft["data"][30:40]= "a"
dft["data"][40:70]= "c"
dft["data"][70:85]= "b"
dft["data"][85:len(dft)]= "c"
In the data column, the unique values are a,b,c,d. These values are repeating in a sequence in different time windows. I want to capture the first and last value of that time window. How can I do that?
Compute a grouper for your changing values using shift to compare consecutive rows, then use groupby+agg to get the min/max per group:
group = dft.data.ne(dft.data.shift()).cumsum()
dft.groupby(group)['date'].agg(['min', 'max'])
output:
min max
data
1 2018-01-01 00:00:00 2018-01-01 04:00:00
2 2018-01-01 05:00:00 2018-01-01 14:00:00
3 2018-01-01 15:00:00 2018-01-01 19:00:00
4 2018-01-01 20:00:00 2018-01-02 05:00:00
5 2018-01-02 06:00:00 2018-01-02 15:00:00
6 2018-01-02 16:00:00 2018-01-03 21:00:00
7 2018-01-03 22:00:00 2018-01-04 12:00:00
8 2018-01-04 13:00:00 2018-01-05 03:00:00
edit. combining with original data:
dft.groupby(group).agg({'data': 'first', 'date': ['min', 'max']})
output:
data date
first min max
data
1 a 2018-01-01 00:00:00 2018-01-01 04:00:00
2 b 2018-01-01 05:00:00 2018-01-01 14:00:00
3 c 2018-01-01 15:00:00 2018-01-01 19:00:00
4 d 2018-01-01 20:00:00 2018-01-02 05:00:00
5 a 2018-01-02 06:00:00 2018-01-02 15:00:00
6 c 2018-01-02 16:00:00 2018-01-03 21:00:00
7 b 2018-01-03 22:00:00 2018-01-04 12:00:00
8 c 2018-01-04 13:00:00 2018-01-05 03:00:00
In the example dataframe below, how can I convert t_relative into hours? For example, the relative time in the first row would be 49 hours.
tstart tend t_relative
0 2131-05-16 23:00:00 2131-05-19 00:00:00 2 days 01:00:00
1 2131-05-16 23:00:00 2131-05-19 00:15:00 2 days 01:15:00
2 2131-05-16 23:00:00 2131-05-19 00:45:00 2 days 01:45:00
3 2131-05-16 23:00:00 2131-05-19 01:00:00 2 days 02:00:00
4 2131-05-16 23:00:00 2131-05-19 01:15:00 2 days 02:15:00
t_relative was calculated with the operation, df['t_relative'] = df['tend']-df['tstart'].
You can divide Timedelta:
df['t_relative']/pd.Timedelta('1H')
Output:
0 49.00
1 49.25
2 49.75
3 50.00
4 50.25
Name: t_relative, dtype: float64
Minimal reproducible code:
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index("date", inplace=True)
now I have dataframe like
data
date
2018-01-01 00:00:00 47
2018-01-01 01:00:00 97
2018-01-01 02:00:00 98
2018-01-01 03:00:00 36
Since I've made its index datetimeindex I can do things like df["2018-01-01"] to get only index within January 1st of 2018.
I cannot find any resource that explains way to certain hours.
I want to get hours from 6am ~ 12pm for all days, leading to expected output
data
date
2018-01-01 06:00:00 47
2018-01-01 07:00:00 97
2018-01-01 08:00:00 98
.
.
.
2018-01-02 06:00:00 36
2018-01-02 07:00:00 47
2018-01-02 08:00:00 97
.
.
.
2018-01-03 06:00:00 98
2018-01-03 07:00:00 36
2018-01-03 08:00:00 47
.
.
. and so on
You can simply use between_time:
print (df.between_time("06:00","12:00"))
#
data
date
2018-01-01 06:00:00 51
2018-01-01 07:00:00 61
2018-01-01 08:00:00 37
2018-01-01 09:00:00 77
2018-01-01 10:00:00 7
2018-01-01 11:00:00 59
2018-01-01 12:00:00 69
2018-01-02 06:00:00 85
2018-01-02 07:00:00 70
2018-01-02 08:00:00 72
2018-01-02 09:00:00 55
2018-01-02 10:00:00 27
2018-01-02 11:00:00 32
2018-01-02 12:00:00 8
...
This question already has answers here:
How to remove relative shift in matplotlib axis
(2 answers)
Closed 4 years ago.
I have this data that I am trying to plot using pandas:
df['SGP_sum'].iloc[0:16] =
Date
2018-01-01 00:00:00 99.998765
2018-01-01 01:00:00 99.993401
2018-01-01 02:00:00 100.005571
2018-01-01 03:00:00 100.027737
2018-01-01 04:00:00 100.022474
2018-01-01 05:00:00 100.039800
2018-01-01 06:00:00 100.043310
2018-01-01 07:00:00 100.045207
2018-01-01 08:00:00 100.045201
2018-01-01 09:00:00 100.043810
2018-01-01 10:00:00 100.042977
2018-01-01 11:00:00 100.054589
2018-01-01 12:00:00 100.052009
2018-01-01 13:00:00 100.040163
2018-01-01 14:00:00 100.009129
2018-01-01 15:00:00 99.975595
Name: SGP_sum, dtype: float64
But when I plot it, this is what I see: ( I get negative values in the display)
df['SGP_sum'].iloc[0:16].plot()
I believe that it is treating your values less than 100 as negative values due to the scale of your y axis being +1e2.
Try plotting without this scaling on the axis.
i have this information; where "opid" is categorical
datetime id nut opid user amount
2018-01-01 07:01:00 1531 3hrnd 1 mherrera 1
2018-01-01 07:05:00 9510 sd45f 1 svasqu 1
2018-01-01 07:06:00 8125 5s8fr 15 urubi 1
2018-01-01 07:08:15 6324 sd5d6 1 jgonza 1
2018-01-01 07:12:01 0198 tgfg5 1 julmaf 1
2018-01-01 07:13:50 6589 mbkg4 15 jdjiep 1
2018-01-01 07:16:10 9501 wurf4 15 polga 1
the result i'm looking for is something like this
datetime opid amount
2018-01-01 07:00:00 1 3
2018-01-01 07:00:00 15 1
2018-01-01 07:10:00 1 1
2018-01-01 07:10:00 15 2
so... basically i need to know how many of each "opid" are done every 10 min
P.D "amount" is always 1, "opid" is from 1 - 15
Using grouper:
df.set_index('datetime').groupby(['opid', pd.Grouper(freq='10min')]).amount.sum()
opid datetime
1 2018-01-01 07:00:00 3
2018-01-01 07:10:00 1
15 2018-01-01 07:00:00 1
2018-01-01 07:10:00 2
Name: amount, dtype: int64