pandas groupby time series by 10 min and also keep some columns - python

i have this information; where "opid" is categorical
datetime id nut opid user amount
2018-01-01 07:01:00 1531 3hrnd 1 mherrera 1
2018-01-01 07:05:00 9510 sd45f 1 svasqu 1
2018-01-01 07:06:00 8125 5s8fr 15 urubi 1
2018-01-01 07:08:15 6324 sd5d6 1 jgonza 1
2018-01-01 07:12:01 0198 tgfg5 1 julmaf 1
2018-01-01 07:13:50 6589 mbkg4 15 jdjiep 1
2018-01-01 07:16:10 9501 wurf4 15 polga 1
the result i'm looking for is something like this
datetime opid amount
2018-01-01 07:00:00 1 3
2018-01-01 07:00:00 15 1
2018-01-01 07:10:00 1 1
2018-01-01 07:10:00 15 2
so... basically i need to know how many of each "opid" are done every 10 min
P.D "amount" is always 1, "opid" is from 1 - 15

Using grouper:
df.set_index('datetime').groupby(['opid', pd.Grouper(freq='10min')]).amount.sum()
opid datetime
1 2018-01-01 07:00:00 3
2018-01-01 07:10:00 1
15 2018-01-01 07:00:00 1
2018-01-01 07:10:00 2
Name: amount, dtype: int64

Related

find the date range of groupped data in a dataframe

I have a dataset with 15-minutes observations for different stations for 20 years. I want to know the range time that each station has data.
station_id
start_time
end_time
observation
2
2000-01-02 01:00:00
2000-01-02 01:15:00
50
2
2000-01-02 01:15:00
2000-01-02 01:30:00
15
2
2000-02-02 01:30:00
2000-01-02 01:45:00
3
3
2000-01-02 05:00:00
2000-01-02 05:15:00
10
3
2000-01-02 05:15:00
2000-01-02 05:30:00
2
3
2000-02-03 01:00:00
2000-01-02 01:15:00
15
3
2000-02-04 01:00:00
2000-01-02 01:15:00
20
an example of I want to have
|station_id | start | end | years |days
| 2 |2000-01-02 01:00:00|2000-01-02 01:45:00| 1 | 1
| 3 |2000-01-02 05:00:00|2000-01-02 01:15:00| 1 | 1
Try using groupby, diff, abs, agg and assign:
df[['start_time', 'end_time']] = df[['start_time', 'end_time']].apply(pd.to_datetime)
x = df.groupby('station_id').agg({'start_time': 'first', 'end_time': 'last'})
temp = x.diff(axis=1).abs()['end_time']
x = x.assign(years=temp.dt.days // 365, days=temp.dt.days % 365).reset_index()
print(x)

Create regular time series from irregular interval with python

I wonder if is it possible to convert irregular time series interval to regular one without interpolating value from other column like this :
Index count
2018-01-05 00:00:00 1
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-11 00:00:00 2
2018-01-14 00:00:00 5
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
And I expect the result to be something like this:
Index count
2018-01-01 00:00:00 0
2018-01-02 00:00:00 0
2018-01-03 00:00:00 0
2018-01-04 00:00:00 0
2018-01-05 00:00:00 1
2018-01-06 00:00:00 0
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-09 00:00:00 0
2018-01-10 00:00:00 0
2018-01-11 00:00:00 2
2018-01-12 00:00:00 0
2018-01-13 00:00:00 0
2018-01-14 00:00:00 5
2018-01-15 00:00:00 0
2018-01-16 00:00:00 0
2018-01-17 00:00:00 0
2018-01-18 00:00:00 0
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-27 00:00:00 0
2018-12-28 00:00:00 0
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
2018-12-31 00:00:00 0
So, far I just try resample from pandas but it only partially solved my problem.
Thanks in advance
Use DataFrame.reindex with date_range:
#if necessary
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range('2018-01-01','2018-12-31'), fill_value=0)
print (df)
count
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 1
...
2018-12-27 0
2018-12-28 0
2018-12-29 7
2018-12-30 8
2018-12-31 0
[365 rows x 1 columns]

How to convert a DatetimeIndexResamplerGroupby object to a Data Frame?

I want to resampling a data frame which has a time series data at 30 second interval to 1 second interval. For this I used:
test_data=test_data.groupby('entity_id').resample('S', fill_method='ffill')
The output is:
<pandas.core.resample.DatetimeIndexResamplerGroupby object at 0x1a1f64f588>
How can I convert this object to a data frame?
I have tried:
test_data = pd.DataFrame(test_data)
after running the last command but it returns a data frame which has the index and a list of all other elements of that row.
Use ffill method:
test_data = pd.DataFrame({
'entity_id': ['a','a','a','a','b','b','b','c','d'],
'data':range(9)},
index=pd.date_range('2018-01-01', periods=9, freq='3S'))
print (test_data)
entity_id data
2018-01-01 00:00:00 a 0
2018-01-01 00:00:03 a 1
2018-01-01 00:00:06 a 2
2018-01-01 00:00:09 a 3
2018-01-01 00:00:12 b 4
2018-01-01 00:00:15 b 5
2018-01-01 00:00:18 b 6
2018-01-01 00:00:21 c 7
2018-01-01 00:00:24 d 8
test_data=test_data.groupby('entity_id')['data'].resample('S').ffill()
print (test_data)
entity_id
a 2018-01-01 00:00:00 0
2018-01-01 00:00:01 0
2018-01-01 00:00:02 0
2018-01-01 00:00:03 1
2018-01-01 00:00:04 1
2018-01-01 00:00:05 1
2018-01-01 00:00:06 2
2018-01-01 00:00:07 2
2018-01-01 00:00:08 2
2018-01-01 00:00:09 3
b 2018-01-01 00:00:12 4
2018-01-01 00:00:13 4
2018-01-01 00:00:14 4
2018-01-01 00:00:15 5
2018-01-01 00:00:16 5
2018-01-01 00:00:17 5
2018-01-01 00:00:18 6
c 2018-01-01 00:00:21 7
d 2018-01-01 00:00:24 8
Name: data, dtype: int64

How to either change the date or get rid off it after using pd.to_datetime()?

I have a df that looks as follows:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 02:15:00 20.5 1 1
and their data types are:
Datum object
Dates object
Time object
Menge float64
day int64
month int64
dtype: object
I wanted to calculate a few things like the hourly average, daily average, monthly average and for that, I had to convert the types of the Dates and Time column. For that, I did:
data_nan_dropped['Dates'] = pd.to_datetime(data_nan_dropped.Dates)
data_nan_dropped.Time = pd.to_datetime(data_nan_dropped.Time, format='%H:%M:%S')
which converted my df to:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 1900-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:00:00 1900-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:00:00 1900-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:00:00 1900-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 00:00:00 1900-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 00:00:00 1900-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 00:00:00 1900-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 00:00:00 1900-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 00:00:00 1900-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 00:00:00 1900-01-01 02:15:00 20.5 1 1
Now, in the Time column, the time is converted and has the form of 1900-01-01. I don't want that.
If possible, I would like one of the following:
The Time column be converted to datetime64[ns] without the date being displayed
or
The date that is in the Datum column be displyed there instead of
1900-01-01.
How can I achieve this?
Expected output:
Datum Dates Time Menge day month
1/1/2018 0:00 2018-01-01 00:00:00 2018-01-01 00:00:00 19.5 1 1
1/1/2018 0:15 2018-01-01 00:00:00 2018-01-01 00:15:00 19.0 1 1
1/1/2018 0:30 2018-01-01 00:00:00 2018-01-01 00:30:00 19.5 1 1
1/1/2018 0:45 2018-01-01 00:00:00 2018-01-01 00:45:00 19.5 1 1
1/1/2018 1:00 2018-01-01 00:00:00 2018-01-01 01:00:00 21.0 1 1
1/1/2018 1:15 2018-01-01 00:00:00 2018-01-01 01:15:00 19.5 1 1
1/1/2018 1:30 2018-01-01 00:00:00 2018-01-01 01:30:00 20.0 1 1
1/1/2018 1:45 2018-01-01 00:00:00 2018-01-01 01:45:00 23.0 1 1
1/1/2018 2:00 2018-01-01 00:00:00 2018-01-01 02:00:00 20.5 1 1
1/1/2018 2:15 2018-01-01 00:00:00 2018-01-01 02:15:00 20.5 1 1
If I understand you correctly by looking at your expected output, we can use the Datum column to create the right Time column:
df['Dates'] = pd.to_datetime(df['Dates'])
df['Time'] = pd.to_datetime(df['Datum'], format='%d/%m/%Y %H:%M')
Datum Dates Time Menge day month
0 1/1/2018 0:00 2018-01-01 2018-01-01 00:00:00 19.5 1 1
1 1/1/2018 0:15 2018-01-01 2018-01-01 00:15:00 19.0 1 1
2 1/1/2018 0:30 2018-01-01 2018-01-01 00:30:00 19.5 1 1
3 1/1/2018 0:45 2018-01-01 2018-01-01 00:45:00 19.5 1 1
4 1/1/2018 1:00 2018-01-01 2018-01-01 01:00:00 21.0 1 1
5 1/1/2018 1:15 2018-01-01 2018-01-01 01:15:00 19.5 1 1
6 1/1/2018 1:30 2018-01-01 2018-01-01 01:30:00 20.0 1 1
7 1/1/2018 1:45 2018-01-01 2018-01-01 01:45:00 23.0 1 1
8 1/1/2018 2:00 2018-01-01 2018-01-01 02:00:00 20.5 1 1
9 1/1/2018 2:15 2018-01-01 2018-01-01 02:15:00 20.5 1 1

How do I group a time series by hour of day?

I have a time series and I want to group the rows by hour of day (regardless of date) and visualize these as boxplots. So I'd want 24 boxplots starting from hour 1, then hour 2, then hour 3 and so on.
The way I see this working is splitting the dataset up into 24 series (1 for each hour of the day), creating a boxplot for each series and then plotting this on the same axes.
The only way I can think of to do this is to manually select all the values between each hour, is there a faster way?
some sample data:
Date Actual Consumption
2018-01-01 00:00:00 47.05
2018-01-01 00:15:00 46
2018-01-01 00:30:00 44
2018-01-01 00:45:00 45
2018-01-01 01:00:00 43.5
2018-01-01 01:15:00 43.5
2018-01-01 01:30:00 43
2018-01-01 01:45:00 42.5
2018-01-01 02:00:00 43
2018-01-01 02:15:00 42.5
2018-01-01 02:30:00 41
2018-01-01 02:45:00 42.5
2018-01-01 03:00:00 42.04
2018-01-01 03:15:00 41.96
2018-01-01 03:30:00 44
2018-01-01 03:45:00 44
2018-01-01 04:00:00 43.54
2018-01-01 04:15:00 43.46
2018-01-01 04:30:00 43.5
2018-01-01 04:45:00 43
2018-01-01 05:00:00 42.04
This is what i've tried so far:
zero = df.between_time('00:00', '00:59')
one = df.between_time('01:00', '01:59')
two = df.between_time('02:00', '02:59')
and then I would plot a boxplot for each of these on the same axes. However it's very tedious to do this for all 24 hours in a day.
This is the kind of output I want:
https://www.researchgate.net/figure/Boxplot-of-the-NOx-data-by-hour-of-the-day_fig1_24054015
there are 2 steps to achieve this:
convert Actual to date time:
df.Actual = pd.to_datetime(df.Actual)
Group by the hour:
df.groupby([df.Date, df.Actual.dt.hour+1]).Consumption.sum().reset_index()
I assumed you wanted to sum the Consumption (unless you wish to have mean or whatever just change it). One note: hour+1 so it will start from 1 and not 0 (remove it if you wish 0 to be midnight).
desired result:
Date Actual Consumption
0 2018-01-01 1 182.05
1 2018-01-01 2 172.50
2 2018-01-01 3 169.00
3 2018-01-01 4 172.00
4 2018-01-01 5 173.50
5 2018-01-01 6 42.04

Categories

Resources