find the date range of groupped data in a dataframe

find the date range of groupped data in a dataframe - python

I have a dataset with 15-minutes observations for different stations for 20 years. I want to know the range time that each station has data.
station_id
start_time
end_time
observation
2
2000-01-02 01:00:00
2000-01-02 01:15:00
50
2
2000-01-02 01:15:00
2000-01-02 01:30:00
15
2
2000-02-02 01:30:00
2000-01-02 01:45:00
3
3
2000-01-02 05:00:00
2000-01-02 05:15:00
10
3
2000-01-02 05:15:00
2000-01-02 05:30:00
2
3
2000-02-03 01:00:00
2000-01-02 01:15:00
15
3
2000-02-04 01:00:00
2000-01-02 01:15:00
20
an example of I want to have
|station_id | start | end | years |days
| 2 |2000-01-02 01:00:00|2000-01-02 01:45:00| 1 | 1
| 3 |2000-01-02 05:00:00|2000-01-02 01:15:00| 1 | 1

Try using groupby, diff, abs, agg and assign:
df[['start_time', 'end_time']] = df[['start_time', 'end_time']].apply(pd.to_datetime)
x = df.groupby('station_id').agg({'start_time': 'first', 'end_time': 'last'})
temp = x.diff(axis=1).abs()['end_time']
x = x.assign(years=temp.dt.days // 365, days=temp.dt.days % 365).reset_index()
print(x)

Related

DataFrame.groupby.apply() with lambda functions

I have a dataframe as follows:
Datetime Value
--------------------------------------------
2000-01-01 15:00:00 10
2000-01-01 16:00:00 12
2000-01-01 17:00:00 14
2000-01-01 18:00:00 16
2000-01-02 15:00:00 13
2000-01-02 16:00:00 18
2000-01-02 17:00:00 16
2000-01-02 18:00:00 15
--------------------------------------------
I want to get a column where I can obtain the difference of values from a specific time for each day onwards (let's say 16:00:00), as follows:
Datetime Value NewColumn
--------------------------------------------
2000-01-01 15:00:00 10 -
2000-01-01 16:00:00 12 0
2000-01-01 17:00:00 14 2
2000-01-01 18:00:00 16 4
2000-01-02 15:00:00 13 -
2000-01-02 16:00:00 18 0
2000-01-02 17:00:00 16 -2
2000-01-02 18:00:00 15 -3
--------------------------------------------
I have tried the following code but it shows an error of:
df['NewColumn'] = df.groupby('Datetime')['Value'].apply(lambda x: x - df.loc[(df['Datetime'].dt.time == dt.time(hour=16)), 'Value'])
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
How should I write my code instead?

IIUC, this is what you need.
df['Datetime']=pd.to_datetime(df['Datetime'])
df['NewColumn'] = (df.groupby(pd.Grouper(freq='D', key='Datetime'))['Value']
.apply(lambda x: x - df.loc[x.loc[df['Datetime'].dt.hour == 16].index[0],'Value']))
df.loc[df['Datetime'].dt.hour < 16, 'NewColumn'] = '-'
print(df)
Output
Datetime Value NewColumn
0 2000-01-01 15:00:00 10 -
1 2000-01-01 16:00:00 12 0
2 2000-01-01 17:00:00 14 2
3 2000-01-01 18:00:00 16 4
4 2000-01-02 15:00:00 13 -
5 2000-01-02 16:00:00 18 0
6 2000-01-02 17:00:00 16 -2
7 2000-01-02 18:00:00 15 -3

pandas groupby time series by 10 min and also keep some columns

i have this information; where "opid" is categorical
datetime id nut opid user amount
2018-01-01 07:01:00 1531 3hrnd 1 mherrera 1
2018-01-01 07:05:00 9510 sd45f 1 svasqu 1
2018-01-01 07:06:00 8125 5s8fr 15 urubi 1
2018-01-01 07:08:15 6324 sd5d6 1 jgonza 1
2018-01-01 07:12:01 0198 tgfg5 1 julmaf 1
2018-01-01 07:13:50 6589 mbkg4 15 jdjiep 1
2018-01-01 07:16:10 9501 wurf4 15 polga 1
the result i'm looking for is something like this
datetime opid amount
2018-01-01 07:00:00 1 3
2018-01-01 07:00:00 15 1
2018-01-01 07:10:00 1 1
2018-01-01 07:10:00 15 2
so... basically i need to know how many of each "opid" are done every 10 min
P.D "amount" is always 1, "opid" is from 1 - 15

Using grouper:
df.set_index('datetime').groupby(['opid', pd.Grouper(freq='10min')]).amount.sum()
opid datetime
1 2018-01-01 07:00:00 3
2018-01-01 07:10:00 1
15 2018-01-01 07:00:00 1
2018-01-01 07:10:00 2
Name: amount, dtype: int64

(Pandas) Different way of combining two dataframes

I was wandering if theres a better approach of combining two dataframes than what I did below.
import pandas as pd
#create ramdom data sets
N = 50
df = pd.DataFrame({'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
peak_time = df.iloc[index.indexer_between_time('7:00','9:00')]
lunch_time = df.iloc[index.indexer_between_time('12:00','14:00')]
comb_data = pd.concat([peak_time, lunch_time], ignore_index=True)
Is there a way to combine two ranges when using between_time using logical operator?
I have to use that to make a new column in df called 'isPeak' where 1 is written when it's in range between 7:00 ~ 9:00 and also 12:00 ~ 14:00 and 0 if not.

For me working np.union1d:
import numpy as np
idx = np.union1d(index.indexer_between_time('7:00','9:00'),
index.indexer_between_time('12:00','14:00'))
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181
Alternative with numpy.r_:
idx = np.r_[index.indexer_between_time('7:00','9:00'),
index.indexer_between_time('12:00','14:00')]
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181
Pure pandas solution with Index.union and convert array to index:
idx = (pd.Index(index.indexer_between_time('7:00','9:00'))
.union(pd.Index(index.indexer_between_time('12:00','14:00'))))
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181

Python pandas resampling instantaneous hourly data to daily timestep including 00:00 of next day

Let's say I have a series of instantaneous temperature measurements (i.e. they capture the temperature at an exact moment in time).
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
Out[130]:
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
I want to get a average of daily temperature. The problem is that I want to include 00:00:00 from the current day and the next day in the average for the current day. For example I want to average 2000-01-01 00:00:00 to 2000-01-02 00:00:00 inclusive. The pandas resample function will not include 2000-01-02 in the bin because it's a different day.
I would imagine this situation comes up often when dealing with instantaneous measurements that need to be resampled. What's the solution?

setup
index = pd.date_range('1/1/2000', periods=9, freq='6H')
series = pd.Series(range(9), index=index)
series
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
solution
series.rolling(5).mean().resample('D').first()
2000-01-01 NaN
2000-01-02 2.0
2000-01-03 6.0
Freq: D, dtype: float64

Problems resampling pandas time series with time gap between indices

I want to resample the pandas series
import pandas as pd
index_1 = pd.date_range('1/1/2000', periods=4, freq='T')
index_2 = pd.date_range('1/2/2000', periods=3, freq='T')
series = pd.Series(range(4), index=index_1)
series=series.append(pd.Series(range(3), index=index_2))
print series
>>>2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 00:02:00 2
such that the resulting DataSeries only contains every second entry, i.e
>>>2000-01-01 00:00:00 0
2000-01-01 00:02:00 2
2000-01-02 00:00:00 0
2000-01-02 00:02:00 2
using the (poorly documented) resample method of pandas in the following way:
resampled_series = series.resample('2T', closed='right')
print resampled_series
I get
>>>1999-12-31 23:58:00 0.0
2000-01-01 00:00:00 1.5
2000-01-01 00:02:00 3.0
2000-01-01 00:04:00 NaN
2000-01-01 00:56:00 NaN
...
2000-01-01 23:54:00 NaN
2000-01-01 23:56:00 NaN
2000-01-01 23:58:00 0.0
2000-01-02 00:00:00 1.5
2000-01-02 00:02:00 3.0
Why does it start 2 minutes earlier than the original series? why does it contain all the time steps inbetween, which are not contained in the original series? How can I get my desired result?

resample() is not the right function for your purpose.
try this:
series[series.index.minute % 2 == 0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find the date range of groupped data in a dataframe - python

Related

DataFrame.groupby.apply() with lambda functions

pandas groupby time series by 10 min and also keep some columns

(Pandas) Different way of combining two dataframes

Python pandas resampling instantaneous hourly data to daily timestep including 00:00 of next day

Problems resampling pandas time series with time gap between indices

Categories

Resources