I have a dataset with 15-minutes observations for different stations for 20 years. I want to know the range time that each station has data.
station_id
start_time
end_time
observation
2
2000-01-02 01:00:00
2000-01-02 01:15:00
50
2
2000-01-02 01:15:00
2000-01-02 01:30:00
15
2
2000-02-02 01:30:00
2000-01-02 01:45:00
3
3
2000-01-02 05:00:00
2000-01-02 05:15:00
10
3
2000-01-02 05:15:00
2000-01-02 05:30:00
2
3
2000-02-03 01:00:00
2000-01-02 01:15:00
15
3
2000-02-04 01:00:00
2000-01-02 01:15:00
20
an example of I want to have
|station_id | start | end | years |days
| 2 |2000-01-02 01:00:00|2000-01-02 01:45:00| 1 | 1
| 3 |2000-01-02 05:00:00|2000-01-02 01:15:00| 1 | 1
Try using groupby, diff, abs, agg and assign:
df[['start_time', 'end_time']] = df[['start_time', 'end_time']].apply(pd.to_datetime)
x = df.groupby('station_id').agg({'start_time': 'first', 'end_time': 'last'})
temp = x.diff(axis=1).abs()['end_time']
x = x.assign(years=temp.dt.days // 365, days=temp.dt.days % 365).reset_index()
print(x)
Related
I have a dataframe as follows:
Datetime Value
--------------------------------------------
2000-01-01 15:00:00 10
2000-01-01 16:00:00 12
2000-01-01 17:00:00 14
2000-01-01 18:00:00 16
2000-01-02 15:00:00 13
2000-01-02 16:00:00 18
2000-01-02 17:00:00 16
2000-01-02 18:00:00 15
--------------------------------------------
I want to get a column where I can obtain the difference of values from a specific time for each day onwards (let's say 16:00:00), as follows:
Datetime Value NewColumn
--------------------------------------------
2000-01-01 15:00:00 10 -
2000-01-01 16:00:00 12 0
2000-01-01 17:00:00 14 2
2000-01-01 18:00:00 16 4
2000-01-02 15:00:00 13 -
2000-01-02 16:00:00 18 0
2000-01-02 17:00:00 16 -2
2000-01-02 18:00:00 15 -3
--------------------------------------------
I have tried the following code but it shows an error of:
df['NewColumn'] = df.groupby('Datetime')['Value'].apply(lambda x: x - df.loc[(df['Datetime'].dt.time == dt.time(hour=16)), 'Value'])
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
How should I write my code instead?
IIUC, this is what you need.
df['Datetime']=pd.to_datetime(df['Datetime'])
df['NewColumn'] = (df.groupby(pd.Grouper(freq='D', key='Datetime'))['Value']
.apply(lambda x: x - df.loc[x.loc[df['Datetime'].dt.hour == 16].index[0],'Value']))
df.loc[df['Datetime'].dt.hour < 16, 'NewColumn'] = '-'
print(df)
Output
Datetime Value NewColumn
0 2000-01-01 15:00:00 10 -
1 2000-01-01 16:00:00 12 0
2 2000-01-01 17:00:00 14 2
3 2000-01-01 18:00:00 16 4
4 2000-01-02 15:00:00 13 -
5 2000-01-02 16:00:00 18 0
6 2000-01-02 17:00:00 16 -2
7 2000-01-02 18:00:00 15 -3
i have this information; where "opid" is categorical
datetime id nut opid user amount
2018-01-01 07:01:00 1531 3hrnd 1 mherrera 1
2018-01-01 07:05:00 9510 sd45f 1 svasqu 1
2018-01-01 07:06:00 8125 5s8fr 15 urubi 1
2018-01-01 07:08:15 6324 sd5d6 1 jgonza 1
2018-01-01 07:12:01 0198 tgfg5 1 julmaf 1
2018-01-01 07:13:50 6589 mbkg4 15 jdjiep 1
2018-01-01 07:16:10 9501 wurf4 15 polga 1
the result i'm looking for is something like this
datetime opid amount
2018-01-01 07:00:00 1 3
2018-01-01 07:00:00 15 1
2018-01-01 07:10:00 1 1
2018-01-01 07:10:00 15 2
so... basically i need to know how many of each "opid" are done every 10 min
P.D "amount" is always 1, "opid" is from 1 - 15
Using grouper:
df.set_index('datetime').groupby(['opid', pd.Grouper(freq='10min')]).amount.sum()
opid datetime
1 2018-01-01 07:00:00 3
2018-01-01 07:10:00 1
15 2018-01-01 07:00:00 1
2018-01-01 07:10:00 2
Name: amount, dtype: int64
I was wandering if theres a better approach of combining two dataframes than what I did below.
import pandas as pd
#create ramdom data sets
N = 50
df = pd.DataFrame({'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
peak_time = df.iloc[index.indexer_between_time('7:00','9:00')]
lunch_time = df.iloc[index.indexer_between_time('12:00','14:00')]
comb_data = pd.concat([peak_time, lunch_time], ignore_index=True)
Is there a way to combine two ranges when using between_time using logical operator?
I have to use that to make a new column in df called 'isPeak' where 1 is written when it's in range between 7:00 ~ 9:00 and also 12:00 ~ 14:00 and 0 if not.
For me working np.union1d:
import numpy as np
idx = np.union1d(index.indexer_between_time('7:00','9:00'),
index.indexer_between_time('12:00','14:00'))
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181
Alternative with numpy.r_:
idx = np.r_[index.indexer_between_time('7:00','9:00'),
index.indexer_between_time('12:00','14:00')]
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181
Pure pandas solution with Index.union and convert array to index:
idx = (pd.Index(index.indexer_between_time('7:00','9:00'))
.union(pd.Index(index.indexer_between_time('12:00','14:00'))))
comb_data = df.iloc[idx]
print (comb_data)
date value
7 2000-01-01 07:00:00 0.760627
8 2000-01-01 08:00:00 0.236474
9 2000-01-01 09:00:00 0.626146
12 2000-01-01 12:00:00 0.625335
13 2000-01-01 13:00:00 0.793105
14 2000-01-01 14:00:00 0.706873
31 2000-01-02 07:00:00 0.113688
32 2000-01-02 08:00:00 0.035565
33 2000-01-02 09:00:00 0.230603
36 2000-01-02 12:00:00 0.423155
37 2000-01-02 13:00:00 0.947584
38 2000-01-02 14:00:00 0.226181
Let's say I have a series of instantaneous temperature measurements (i.e. they capture the temperature at an exact moment in time).
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series
Out[130]:
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
I want to get a average of daily temperature. The problem is that I want to include 00:00:00 from the current day and the next day in the average for the current day. For example I want to average 2000-01-01 00:00:00 to 2000-01-02 00:00:00 inclusive. The pandas resample function will not include 2000-01-02 in the bin because it's a different day.
I would imagine this situation comes up often when dealing with instantaneous measurements that need to be resampled. What's the solution?
setup
index = pd.date_range('1/1/2000', periods=9, freq='6H')
series = pd.Series(range(9), index=index)
series
2000-01-01 00:00:00 0
2000-01-01 06:00:00 1
2000-01-01 12:00:00 2
2000-01-01 18:00:00 3
2000-01-02 00:00:00 4
2000-01-02 06:00:00 5
2000-01-02 12:00:00 6
2000-01-02 18:00:00 7
2000-01-03 00:00:00 8
Freq: 6H, dtype: int64
solution
series.rolling(5).mean().resample('D').first()
2000-01-01 NaN
2000-01-02 2.0
2000-01-03 6.0
Freq: D, dtype: float64
I want to resample the pandas series
import pandas as pd
index_1 = pd.date_range('1/1/2000', periods=4, freq='T')
index_2 = pd.date_range('1/2/2000', periods=3, freq='T')
series = pd.Series(range(4), index=index_1)
series=series.append(pd.Series(range(3), index=index_2))
print series
>>>2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 00:02:00 2
such that the resulting DataSeries only contains every second entry, i.e
>>>2000-01-01 00:00:00 0
2000-01-01 00:02:00 2
2000-01-02 00:00:00 0
2000-01-02 00:02:00 2
using the (poorly documented) resample method of pandas in the following way:
resampled_series = series.resample('2T', closed='right')
print resampled_series
I get
>>>1999-12-31 23:58:00 0.0
2000-01-01 00:00:00 1.5
2000-01-01 00:02:00 3.0
2000-01-01 00:04:00 NaN
2000-01-01 00:56:00 NaN
...
2000-01-01 23:54:00 NaN
2000-01-01 23:56:00 NaN
2000-01-01 23:58:00 0.0
2000-01-02 00:00:00 1.5
2000-01-02 00:02:00 3.0
Why does it start 2 minutes earlier than the original series? why does it contain all the time steps inbetween, which are not contained in the original series? How can I get my desired result?
resample() is not the right function for your purpose.
try this:
series[series.index.minute % 2 == 0]