I often deal with pandas DataFrames with DateTimeIndexes, where I want to - for example - select only the parts where the hour of the index = 6. The only way I currently know how to do this is with reindexing:
df.reindex(pd.date_range(*df.index.to_series().agg([min, max]).apply(lambda ts: ts.replace(hour=6)), freq="24H"))
But this is quite unreadable and complex, which gets even worse when there is a MultiIndex with multiple DateTimeIndex levels. I know of methods that use .reset_index() and then either df.where or df.loc with conditional statements, but is there a simpler way to do this with regular IndexSlicing? I tried it as follows
df.loc[df.index.min().replace(hour=6)::pd.Timedelta(24, unit="H")]
but this gives a TypeError:
TypeError: '>=' not supported between instances of 'Timedelta' and 'int'
If your index is a DatetimeIndex, you can use:
>>> df[df.index.hour == 6]
val
2022-03-01 06:00:00 7
2022-03-02 06:00:00 31
2022-03-03 06:00:00 55
2022-03-04 06:00:00 79
2022-03-05 06:00:00 103
2022-03-06 06:00:00 127
2022-03-07 06:00:00 151
2022-03-08 06:00:00 175
2022-03-09 06:00:00 199
2022-03-10 06:00:00 223
2022-03-11 06:00:00 247
2022-03-12 06:00:00 271
2022-03-13 06:00:00 295
2022-03-14 06:00:00 319
2022-03-15 06:00:00 343
2022-03-16 06:00:00 367
2022-03-17 06:00:00 391
2022-03-18 06:00:00 415
2022-03-19 06:00:00 439
2022-03-20 06:00:00 463
2022-03-21 06:00:00 487
Setup:
dti = pd.date_range('2022-3-1', '2022-3-22', freq='1H')
df = pd.DataFrame({'val': range(1, len(dti)+1)}, index=dti)
Related
I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row.
Is there an intelligent way of doing this?
You can do:
x=pd.to_datetime('2021-01-01 00:00')
y=pd.to_datetime('2021-01-07 23:59')
pd.Series(pd.date_range(x,y,freq='H'))
output:
pd.Series(pd.date_range(x,y,freq='H'))
Out[153]:
0 2021-01-01 00:00:00
1 2021-01-01 01:00:00
2 2021-01-01 02:00:00
3 2021-01-01 03:00:00
4 2021-01-01 04:00:00
163 2021-01-07 19:00:00
164 2021-01-07 20:00:00
165 2021-01-07 21:00:00
166 2021-01-07 22:00:00
167 2021-01-07 23:00:00
Length: 168, dtype: datetime64[ns]
Trying to extract a max value from a Pandas Dataframe with a daytime as index, I'm using .last('1W').
My data goes from the first day of month (2020-09-01 00:00:00). It seems to work properly until I reach today (monday 07/09/2020). At first I supposed that .last() takes the last days of week from starting value (sunday I guess) instead of the last 7 days (as I assumed) but, what confuses me is that if I extend the hours, the resulting dataframe shifts the first sample too...
I'm try to simulate this with:
import pandas as pd
i = pd.date_range('2020-09-01', periods=24*6+5, freq='1H')
values = range(0, 24*6+5 )
df = pd.DataFrame({'A': values}, index=i)
print(df)
print(df.last('1W'))
With output:
A
2020-09-01 00:00:00 0
2020-09-01 01:00:00 1
2020-09-01 02:00:00 2
2020-09-01 03:00:00 3
2020-09-01 04:00:00 4
... ...
2020-09-07 00:00:00 144
2020-09-07 01:00:00 145
2020-09-07 02:00:00 146
2020-09-07 03:00:00 147
2020-09-07 04:00:00 148
[149 rows x 1 columns]
A
2020-09-06 05:00:00 125
2020-09-06 06:00:00 126
2020-09-06 07:00:00 127
2020-09-06 08:00:00 128
2020-09-06 09:00:00 129
2020-09-06 10:00:00 130
2020-09-06 11:00:00 131
2020-09-06 12:00:00 132
2020-09-06 13:00:00 133
2020-09-06 14:00:00 134
2020-09-06 15:00:00 135
2020-09-06 16:00:00 136
2020-09-06 17:00:00 137
2020-09-06 18:00:00 138
2020-09-06 19:00:00 139
2020-09-06 20:00:00 140
2020-09-06 21:00:00 141
2020-09-06 22:00:00 142
2020-09-06 23:00:00 143
2020-09-07 00:00:00 144
2020-09-07 01:00:00 145
2020-09-07 02:00:00 146
2020-09-07 03:00:00 147
2020-09-07 04:00:00 148
Process finished with exit code 0
The first value in df is 0 at 2020-09-01 00:00:00
But,
When I try to apply last('1W'), the selection goes from 2020-09-06 05:00:00, to the last value, instead of the last 7 days... as I assumed, nor from 2020-09-06 00:00:00 if the operator works from sunday to sunday.
If you're looking for an offset of 7 days, why not use the Day offset, rather than the Week?
"1W" offset isn't the same as "7D" because "1W" starting on a Monday in a two-week dataset where the last row is Tuesday will have only 2 days. "2W" will include previous week (Monday-Sunday) + (Monday-Tuesday).
You can see the effects of changing the start day of the week by calling the offset class directly, like so:
week_offset = pd.tseries.offsets.Week(n=1, weekday=0) # week starting Monday
day_offset = pd.tseries.offsets.Day(n=7) # or simply "7D"
df.last(day_offset)
I have some time series data as:
import pandas as pd
index = pd.date_range('06/01/2014',periods=24*30,freq='H')
df1 = pd.DataFrame(range(len(index)),index=index)
Now I want to subset data of below dates
selec_dates = ['2014-06-10','2014-06-15','2014-06-20']
I tried following statement but it is not working
sub_data = df1.loc[df1.index.isin(pd.to_datetime(selec_dates))]
Where am I doing wrong? Is there any other approach to subset selected days data?
You need compare dates and for test membership use numpy.in1d:
sub_data = df1.loc[np.in1d(df1.index.date, pd.to_datetime(selec_dates).date)]
print (sub_data)
a
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
...
If want use isin, is necessary create Series with same index:
sub_data = df1.loc[pd.Series(df1.index.date, index=df1.index)
.isin(pd.to_datetime(selec_dates).date)]
print (sub_data)
a
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
2014-06-10 11:00:00 227
...
I'm sorry and misunderstood your question
df1[pd.Series(df1.index.date, index=df1.index).isin(pd.to_datetime(selec_dates).date)]
Should perform what was needed
original answer
Please check the pandas documentation on selection
You can easily do
sub_data = df1.loc[pd.to_datetime(selec_dates)]
You can use .query() method:
In [202]: df1.query('#index.normalize() in #selec_dates')
Out[202]:
0
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
... ...
2014-06-20 14:00:00 470
2014-06-20 15:00:00 471
2014-06-20 16:00:00 472
2014-06-20 17:00:00 473
2014-06-20 18:00:00 474
2014-06-20 19:00:00 475
2014-06-20 20:00:00 476
2014-06-20 21:00:00 477
2014-06-20 22:00:00 478
2014-06-20 23:00:00 479
[72 rows x 1 columns]
Edit: I have been made aware this only works if you are working with a daterange in the same month and year as in your query. For a more general (and better answer) see #jezrael solution.
You can use np.in1d and .day on your index if you wanted to do it as you tried:
selec_dates = ['2014-06-10','2014-06-15','2014-06-20']
df1.loc[np.in1d(df1.index.day, (pd.to_datetime(selec_dates).day))]
This gives you as you require:
2014-06-10 00:00:00 216
2014-06-10 01:00:00 217
2014-06-10 02:00:00 218
2014-06-10 03:00:00 219
2014-06-10 04:00:00 220
2014-06-10 05:00:00 221
2014-06-10 06:00:00 222
2014-06-10 07:00:00 223
2014-06-10 08:00:00 224
2014-06-10 09:00:00 225
2014-06-10 10:00:00 226
2014-06-10 11:00:00 227
2014-06-10 12:00:00 228
2014-06-10 13:00:00 229
2014-06-10 14:00:00 230
2014-06-10 15:00:00 231
2014-06-10 16:00:00 232
2014-06-10 17:00:00 233
2014-06-10 18:00:00 234
2014-06-10 19:00:00 235
2014-06-10 20:00:00 236
2014-06-10 21:00:00 237
2014-06-10 22:00:00 238
2014-06-10 23:00:00 239
2014-06-15 00:00:00 336
2014-06-15 01:00:00 337
2014-06-15 02:00:00 338
2014-06-15 03:00:00 339
2014-06-15 04:00:00 340
2014-06-15 05:00:00 341
...
2014-06-15 18:00:00 354
2014-06-15 19:00:00 355
2014-06-15 20:00:00 356
2014-06-15 21:00:00 357
2014-06-15 22:00:00 358
2014-06-15 23:00:00 359
2014-06-20 00:00:00 456
2014-06-20 01:00:00 457
2014-06-20 02:00:00 458
2014-06-20 03:00:00 459
2014-06-20 04:00:00 460
2014-06-20 05:00:00 461
2014-06-20 06:00:00 462
2014-06-20 07:00:00 463
2014-06-20 08:00:00 464
2014-06-20 09:00:00 465
2014-06-20 10:00:00 466
2014-06-20 11:00:00 467
2014-06-20 12:00:00 468
2014-06-20 13:00:00 469
2014-06-20 14:00:00 470
2014-06-20 15:00:00 471
2014-06-20 16:00:00 472
2014-06-20 17:00:00 473
2014-06-20 18:00:00 474
2014-06-20 19:00:00 475
2014-06-20 20:00:00 476
2014-06-20 21:00:00 477
2014-06-20 22:00:00 478
2014-06-20 23:00:00 479
[72 rows x 1 columns]
I used these Sources for this answer:
- Selecting a subset of a Pandas DataFrame indexed by DatetimeIndex with a list of TimeStamps
- In Python-Pandas, How can I subset a dataframe by specific datetime index values?
- return pandas DF column with the number of days elapsed between index and today's date
- Get weekday/day-of-week for Datetime column of DataFrame
- https://stackoverflow.com/a/36893416/2254228
Use the string repr of the date, leaving out the time periods in the day.
pd.concat([df1['2014-06-10'] , df1['2014-06-15'], df1['2014-06-20']])
I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean
I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...