I would like to get the 07h00 value every day, from a multiday DataFrame that has 24 hours of minute data in it each day.
import numpy as np
import pandas as pd
aframe = pd.DataFrame([np.arange(10000), np.arange(10000) * 2]).T
aframe.index = pd.date_range("2015-09-01", periods = 10000, freq = "1min")
aframe.head()
Out[174]:
0 1
2015-09-01 00:00:00 0 0
2015-09-01 00:01:00 1 2
2015-09-01 00:02:00 2 4
2015-09-01 00:03:00 3 6
2015-09-01 00:04:00 4 8
aframe.tail()
Out[175]:
0 1
2015-09-07 22:35:00 9995 19990
2015-09-07 22:36:00 9996 19992
2015-09-07 22:37:00 9997 19994
2015-09-07 22:38:00 9998 19996
2015-09-07 22:39:00 9999 19998
In this 10 000 row DataFrame spanning 7 days, how would I get the 7am value each day as efficiently as possible? Assume I might have to do this for very large tick databases so I value speed and low memory usage highly.
I know I can index with strings such as:
aframe.ix["2015-09-02 07:00:00"]
Out[176]:
0 1860
1 3720
Name: 2015-09-02 07:00:00, dtype: int64
But what I need is basically a wildcard style query for example
aframe.ix["* 07:00:00"]
You can use indexer_at_time:
>>> locs = aframe.index.indexer_at_time('7:00:00')
>>> aframe.iloc[locs]
0 1
2015-09-01 07:00:00 420 840
2015-09-02 07:00:00 1860 3720
2015-09-03 07:00:00 3300 6600
2015-09-04 07:00:00 4740 9480
2015-09-05 07:00:00 6180 12360
2015-09-06 07:00:00 7620 15240
2015-09-07 07:00:00 9060 18120
There's also indexer_between_time if you need select all indices that lie between two particular time of day.
Both of these methods return the integer locations of the desired values; the corresponding rows of the Series or DataFrame can be fetched with iloc, as shown above.
Related
I currently have a list of tuples that look like this:
time_constraints = [
('001', '01/01/2020 10:00 AM', '01/01/2020 11:00 AM'),
('001', '01/03/2020 05:00 AM', '01/03/2020 06:00 AM'),
...
('999', '01/07/2020 07:00 AM', '01/07/2020 08:00 AM')
]
where:
each tuple contains an id, lower_bound, and upper_bound
none of the time frames overlap for a given id
len(time_constraints) can be on the order of 10^4 to 10^5.
My goal is to quickly and efficiently filter a relatively large (millions of rows) Pandas dataframe (df) to include only the rows that match on the id column and fall between the specified lower_bound and upper_bound times (inclusive).
My current plan is to do this:
import pandas as pd
output = []
for i, lower, upper in time_constraints:
indices = list(df.loc[(df['id'] == i) & (df['timestamp'] >= lower) & (df['timestamp'] <= upper), ].index)
output.extend(indices)
output_df = df.loc[df.index.isin(output), ].copy()
However, using a for-loop isn't ideal. I was wondering if there was a better solution (ideally vectorized) using Pandas or NumPy arrays that would be faster.
Edited:
Here's some sample rows of df:
id
timestamp
1
01/01/2020 9:56 AM
1
01/01/2020 10:32 AM
1
01/01/2020 10:36 AM
2
01/01/2020 9:42 AM
2
01/01/2020 9:57 AM
2
01/01/2020 10:02 AM
I already answered for a similar case.
To test, I used 100,000 constraints (tc) and 5,000,000 of records (df).
Is it what you expect
>>> df
id timestamp
0 565 2020-08-16 05:40:55
1 477 2020-04-05 22:21:40
2 299 2020-02-22 04:54:34
3 108 2020-08-17 23:54:02
4 041 2020-09-10 10:01:31
... ... ...
4999995 892 2020-12-27 16:16:35
4999996 373 2020-08-29 05:44:34
4999997 659 2020-05-23 20:48:15
4999998 858 2020-09-08 22:58:20
4999999 710 2020-04-10 08:03:14
[5000000 rows x 2 columns]
>>> tc
id lower_bound upper_bound
0 000 2020-01-01 00:00:00 2020-01-04 14:00:00
1 000 2020-01-04 15:00:00 2020-01-08 05:00:00
2 000 2020-01-08 06:00:00 2020-01-11 20:00:00
3 000 2020-01-11 21:00:00 2020-01-15 11:00:00
4 000 2020-01-15 12:00:00 2020-01-19 02:00:00
... ... ... ...
99995 999 2020-12-10 09:00:00 2020-12-13 23:00:00
99996 999 2020-12-14 00:00:00 2020-12-17 14:00:00
99997 999 2020-12-17 15:00:00 2020-12-21 05:00:00
99998 999 2020-12-21 06:00:00 2020-12-24 20:00:00
99999 999 2020-12-24 21:00:00 2020-12-28 11:00:00
[100000 rows x 3 columns]
# from tqdm import tqdm
from itertools import chain
# df = pd.DataFrame(data, columns=['id', 'timestamp'])
tc = pd.DataFrame(time_constraints, columns=['id', 'lower_bound', 'upper_bound'])
g1 = df.groupby('id')
g2 = tc.groupby('id')
indexes = []
# for id_ in tqdm(tc['id'].unique()):
for id_ in tc['id'].unique():
df1 = g1.get_group(id_)
df2 = g2.get_group(id_)
ii = pd.IntervalIndex.from_tuples(list(zip(df2['lower_bound'],
df2['upper_bound'])),
closed='both')
indexes.append(pd.cut(df1['timestamp'], bins=ii).dropna().index)
out = df.loc[chain.from_iterable(indexes)]
Performance:
100%|█████████████████████████████████████████████████| 1000/1000 [00:17<00:00, 58.40it/s]
Output result:
>>> out
id timestamp
1326 000 2020-11-10 05:51:00
1685 000 2020-10-07 03:12:48
2151 000 2020-05-08 11:11:18
2246 000 2020-07-06 07:36:57
3995 000 2020-02-02 04:39:11
... ... ...
4996406 999 2020-02-19 15:27:06
4996684 999 2020-02-05 11:13:56
4997408 999 2020-07-09 09:31:31
4997896 999 2020-04-10 03:26:13
4999674 999 2020-04-21 22:57:04
[4942976 rows x 2 columns] # 57024 records filtered
You can use boolean indexing, likewise:
output_df = df[pd.Series(list(zip(df['id'],
df['lower_bound'],
df['upper_bound']))).isin(time_constraints)]
The zip function is creating tuples from each column and then comparing it with your list of tuple. The pd.Series is used to create a Boolean series.
I want the time without the date in Pandas.
I want to keep the time as dtype datetime64[ns] and not as an object so that I can determine periods between times.
The closest I have gotten is as follows, but it gives back the date in a new column not the time as needed as dtype datetime.
df_pres_mf['time'] = pd.to_datetime(df_pres_mf['time'], format ='%H:%M', errors = 'coerce') # returns date (1900-01-01) and actual time as a dtype datetime64[ns] format
df_pres_mf['just_time'] = df_pres_mf['time'].dt.date
df_pres_mf['normalised_time'] = df_pres_mf['time'].dt.normalize()
df_pres_mf.head()
Returns the date as 1900-01-01 and not the time that is needed.
Edit: Data
time
1900-01-01 11:16:00
1900-01-01 15:20:00
1900-01-01 09:55:00
1900-01-01 12:01:00
You could do it like Vishnudev suggested but then you would have dtype: object (or even strings, after using dt.strftime), which you said you didn't want.
What you are looking for doesn't exist, but the closest thing that I can get you is converting to timedeltas. Which won't seem like a solution at first but is actually very useful.
Convert it like this:
# sample df
df
>>
time
0 2021-02-07 09:22:00
1 2021-05-10 19:45:00
2 2021-01-14 06:53:00
3 2021-05-27 13:42:00
4 2021-01-18 17:28:00
df["timed"] = df.time - df.time.dt.normalize()
df
>>
time timed
0 2021-02-07 09:22:00 0 days 09:22:00 # this is just the time difference
1 2021-05-10 19:45:00 0 days 19:45:00 # since midnight, which is essentially the
2 2021-01-14 06:53:00 0 days 06:53:00 # same thing as regular time, except
3 2021-05-27 13:42:00 0 days 13:42:00 # that you can go over 24 hours
4 2021-01-18 17:28:00 0 days 17:28:00
this allows you to calculate periods between times like this:
# subtract the last time from the current
df["difference"] = df.timed - df.timed.shift()
df
Out[48]:
time timed difference
0 2021-02-07 09:22:00 0 days 09:22:00 NaT
1 2021-05-10 19:45:00 0 days 19:45:00 0 days 10:23:00
2 2021-01-14 06:53:00 0 days 06:53:00 -1 days +11:08:00 # <-- this is because the last
3 2021-05-27 13:42:00 0 days 13:42:00 0 days 06:49:00 # time was later than the current
4 2021-01-18 17:28:00 0 days 17:28:00 0 days 03:46:00 # (see below)
to get rid of odd differences, make it absolute:
df["abs_difference"] = df.difference.abs()
df
>>
time timed difference abs_difference
0 2021-02-07 09:22:00 0 days 09:22:00 NaT NaT
1 2021-05-10 19:45:00 0 days 19:45:00 0 days 10:23:00 0 days 10:23:00
2 2021-01-14 06:53:00 0 days 06:53:00 -1 days +11:08:00 0 days 12:52:00 ### <<--
3 2021-05-27 13:42:00 0 days 13:42:00 0 days 06:49:00 0 days 06:49:00
4 2021-01-18 17:28:00 0 days 17:28:00 0 days 03:46:00 0 days 03:46:00
Use proper formatting according to your date format and convert to datetime
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
Format according to the preferred format
df['time'].dt.strftime('%H:%M')
Output
0 11:16
1 15:20
2 09:55
3 12:01
Name: time, dtype: object
I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?
I need to assign number to values between different time hourly. How can I then add a new column to this where I can specify each cell to be grouped hourly. for instance, all the transactions within 00:00:00 to 00:59:59 to be filled with 1, transactions within 01:00:00 to 01:59:59 to be filled with 2, and so on till 23:00:00 to 23:59:59 to be filled with 24
Time_duration = df['period']
print (Time_duration)
0 23:59:56
1 23:59:56
2 23:59:55
3 23:59:53
4 23:59:52
...
74187 00:00:18
74188 00:00:09
74189 00:00:08
74190 00:00:03
74191 00:00:02 ```
# this is the result I desire.... How can I then add a new column to this where I can specify each cell to be grouped hourly. for instance, all the transactions within 00:00:00 to 00:59:59 to be filled with 1, transactions within 01:00:00 to 01:59:59 to be filled with 2, and so on till 23:00:00 to 23:59:59 to be filled with 24.
0 23:59:56 24
1 23:59:56 24
2 23:59:55 24
3 23:59:53 24
4 23:59:52 24
...
74187 00:00:18 1
74188 00:00:09 1
74189 00:00:08 1
74190 00:00:03 1
74191 00:00:02 1
df.sort_values(by=["period"])
timeStamp_list = (pd.to_datetime(list(df['period'])))
df['Hour'] =timeStamp_list.hour
try this code, this works for me.
You can use regular expressions and str.extract
import pandas as pd
pattern= r'^(\d{1,2}):' #capture the digits of the hour
df['hour']=df['period'].str.extract(pattern).astype('int') + 1 # cast it as int so that you can add 1
Can I receive the covered timespans of groups resulting from groupby operations without using my own lambda function?
Currently I have the below solution but I am wondering if the pandas API not already has this built-in somehow?
To describe what I'm doing in the data prep part: My task is to find out when and especially for how long the boolean flag is True. I found that ndimage.label-ing is an efficient way to deal with non-contiguous data blocks. But I am open to any other cool suggestions!
import pandas as pd
from scipy.ndimage import label
# data preparation
idx = pd.date_range(start='now', periods = 100, freq='min')
df= pd.DataFrame(randn(100), index=idx, columns=['data'])
df['mybool'] = df.data > 0
df['label'] = label(df.mybool)[0]
# my actual question:
df.groupby('label').apply(lambda x:x.index[-1] - x.index[0])
Basically, I subtract the last timestamp from the first for each group.
This results in:
label
0 01:37:00
1 00:00:00
2 00:01:00
3 00:01:00
4 00:01:00
5 00:02:00
6 00:00:00
7 00:10:00
8 00:00:00
9 00:01:00
10 00:02:00
11 00:00:00
12 00:01:00
13 00:04:00
14 00:02:00
15 00:01:00
16 00:00:00
17 00:00:00
18 00:00:00
19 00:01:00
20 00:00:00
21 00:01:00
22 00:02:00
23 00:00:00
24 00:00:00
dtype: timedelta64[ns]
To reiterate my question: Does the pandas API offer a trick that does the same without using applying a lambda function or maybe even without grouping first?
Try like this
In [11]: df
Out[11]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2013-10-25 00:45:49 to 2013-10-25 02:24:49
Freq: T
Data columns (total 3 columns):
data 100 non-null values
mybool 100 non-null values
label 100 non-null values
dtypes: bool(1), float64(1), int32(1)
In [12]: df['date'] = df.index
In [14]: g = df.groupby('label')['date']
In [15]: g.last()-g.first()
Out[15]:
label
0 01:39:00
1 00:03:00
2 00:00:00
3 00:04:00
4 00:02:00
5 00:00:00
6 00:01:00
7 00:02:00
8 00:08:00
9 00:00:00
10 00:00:00
11 00:06:00
12 00:07:00
13 00:00:00
14 00:00:00
15 00:04:00
16 00:00:00
17 00:01:00
18 00:00:00
19 00:01:00
20 00:00:00
21 00:00:00
22 00:00:00
Name: date, dtype: timedelta64[ns]