Slice data by ID and datetime index - python

I have a dataframe, x_train, with three variables an index datetime which takes a reading every 5 minutes and an ID column:
x_train
Time ID var_1 var_2 var_3
2020-01-01 00:00:00 1 9.3 4.2 2.4
2020-01-02 00:00:05 1 3.5 4.5 7.6
2020-01-01 00:00:00 2 2.1 7.6 4.5
2020-01-02 00:00:05 2 3.9 7.5 7.0
and a second dataframe, y_train, with labels for each mode the IDs are in:
y_train
Time ID mode label
2020-01-01 00:00:00 1 1 B
2020-01-02 00:00:05 1 1 B
2020-01-01 00:00:00 2 0 A
2020-01-02 00:00:05 2 0 A
I want to slice the data by ID and time with a step size of 1 day or 288 rows as this data is time-series dependent. So far I've managed to split the data by id using groupby, however I'm not sure how to apply the time slicing.
Heres what I've tried:
FEATURE_COLUMNS = X_train.columns.to_list()
sequences = []
for Id, group in X_train.groupby("ID"):
sequence_features = group[FEATURE_COLUMNS]
label = y_train[y_train.ID == ID].iloc[0].label
sequences.append((sequence_features, label))
Which gives me a slice of all the different IDs but not the time sliced:
( ID var_1 var_2 var_3
Time
2016-01-09 01:55:00 2 0.402679 0.588398 0.560771
2016-03-22 11:40:00 2 0.382457 0.507188 0.450901
2016-02-29 09:40:00 2 0.344540 0.652963 0.607460
2016-01-06 01:00:00 2 0.384479 0.825977 0.499619
2016-01-19 18:10:00 2 0.437563 0.631526 0.479827
... ... ... ... ...
2016-01-10 23:30:00 2 0.366026 0.829760 0.636387
2016-01-22 18:25:00 2 0.976997 0.350567 0.674448
2016-01-28 06:30:00 2 0.975986 0.719546 0.727988
2016-02-27 04:15:00 2 0.451972 0.674149 0.470185
2016-03-10 19:15:00 2 0.354146 0.423203 0.487947
[17673 rows x 4 columns],
'b')
I feel I need to add a line that tells the loop to only look at 288 rows per ID at a time but I'm not sure how to execute it.
Edit: also my sliced output data reorganises the index datetime in a weird order is there a way to fix this?

Related

pandas consecutive Boolean event rollup time series

Here's some made up time series data on 1 minute intervals:
import pandas as pd
import numpy as np
import random
random.seed(5)
rows,cols = 8760,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='1T')
df = pd.DataFrame(data, columns=['condition1','condition2','condition3'], index=tidx)
This is just some code to create some Boolean columns
df['condition1_bool'] = df['condition1'].lt(.1)
df['condition2_bool'] = df['condition2'].lt(df['condition1']) & df['condition2'].gt(df['condition3'])
df['condition3_bool'] = df['condition3'].gt(.9)
df = df[['condition1_bool','condition2_bool','condition3_bool']]
df = df.astype(int)
On my screen this prints:
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 0 0 0
2019-01-01 00:01:00 0 0 1 <---- Count as same event!
2019-01-01 00:02:00 0 0 1 <---- Count as same event!
2019-01-01 00:03:00 1 0 0
2019-01-01 00:04:00 0 0 0
What I am trying to figure out is how to rollup per hour cumulative events (True or 1) but if there is no 0 between events, its the same event! Hopefully that makes sense what I was describing above on the <---- Count as same event!
If I do:
df = df.resample('H').sum()
This will just resample and count all events, right regardless of the time series commitment I was trying to highlight with the <---- Count as same event!
Thanks for any tips!!
Check if the current row ("2019-01-01 00:02:00") equals to 1 and check if the previous row ("2019-01-01 00:01:00") is not equal to 1. This removes consecutive 1 of the sum.
>>> df.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 4 8 4
2019-01-01 01:00:00 9 7 6
2019-01-01 02:00:00 7 14 4
2019-01-01 03:00:00 2 8 7
2019-01-01 04:00:00 4 9 5
... ... ... ...
2019-01-06 21:00:00 4 8 2
2019-01-06 22:00:00 3 11 4
2019-01-06 23:00:00 6 11 4
2019-01-07 00:00:00 8 7 8
2019-01-07 01:00:00 4 9 6
[146 rows x 3 columns]
Using your code:
>>> df.resample('H').sum()
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 5 8 5
2019-01-01 01:00:00 9 8 6
2019-01-01 02:00:00 7 14 5
2019-01-01 03:00:00 2 9 7
2019-01-01 04:00:00 4 11 5
... ... ... ...
2019-01-06 21:00:00 5 11 3
2019-01-06 22:00:00 3 15 4
2019-01-06 23:00:00 6 12 4
2019-01-07 00:00:00 8 7 10
2019-01-07 01:00:00 4 9 7
[146 rows x 3 columns]
Check:
dti = pd.date_range('2021-11-15 21:00:00', '2021-11-15 22:00:00',
closed='left', freq='T')
df1 = pd.DataFrame({'c1': 1}, index=dti)
>>> df1.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
c1
2021-11-15 21:00:00 1
>>> df1.resample('H').sum()
c1
2021-11-15 21:00:00 60

replace values greater than 0 in a range of time in pandas dataframe

I have a large csv file in which I want to replace values with zero in a particular range of time. For example in between 20:00:00 to 05:00:00 I want to replace all the values greater than zero with 0. How do I do it?
dff = pd.read_csv('108e.csv', header=None) # reading the data set
data = df.copy()
df = pd.DataFrame(data)
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
for i in df.set_index('timeStamp').between_time('20:00:00' , '05:00:00')['luminosity']:
if( i > 0):
df[['luminosity']] = df[["luminosity"]].replace({i:0})
You can use the function select from numpy.
import numpy as np
df['luminosity'] = np.select((df['timeStamp']>='20:00:00') & (df['timeStamp']<='05:00:00') & (df['luminosity']>=0), 0, df['luminosity'])
Here are other examples to use it and here are the official docs.
Assume that your DataFrame contains:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 11
2 2020-01-02 22:00:00 12
3 2020-01-03 02:00:00 13
4 2020-01-03 05:00:00 14
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 17
8 2020-01-03 22:10:00 18
9 2020-01-04 02:10:00 19
10 2020-01-04 05:00:00 20
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22
To only retrieve rows in the time range of interest you could run:
df.set_index('timeStamp').between_time('20:00' , '05:00')
But if you attempted to modify these data, e.g.
df = df.set_index('timeStamp')
df.between_time('20:00' , '05:00')['luminosity'] = 0
you would get SettingWithCopyWarning. The reason is that this function
returns a view of the original data.
To circumvent this limitation, you can use indexer_between_time,
on the index of a DataFrame, which returns a Numpy array - locations
of rows meeting your time range criterion.
To update the underlying data, with setting index only to get row positions,
you can run:
df.iloc[df.set_index('timeStamp').index\
.indexer_between_time('20:00', '05:00'), 1] = 0
Note that to keep the code short, I passed the int location of the column
of interest.
Access by iloc should be quite fast.
When you print the df again, the result is:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 0
2 2020-01-02 22:00:00 0
3 2020-01-03 02:00:00 0
4 2020-01-03 05:00:00 0
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 0
8 2020-01-03 22:10:00 0
9 2020-01-04 02:10:00 0
10 2020-01-04 05:00:00 0
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22

How can I get different statistics for a rolling datetime range up top a current value in a pandas dataframe?

I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.

How to get time difference in specifc rows include in one column data using python

Here I have a dataset with time and three inputs. Here I calculate the time difference using panda.
code is :
data['Time_different'] = pd.to_timedelta(data['time'].astype(str)).diff(-1).dt.total_seconds().div(60)
This is reading the difference of time in each row. But I want to write a code for find the time difference only specific rows which are having X3 values.
I tried to write the code using for loop. But it's not working properly. Without using for loop can we write the code.?
As you can see in my image I have three inputs, X1,X2,X3. Here when I used that code it is showing the time difference of X1,X2,X3.
Here what I want to write is getting the time difference for X3 inputs which are having a values.
time X3
6:00:00 0
7:00:00 2
8:00:00 0
9:00:00 50
10:00:00 0
11:00:00 0
12:00:00 0
13:45:00 0
15:00:00 0
16:00:00 0
17:00:00 0
18:00:00 0
19:00:00 20
Then here I want to skip the time of having 0 values of X3 and want to read only time difference of values of X3.
time x3
7:00:00 2(values having)
9:00:00 50
So the time difference is 2hrs
Then second:
9:00:00 50
19:00:00 20
Then time difference is 10 hrs
Like wise I want write the code or my whole column. Can anyone help me to solve this?
While putting the code then get the error with time difference in minus value.
You can try to:
Find rows where X3 different from 0
Compute the difference is hours using shift
Update the dataframe using join:
Full example:
data = """time X3
6:00:00 0
7:00:00 2
8:00:00 0
9:00:00 50
10:00:00 0
11:00:00 0
12:00:00 0
13:45:00 0
15:00:00 0
16:00:00 0
17:00:00 0
18:00:00 0
19:00:00 20"""
# Build dataframe from example
df = pd.read_csv(StringIO(data), sep=r'\s{1,}')
df['X1'] = np.random.randint(0,10,len(df)) # Add random values for "X1" column
df['X2'] = np.random.randint(0,10,len(df)) # Add random values for "X2" column
# Convert the time column to datetime object
df.time = pd.to_datetime(df.time, format="%H:%M:%S")
print(df)
# time X3 X1 X2
# 0 1900-01-01 06:00:00 0 5 4
# 1 1900-01-01 07:00:00 2 7 1
# 2 1900-01-01 08:00:00 0 2 8
# 3 1900-01-01 09:00:00 50 1 0
# 4 1900-01-01 10:00:00 0 3 9
# 5 1900-01-01 11:00:00 0 8 4
# 6 1900-01-01 12:00:00 0 0 2
# 7 1900-01-01 13:45:00 0 5 0
# 8 1900-01-01 15:00:00 0 5 7
# 9 1900-01-01 16:00:00 0 0 8
# 10 1900-01-01 17:00:00 0 6 7
# 11 1900-01-01 18:00:00 0 1 5
# 12 1900-01-01 19:00:00 20 4 7
# Compute difference
sub_df = df[df.X3 != 0]
out_values = (sub_df.time.dt.hour - sub_df.shift().time.dt.hour) \
.to_frame() \
.fillna(sub_df.time.dt.hour.iloc[0]) \
.rename(columns={'time': 'out'}) # Rename column
print(out_values)
# out
# 1 7.0
# 3 2.0
# 12 10.0
df = df.join(out_values) # Add out values
print(df)
# time X3 X1 X2 out
# 0 1900-01-01 06:00:00 0 2 9 NaN
# 1 1900-01-01 07:00:00 2 7 4 7.0
# 2 1900-01-01 08:00:00 0 6 6 NaN
# 3 1900-01-01 09:00:00 50 9 1 2.0
# 4 1900-01-01 10:00:00 0 2 9 NaN
# 5 1900-01-01 11:00:00 0 5 3 NaN
# 6 1900-01-01 12:00:00 0 6 4 NaN
# 7 1900-01-01 13:45:00 0 9 3 NaN
# 8 1900-01-01 15:00:00 0 3 0 NaN
# 9 1900-01-01 16:00:00 0 1 8 NaN
# 10 1900-01-01 17:00:00 0 7 5 NaN
# 11 1900-01-01 18:00:00 0 6 7 NaN
# 12 1900-01-01 19:00:00 20 1 5 10.0
Here is use .fillna(sub_df.time.dt.hour.iloc[0]) to replace the first values with the matching hours (since the subtract 0 does nothing). You can define your own rule for the value in fillna().

Pandas - groupby continuous datetime periods

I have a pandas dataframe that looks like this:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
I would like to groupby on KEY and sum on VALUE but only on continuous periods of time. For instance in the above example I would like to get:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-01 5.0
2 B 2017-01-01 2017-02-10 6.0
There are tow groups for A since there is a gap in the time periods.
I would like to avoid for loops since the dataframe has tens of millions of rows.
Create helper Series by compare shifted START column per group and use it for groupby:
s = df.loc[df.groupby('KEY')['START'].shift(-1) == df['END'], 'END']
s = s.combine_first(df['START'])
print (s)
0 2017-01-01
1 2017-01-23
2 2017-01-23
3 2017-02-02
4 2017-02-02
Name: END, dtype: datetime64[ns]
df = df.groupby(['KEY', s], as_index=False).agg({'START':'first','END':'last','VALUE':'sum'})
print (df)
KEY VALUE START END
0 A 2.1 2017-01-01 2017-01-16
1 A 5.0 2017-01-28 2017-03-01
2 B 6.0 2017-01-01 2017-02-10
The answer from jezrael works like a charm if there are only two consecutive rows to aggregate. In the new example, it would not aggregate the last three rows for KEY = A.
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 B 2017-01-01 2017-01-23 4.3
2 B 2017-01-23 2017-02-10 1.7
3 A 2017-01-28 2017-02-02 4.2
4 A 2017-02-02 2017-03-01 0.8
5 A 2017-03-01 2017-03-23 1.0
The following solution (slight modification of jezrael's solution) enables to aggregate all rows that should be aggregated:
df = df.sort_values(by='START')
idx = df.groupby('KEY')['START'].shift(-1) != df['END']
df['DATE'] = df.loc[idx, 'START']
df['DATE'] = df.groupby('KEY').DATE.fillna(method='backfill')
df = (df.groupby(['KEY', 'DATE'], as_index=False)
.agg({'START': 'first', 'END': 'last', 'VALUE': 'sum'})
.drop(['DATE'], axis=1))
Which gives:
KEY START END VALUE
0 A 2017-01-01 2017-01-16 2.1
1 A 2017-01-28 2017-03-23 6.0
2 B 2017-01-01 2017-02-10 6.0
Thanks #jezrael for the elegant approach!

Categories

Resources