pandas consecutive Boolean event rollup time series - python

Here's some made up time series data on 1 minute intervals:
import pandas as pd
import numpy as np
import random
random.seed(5)
rows,cols = 8760,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='1T')
df = pd.DataFrame(data, columns=['condition1','condition2','condition3'], index=tidx)
This is just some code to create some Boolean columns
df['condition1_bool'] = df['condition1'].lt(.1)
df['condition2_bool'] = df['condition2'].lt(df['condition1']) & df['condition2'].gt(df['condition3'])
df['condition3_bool'] = df['condition3'].gt(.9)
df = df[['condition1_bool','condition2_bool','condition3_bool']]
df = df.astype(int)
On my screen this prints:
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 0 0 0
2019-01-01 00:01:00 0 0 1 <---- Count as same event!
2019-01-01 00:02:00 0 0 1 <---- Count as same event!
2019-01-01 00:03:00 1 0 0
2019-01-01 00:04:00 0 0 0
What I am trying to figure out is how to rollup per hour cumulative events (True or 1) but if there is no 0 between events, its the same event! Hopefully that makes sense what I was describing above on the <---- Count as same event!
If I do:
df = df.resample('H').sum()
This will just resample and count all events, right regardless of the time series commitment I was trying to highlight with the <---- Count as same event!
Thanks for any tips!!

Check if the current row ("2019-01-01 00:02:00") equals to 1 and check if the previous row ("2019-01-01 00:01:00") is not equal to 1. This removes consecutive 1 of the sum.
>>> df.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 4 8 4
2019-01-01 01:00:00 9 7 6
2019-01-01 02:00:00 7 14 4
2019-01-01 03:00:00 2 8 7
2019-01-01 04:00:00 4 9 5
... ... ... ...
2019-01-06 21:00:00 4 8 2
2019-01-06 22:00:00 3 11 4
2019-01-06 23:00:00 6 11 4
2019-01-07 00:00:00 8 7 8
2019-01-07 01:00:00 4 9 6
[146 rows x 3 columns]
Using your code:
>>> df.resample('H').sum()
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 5 8 5
2019-01-01 01:00:00 9 8 6
2019-01-01 02:00:00 7 14 5
2019-01-01 03:00:00 2 9 7
2019-01-01 04:00:00 4 11 5
... ... ... ...
2019-01-06 21:00:00 5 11 3
2019-01-06 22:00:00 3 15 4
2019-01-06 23:00:00 6 12 4
2019-01-07 00:00:00 8 7 10
2019-01-07 01:00:00 4 9 7
[146 rows x 3 columns]
Check:
dti = pd.date_range('2021-11-15 21:00:00', '2021-11-15 22:00:00',
closed='left', freq='T')
df1 = pd.DataFrame({'c1': 1}, index=dti)
>>> df1.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
c1
2021-11-15 21:00:00 1
>>> df1.resample('H').sum()
c1
2021-11-15 21:00:00 60

Related

How to number timestamps that comes under particular duration of time in dataframe

If we can divide time of a day from 00:00:00 hrs to 23:59:00 into 15 min blocks we will have 96 blocks. we can number them from 0 to 95.
I want to add a "timeblock" column to the dataframe, where i can number each row with a timeblock number that time stamp sits in as shown below.
tagdatetime tagvalue timeblock
2020-01-01 00:00:00 47.874423 0
2020-01-01 00:01:00 14.913561 0
2020-01-01 00:02:00 56.368034 0
2020-01-01 00:03:00 16.555687 0
2020-01-01 00:04:00 42.138176 0
... ... ...
2020-01-01 00:13:00 47.874423 0
2020-01-01 00:14:00 14.913561 0
2020-01-01 00:15:00 56.368034 0
2020-01-01 00:16:00 16.555687 1
2020-01-01 00:17:00 42.138176 1
... ... ...
2020-01-01 23:55:00 18.550685 95
2020-01-01 23:56:00 51.219147 95
2020-01-01 23:57:00 15.098951 95
2020-01-01 23:58:00 37.863191 95
2020-01-01 23:59:00 51.380950 95
I think there's a better way to do it, but I think it's possible below.
import pandas as pd
import numpy as np
tindex = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='min')
tvalue = np.random.randint(1,50, (1440,))
df = pd.DataFrame({'tagdatetime':tindex, 'tagvalue':tvalue})
min15 = pd.date_range('2020-01-01 00:00:00', '2020-01-01 23:59:00', freq='15min')
tblock = np.arange(96)
df2 = pd.DataFrame({'min15':min15, 'timeblock':tblock})
df3 = pd.merge(df, df2, left_on='tagdatetime', right_on='min15', how='outer')
df3.ffill(axis=0, inplace=True)
df3 = df3.drop('min15', axis=1)
df3.iloc[10:20,]
tagdatetime tagvalue timeblock
10 2020-01-01 00:10:00 20 0.0
11 2020-01-01 00:11:00 25 0.0
12 2020-01-01 00:12:00 42 0.0
13 2020-01-01 00:13:00 45 0.0
14 2020-01-01 00:14:00 11 0.0
15 2020-01-01 00:15:00 15 1.0
16 2020-01-01 00:16:00 38 1.0
17 2020-01-01 00:17:00 23 1.0
18 2020-01-01 00:18:00 5 1.0
19 2020-01-01 00:19:00 32 1.0

How to get time difference in specifc rows include in one column data using python

Here I have a dataset with time and three inputs. Here I calculate the time difference using panda.
code is :
data['Time_different'] = pd.to_timedelta(data['time'].astype(str)).diff(-1).dt.total_seconds().div(60)
This is reading the difference of time in each row. But I want to write a code for find the time difference only specific rows which are having X3 values.
I tried to write the code using for loop. But it's not working properly. Without using for loop can we write the code.?
As you can see in my image I have three inputs, X1,X2,X3. Here when I used that code it is showing the time difference of X1,X2,X3.
Here what I want to write is getting the time difference for X3 inputs which are having a values.
time X3
6:00:00 0
7:00:00 2
8:00:00 0
9:00:00 50
10:00:00 0
11:00:00 0
12:00:00 0
13:45:00 0
15:00:00 0
16:00:00 0
17:00:00 0
18:00:00 0
19:00:00 20
Then here I want to skip the time of having 0 values of X3 and want to read only time difference of values of X3.
time x3
7:00:00 2(values having)
9:00:00 50
So the time difference is 2hrs
Then second:
9:00:00 50
19:00:00 20
Then time difference is 10 hrs
Like wise I want write the code or my whole column. Can anyone help me to solve this?
While putting the code then get the error with time difference in minus value.
You can try to:
Find rows where X3 different from 0
Compute the difference is hours using shift
Update the dataframe using join:
Full example:
data = """time X3
6:00:00 0
7:00:00 2
8:00:00 0
9:00:00 50
10:00:00 0
11:00:00 0
12:00:00 0
13:45:00 0
15:00:00 0
16:00:00 0
17:00:00 0
18:00:00 0
19:00:00 20"""
# Build dataframe from example
df = pd.read_csv(StringIO(data), sep=r'\s{1,}')
df['X1'] = np.random.randint(0,10,len(df)) # Add random values for "X1" column
df['X2'] = np.random.randint(0,10,len(df)) # Add random values for "X2" column
# Convert the time column to datetime object
df.time = pd.to_datetime(df.time, format="%H:%M:%S")
print(df)
# time X3 X1 X2
# 0 1900-01-01 06:00:00 0 5 4
# 1 1900-01-01 07:00:00 2 7 1
# 2 1900-01-01 08:00:00 0 2 8
# 3 1900-01-01 09:00:00 50 1 0
# 4 1900-01-01 10:00:00 0 3 9
# 5 1900-01-01 11:00:00 0 8 4
# 6 1900-01-01 12:00:00 0 0 2
# 7 1900-01-01 13:45:00 0 5 0
# 8 1900-01-01 15:00:00 0 5 7
# 9 1900-01-01 16:00:00 0 0 8
# 10 1900-01-01 17:00:00 0 6 7
# 11 1900-01-01 18:00:00 0 1 5
# 12 1900-01-01 19:00:00 20 4 7
# Compute difference
sub_df = df[df.X3 != 0]
out_values = (sub_df.time.dt.hour - sub_df.shift().time.dt.hour) \
.to_frame() \
.fillna(sub_df.time.dt.hour.iloc[0]) \
.rename(columns={'time': 'out'}) # Rename column
print(out_values)
# out
# 1 7.0
# 3 2.0
# 12 10.0
df = df.join(out_values) # Add out values
print(df)
# time X3 X1 X2 out
# 0 1900-01-01 06:00:00 0 2 9 NaN
# 1 1900-01-01 07:00:00 2 7 4 7.0
# 2 1900-01-01 08:00:00 0 6 6 NaN
# 3 1900-01-01 09:00:00 50 9 1 2.0
# 4 1900-01-01 10:00:00 0 2 9 NaN
# 5 1900-01-01 11:00:00 0 5 3 NaN
# 6 1900-01-01 12:00:00 0 6 4 NaN
# 7 1900-01-01 13:45:00 0 9 3 NaN
# 8 1900-01-01 15:00:00 0 3 0 NaN
# 9 1900-01-01 16:00:00 0 1 8 NaN
# 10 1900-01-01 17:00:00 0 7 5 NaN
# 11 1900-01-01 18:00:00 0 6 7 NaN
# 12 1900-01-01 19:00:00 20 1 5 10.0
Here is use .fillna(sub_df.time.dt.hour.iloc[0]) to replace the first values with the matching hours (since the subtract 0 does nothing). You can define your own rule for the value in fillna().

Is there a way to perform create relational pandas dataframes?

I am struggling to get my pandas df into the format I require due to incorrectly populating a bit masked dataframe.
I have a number of data frames:
plot_d1_sw1 - this is a read from a .csv
timestamp switchID deviceID count
0 2019-05-01 07:00:00 1 GTEC122277 1
1 2019-05-01 08:00:00 1 GTEC122277 1
3 2019-05-01 10:00:00 1 GTEC122277 3
d1_sw1 - this is the last 12 hours and a conditional as to whether the data appears in filt
timestamp num
0 2019-05-01 12:00:00 False
1 2019-05-01 11:00:00 False
2 2019-05-01 10:00:00 True
3 2019-05-01 09:00:00 False
4 2019-05-01 08:00:00 True
5 2019-05-01 07:00:00 True
6 2019-05-01 06:00:00 False
7 2019-05-01 05:00:00 False
8 2019-05-01 04:00:00 False
9 2019-05-01 03:00:00 False
10 2019-05-01 02:00:00 False
11 2019-05-01 01:00:00 False
I have tried masking this and pulling through the count column into the any True values using the following:
mask_d1_sw1 = d1_sw1.num == False
d1_sw1.loc[mask_d1_sw1, column_name] = 0
i=0
for row in plot_d1_sw1.itertuples():
mask_d1_sw1 = d1_sw1.num == True
d1_sw1.loc[mask_d1_sw1, column_name] = plot_d1_sw1['count'].values[i]
print(d1_sw1)
i = i + 1
this gives me:
timestamp num
0 2019-05-01 12:00:00 0
1 2019-05-01 11:00:00 0
2 2019-05-01 10:00:00 3
3 2019-05-01 09:00:00 0
4 2019-05-01 08:00:00 3
5 2019-05-01 07:00:00 3
6 2019-05-01 06:00:00 0
7 2019-05-01 05:00:00 0
8 2019-05-01 04:00:00 0
9 2019-05-01 03:00:00 0
10 2019-05-01 02:00:00 0
11 2019-05-01 01:00:00 0
... I know that this is because I am looping through the count column of plot_d1_sw1 but I cannot for the life of me work out how to logically fill this to get the outcome:
timestamp num
0 2019-05-01 12:00:00 0
1 2019-05-01 11:00:00 0
2 2019-05-01 10:00:00 3
3 2019-05-01 09:00:00 0
4 2019-05-01 08:00:00 1
5 2019-05-01 07:00:00 1
6 2019-05-01 06:00:00 0
7 2019-05-01 05:00:00 0
8 2019-05-01 04:00:00 0
9 2019-05-01 03:00:00 0
10 2019-05-01 02:00:00 0
11 2019-05-01 01:00:00 0
How can I achieve this outcome?
One way is to merge on the timestamp and then multiply the boolean values with count:
df = d1_sw1.merge(plot_d1_sw1, how='left', on='timestamp')
df['num'] = df.num.mul(df['count'].fillna(0)).astype(int)
df[['timestamp', 'num']]
Which gives:
timestamp num
0 2019-05-01-12:00:00 0
1 2019-05-01-11:00:00 0
2 2019-05-01-10:00:00 3
3 2019-05-01-09:00:00 0
4 2019-05-01-08:00:00 1
5 2019-05-01-07:00:00 1
6 2019-05-01-06:00:00 0
7 2019-05-01-05:00:00 0
8 2019-05-01-04:00:00 0
9 2019-05-01-03:00:00 0
10 2019-05-01-02:00:00 0
11 2019-05-01-01:00:00 0

Conditional selection before certain time of day - Pandas dataframe

I have the above dataframe (snippet) and want create a new dataframe which is a conditional selection where I keep only the rows that are timestamped with a time before 15:00:00.
I'm still somewhat new to Pandas / python and have been stuck on this for a while :(
You can use DataFrame.between_time:
start = pd.to_datetime('2015-02-24 11:00')
rng = pd.date_range(start, periods=10, freq='14h')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
print (df)
Date a
0 2015-02-24 11:00:00 0
1 2015-02-25 01:00:00 1
2 2015-02-25 15:00:00 2
3 2015-02-26 05:00:00 3
4 2015-02-26 19:00:00 4
5 2015-02-27 09:00:00 5
6 2015-02-27 23:00:00 6
7 2015-02-28 13:00:00 7
8 2015-03-01 03:00:00 8
9 2015-03-01 17:00:00 9
df = df.set_index('Date').between_time('00:00:00', '15:00:00')
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-25 15:00:00 2
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
If need exclude 15:00:00 add parameter include_end=False:
df = df.set_index('Date').between_time('00:00:00', '15:00:00', include_end=False)
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
You can check the hours of the date column and use it for subsetting:
df['date'] = pd.to_datetime(df['date']) # optional if the date column is of datetime type
df[df.date.dt.hour < 15]

How to merge two dataframes based on the closest (or most recent) timestamp

Suppose I have a dataframe df1, with columns 'A' and 'B'. A is a column of timestamps (e.g. unixtime) and 'B' is a column of some value.
Suppose I also have a dataframe df2 with columns 'C' and 'D'. C is also a unixtime column and D is a column containing some other values.
I would like to fuzzy merge the dataframes with a join on the timestamp. However, if the timestamps don't match (which they most likely don't), I would like it to merge on the closest entry before the timestamp in 'A' that it can find in 'C'.
pd.merge does not support this, and I find myself converting away from dataframes using to_dict(), and using some iteration to solve this. Is there a way in pandas to solve this?
numpy.searchsorted() finds the appropriate index positions to merge on (see docs) - hope the below get you closer to what you're looking for:
start = datetime(2015, 12, 1)
df1 = pd.DataFrame({'A': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'B': [1] * 10}).sort_values('A').reset_index(drop=True)
df2 = pd.DataFrame({'C': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'D': [2] * 10}).sort_values('C').reset_index(drop=True)
df2.index = np.searchsorted(df1.A.values, df2.C.values)
print(pd.merge(left=df1, right=df2, left_index=True, right_index=True, how='left'))
A B C D
0 2015-12-01 00:01:00 1 NaT NaN
1 2015-12-01 00:02:00 1 2015-12-01 00:02:00 2
2 2015-12-01 00:02:00 1 NaT NaN
3 2015-12-01 00:12:00 1 2015-12-01 00:05:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
5 2015-12-01 00:28:00 1 2015-12-01 00:22:00 2
6 2015-12-01 00:30:00 1 NaT NaN
7 2015-12-01 00:39:00 1 2015-12-01 00:31:00 2
7 2015-12-01 00:39:00 1 2015-12-01 00:39:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:40:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:46:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:54:00 2
9 2015-12-01 00:57:00 1 NaT NaN
Building on #Stephan's answer and #JohnE's comment, something similar can be done with pandas.merge_asof for pandas>=0.19.0:
>>> import numpy as np
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> a_timestamps = pd.date_range(start, start + timedelta(hours=4.5), freq='30Min')
>>> c_timestamps = pd.date_range(start, start + timedelta(hours=9), freq='H')
>>> df1 = pd.DataFrame({'A': a_timestamps, 'B': range(10)})
A B
0 2015-12-01 00:00:00 0
1 2015-12-01 00:30:00 1
2 2015-12-01 01:00:00 2
3 2015-12-01 01:30:00 3
4 2015-12-01 02:00:00 4
5 2015-12-01 02:30:00 5
6 2015-12-01 03:00:00 6
7 2015-12-01 03:30:00 7
8 2015-12-01 04:00:00 8
9 2015-12-01 04:30:00 9
>>> df2 = pd.DataFrame({'C': c_timestamps, 'D': range(10, 20)})
C D
0 2015-12-01 00:00:00 10
1 2015-12-01 01:00:00 11
2 2015-12-01 02:00:00 12
3 2015-12-01 03:00:00 13
4 2015-12-01 04:00:00 14
5 2015-12-01 05:00:00 15
6 2015-12-01 06:00:00 16
7 2015-12-01 07:00:00 17
8 2015-12-01 08:00:00 18
9 2015-12-01 09:00:00 19
>>> pd.merge_asof(left=df1, right=df2, left_on='A', right_on='C')
A B C D
0 2015-12-01 00:00:00 0 2015-12-01 00:00:00 10
1 2015-12-01 00:30:00 1 2015-12-01 00:00:00 10
2 2015-12-01 01:00:00 2 2015-12-01 01:00:00 11
3 2015-12-01 01:30:00 3 2015-12-01 01:00:00 11
4 2015-12-01 02:00:00 4 2015-12-01 02:00:00 12
5 2015-12-01 02:30:00 5 2015-12-01 02:00:00 12
6 2015-12-01 03:00:00 6 2015-12-01 03:00:00 13
7 2015-12-01 03:30:00 7 2015-12-01 03:00:00 13
8 2015-12-01 04:00:00 8 2015-12-01 04:00:00 14
9 2015-12-01 04:30:00 9 2015-12-01 04:00:00 14

Categories

Resources