Cumulative sum of Timedelta column based on boolean condition

Cumulative sum of Timedelta column based on boolean condition - python

I have a Pandas dataframe in the following format:
id name timestamp time_diff <=30min
1 movie3 2009-05-04 18:00:00+00:00 NaN False
1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True
1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False
2 movie7 2009-05-04 09:30:00+00:00 NaN False
2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False
3 movie1 2009-05-04 17:45:00+00:00 NaN False
3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True
3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True
3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True
4 movie1 2009-05-05 12:45:00+00:00 NaN False
5 movie7 2009-05-04 11:00:00+00:00 NaN False
5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True
The data shows the movies watched on a video streaming platform. Id is the user id, name is the name of the movie and timestamp is the timestamp at which the movie started. <30min indicates if the user has started the movie within 30minutes of the previous movie watched.
A movie-session is comprised by one or more movies played by a single user, where each movie has started within 30 minutes of the previous movie start time (Basically a session is defined as consecutive rows in which df['<30min'] == True).
The length of a session is defined as time_stamp of the last consecutive df['<30min'] == True - timestamp of the first True of the session.
How can I find the 3 longest sessions (in minutes) in the data, and the movies played during the sessions?
As a first step, I have tried something like this:
df.groupby((df['<20'] == False).cumsum())['time_diff'].fillna(pd.Timedelta(seconds=0)).cumsum()
But it doesn't work (the cumsum does not reset when df['time_diff']=False), and looks very slow.
Also, I think it would make my life harder when I have to select the longest 3 sessions as I could get multiple values for the same session that could be selected in the longest 3.

Not sure I understood you correctly. If I did then this may work;
Coercer timestamp to datetime;
df['timestamp']=pd.to_datetime(df['timestamp'])
filter out the True values which indicate consecutive watch.Groupby id whicle calculating the difference between maximum and minimum time. This is then joined to the main df
df.join(df[df['<=30min']==True].groupby('id')['timestamp'].transform(lambda x:x.max()-x.min()).to_frame().rename(columns={'timestamp':'Max'}))
id name timestamp time_diff <=30min Max
0 1 movie3 2009-05-04 18:00:00+00:00 NaN False NaT
1 1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True 00:00:00
2 1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False NaT
3 2 movie7 2009-05-04 09:30:00+00:00 NaN False NaT
4 2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False NaT
5 3 movie1 2009-05-04 17:45:00+00:00 NaN False NaT
6 3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True 00:45:00
7 3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True 00:45:00
8 3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True 00:45:00
9 4 movie1 2009-05-05 12:45:00+00:00 NaN False NaT
10 5 movie7 2009-05-04 11:00:00+00:00 NaN False NaT
11 5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True 00:00:00

Related

Determine the duration of an event

I have a dataframe with a list of events, a column for an indicator for a criterion, and a column for a timestamp.
For each event, if the indicator is true, I want to see if the event lasted more than one period, and for how long.
In terms of an expected output, I have provided an example below.
For the duration column, A is true for only one time period so it will be coded as 1. Then, A is False for the next period, so it will code that as 0. Then, A is true for 2 time periods, so the duration is two, the next entry can be coded as 0 since I am only interested in the first entry, and so on.
id target time duration
0 A True 2023-01-22 11:00:00 1
3 A False 2023-01-22 11:05:00 0
6 A True 2023-01-22 11:10:00 2
9 A True 2023-01-22 11:15:00 0
12 A False 2023-01-22 11:20:00 0
But I have no idea how to do this.
A sample dataframe is included below
import pandas as pd
time_test = pd.DataFrame({'id':[
'A','B','C','A','B','C',
'A','B','C','A','B','C',
'A','B','C','A','B','C'],
'target':[
'True','True','True','False','True','True',
'True','False','True','True','True','True',
'False','True','False','True','False','True'],
'time':[
'11:00','11:00','11:00','11:05','11:05','11:05',
'11:10','11:10','11:10','11:15','11:15','11:15',
'11:20','11:20','11:20','11:25','11:25','11:25']})
time_test =time_test.sort_values(['id','time'])
time_test['time'] =pd.to_datetime(time_test['time'])
time_test
EDIT: I need to provide some clarification about the expected output
Let's take group B as an example. An event occurs for B at 11:00, indicated by the "True" under target. At 11:05, the event is still occurring so duration should be 2 for the row 1 B True 2023-01-22 11:00:00 . I am not interested in the row following so that can coded as 0. So in a since 0 would represent both "already accounted for" and the absence of an event.
At 11:10 that event is not occurring so the summation would re-set.
At 11:15 another event is occurring, and at 11:20 that event is still going, so the value for the first row should be 2.
In the end, the values for B should be 2,0,0,2,0,0.
I can see why this method would be confusing but I hope my explanation makes since. My data is in 5 minute chunks so I figured I could just count the number of chunks to see how long an event lasted for, instead of using a start and end time to calculate the elapsed time(but maybe that would be easier?)

Annotated code
# Convert the target column to boolean
mask = time_test['target'].eq('True')
# Create subgroups to identify blocks of consecutive True's
time_test['subgrps'] = (~mask).cumsum()[mask]
# Group the target mask by id and subgrps
g = mask.groupby([time_test['id'], time_test['subgrps']])
# Create a boolean mask to identify dupes per id and subgrps
dupes = time_test.duplicated(subset=['id', 'subgrps'])
# Sum the True value per group and mask the duplicates
time_test['duration'] = g.transform('sum').mask(dupes).fillna(0)
Result
id target time subgrps duration
0 A True 2023-01-22 11:00:00 0.0 1.0
3 A False 2023-01-22 11:05:00 NaN 0.0
6 A True 2023-01-22 11:10:00 1.0 2.0
9 A True 2023-01-22 11:15:00 1.0 0.0
12 A False 2023-01-22 11:20:00 NaN 0.0
15 A True 2023-01-22 11:25:00 2.0 1.0
1 B True 2023-01-22 11:00:00 2.0 2.0
4 B True 2023-01-22 11:05:00 2.0 0.0
7 B False 2023-01-22 11:10:00 NaN 0.0
10 B True 2023-01-22 11:15:00 3.0 2.0
13 B True 2023-01-22 11:20:00 3.0 0.0
16 B False 2023-01-22 11:25:00 NaN 0.0
2 C True 2023-01-22 11:00:00 4.0 4.0
5 C True 2023-01-22 11:05:00 4.0 0.0
8 C True 2023-01-22 11:10:00 4.0 0.0
11 C True 2023-01-22 11:15:00 4.0 0.0
14 C False 2023-01-22 11:20:00 NaN 0.0
17 C True 2023-01-22 11:25:00 5.0 1.0

Subtract one column by itself based on a condition set by another column

I have the following data frame, where time_stamp is already sorted in the ascending order:
time_stamp indicator
0 2021-01-01 00:00:00 1
1 2021-01-01 00:02:00 1
2 2021-01-01 00:03:00 NaN
3 2021-01-01 00:04:00 NaN
4 2021-01-01 00:09:00 NaN
5 2021-01-01 00:14:00 NaN
6 2021-01-01 00:19:00 NaN
7 2021-01-01 00:24:00 NaN
8 2021-01-01 00:27:00 1
9 2021-01-01 00:29:00 NaN
10 2021-01-01 00:32:00 2
11 2021-01-01 00:34:00 NaN
12 2021-01-01 00:37:00 2
13 2021-01-01 00:38:00 NaN
14 2021-01-01 00:39:00 NaN
I want to create a new column in the above data frame, that shows the time difference between each row's time_stamp value and the first time_stamp value above that row where indicator is not NaN (immediately above row, where indicator is not NaN).
Below is how the output should look like (time_diff is a timedelta value, but I'll just show subtraction by indices to better illustrate. For example, ( 2 - 1 ) = df['time_stamp'][2] - df['time_stamp'][1] ):
time_stamp indicator time_diff
0 2021-01-01 00:00:00 1 NaT # (or undefined)
1 2021-01-01 00:02:00 1 1 - 0
2 2021-01-01 00:03:00 NaN 2 - 1
3 2021-01-01 00:04:00 NaN 3 - 1
4 2021-01-01 00:09:00 NaN 4 - 1
5 2021-01-01 00:14:00 NaN 5 - 1
6 2021-01-01 00:19:00 NaN 6 - 1
7 2021-01-01 00:24:00 NaN 7 - 1
8 2021-01-01 00:27:00 1 8 - 1
9 2021-01-01 00:29:00 NaN 9 - 8
10 2021-01-01 00:32:00 1 10 - 8
11 2021-01-01 00:34:00 NaN 11 - 10
12 2021-01-01 00:37:00 1 12 - 10
13 2021-01-01 00:38:00 NaN 13 - 12
14 2021-01-01 00:39:00 NaN 14 - 12
We can use a for loop that keeps track of the last NaN entry, but I'm looking for a solution that does not use a for loop.

I've ended up doing this:
# create an intermediate column to track the last timestamp corresponding to the non-NaN `indicator` value
df['tracking'] = np.nan
df['tracking'][~df['indicator'].isna()] = df['time_stamp'][~df['indicator'].isna()]
df['tracking'] = df['tracking'].ffill()
# use that to subtract the value from the `time_stamp`
df['time_diff'] = df['time_stamp'] - df['tracking']

I am trying to do here is I have to run a loop over rows here, in index[7] it should have shown "Sell" but my 2nd if condition is not working

I am trying to do here is I have to run a loop over rows and if the condition satisfies it should print according to my command. here, in index[7] it should have shown "Sell" but my 2nd if condition is not working. What am I doing wrong?
for i in range (n_steps,(len(extended_stock_data_new)-1)):
if (extended_stock_data_new["Close"][i]<=extended_stock_data_new["Prediction"][i+1]):
extended_stock_data_new.loc[[i],"Decision"]="Buy"
if (extended_stock_data_new["Low"][i+1]<extended_stock_data_new["Prediction"][i+1]<=extended_stock_data_new["High"][i+1]):
extended_stock_data_new.loc[[i+1],"Decision"]="Sell"
else:
extended_stock_data_new.loc[[i],"Decision"]="--"
extended_stock_data_new.head(50)
output:
0 2020-01-25 08:00:00 3295.26 3298.26 3291.30 3291.75 NaN NaN
1 2020-01-27 10:00:00 3267.88 3269.01 3253.26 3259.76 NaN NaN
2 2020-01-27 11:00:00 3259.51 3269.51 3258.26 3269.51 NaN NaN
3 2020-01-27 12:00:00 3269.76 3269.76 3265.26 3267.26 NaN NaN
4 2020-01-27 13:00:00 3267.13 3267.26 3258.76 3260.26 NaN NaN
5 2020-01-27 14:00:00 3260.51 3266.76 3260.51 3265.26 NaN NaN
6 2020-01-27 15:00:00 3265.38 3266.01 3262.76 3263.01 3264.800049 Buy
7 2020-01-27 16:00:00 3263.26 3264.26 3260.01 3260.26 3263.800049 Buy
8 2020-01-27 17:00:00 3260.51 3263.13 3259.26 3261.51 3260.699951 Buy
9 2020-01-27 18:00:00 3261.26 3264.01 3259.51 3261.76 3261.600098 Buy
10 2020-01-27 19:00:00 3262.26 3267.26 3257.76 3262.76 3262.100098 Buy
11 2020-01-27 20:00:00 3262.51 3263.01 3250.26 3254.01 3263.300049 Buy
12 2020-01-27 21:00:00 3253.76 3253.76 3240.26 3240.26 3254.800049 Buy

So what's happening it that you are overwriting yourself:
when i == 6:
you assign "Buy" to row i
you assign "Sell" to row i + 1
when i == 7:
you assign "Buy" to row i, overwriting your previous answer.
If you don't want to overwrite yourself, you need to add a check to your first condition to see if a "Decision" value already exists.

Finding maximum null values in stretch and generating flag

I have dataframe with datetime and two columns.I have to find maximum stretch of null values in a 'particular date' for column 'X' and replace it with zero in both column for that particular date. In addition to that I have to create third column with name 'flag' which will carry value of 1 for every zero imputation in other two column or else value of 0. In example below, January 1st the maximum stretch null value is 3 times, so I have to replace this with zero. Similarly, I have to replicate the process for 2nd January.
Below is my sample data:
Datetime X Y
01-01-2018 00:00 1 1
01-01-2018 00:05 nan 2
01-01-2018 00:10 2 nan
01-01-2018 00:15 3 4
01-01-2018 00:20 2 2
01-01-2018 00:25 nan 1
01-01-2018 00:30 nan nan
01-01-2018 00:35 nan nan
01-01-2018 00:40 4 4
02-01-2018 00:00 nan nan
02-01-2018 00:05 2 3
02-01-2018 00:10 2 2
02-01-2018 00:15 2 5
02-01-2018 00:20 2 2
02-01-2018 00:25 nan nan
02-01-2018 00:30 nan 1
02-01-2018 00:35 3 nan
02-01-2018 00:40 nan nan
"Below is the result that I am expecting"
Datetime X Y Flag
01-01-2018 00:00 1 1 0
01-01-2018 00:05 nan 2 0
01-01-2018 00:10 2 nan 0
01-01-2018 00:15 3 4 0
01-01-2018 00:20 2 2 0
01-01-2018 00:25 0 0 1
01-01-2018 00:30 0 0 1
01-01-2018 00:35 0 0 1
01-01-2018 00:40 4 4 0
02-01-2018 00:00 nan nan 0
02-01-2018 00:05 2 3 0
02-01-2018 00:10 2 2 0
02-01-2018 00:15 2 5 0
02-01-2018 00:20 2 2 0
02-01-2018 00:25 nan nan 0
02-01-2018 00:30 nan 1 0
02-01-2018 00:35 3 nan 0
02-01-2018 00:40 nan nan 0
This question is the extension of previous question. Here is the link Python - Find maximum null values in stretch and replacing with 0

First create consecutive groups for each column filled by unique values:
df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 6.0 108.0
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 8.0 144.0
2018-02-01 00:30:00 8.0 NaN
2018-02-01 00:35:00 NaN 180.0
2018-02-01 00:40:00 10.0 180.0
Then get groups with maximum count - here group 4:
a = df2.stack().value_counts().index[0]
print (a)
4.0
Get mask for match rows for set 0 and for Flag column cast mask to integer to Tru/False to 1/0 mapping:
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
2018-02-01 00:40:00 NaN NaN 0
EDIT:
Added new condition for match dates from list:
dates = df.index.floor('d')
filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]
df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 NaN NaN
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 NaN NaN
2018-02-01 00:30:00 NaN NaN
2018-02-01 00:35:00 NaN NaN
2018-02-01 00:40:00 NaN NaN
a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0

Can multiple columns of a Pandas DataFrame be sliced by different values

I have a date time indexed DataFrame, (65 columns (only 9 shown for clarity) -> number of sensors, and x rows -> number of observations(for the sample data I limited it to 700 rows, to illustrate the issue I am having).
demo csv:
https://pastebin.com/mpSgJF94
swp_data = pd.read_csv(FILE_NAME, index_col=0, header=0, parse_dates=True, infer_datetime_format=True)
swp_data = swp_data.sort_index()
For each column, I need to find the point where the value is 95% of the column max value, and figure out from the beginning of the DataFrame to the 95% point, where the difference between the time steps is greater than a given value (0.2 in this case).
something similar to what would work in R (not actual code but an illustration)
for (i in 1 : 95% point){
difference[i] <- s[i] - s[(i-1)]
}
breaking <-which(difference > 0.2)[1]
Which would take the 95% point as the end index of a loop, and look at the differences between the time steps and return an index value where the difference > 0.2
In pandas I have calculated the following:
95% value
s95 = (swp_data.max() + (swp_data.max() * .05))
A1-24, -20.6260635,
A1-18, -17.863923,
A1-12, -11.605629,
A2-24, -16.755144,
A2-18, -17.6815275,
A2-12, -16.369584,
A3-24, -15.5030295,
95% time
s95_time = (swp_data >= (swp_data.max() + (swp_data.max() * .05))).idxmax()
A1-24, 10/2/2011 1:30,
A1-18, 10/3/2011 6:20,
A1-12, 10/2/2011 17:20,
A2-24, 10/3/2011 6:10,
A2-18, 10/3/2011 1:30,
A2-12, 10/2/2011 17:10,
A3-24, 10/2/2011 1:30,
Thus far, I have the max value, and the 95% value, as well as a series of timestamps where each column reached its 95% point.
s95 value:
I have tried to mask the DataFrame, (trying to replicate R's which) by creating a boolean DataFrame of values <= 95% point, and have tried df.where using values >=95%. Neither mask or where has provided me what I need, as some of the sensors can already be above the 95% of max when I started recording (mask returns NaN for these values), while where returns these values but not the values below the 95% threshold.
The output I am looking for would be something along the lines of
A1-24, A1-18, A1-12, A2-24, A2-18, A2-12, A3-24, A3-18, A3-12
BREAKING hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm
where hh:mm equals the time from the start of the data file to the breaking value.
So far what I have found on SE and google, has me confused if I can subset the columns of the dataframe by different values, and am having trouble figuring out what I am trying to do is called.
edit: #Prateek Comment:
What i am trying to do is find a way that I can somewhat automate this process, so that using the position of 95% I can have the breaking point returned. I have ~200 csv files that I am trying to process, and would like as much of the filtering to be done using the 95% and breaking positions as possible.

A possible solution from what I understand.
Note that I renamed swap_data to df in the example, and the solution is tested on the provided csv sample file from your question.
Find duration from the start up to when value reaches 95% of column's max
Finding the first timepoint where each column reaches 95% of the max is done as you described:
idx = (df >= df.max(axis=0) * 1.05).idxmax()
>>> idx
Out[]:
A1-24 2011-10-02 01:30:00
A1-18 2011-10-03 06:20:00
A1-12 2011-10-02 17:20:00
A2-24 2011-10-03 06:10:00
A2-18 2011-10-03 01:30:00
A2-12 2011-10-02 17:10:00
A3-24 2011-10-02 01:30:00
dtype: datetime64[ns]
Note the using df.max() * 1.05 avoids to compute the max twice, as compared to s95 = (swp_data.max() + (swp_data.max() * .05)) otherwise it's the same.
Then computing the duration from the start of the dataframe is obtained by substracting the first timestamp
>>> idx - df.index[0]
Out[]:
A1-24 0 days 00:00:00
A1-18 1 days 04:50:00
A1-12 0 days 15:50:00
A2-24 1 days 04:40:00
A2-18 1 days 00:00:00
A2-12 0 days 15:40:00
A3-24 0 days 00:00:00
dtype: timedelta64[ns]
This is for each column the time spent from the start of the record to the s95 point.
Time is 0 if the first recorded value is already above this point.
Mask the dataframe to cover this period
mask = pd.concat([pd.Series(df.index)] * df.columns.size, axis=1) < idx.values.T
df_masked = df.where(mask.values)
>>> df_masked.dropna(how='all')
Out[]:
A1-24 A1-18 A1-12 A2-24 A2-18 A2-12 A3-24
Timestamp
2011-10-02 01:30:00 NaN -18.63589 -16.90389 -17.26780 -19.20653 -19.59666 NaN
2011-10-02 01:40:00 NaN -18.64686 -16.93100 -17.26832 -19.22702 -19.62036 NaN
2011-10-02 01:50:00 NaN -18.65098 -16.92761 -17.26132 -19.22705 -19.61355 NaN
2011-10-02 02:00:00 NaN -18.64307 -16.94764 -17.27702 -19.22746 -19.63462 NaN
2011-10-02 02:10:00 NaN -18.66338 -16.94900 -17.27325 -19.25358 -19.62761 NaN
2011-10-02 02:20:00 NaN -18.66217 -16.95625 -17.27386 -19.25455 -19.64009 NaN
2011-10-02 02:30:00 NaN -18.66015 -16.96130 -17.27040 -19.25898 -19.64241 NaN
2011-10-02 02:40:00 NaN -18.66883 -16.96980 -17.27580 -19.27054 -19.65454 NaN
2011-10-02 02:50:00 NaN -18.68635 -16.97897 -17.27488 -19.28492 -19.65808 NaN
2011-10-02 03:00:00 NaN -18.68009 -16.99057 -17.28346 -19.28928 -19.67182 NaN
2011-10-02 03:10:00 NaN -18.68450 -17.00258 -17.28196 -19.32272 -19.68135 NaN
2011-10-02 03:20:00 NaN -18.68777 -17.01009 -17.29675 -19.30864 -19.68747 NaN
2011-10-02 03:30:00 NaN -18.70067 -17.01706 -17.29178 -19.32034 -19.69742 NaN
2011-10-02 03:40:00 NaN -18.70095 -17.03559 -17.29352 -19.32741 -19.70945 NaN
2011-10-02 03:50:00 NaN -18.70636 -17.03651 -17.28925 -19.33549 -19.71560 NaN
2011-10-02 04:00:00 NaN -18.70937 -17.03548 -17.28996 -19.33433 -19.71211 NaN
2011-10-02 04:10:00 NaN -18.70599 -17.04444 -17.29223 -19.33740 -19.72227 NaN
2011-10-02 04:20:00 NaN -18.71292 -17.05510 -17.29449 -19.35154 -19.72779 NaN
2011-10-02 04:30:00 NaN -18.72158 -17.06376 -17.28770 -19.35647 -19.73064 NaN
2011-10-02 04:40:00 NaN -18.72185 -17.06910 -17.30018 -19.36785 -19.74481 NaN
2011-10-02 04:50:00 NaN -18.72048 -17.06599 -17.29004 -19.37320 -19.73424 NaN
2011-10-02 05:00:00 NaN -18.73083 -17.07618 -17.29528 -19.37319 -19.75045 NaN
2011-10-02 05:10:00 NaN -18.72215 -17.08587 -17.29650 -19.38400 -19.75713 NaN
2011-10-02 05:20:00 NaN -18.73206 -17.10233 -17.29767 -19.39254 -19.76838 NaN
2011-10-02 05:30:00 NaN -18.73719 -17.09621 -17.29842 -19.39363 -19.76258 NaN
2011-10-02 05:40:00 NaN -18.73839 -17.10910 -17.29237 -19.40390 -19.76864 NaN
2011-10-02 05:50:00 NaN -18.74257 -17.12091 -17.29398 -19.40846 -19.78042 NaN
2011-10-02 06:00:00 NaN -18.74327 -17.12995 -17.29097 -19.41153 -19.77897 NaN
2011-10-02 06:10:00 NaN -18.74326 -17.04482 -17.28397 -19.40928 -19.77430 NaN
2011-10-02 06:20:00 NaN -18.73100 -16.86221 -17.28575 -19.40956 -19.78396 NaN
... ... ... ... ... ... ... ...
2011-10-03 01:20:00 NaN -18.16448 NaN -16.99797 -17.95030 NaN NaN
2011-10-03 01:30:00 NaN -18.15606 NaN -16.98879 NaN NaN NaN
2011-10-03 01:40:00 NaN -18.12795 NaN -16.97951 NaN NaN NaN
2011-10-03 01:50:00 NaN -18.12974 NaN -16.97937 NaN NaN NaN
2011-10-03 02:00:00 NaN -18.11848 NaN -16.96770 NaN NaN NaN
2011-10-03 02:10:00 NaN -18.11879 NaN -16.95256 NaN NaN NaN
2011-10-03 02:20:00 NaN -18.08212 NaN -16.95461 NaN NaN NaN
2011-10-03 02:30:00 NaN -18.09060 NaN -16.94141 NaN NaN NaN
2011-10-03 02:40:00 NaN -18.07000 NaN -16.93006 NaN NaN NaN
2011-10-03 02:50:00 NaN -18.07461 NaN -16.91700 NaN NaN NaN
2011-10-03 03:00:00 NaN -18.06039 NaN -16.91466 NaN NaN NaN
2011-10-03 03:10:00 NaN -18.04229 NaN -16.89537 NaN NaN NaN
2011-10-03 03:20:00 NaN -18.03514 NaN -16.89753 NaN NaN NaN
2011-10-03 03:30:00 NaN -18.03014 NaN -16.88813 NaN NaN NaN
2011-10-03 03:40:00 NaN -18.00851 NaN -16.88086 NaN NaN NaN
2011-10-03 03:50:00 NaN -18.01028 NaN -16.87721 NaN NaN NaN
2011-10-03 04:00:00 NaN -18.00227 NaN -16.86687 NaN NaN NaN
2011-10-03 04:10:00 NaN -17.98804 NaN -16.85424 NaN NaN NaN
2011-10-03 04:20:00 NaN -17.96740 NaN -16.84466 NaN NaN NaN
2011-10-03 04:30:00 NaN -17.96451 NaN -16.84205 NaN NaN NaN
2011-10-03 04:40:00 NaN -17.95414 NaN -16.82609 NaN NaN NaN
2011-10-03 04:50:00 NaN -17.93661 NaN -16.81903 NaN NaN NaN
2011-10-03 05:00:00 NaN -17.92905 NaN -16.80737 NaN NaN NaN
2011-10-03 05:10:00 NaN -17.92743 NaN -16.80582 NaN NaN NaN
2011-10-03 05:20:00 NaN -17.91504 NaN -16.78991 NaN NaN NaN
2011-10-03 05:30:00 NaN -17.89965 NaN -16.78469 NaN NaN NaN
2011-10-03 05:40:00 NaN -17.89945 NaN -16.77288 NaN NaN NaN
2011-10-03 05:50:00 NaN -17.88822 NaN -16.76610 NaN NaN NaN
2011-10-03 06:00:00 NaN -17.87259 NaN -16.75742 NaN NaN NaN
2011-10-03 06:10:00 NaN -17.87308 NaN NaN NaN NaN NaN
[173 rows x 7 columns]
To achieve this you have to compute a bool mask for each column:
create a dataframe with the DateTimeIndex values repeated over the same number of columns as df: pd.concat([pd.Series(df.index)] * df.columns.size, axis=1).
Here df.index must be turned into a pd.Series for concatenation, then repeated to match the number of columns df.columns.size.
create the mask itself with < idx.values.T, where values gets idx as a numpy.array and T transposes it in order to compare column-wise with the dataframe.
mask the dataframe with df.where(mask.values), where using values gets the mask as a numpy.array. This is needed as the mask does not have the same labels as df.
optionally only keep the rows where a least one value is not NaN using .dropna(how='all')
Filter masked data on the difference between each time point
If I understand well it is the point where you want to filter your data on difference > 0.2 between each time point and for the selected period only.
It remains a bit unclear to me so do not hesitate to discuss in the comments if I misunderstood.
This can be done with:
df[df_masked.diff(1) > 0.2]
But unfortunately for the provided dataset there is no value matching these conditions.
>>> df[df_masked.diff(1) > 0.2].any()
Out[]:
A1-24 False
A1-18 False
A1-12 False
A2-24 False
A2-18 False
A2-12 False
A3-24 False
dtype: bool
Edit: vizualize results as bool dataframe (comments follow-up)
Visualizing the results as a boolean dataframe with index and columns is done very simply with df_masked.diff(1) > 0.2.
However there will likely be a lot of unnecessary rows containing only False, so you can filter it this way:
df_diff = df_masked.diff(1) > 0.1 # Raising the threshold a bit to get some values
>>> df_diff[df_diff.any(axis=1)]
Out[]:
A1-24 A1-18 A1-12 A2-24 A2-18 A2-12 A3-24
Timestamp
2011-10-02 06:20:00 False False True False False False False
2011-10-02 06:30:00 False False True False False False False
2011-10-02 06:40:00 False False True False False False False
2011-10-02 06:50:00 False False True False False False False
2011-10-02 07:00:00 False False True False False False False
2011-10-02 07:10:00 False False True False False False False
2011-10-02 07:20:00 False False True False False False False
2011-10-02 07:30:00 False False True False False False False
2011-10-02 07:40:00 False False True False False False False
2011-10-02 07:50:00 False False True False False False False
2011-10-02 08:00:00 False False True False False False False
2011-10-02 08:10:00 False False True False False False False
2011-10-02 08:20:00 False False True False False False False
2011-10-02 08:30:00 False False True False False False False
2011-10-02 08:40:00 False False True False False False False
2011-10-02 08:50:00 False False True False False False False
2011-10-02 09:00:00 False False True False False False False
2011-10-02 09:10:00 False False True False False False False
2011-10-02 09:20:00 False False True False False False False
2011-10-02 09:30:00 False False True False False False False
2011-10-02 12:20:00 False False False False False True False
2011-10-02 12:50:00 False False False False True True False
2011-10-02 13:10:00 False False False False False True False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cumulative sum of Timedelta column based on boolean condition - python

Related

Determine the duration of an event

Subtract one column by itself based on a condition set by another column

I am trying to do here is I have to run a loop over rows here, in index[7] it should have shown "Sell" but my 2nd if condition is not working

Finding maximum null values in stretch and generating flag

Can multiple columns of a Pandas DataFrame be sliced by different values

Categories

Resources