Compare current row value to previous row values - python

I have login history data from User A for a day. My requirement is that at any point in time the User A can have only one valid login. As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active. So, any logins that happened during the valid session needs to be flagged as duplicate.
Example 1:
In the first sample data below, while the user was still logged in from 00:12:38 to 01:00:02 (index 0), there is another login from the user at 00:55:14 to 01:00:02 (index 1).
Similarly, if we compare index 2 and 3, we can see that the record at index 3 is duplicate login as per requirement.
start_time end_time
0 00:12:38 01:00:02
1 00:55:14 01:00:02
2 01:00:02 01:32:40
3 01:00:02 01:08:40
4 01:41:22 03:56:23
5 18:58:26 19:16:49
6 20:12:37 20:52:49
7 20:55:16 22:02:50
8 22:21:24 22:48:50
9 23:11:30 00:00:00
Expected output:
start_time end_time isDup
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
These duplicate records need to be updated to 1 at column isDup.
Example 2:
Another sample of data as below. Here, while the user was still logged in between 13:36:10 and 13:50:16, there were 3 additional sessions too that needs to be flagged.
start_time end_time
0 13:32:54 13:32:55
1 13:36:10 13:50:16
2 13:37:54 13:38:14
3 13:46:38 13:46:45
4 13:48:59 13:49:05
5 13:50:16 13:50:20
6 14:03:39 14:03:49
7 15:36:20 15:36:20
8 15:46:47 15:46:47
Expected output:
start_time end_time isDup
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
What's the efficient way to compare the start time of the current record with previous records?

Query duplicated() and change astype to int
df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int)
Or did you need
df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int)

Map the time like values in columns start_time and end_time to pandas TimeDelta objects and subtract 1 seconds from the 00:00:00 timedelta values in end_time column.
c = ['start_time', 'end_time']
s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T
e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1)
Then for each pair of start_time and end_time in the dataframe df mark the corresponding duplicate intervals using numpy broadcasting:
m = (s[:, None] >= s) & (e[:, None] <= e)
np.fill_diagonal(m, False)
df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1')
# example 1
start_time end_time isDupe
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
# example 2
start_time end_time isDupe
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0

Here's my solution to the above question. However, if there are any efficient way, I would be happy to accept it. Thanks!
def getDuplicate(data):
data['check_time'] = data.iloc[-1]['start_time']
data['isDup'] = data.apply(lambda x: 1
if (x['start_time'] <= x['check_time']) & (x['check_time'] < x['end_time'])
else 0
, axis = 1)
return data['isDup'].sum()
limit = 1
df_copy = df.copy()
df['isDup'] = 0
for i, row in df.iterrows():
data = df_copy.iloc[:limit]
isDup = getDuplicate(data)
limit = limit + 1
if isDup > 1:
df.at[i, 'isDup'] = 1
else:
df.at[i, 'isDup'] = 0

Related

Check certain conditions looking back x hours (pandas)

I have some data like this:
import pandas as pd
dates = ["12/25/2021 07:47:01", "12/25/2021 08:02:32", "12/25/2021 13:57:40", "12/25/2021 14:17:11", "12/25/2021 17:23:01", "12/25/2021 23:48:55", "12/26/2021 08:22:32", "12/26/2021 11:11:11", "12/26/2021 14:53:40", "12/26/2021 16:07:07", "12/26/2021 23:56:07"]
is_manual = [0,0,0,0,1,1,0,0,0,0,1]
is_problem = [0,0,0,0,1,1,0,0,0,1,1]
df = pd.DataFrame({'dates':dates,
'manual_entry': is_manual,
'problem_entry': is_problem})
dates manual_entry problem_entry
0 12/25/2021 07:47:01 0 0
1 12/25/2021 08:02:32 0 0
2 12/25/2021 13:57:40 0 0
3 12/25/2021 14:17:11 0 0
4 12/25/2021 17:23:01 1 1
5 12/25/2021 23:48:55 1 1
6 12/26/2021 08:22:32 0 0
7 12/26/2021 11:11:11 0 0
8 12/26/2021 14:53:40 0 0
9 12/26/2021 16:07:07 0 1
10 12/26/2021 23:56:07 1 1
What I would like to do is to take every row where problem_entry == 1 and examine if every row in the 24 hours prior to that row is manual_entry == 0
While I know you can create a rolling lookback window of a certain number of rows, each row is not spaced a normal time period apart, so wondering how to look back 24 hours and determine whether the criteria above is met.
Thanks in advance
EDIT: Expected output:
dates manual_entry problem_entry
4 12/25/2021 17:23:01 1 1
10 12/26/2021 23:56:07 1 1
Try the following.Extracted 'manual_entry' into a separate variable and collected the amounts in a sliding window of the day. If the current 'manual_entry' is equal to 1, then there were no other values during the day. Next, the dataframe is filtered where 'problem_entry', 'manual_entry' where are equal to 1.
df['dates'] = pd.to_datetime(df['dates'])
a = (df.rolling("86400s", on='dates', min_periods=1).sum()).loc[:, 'manual_entry']
print(df.loc[(df['problem_entry'] == 1) & (a == 1)])
Output:
dates manual_entry problem_entry
4 2021-12-25 17:23:01 1 1
10 2021-12-26 23:56:07 1 1

Extract a window of rows following a set of one's in a pandas dataframe

I have a pandas dataframe that looks like the following:
Day val
Day1 0
Day2 0
Day3 0
Day4 0
Day5 1
Day6 1
Day7 1
Day8 1
Day9 0
Day10 0
Day11 0
Day12 1
Day13 1
Day14 1
Day15 1
Day16 0
Day17 0
Day18 0
Day19 0
Day20 0
Day21 1
Day22 0
Day23 1
Day24 1
Day25 1
I am looking to extract at-most 2 rows where val = 0 but only those where the proceeding rows are a set of 1's.
For example:
There is a set of ones from Day5 to Day8 (an event). I would need to look into at-most two rows after the end of the event. So here it's Day9 and Day10.
Similarly, Day21 is a single-day event, and I need to look into only Day22 since it is the single zero that follows the event.
For the table data above, the output would be the following:
Day val
day9 0
Day10 0
Day16 0
Day17 0
Day22 0
We can simplify the condition for each row to:
The val value should be 0
The previous day or the day before that should have a val of 1
In code:
cond = (df['val'].shift(1) == 1) | (df['val'].shift(2) == 1)
df.loc[(df['val'] == 0) & cond]
Result:
Day val
8 Day9 0
9 Day10 0
15 Day16 0
16 Day17 0
21 Day22 0
Note: If more than 2 days should be considered this can easily be added to the condition cond. In this case, cond can be constructed with a list comprehension and np.any(), for example:
n = 2
cond = np.any([df['val'].shift(s) == 1 for s in range(1, n+1)], axis=0)
df.loc[(df['val'] == 0) & cond]
You can compute a mask on the rolling max per group where the groups start for each 1->0 transition and combine it with a second mask where the values are 0:
N = 2
o2z = df['val'].diff().eq(-1)
m1 = o2z.groupby(o2z.cumsum()).rolling(N, min_periods=1).max().astype(bool).values
m2 = df['val'].eq(0)
df[m1&m2]
Output:
Day val
8 Day9 0
9 Day10 0
15 Day16 0
16 Day17 0
21 Day22 0

np where with two conditions and met first

I am trying to create a target variable based on 2 conditions. I have X values that are binary and X2 values that are also binary. My condition is whenver X changes from 1 to zero, we have one in y only if it is followed by a change from 0 to 1 in X2. If that was followed by a change from 0 to 1 in X then we don't do the change in the first place. I attached a picture from excel.
I also did the following to account for the change in X
df['X-prev']=df['X'].shift(1)
df['Change-X;]=np.where(df['X-prev']+df['X']==1,1,0)
# this is the data frame
X=[1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0]
X2=[0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1]
df=pd.DataFrame()
df['X']=X
df['X2']=X2
however, this is not enough as I need to know which change came first after the X change. I attached a picture of the example.
Thanks a lot for all the contributions.
Keep rows that match your transition (X=1, X+1=0) and (X2=1, X2-1=0) then merge all selected rows to a list where a value of 0 means 'start a cycle' and 1 means 'end a cycle'.
But in this list, you can have consecutive start or end so you need to filter again to get only cycles of (0, 1). After that, reindex this new series by your original dataframe index and back fill with 1.
x1 = df['X'].sub(df['X'].shift(-1)).eq(1)
x2 = df['X2'].sub(df['X2'].shift(1)).eq(1)
sr1 = pd.Series(0, df.index[x1])
sr2 = pd.Series(1, df.index[x2])
sr = pd.concat([sr2, sr1]).sort_index()
df['Y'] = sr[sr.lt(sr.shift(-1)) | sr.gt(sr.shift(1))] \
.reindex(df.index).bfill().fillna(0).astype(int)
>>> df
X X2 Y
0 1 0 0 # start here: (X=1, X+1=0) but never ended before another start
1 1 0 0
2 0 0 0
3 0 0 0
4 1 0 0 # start here: (X=1, X+1=0)
5 0 0 1 # <- fill with 1
6 0 0 1 # <- fill with 1
7 0 0 1 # <- fill with 1
8 0 0 1 # <- fill with 1
9 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
10 0 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 0 0
15 0 0 0
16 0 1 0 # end here: (X2=1, X2-1=0) but never started before
17 0 0 0
18 0 0 0
19 0 0 0
20 1 0 0
21 1 0 0 # start here: (X=1, X+1=0)
22 0 0 1 # <- fill with 1
23 0 0 1 # <- fill with 1
24 0 0 1 # <- fill with 1
25 0 0 1 # <- fill with 1
26 0 0 1 # <- fill with 1
27 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
28 0 1 0
29 0 1 0

Conditional sum of non zero values

I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df

Checking for subset in a column?

I'm trying to flag some price data as "stale" if the quoted price of the security hasn't changed over lets say 3 trading days. I'm currently trying it with:
firm["dev"] = np.std(firm["Price"],firm["Price"].shift(1),firm["Price"].shift(2))
firm["flag"] == np.where(firm["dev"] = 0, 1, 0)
But I'm getting nowhere with it. This is what my dataframe would look like.
Index
Price
Flag
1
10
0
2
11
0
3
12
0
4
12
0
5
12
1
6
11
0
7
13
0
Any help is appreciated!
If you are okay with other conditions, you can first check if series.diff equals 0 and take cumsum to check if you have a cumsum of 2 (n-1). Also check if the next row is equal to current, when both these conditions suffice, assign a flag of 1 else 0.
n=3
firm['Flag'] = (firm['Price'].diff().eq(0).cumsum().eq(n-1) &
firm['Price'].eq(firm['Price'].shift())).astype(int)
EDIT, to make it a generalized function with consecutive n, use this:
def fun(df,col,n):
c = df[col].diff().eq(0)
return (c|c.shift(-1)).cumsum().ge(n) & df[col].eq(df[col].shift())
firm['flag_2'] = fun(firm,'Price',2).astype(int)
firm['flag_3'] = fun(firm,'Price',3).astype(int)
print(firm)
Price Flag flag_2 flag_3
Index
1 10 0 0 0
2 11 0 0 0
3 12 0 0 0
4 12 0 1 0
5 12 1 1 1
6 11 0 0 0
7 13 0 0 0

Categories

Resources