Check certain conditions looking back x hours (pandas) - python

I have some data like this:
import pandas as pd
dates = ["12/25/2021 07:47:01", "12/25/2021 08:02:32", "12/25/2021 13:57:40", "12/25/2021 14:17:11", "12/25/2021 17:23:01", "12/25/2021 23:48:55", "12/26/2021 08:22:32", "12/26/2021 11:11:11", "12/26/2021 14:53:40", "12/26/2021 16:07:07", "12/26/2021 23:56:07"]
is_manual = [0,0,0,0,1,1,0,0,0,0,1]
is_problem = [0,0,0,0,1,1,0,0,0,1,1]
df = pd.DataFrame({'dates':dates,
'manual_entry': is_manual,
'problem_entry': is_problem})
dates manual_entry problem_entry
0 12/25/2021 07:47:01 0 0
1 12/25/2021 08:02:32 0 0
2 12/25/2021 13:57:40 0 0
3 12/25/2021 14:17:11 0 0
4 12/25/2021 17:23:01 1 1
5 12/25/2021 23:48:55 1 1
6 12/26/2021 08:22:32 0 0
7 12/26/2021 11:11:11 0 0
8 12/26/2021 14:53:40 0 0
9 12/26/2021 16:07:07 0 1
10 12/26/2021 23:56:07 1 1
What I would like to do is to take every row where problem_entry == 1 and examine if every row in the 24 hours prior to that row is manual_entry == 0
While I know you can create a rolling lookback window of a certain number of rows, each row is not spaced a normal time period apart, so wondering how to look back 24 hours and determine whether the criteria above is met.
Thanks in advance
EDIT: Expected output:
dates manual_entry problem_entry
4 12/25/2021 17:23:01 1 1
10 12/26/2021 23:56:07 1 1

Try the following.Extracted 'manual_entry' into a separate variable and collected the amounts in a sliding window of the day. If the current 'manual_entry' is equal to 1, then there were no other values during the day. Next, the dataframe is filtered where 'problem_entry', 'manual_entry' where are equal to 1.
df['dates'] = pd.to_datetime(df['dates'])
a = (df.rolling("86400s", on='dates', min_periods=1).sum()).loc[:, 'manual_entry']
print(df.loc[(df['problem_entry'] == 1) & (a == 1)])
Output:
dates manual_entry problem_entry
4 2021-12-25 17:23:01 1 1
10 2021-12-26 23:56:07 1 1

Related

Sample from dataframe with conditions

I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'
What I have:
df['target'].value_counts()
0 = 4000
1 = 120000
What I need:
new_df['target'].value_counts()
0 = 4000
1 = 6000
I know I can df.sample but I dont know how to insert the conditional.
Thanks
Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:
df.groupby('target').sample(4000)
Demo:
df = pd.DataFrame({'x': [0] * 10 + [1] * 25})
df.groupby('x').sample(5)
x
8 0
6 0
7 0
2 0
9 0
18 1
33 1
24 1
32 1
15 1
If you need to sample conditionally based on the group value, you can do:
df.groupby('target', group_keys=False).apply(
lambda g: g.sample(4000 if g.name == 0 else 6000)
)
Demo:
df.groupby('x', group_keys=False).apply(
lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7 0
8 0
2 0
1 0
18 1
12 1
17 1
22 1
30 1
28 1
Assuming the following input and using the values 4/6 instead of 4000/6000:
df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})
You could groupby your target and sample to take at most N values per group:
df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))
example output:
target
4 0
0 0
8 0
12 0
10 1
14 1
1 1
7 1
11 1
13 1
If you want the same size you can simply use df.groupby('target').sample(n=4)

Conditional sum of non zero values

I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df

Checking for subset in a column?

I'm trying to flag some price data as "stale" if the quoted price of the security hasn't changed over lets say 3 trading days. I'm currently trying it with:
firm["dev"] = np.std(firm["Price"],firm["Price"].shift(1),firm["Price"].shift(2))
firm["flag"] == np.where(firm["dev"] = 0, 1, 0)
But I'm getting nowhere with it. This is what my dataframe would look like.
Index
Price
Flag
1
10
0
2
11
0
3
12
0
4
12
0
5
12
1
6
11
0
7
13
0
Any help is appreciated!
If you are okay with other conditions, you can first check if series.diff equals 0 and take cumsum to check if you have a cumsum of 2 (n-1). Also check if the next row is equal to current, when both these conditions suffice, assign a flag of 1 else 0.
n=3
firm['Flag'] = (firm['Price'].diff().eq(0).cumsum().eq(n-1) &
firm['Price'].eq(firm['Price'].shift())).astype(int)
EDIT, to make it a generalized function with consecutive n, use this:
def fun(df,col,n):
c = df[col].diff().eq(0)
return (c|c.shift(-1)).cumsum().ge(n) & df[col].eq(df[col].shift())
firm['flag_2'] = fun(firm,'Price',2).astype(int)
firm['flag_3'] = fun(firm,'Price',3).astype(int)
print(firm)
Price Flag flag_2 flag_3
Index
1 10 0 0 0
2 11 0 0 0
3 12 0 0 0
4 12 0 1 0
5 12 1 1 1
6 11 0 0 0
7 13 0 0 0

Compare current row value to previous row values

I have login history data from User A for a day. My requirement is that at any point in time the User A can have only one valid login. As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active. So, any logins that happened during the valid session needs to be flagged as duplicate.
Example 1:
In the first sample data below, while the user was still logged in from 00:12:38 to 01:00:02 (index 0), there is another login from the user at 00:55:14 to 01:00:02 (index 1).
Similarly, if we compare index 2 and 3, we can see that the record at index 3 is duplicate login as per requirement.
start_time end_time
0 00:12:38 01:00:02
1 00:55:14 01:00:02
2 01:00:02 01:32:40
3 01:00:02 01:08:40
4 01:41:22 03:56:23
5 18:58:26 19:16:49
6 20:12:37 20:52:49
7 20:55:16 22:02:50
8 22:21:24 22:48:50
9 23:11:30 00:00:00
Expected output:
start_time end_time isDup
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
These duplicate records need to be updated to 1 at column isDup.
Example 2:
Another sample of data as below. Here, while the user was still logged in between 13:36:10 and 13:50:16, there were 3 additional sessions too that needs to be flagged.
start_time end_time
0 13:32:54 13:32:55
1 13:36:10 13:50:16
2 13:37:54 13:38:14
3 13:46:38 13:46:45
4 13:48:59 13:49:05
5 13:50:16 13:50:20
6 14:03:39 14:03:49
7 15:36:20 15:36:20
8 15:46:47 15:46:47
Expected output:
start_time end_time isDup
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
What's the efficient way to compare the start time of the current record with previous records?
Query duplicated() and change astype to int
df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int)
Or did you need
df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int)
Map the time like values in columns start_time and end_time to pandas TimeDelta objects and subtract 1 seconds from the 00:00:00 timedelta values in end_time column.
c = ['start_time', 'end_time']
s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T
e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1)
Then for each pair of start_time and end_time in the dataframe df mark the corresponding duplicate intervals using numpy broadcasting:
m = (s[:, None] >= s) & (e[:, None] <= e)
np.fill_diagonal(m, False)
df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1')
# example 1
start_time end_time isDupe
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
# example 2
start_time end_time isDupe
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
Here's my solution to the above question. However, if there are any efficient way, I would be happy to accept it. Thanks!
def getDuplicate(data):
data['check_time'] = data.iloc[-1]['start_time']
data['isDup'] = data.apply(lambda x: 1
if (x['start_time'] <= x['check_time']) & (x['check_time'] < x['end_time'])
else 0
, axis = 1)
return data['isDup'].sum()
limit = 1
df_copy = df.copy()
df['isDup'] = 0
for i, row in df.iterrows():
data = df_copy.iloc[:limit]
isDup = getDuplicate(data)
limit = limit + 1
if isDup > 1:
df.at[i, 'isDup'] = 1
else:
df.at[i, 'isDup'] = 0

Complete pandas dataframe with zero values for large datasets

I have a dataframe that looks like this:
>> df
index week day hour count
5 10 2 10 70
5 10 3 11 80
7 10 2 18 15
7 10 2 19 12
where week is the week of the year, day is day of the week (0-6), and hour is hour of the day (0-23). However, since I plan to convert this to a 3D array (week x day x hour) later, I have to include hours where there are no items in the count column. Example:
>> target_df
index week day hour count
5 10 0 0 0
5 10 0 1 0
...
5 10 2 10 70
5 10 2 11 0
...
7 10 0 0 0
...
...
and so on. What I do is to generate a dummy dataframe containing all index-week-day-hour combinations possible (basically target_df without the count column):
>> dummy_df
index week day hour
5 10 0 0
5 10 0 1
...
5 10 2 10
5 10 2 11
...
7 10 0 0
...
...
and then using
target_df = pd.merge(df, dummy_df, on=['index','week','day','hour'], how='outer').fillna(0)
This works fine for small datasets, but I'm working with a lot of rows. With the case I'm working on now, I get 82M rows for dummy_df and target_df, and it's painfully slow.
EDIT: The slowest part is actually constructing dummy_df!!! I can generate the individual lists but combining them into a pandas dataframe is the slowest part.
num_weeks = len(week_list)
num_idxs = len(df['index'].unique())
print('creating dummies')
_dummy_idxs = list(itertools.chain.from_iterable(
itertools.repeat(x, 24*7*num_weeks) for x in df['index'].unique()))
print('\t_dummy_idxs')
_dummy_weeks = list(itertools.chain.from_iterable(
itertools.repeat(x, 24*7) for x in week_list)) * num_idxs
print('\t_dummy_weeks')
_dummy_days = list(itertools.chain.from_iterable(
itertools.repeat(x, 24) for x in range(0,7))) * num_weeks * num_idxs
print('\t_dummy_days')
_dummy_hours = list(range(0,24)) * 7 * num_weeks * num_idxs
print('\t_dummy_hours')
print('Creating dummy_hour_df with {0} rows...'.format(len(_dummy_hours)))
# the part below takes the longest time
dummy_hour_df = pd.DataFrame({'index': _dummy_idxs, 'week': _dummy_weeks, 'day': _dummy_days, 'hour': _dummy_hours})
print('dummy_hour_df completed')
Is there a faster way to do this?
As an alternative, you can use itertools.product for the creation of dummy_df as a product of lists:
import itertools
index = range(100)
weeks = range(53)
days = range(7)
hours = range(24)
dummy_df = pd.DataFrame(list(itertools.product(index, weeks, days, hours)), columns=['index','week','day','hour'])
dummy_df.head()
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 0 2
3 0 0 0 3
4 0 0 0 4

Categories

Resources