I have following dataframe in pandas
code tank date time no_operation_flag
123 1 01-01-2019 00:00:00 1
123 1 01-01-2019 00:30:00 1
123 1 01-01-2019 01:00:00 0
123 1 01-01-2019 01:30:00 1
123 1 01-01-2019 02:00:00 1
123 1 01-01-2019 02:30:00 1
123 1 01-01-2019 03:00:00 1
123 1 01-01-2019 03:30:00 1
123 1 01-01-2019 04:00:00 1
123 1 01-01-2019 05:00:00 1
123 1 01-01-2019 14:00:00 1
123 1 01-01-2019 14:30:00 1
123 1 01-01-2019 15:00:00 1
123 1 01-01-2019 15:30:00 1
123 1 01-01-2019 16:00:00 1
123 1 01-01-2019 16:30:00 1
123 2 02-01-2019 00:00:00 1
123 2 02-01-2019 00:30:00 0
123 2 02-01-2019 01:00:00 0
123 2 02-01-2019 01:30:00 0
123 2 02-01-2019 02:00:00 1
123 2 02-01-2019 02:30:00 1
123 2 02-01-2019 03:00:00 1
123 2 03-01-2019 03:30:00 1
123 2 03-01-2019 04:00:00 1
123 1 03-01-2019 14:00:00 1
123 2 03-01-2019 15:00:00 1
123 2 03-01-2019 00:30:00 1
123 2 04-01-2019 11:00:00 1
123 2 04-01-2019 11:30:00 0
123 2 04-01-2019 12:00:00 1
123 2 04-01-2019 13:30:00 1
123 2 05-01-2019 03:00:00 1
123 2 05-01-2019 03:30:00 1
123 2 05-01-2019 04:00:00 1
What I want to do is to flag consecutive 1's in no_operation_flag more than 5 times at tank level and day level, but the time should be consecutive (time is at half an hour level). Dataframe is already sorted at tank, date and time level.
My desired dataframe would be
code tank date time no_operation_flag final_flag
123 1 01-01-2019 00:00:00 1 0
123 1 01-01-2019 00:30:00 1 0
123 1 01-01-2019 01:00:00 0 0
123 1 01-01-2019 01:30:00 1 1
123 1 01-01-2019 02:00:00 1 1
123 1 01-01-2019 02:30:00 1 1
123 1 01-01-2019 03:00:00 1 1
123 1 01-01-2019 03:30:00 1 1
123 1 01-01-2019 04:00:00 1 1
123 1 01-01-2019 05:00:00 1 0
123 1 01-01-2019 14:00:00 1 1
123 1 01-01-2019 14:30:00 1 1
123 1 01-01-2019 15:00:00 1 1
123 1 01-01-2019 15:30:00 1 1
123 1 01-01-2019 16:00:00 1 1
123 1 01-01-2019 16:30:00 1 1
123 2 02-01-2019 00:00:00 1 0
123 2 02-01-2019 00:30:00 0 0
123 2 02-01-2019 01:00:00 0 0
123 2 02-01-2019 01:30:00 0 0
123 2 02-01-2019 02:00:00 1 0
123 2 02-01-2019 02:30:00 1 0
123 2 02-01-2019 03:00:00 1 0
123 2 03-01-2019 03:30:00 1 0
123 2 03-01-2019 04:00:00 1 0
123 1 03-01-2019 14:00:00 1 0
123 2 03-01-2019 15:00:00 1 0
123 2 03-01-2019 00:30:00 1 0
123 2 04-01-2019 11:00:00 1 0
123 2 04-01-2019 11:30:00 0 0
123 2 04-01-2019 12:00:00 1 0
123 2 04-01-2019 13:30:00 1 0
123 2 05-01-2019 03:00:00 1 0
123 2 05-01-2019 03:30:00 1 0
123 2 05-01-2019 04:00:00 1 0
How can I do this in pandas?
You can use solution like this, only filtering for consecutive datetimes per groups with new helper DataFrame with added all missing datetimes, last merge for add new column:
df['datetimes'] = pd.to_datetime(df['date'].astype(str) + ' ' + df['time'].astype(str))
df1 = (df.set_index('datetimes')
.groupby(['code','tank', 'date'])['no_operation_flag']
.resample('30T')
.first()
.reset_index())
shifted1 = df1.groupby(['code','tank', 'date'])['no_operation_flag'].shift()
g1 = df1['no_operation_flag'].ne(shifted1).cumsum()
mask1 = g1.map(g1.value_counts()).gt(5) & df1['no_operation_flag'].eq(1)
df1['final_flag'] = mask1.astype(int)
#print (df1.head(40))
df = df.merge(df1[['code','tank','datetimes','final_flag']]).drop('datetimes', axis=1)
print (df)
code tank date time no_operation_flag final_flag
0 123 1 01-01-2019 00:00:00 1 0
1 123 1 01-01-2019 00:30:00 1 0
2 123 1 01-01-2019 01:00:00 0 0
3 123 1 01-01-2019 01:30:00 1 1
4 123 1 01-01-2019 02:00:00 1 1
5 123 1 01-01-2019 02:30:00 1 1
6 123 1 01-01-2019 03:00:00 1 1
7 123 1 01-01-2019 03:30:00 1 1
8 123 1 01-01-2019 04:00:00 1 1
9 123 1 01-01-2019 05:00:00 1 0
10 123 1 01-01-2019 14:00:00 1 1
11 123 1 01-01-2019 14:30:00 1 1
12 123 1 01-01-2019 15:00:00 1 1
13 123 1 01-01-2019 15:30:00 1 1
14 123 1 01-01-2019 16:00:00 1 1
15 123 1 01-01-2019 16:30:00 1 1
16 123 2 02-01-2019 00:00:00 1 0
17 123 2 02-01-2019 00:30:00 0 0
18 123 2 02-01-2019 01:00:00 0 0
19 123 2 02-01-2019 01:30:00 0 0
20 123 2 02-01-2019 02:00:00 1 0
21 123 2 02-01-2019 02:30:00 1 0
22 123 2 02-01-2019 03:00:00 1 0
23 123 2 03-01-2019 03:30:00 1 0
24 123 2 03-01-2019 04:00:00 1 0
25 123 1 03-01-2019 14:00:00 1 0
26 123 2 03-01-2019 15:00:00 1 0
27 123 2 03-01-2019 00:30:00 1 0
28 123 2 04-01-2019 11:00:00 1 0
29 123 2 04-01-2019 11:30:00 0 0
30 123 2 04-01-2019 12:00:00 1 0
31 123 2 04-01-2019 13:30:00 1 0
32 123 2 05-01-2019 03:00:00 1 0
33 123 2 05-01-2019 03:30:00 1 0
34 123 2 05-01-2019 04:00:00 1 0
Use:
df['final_flag'] = ( df.groupby([df['no_operation_flag'].ne(1).cumsum(),
'tank',
'date',
pd.to_datetime(df['time'].astype(str))
.diff()
.ne(pd.Timedelta(minutes = 30))
.cumsum(),
'no_operation_flag'])['no_operation_flag']
.transform('size')
.gt(5)
.view('uint8') )
print(df)
Output
code tank date time no_operation_flag final_flag
0 123 1 01-01-2019 00:00:00 1 0
1 123 1 01-01-2019 00:30:00 1 0
2 123 1 01-01-2019 01:00:00 0 0
3 123 1 01-01-2019 01:30:00 1 1
4 123 1 01-01-2019 02:00:00 1 1
5 123 1 01-01-2019 02:30:00 1 1
6 123 1 01-01-2019 03:00:00 1 1
7 123 1 01-01-2019 03:30:00 1 1
8 123 1 01-01-2019 04:00:00 1 1
9 123 1 01-01-2019 05:00:00 1 0
10 123 1 01-01-2019 14:00:00 1 1
11 123 1 01-01-2019 14:30:00 1 1
12 123 1 01-01-2019 15:00:00 1 1
13 123 1 01-01-2019 15:30:00 1 1
14 123 1 01-01-2019 16:00:00 1 1
15 123 1 01-01-2019 16:30:00 1 1
16 123 2 02-01-2019 00:00:00 1 0
17 123 2 02-01-2019 00:30:00 0 0
18 123 2 02-01-2019 01:00:00 0 0
19 123 2 02-01-2019 01:30:00 0 0
20 123 2 02-01-2019 02:00:00 1 0
21 123 2 02-01-2019 02:30:00 1 0
22 123 2 02-01-2019 03:00:00 1 0
23 123 2 03-01-2019 03:30:00 1 0
24 123 2 03-01-2019 04:00:00 1 0
25 123 1 03-01-2019 14:00:00 1 0
26 123 2 03-01-2019 15:00:00 1 0
27 123 2 03-01-2019 00:30:00 1 0
28 123 2 04-01-2019 11:00:00 1 0
29 123 2 04-01-2019 11:30:00 0 0
30 123 2 04-01-2019 12:00:00 1 0
31 123 2 04-01-2019 13:30:00 1 0
32 123 2 05-01-2019 03:00:00 1 0
33 123 2 05-01-2019 03:30:00 1 0
There might be a way to do in one go but the two steps approach is simpler,
first you select tanks one by one and then you look for the sequence of five 1.
This other question already solves the searching the pattern in a column.
If you want to go the other way you might take a look at rolling, you can either sum the 1 or use a all values are True condition to find the sequence of n elements.
You could also just mask mask a column but that would give you just the values in the mask. This solves the other problem, "which tanks where non operative at a give time".
This is very premitive and somewhat dirty way but easy to understand, I think.
For loop of rows, check time after 4 rows is 2 hours far.
(if 1 is True) Check all of corresponding five values of df['no_operation_flag'] are 1.
(if 2 is True) Put 1 in corresponding five values of df['final_flag'].
# make col with zero
df['final_flag'] = 0
for i in range(1, len(df)-4):
j = i + 4
dt1 = df['date'].iloc[i]+' '+df['time'].iloc[i]
ts1 = pd.to_datetime(dt1)
dt2 = df['date'].iloc[j]+' '+df['time'].iloc[j]
ts2 = pd.to_datetime(dt2)
# timedelta is 2 hours?
if ts2 - ts1 == datetime.timedelta(hours=2, minutes=0):
# all of no_operation_flag == 1?
if (df['no_operation_flag'].iloc[i:j+1] == 1).all():
df['final_flag'].iloc[i:j+1] = 1
Related
I am working with structured log data structured as the following (here a pastebin snippet of mock data for easy tinkering):
import pandas as pd
df = pd.read_csv("https://pastebin.com/raw/qrqTMrGa")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 123 32 0
1 123 2020-01-02 2 43 0
2 123 2020-01-03 43 4 1
3 123 2020-01-04 43 4 0
4 123 2020-01-05 43 4 0
5 123 2020-01-06 43 4 0
6 123 2020-01-07 43 4 1
7 123 2020-01-08 43 4 0
8 232 2020-01-04 56 4 0
9 232 2020-01-05 97 1 0
10 232 2020-01-06 23 74 0
11 232 2020-01-07 91 85 1
12 232 2020-01-08 91 85 0
13 232 2020-01-09 91 85 0
14 232 2020-01-10 91 85 1
Variables are pretty straightforward:
id: the id of the observed machine
date: observation date
info_a_cnt: counts of a specific kind of info event
info_b_cnt: same as above for a different event type
has_err: whether or not the machine logged any errors
Now, I'd like to group the dataframe by id to create a variable storing the number of days left before an error event. The desired dataframe should look like:
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 232 2020-01-04 56 4 0 3
8 232 2020-01-05 97 1 0 2
9 232 2020-01-06 23 74 0 1
10 232 2020-01-07 91 85 1 0
11 232 2020-01-08 91 85 0 2
12 232 2020-01-09 91 85 0 1
13 232 2020-01-10 91 85 1 0
I am having an hard time figuring out the correct implementation with the right grouping functions.
Edit:
All the answers below work really well when dealing with dates with a daily granularity. I am wondering how to adapt #jezrael solution below to a dataframe containing timestamps (logs will be batched with 15 minutes interval):
df:
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 12:00:00 123 32 0
1 123 2020-01-01 12:15:00 2 43 0
2 123 2020-01-01 12:30:00 43 4 1
3 123 2020-01-01 12:45:00 43 4 0
4 123 2020-01-01 13:00:00 43 4 0
5 123 2020-01-01 13:15:00 43 4 0
6 123 2020-01-01 13:30:00 43 4 1
7 123 2020-01-01 13:45:00 43 4 0
8 232 2020-01-04 17:00:00 56 4 0
9 232 2020-01-05 17:15:00 97 1 0
10 232 2020-01-06 17:30:00 23 74 0
11 232 2020-01-07 17:45:00 91 85 1
12 232 2020-01-08 18:00:00 91 85 0
13 232 2020-01-09 18:15:00 91 85 0
14 232 2020-01-10 18:30:00 91 85 1
I am wondering how to adapt #jezrael answer in order to land on something like:
id date info_a_cnt info_b_cnt has_err mins_to_err
0 123 2020-01-01 12:00:00 123 32 0 30
1 123 2020-01-01 12:15:00 2 43 0 15
2 123 2020-01-01 12:30:00 43 4 1 0
3 123 2020-01-01 12:45:00 43 4 0 45
4 123 2020-01-01 13:00:00 43 4 0 30
5 123 2020-01-01 13:15:00 43 4 0 15
6 123 2020-01-01 13:30:00 43 4 1 0
7 123 2020-01-01 13:45:00 43 4 0 60
8 232 2020-01-04 17:00:00 56 4 0 45
9 232 2020-01-05 17:15:00 97 1 0 30
10 232 2020-01-06 17:30:00 23 74 0 15
11 232 2020-01-07 17:45:00 91 85 1 0
12 232 2020-01-08 18:00:00 91 85 0 30
13 232 2020-01-09 18:15:00 91 85 0 15
14 232 2020-01-10 18:30:00 91 85 1 0
Use GroupBy.cumcount with ascending=False by column id and helper Series with Series.cumsum but form back - so added indexing by Series.iloc:
g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
EDIT: For count cumulative sum of differencies of dates use custom lambda function with GroupBy.transform:
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.days.cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2.0
1 123 2020-01-02 2 43 0 1.0
2 123 2020-01-03 43 4 1 0.0
3 123 2020-01-04 43 4 0 3.0
4 123 2020-01-05 43 4 0 2.0
5 123 2020-01-06 43 4 0 1.0
6 123 2020-01-07 43 4 1 0.0
7 123 2020-01-08 43 4 0 0.0
8 232 2020-01-04 56 4 0 3.0
9 232 2020-01-05 97 1 0 2.0
10 232 2020-01-06 23 74 0 1.0
11 232 2020-01-07 91 85 1 0.0
12 232 2020-01-08 91 85 0 2.0
13 232 2020-01-09 91 85 0 1.0
14 232 2020-01-10 91 85 1 0.0
EDIT1: Use Series.dt.total_seconds with divide by 60:
#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 12:00:00 123 32 0 30.0
1 123 2020-01-01 12:15:00 2 43 0 15.0
2 123 2020-01-01 12:30:00 43 4 1 0.0
3 123 2020-01-01 12:45:00 43 4 0 45.0
4 123 2020-01-01 13:00:00 43 4 0 30.0
5 123 2020-01-01 13:15:00 43 4 0 15.0
6 123 2020-01-01 13:30:00 43 4 1 0.0
7 123 2020-01-01 13:45:00 43 4 0 0.0
8 232 2020-01-01 17:00:00 56 4 0 45.0
9 232 2020-01-01 17:15:00 97 1 0 30.0
10 232 2020-01-01 17:30:00 23 74 0 15.0
11 232 2020-01-01 17:45:00 91 85 1 0.0
12 232 2020-01-01 18:00:00 91 85 0 30.0
13 232 2020-01-01 18:15:00 91 85 0 15.0
14 232 2020-01-01 18:30:00 91 85 1 0.0
Use:
df2 = df[::-1]
df['days_to_err'] = df2.groupby(['id', df2['has_err'].eq(1).cumsum()]).cumcount()
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
I have following dataframe in pandas
code tank date time no_operation_flag
123 1 01-01-2019 00:00:00 1
123 1 01-01-2019 00:30:00 1
123 1 01-01-2019 01:00:00 0
123 1 01-01-2019 01:30:00 1
123 1 01-01-2019 02:00:00 1
123 1 01-01-2019 02:30:00 1
123 1 01-01-2019 03:00:00 1
123 1 01-01-2019 03:30:00 1
123 1 01-01-2019 04:00:00 1
123 2 01-01-2019 00:00:00 1
123 2 01-01-2019 00:30:00 1
123 2 01-01-2019 01:00:00 1
123 2 01-01-2019 01:30:00 0
123 2 01-01-2019 02:00:00 1
123 2 01-01-2019 02:30:00 1
123 2 01-01-2019 03:00:00 1
123 2 01-01-2019 03:30:00 1
123 2 01-01-2019 04:00:00 1
What I want to do is to flag consecutive 1's in no_operation_flag more than 3 times at tank level. Dataframe is already sorted at tank,date and time level.
code tank date time no_operation_flag final_flag
123 1 01-01-2019 00:00:00 1 0
123 1 01-01-2019 00:30:00 1 0
123 1 01-01-2019 01:00:00 0 0
123 1 01-01-2019 01:30:00 1 1
123 1 01-01-2019 02:00:00 1 1
123 1 01-01-2019 02:30:00 1 1
123 1 01-01-2019 03:00:00 1 1
123 1 01-01-2019 03:30:00 1 1
123 1 01-01-2019 04:00:00 1 1
123 2 01-01-2019 00:00:00 1 0
123 2 01-01-2019 00:30:00 1 0
123 2 01-01-2019 01:00:00 1 0
123 2 01-01-2019 01:30:00 0 0
123 2 01-01-2019 02:00:00 1 1
123 2 01-01-2019 02:30:00 1 1
123 2 01-01-2019 03:00:00 1 1
123 2 01-01-2019 03:30:00 1 1
123 2 01-01-2019 04:00:00 1 1
How do I do it in python?
Crete consecutive groups with DataFrameGroupBy.shift, not equal and cumulative sum, then get counts with Series.map and Series.value_counts, compare with Series.gt for > and 1 values, last set values by numpy.where:
shifted = df.groupby(['code','tank'])['no_operation_flag'].shift()
g = df['no_operation_flag'].ne(shifted).cumsum()
mask = g.map(g.value_counts()).gt(3) & df['no_operation_flag'].eq(1)
df['no_operation_flag'] = np.where(mask, 1, 0)
Or:
df['no_operation_flag'] = mask.astype(int)
print (df)
code tank date time no_operation_flag
0 123 1 01-01-2019 00:00:00 0
1 123 1 01-01-2019 00:30:00 0
2 123 1 01-01-2019 01:00:00 0
3 123 1 01-01-2019 01:30:00 1
4 123 1 01-01-2019 02:00:00 1
5 123 1 01-01-2019 02:30:00 1
6 123 1 01-01-2019 03:00:00 1
7 123 1 01-01-2019 03:30:00 1
8 123 1 01-01-2019 04:00:00 1
9 123 2 01-01-2019 00:00:00 0
10 123 2 01-01-2019 00:30:00 0
11 123 2 01-01-2019 01:00:00 0
12 123 2 01-01-2019 01:30:00 0
13 123 2 01-01-2019 02:00:00 1
14 123 2 01-01-2019 02:30:00 1
15 123 2 01-01-2019 03:00:00 1
16 123 2 01-01-2019 03:30:00 1
17 123 2 01-01-2019 04:00:00 1
I am struggling to get my pandas df into the format I require due to incorrectly populating a bit masked dataframe.
I have a number of data frames:
plot_d1_sw1 - this is a read from a .csv
timestamp switchID deviceID count
0 2019-05-01 07:00:00 1 GTEC122277 1
1 2019-05-01 08:00:00 1 GTEC122277 1
3 2019-05-01 10:00:00 1 GTEC122277 3
d1_sw1 - this is the last 12 hours and a conditional as to whether the data appears in filt
timestamp num
0 2019-05-01 12:00:00 False
1 2019-05-01 11:00:00 False
2 2019-05-01 10:00:00 True
3 2019-05-01 09:00:00 False
4 2019-05-01 08:00:00 True
5 2019-05-01 07:00:00 True
6 2019-05-01 06:00:00 False
7 2019-05-01 05:00:00 False
8 2019-05-01 04:00:00 False
9 2019-05-01 03:00:00 False
10 2019-05-01 02:00:00 False
11 2019-05-01 01:00:00 False
I have tried masking this and pulling through the count column into the any True values using the following:
mask_d1_sw1 = d1_sw1.num == False
d1_sw1.loc[mask_d1_sw1, column_name] = 0
i=0
for row in plot_d1_sw1.itertuples():
mask_d1_sw1 = d1_sw1.num == True
d1_sw1.loc[mask_d1_sw1, column_name] = plot_d1_sw1['count'].values[i]
print(d1_sw1)
i = i + 1
this gives me:
timestamp num
0 2019-05-01 12:00:00 0
1 2019-05-01 11:00:00 0
2 2019-05-01 10:00:00 3
3 2019-05-01 09:00:00 0
4 2019-05-01 08:00:00 3
5 2019-05-01 07:00:00 3
6 2019-05-01 06:00:00 0
7 2019-05-01 05:00:00 0
8 2019-05-01 04:00:00 0
9 2019-05-01 03:00:00 0
10 2019-05-01 02:00:00 0
11 2019-05-01 01:00:00 0
... I know that this is because I am looping through the count column of plot_d1_sw1 but I cannot for the life of me work out how to logically fill this to get the outcome:
timestamp num
0 2019-05-01 12:00:00 0
1 2019-05-01 11:00:00 0
2 2019-05-01 10:00:00 3
3 2019-05-01 09:00:00 0
4 2019-05-01 08:00:00 1
5 2019-05-01 07:00:00 1
6 2019-05-01 06:00:00 0
7 2019-05-01 05:00:00 0
8 2019-05-01 04:00:00 0
9 2019-05-01 03:00:00 0
10 2019-05-01 02:00:00 0
11 2019-05-01 01:00:00 0
How can I achieve this outcome?
One way is to merge on the timestamp and then multiply the boolean values with count:
df = d1_sw1.merge(plot_d1_sw1, how='left', on='timestamp')
df['num'] = df.num.mul(df['count'].fillna(0)).astype(int)
df[['timestamp', 'num']]
Which gives:
timestamp num
0 2019-05-01-12:00:00 0
1 2019-05-01-11:00:00 0
2 2019-05-01-10:00:00 3
3 2019-05-01-09:00:00 0
4 2019-05-01-08:00:00 1
5 2019-05-01-07:00:00 1
6 2019-05-01-06:00:00 0
7 2019-05-01-05:00:00 0
8 2019-05-01-04:00:00 0
9 2019-05-01-03:00:00 0
10 2019-05-01-02:00:00 0
11 2019-05-01-01:00:00 0
I have one dataframe as below. At first,they have three columns('date','time','flag'). I want to add one column which based on the flag and date which means when I get flag=1 ,then the rest of this day the target is 1, otherwise the target is zero.
date time flag target
0 2017/4/10 10:00:00 0 0
1 2017/4/10 11:00:00 1 1
2 2017/4/10 12:00:00 0 1
3 2017/4/10 13:00:00 0 1
4 2017/4/10 14:00:00 0 1
5 2017/4/11 10:00:00 1 1
6 2017/4/11 11:00:00 0 1
7 2017/4/11 12:00:00 1 1
8 2017/4/11 13:00:00 1 1
9 2017/4/11 14:00:00 0 1
10 2017/4/12 10:00:00 0 0
11 2017/4/12 11:00:00 0 0
12 2017/4/12 12:00:00 0 0
13 2017/4/12 13:00:00 0 0
14 2017/4/12 14:00:00 0 0
15 2017/4/13 10:00:00 0 0
16 2017/4/13 11:00:00 1 1
17 2017/4/13 12:00:00 0 1
18 2017/4/13 13:00:00 1 1
19 2017/4/13 14:00:00 0 1
Use DataFrameGroupBy.cumsum for cumulative sum flag values, compare with 0 and last cast mask to integer:
df['new'] = (df.groupby('date')['flag'].cumsum() > 0).astype(int)
print (df)
date time flag target new
0 2017/4/10 10:00:00 0 0 0
1 2017/4/10 11:00:00 1 1 1
2 2017/4/10 12:00:00 0 1 1
3 2017/4/10 13:00:00 0 1 1
4 2017/4/10 14:00:00 0 1 1
5 2017/4/11 10:00:00 1 1 1
6 2017/4/11 11:00:00 0 1 1
7 2017/4/11 12:00:00 1 1 1
8 2017/4/11 13:00:00 1 1 1
9 2017/4/11 14:00:00 0 1 1
10 2017/4/12 10:00:00 0 0 0
11 2017/4/12 11:00:00 0 0 0
12 2017/4/12 12:00:00 0 0 0
13 2017/4/12 13:00:00 0 0 0
14 2017/4/12 14:00:00 0 0 0
15 2017/4/13 10:00:00 0 0 0
16 2017/4/13 11:00:00 1 1 1
17 2017/4/13 12:00:00 0 1 1
18 2017/4/13 13:00:00 1 1 1
19 2017/4/13 14:00:00 0 1 1
Okay, I know that we've already found a solution here but just to satisfy the nerd in me, here's an answer (not elegant given how long it is) to avoid that nagging first-row flaw
pd.merge(df, (df.groupby('date')['flag'].any().astype(int)).to_frame().T.transpose().reset_index(), left_on='date', right_on='date')
Approach remains the same as #jezrael - the groupby function is key here. Instead of using the cumsum, which leads to the first-row flaw, any() appears to fit really well into this solution. The only drawback is that it produces a series, which we then need to coerce back into a dataframe and transpose before joining them together by the date key.
My data looks something like this:
ID1 ID2 Date Values
1 1 2018-01-05 75
1 1 2018-01-06 83
1 1 2018-01-07 17
1 1 2018-01-08 15
1 2 2018-02-01 85
1 2 2018-02-02 98
2 1 2018-02-15 54
2 1 2018-02-16 17
2 1 2018-02-17 83
2 1 2018-02-18 94
2 2 2017-12-18 16
2 2 2017-12-19 84
2 2 2017-12-20 47
2 2 2017-12-21 28
2 2 2017-12-22 38
All the operations must be done within groups of ['ID1', 'ID2'].
What I want to do is upsample the dataframe in a way such that I end up with a sub-dataframe for each 'Date' index which includes all previous dates including the current one from it's own ['ID1', 'ID2'] group. The resulting dataframe should look like this:
ID1 ID2 DateGroup Date Values
1 1 2018-01-05 2018-01-05 75
1 1 2018-01-06 2018-01-05 75
1 1 2018-01-06 2018-01-06 83
1 1 2018-01-07 2018-01-05 75
1 1 2018-01-07 2018-01-06 83
1 1 2018-01-07 2018-01-07 17
1 1 2018-01-08 2018-01-05 75
1 1 2018-01-08 2018-01-06 83
1 1 2018-01-08 2018-01-07 17
1 1 2018-01-08 2018-01-08 15
1 2 2018-02-01 2018-02-01 85
1 2 2018-02-02 2018-02-01 85
1 2 2018-02-02 2018-02-02 98
2 1 2018-02-15 2018-02-15 54
2 1 2018-02-16 2018-02-15 54
2 1 2018-02-16 2018-02-16 17
2 1 2018-02-17 2018-02-15 54
2 1 2018-02-17 2018-02-16 17
2 1 2018-02-17 2018-02-17 83
2 1 2018-02-18 2018-02-15 54
2 1 2018-02-18 2018-02-16 17
2 1 2018-02-18 2018-02-17 83
2 1 2018-02-18 2018-02-18 94
2 2 2017-12-18 2017-12-18 16
2 2 2017-12-19 2017-12-18 16
2 2 2017-12-19 2017-12-19 84
2 2 2017-12-20 2017-12-18 16
2 2 2017-12-20 2017-12-19 84
2 2 2017-12-20 2017-12-20 47
2 2 2017-12-21 2017-12-18 16
2 2 2017-12-21 2017-12-19 84
2 2 2017-12-21 2017-12-20 47
2 2 2017-12-21 2017-12-21 28
2 2 2017-12-22 2017-12-18 16
2 2 2017-12-22 2017-12-19 84
2 2 2017-12-22 2017-12-20 47
2 2 2017-12-22 2017-12-21 28
2 2 2017-12-22 2017-12-22 38
The dataframe I'm working with is quite big (~20 million rows), thus I would like to avoid iterating through each row.
Is it possible to use a function or combination of pandas functions like resample/apply/reindex to achieve what I need?
Assuming ID1 and ID2 is your original Index. You should reset the index, set Date as Index, reset the index back to [ID1, ID2]:
df = df.reset_index().set_index(['Date']).resample('d').ffill().reset_index().set_index(['ID1','ID2'])
If your 'Date' field is string, then you should be converting it into datetime before resampling on that field. You can use the below for that:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')