I am working with structured log data structured as the following (here a pastebin snippet of mock data for easy tinkering):
import pandas as pd
df = pd.read_csv("https://pastebin.com/raw/qrqTMrGa")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 123 32 0
1 123 2020-01-02 2 43 0
2 123 2020-01-03 43 4 1
3 123 2020-01-04 43 4 0
4 123 2020-01-05 43 4 0
5 123 2020-01-06 43 4 0
6 123 2020-01-07 43 4 1
7 123 2020-01-08 43 4 0
8 232 2020-01-04 56 4 0
9 232 2020-01-05 97 1 0
10 232 2020-01-06 23 74 0
11 232 2020-01-07 91 85 1
12 232 2020-01-08 91 85 0
13 232 2020-01-09 91 85 0
14 232 2020-01-10 91 85 1
Variables are pretty straightforward:
id: the id of the observed machine
date: observation date
info_a_cnt: counts of a specific kind of info event
info_b_cnt: same as above for a different event type
has_err: whether or not the machine logged any errors
Now, I'd like to group the dataframe by id to create a variable storing the number of days left before an error event. The desired dataframe should look like:
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 232 2020-01-04 56 4 0 3
8 232 2020-01-05 97 1 0 2
9 232 2020-01-06 23 74 0 1
10 232 2020-01-07 91 85 1 0
11 232 2020-01-08 91 85 0 2
12 232 2020-01-09 91 85 0 1
13 232 2020-01-10 91 85 1 0
I am having an hard time figuring out the correct implementation with the right grouping functions.
Edit:
All the answers below work really well when dealing with dates with a daily granularity. I am wondering how to adapt #jezrael solution below to a dataframe containing timestamps (logs will be batched with 15 minutes interval):
df:
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 12:00:00 123 32 0
1 123 2020-01-01 12:15:00 2 43 0
2 123 2020-01-01 12:30:00 43 4 1
3 123 2020-01-01 12:45:00 43 4 0
4 123 2020-01-01 13:00:00 43 4 0
5 123 2020-01-01 13:15:00 43 4 0
6 123 2020-01-01 13:30:00 43 4 1
7 123 2020-01-01 13:45:00 43 4 0
8 232 2020-01-04 17:00:00 56 4 0
9 232 2020-01-05 17:15:00 97 1 0
10 232 2020-01-06 17:30:00 23 74 0
11 232 2020-01-07 17:45:00 91 85 1
12 232 2020-01-08 18:00:00 91 85 0
13 232 2020-01-09 18:15:00 91 85 0
14 232 2020-01-10 18:30:00 91 85 1
I am wondering how to adapt #jezrael answer in order to land on something like:
id date info_a_cnt info_b_cnt has_err mins_to_err
0 123 2020-01-01 12:00:00 123 32 0 30
1 123 2020-01-01 12:15:00 2 43 0 15
2 123 2020-01-01 12:30:00 43 4 1 0
3 123 2020-01-01 12:45:00 43 4 0 45
4 123 2020-01-01 13:00:00 43 4 0 30
5 123 2020-01-01 13:15:00 43 4 0 15
6 123 2020-01-01 13:30:00 43 4 1 0
7 123 2020-01-01 13:45:00 43 4 0 60
8 232 2020-01-04 17:00:00 56 4 0 45
9 232 2020-01-05 17:15:00 97 1 0 30
10 232 2020-01-06 17:30:00 23 74 0 15
11 232 2020-01-07 17:45:00 91 85 1 0
12 232 2020-01-08 18:00:00 91 85 0 30
13 232 2020-01-09 18:15:00 91 85 0 15
14 232 2020-01-10 18:30:00 91 85 1 0
Use GroupBy.cumcount with ascending=False by column id and helper Series with Series.cumsum but form back - so added indexing by Series.iloc:
g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
EDIT: For count cumulative sum of differencies of dates use custom lambda function with GroupBy.transform:
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.days.cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2.0
1 123 2020-01-02 2 43 0 1.0
2 123 2020-01-03 43 4 1 0.0
3 123 2020-01-04 43 4 0 3.0
4 123 2020-01-05 43 4 0 2.0
5 123 2020-01-06 43 4 0 1.0
6 123 2020-01-07 43 4 1 0.0
7 123 2020-01-08 43 4 0 0.0
8 232 2020-01-04 56 4 0 3.0
9 232 2020-01-05 97 1 0 2.0
10 232 2020-01-06 23 74 0 1.0
11 232 2020-01-07 91 85 1 0.0
12 232 2020-01-08 91 85 0 2.0
13 232 2020-01-09 91 85 0 1.0
14 232 2020-01-10 91 85 1 0.0
EDIT1: Use Series.dt.total_seconds with divide by 60:
#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 12:00:00 123 32 0 30.0
1 123 2020-01-01 12:15:00 2 43 0 15.0
2 123 2020-01-01 12:30:00 43 4 1 0.0
3 123 2020-01-01 12:45:00 43 4 0 45.0
4 123 2020-01-01 13:00:00 43 4 0 30.0
5 123 2020-01-01 13:15:00 43 4 0 15.0
6 123 2020-01-01 13:30:00 43 4 1 0.0
7 123 2020-01-01 13:45:00 43 4 0 0.0
8 232 2020-01-01 17:00:00 56 4 0 45.0
9 232 2020-01-01 17:15:00 97 1 0 30.0
10 232 2020-01-01 17:30:00 23 74 0 15.0
11 232 2020-01-01 17:45:00 91 85 1 0.0
12 232 2020-01-01 18:00:00 91 85 0 30.0
13 232 2020-01-01 18:15:00 91 85 0 15.0
14 232 2020-01-01 18:30:00 91 85 1 0.0
Use:
df2 = df[::-1]
df['days_to_err'] = df2.groupby(['id', df2['has_err'].eq(1).cumsum()]).cumcount()
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
Related
How can I give a column the same number every 7 times in a dataframe?
In the last column,
'ww' I want to put the same 1 from 1-21 to 1-27, the same 2 from 1-28 to 2-3,..
2 for the next 7 days
3 for the next 7 days, etc..
Finally, I want to put a number that increases every 7 days, but I am not sure of the code.
date people ww
0 2020-01-21 0
1 2020-01-22 0
2 2020-01-23 0
3 2020-01-24 1
4 2020-01-25 0
... ... ...
616 2021-09-28 2289
617 2021-09-29 2883
618 2021-09-30 2564
619 2021-10-01 2484
620 2021-10-02 2247
Since you have daily data, you can do this with simple math:
df["ww"] = (df["date"]-df["date"].min()).dt.days//7+1
>>> df
date ww
0 2021-01-21 1
1 2021-01-22 1
2 2021-01-23 1
3 2021-01-24 1
4 2021-01-25 1
.. ... ..
250 2021-09-28 36
251 2021-09-29 36
252 2021-09-30 37
253 2021-10-01 37
254 2021-10-02 37
I have following dataframe in pandas
code tank date time no_operation_flag
123 1 01-01-2019 00:00:00 1
123 1 01-01-2019 00:30:00 1
123 1 01-01-2019 01:00:00 0
123 1 01-01-2019 01:30:00 1
123 1 01-01-2019 02:00:00 1
123 1 01-01-2019 02:30:00 1
123 1 01-01-2019 03:00:00 1
123 1 01-01-2019 03:30:00 1
123 1 01-01-2019 04:00:00 1
123 1 01-01-2019 05:00:00 1
123 1 01-01-2019 14:00:00 1
123 1 01-01-2019 14:30:00 1
123 1 01-01-2019 15:00:00 1
123 1 01-01-2019 15:30:00 1
123 1 01-01-2019 16:00:00 1
123 1 01-01-2019 16:30:00 1
123 2 02-01-2019 00:00:00 1
123 2 02-01-2019 00:30:00 0
123 2 02-01-2019 01:00:00 0
123 2 02-01-2019 01:30:00 0
123 2 02-01-2019 02:00:00 1
123 2 02-01-2019 02:30:00 1
123 2 02-01-2019 03:00:00 1
123 2 03-01-2019 03:30:00 1
123 2 03-01-2019 04:00:00 1
123 1 03-01-2019 14:00:00 1
123 2 03-01-2019 15:00:00 1
123 2 03-01-2019 00:30:00 1
123 2 04-01-2019 11:00:00 1
123 2 04-01-2019 11:30:00 0
123 2 04-01-2019 12:00:00 1
123 2 04-01-2019 13:30:00 1
123 2 05-01-2019 03:00:00 1
123 2 05-01-2019 03:30:00 1
123 2 05-01-2019 04:00:00 1
What I want to do is to flag consecutive 1's in no_operation_flag more than 5 times at tank level and day level, but the time should be consecutive (time is at half an hour level). Dataframe is already sorted at tank, date and time level.
My desired dataframe would be
code tank date time no_operation_flag final_flag
123 1 01-01-2019 00:00:00 1 0
123 1 01-01-2019 00:30:00 1 0
123 1 01-01-2019 01:00:00 0 0
123 1 01-01-2019 01:30:00 1 1
123 1 01-01-2019 02:00:00 1 1
123 1 01-01-2019 02:30:00 1 1
123 1 01-01-2019 03:00:00 1 1
123 1 01-01-2019 03:30:00 1 1
123 1 01-01-2019 04:00:00 1 1
123 1 01-01-2019 05:00:00 1 0
123 1 01-01-2019 14:00:00 1 1
123 1 01-01-2019 14:30:00 1 1
123 1 01-01-2019 15:00:00 1 1
123 1 01-01-2019 15:30:00 1 1
123 1 01-01-2019 16:00:00 1 1
123 1 01-01-2019 16:30:00 1 1
123 2 02-01-2019 00:00:00 1 0
123 2 02-01-2019 00:30:00 0 0
123 2 02-01-2019 01:00:00 0 0
123 2 02-01-2019 01:30:00 0 0
123 2 02-01-2019 02:00:00 1 0
123 2 02-01-2019 02:30:00 1 0
123 2 02-01-2019 03:00:00 1 0
123 2 03-01-2019 03:30:00 1 0
123 2 03-01-2019 04:00:00 1 0
123 1 03-01-2019 14:00:00 1 0
123 2 03-01-2019 15:00:00 1 0
123 2 03-01-2019 00:30:00 1 0
123 2 04-01-2019 11:00:00 1 0
123 2 04-01-2019 11:30:00 0 0
123 2 04-01-2019 12:00:00 1 0
123 2 04-01-2019 13:30:00 1 0
123 2 05-01-2019 03:00:00 1 0
123 2 05-01-2019 03:30:00 1 0
123 2 05-01-2019 04:00:00 1 0
How can I do this in pandas?
You can use solution like this, only filtering for consecutive datetimes per groups with new helper DataFrame with added all missing datetimes, last merge for add new column:
df['datetimes'] = pd.to_datetime(df['date'].astype(str) + ' ' + df['time'].astype(str))
df1 = (df.set_index('datetimes')
.groupby(['code','tank', 'date'])['no_operation_flag']
.resample('30T')
.first()
.reset_index())
shifted1 = df1.groupby(['code','tank', 'date'])['no_operation_flag'].shift()
g1 = df1['no_operation_flag'].ne(shifted1).cumsum()
mask1 = g1.map(g1.value_counts()).gt(5) & df1['no_operation_flag'].eq(1)
df1['final_flag'] = mask1.astype(int)
#print (df1.head(40))
df = df.merge(df1[['code','tank','datetimes','final_flag']]).drop('datetimes', axis=1)
print (df)
code tank date time no_operation_flag final_flag
0 123 1 01-01-2019 00:00:00 1 0
1 123 1 01-01-2019 00:30:00 1 0
2 123 1 01-01-2019 01:00:00 0 0
3 123 1 01-01-2019 01:30:00 1 1
4 123 1 01-01-2019 02:00:00 1 1
5 123 1 01-01-2019 02:30:00 1 1
6 123 1 01-01-2019 03:00:00 1 1
7 123 1 01-01-2019 03:30:00 1 1
8 123 1 01-01-2019 04:00:00 1 1
9 123 1 01-01-2019 05:00:00 1 0
10 123 1 01-01-2019 14:00:00 1 1
11 123 1 01-01-2019 14:30:00 1 1
12 123 1 01-01-2019 15:00:00 1 1
13 123 1 01-01-2019 15:30:00 1 1
14 123 1 01-01-2019 16:00:00 1 1
15 123 1 01-01-2019 16:30:00 1 1
16 123 2 02-01-2019 00:00:00 1 0
17 123 2 02-01-2019 00:30:00 0 0
18 123 2 02-01-2019 01:00:00 0 0
19 123 2 02-01-2019 01:30:00 0 0
20 123 2 02-01-2019 02:00:00 1 0
21 123 2 02-01-2019 02:30:00 1 0
22 123 2 02-01-2019 03:00:00 1 0
23 123 2 03-01-2019 03:30:00 1 0
24 123 2 03-01-2019 04:00:00 1 0
25 123 1 03-01-2019 14:00:00 1 0
26 123 2 03-01-2019 15:00:00 1 0
27 123 2 03-01-2019 00:30:00 1 0
28 123 2 04-01-2019 11:00:00 1 0
29 123 2 04-01-2019 11:30:00 0 0
30 123 2 04-01-2019 12:00:00 1 0
31 123 2 04-01-2019 13:30:00 1 0
32 123 2 05-01-2019 03:00:00 1 0
33 123 2 05-01-2019 03:30:00 1 0
34 123 2 05-01-2019 04:00:00 1 0
Use:
df['final_flag'] = ( df.groupby([df['no_operation_flag'].ne(1).cumsum(),
'tank',
'date',
pd.to_datetime(df['time'].astype(str))
.diff()
.ne(pd.Timedelta(minutes = 30))
.cumsum(),
'no_operation_flag'])['no_operation_flag']
.transform('size')
.gt(5)
.view('uint8') )
print(df)
Output
code tank date time no_operation_flag final_flag
0 123 1 01-01-2019 00:00:00 1 0
1 123 1 01-01-2019 00:30:00 1 0
2 123 1 01-01-2019 01:00:00 0 0
3 123 1 01-01-2019 01:30:00 1 1
4 123 1 01-01-2019 02:00:00 1 1
5 123 1 01-01-2019 02:30:00 1 1
6 123 1 01-01-2019 03:00:00 1 1
7 123 1 01-01-2019 03:30:00 1 1
8 123 1 01-01-2019 04:00:00 1 1
9 123 1 01-01-2019 05:00:00 1 0
10 123 1 01-01-2019 14:00:00 1 1
11 123 1 01-01-2019 14:30:00 1 1
12 123 1 01-01-2019 15:00:00 1 1
13 123 1 01-01-2019 15:30:00 1 1
14 123 1 01-01-2019 16:00:00 1 1
15 123 1 01-01-2019 16:30:00 1 1
16 123 2 02-01-2019 00:00:00 1 0
17 123 2 02-01-2019 00:30:00 0 0
18 123 2 02-01-2019 01:00:00 0 0
19 123 2 02-01-2019 01:30:00 0 0
20 123 2 02-01-2019 02:00:00 1 0
21 123 2 02-01-2019 02:30:00 1 0
22 123 2 02-01-2019 03:00:00 1 0
23 123 2 03-01-2019 03:30:00 1 0
24 123 2 03-01-2019 04:00:00 1 0
25 123 1 03-01-2019 14:00:00 1 0
26 123 2 03-01-2019 15:00:00 1 0
27 123 2 03-01-2019 00:30:00 1 0
28 123 2 04-01-2019 11:00:00 1 0
29 123 2 04-01-2019 11:30:00 0 0
30 123 2 04-01-2019 12:00:00 1 0
31 123 2 04-01-2019 13:30:00 1 0
32 123 2 05-01-2019 03:00:00 1 0
33 123 2 05-01-2019 03:30:00 1 0
There might be a way to do in one go but the two steps approach is simpler,
first you select tanks one by one and then you look for the sequence of five 1.
This other question already solves the searching the pattern in a column.
If you want to go the other way you might take a look at rolling, you can either sum the 1 or use a all values are True condition to find the sequence of n elements.
You could also just mask mask a column but that would give you just the values in the mask. This solves the other problem, "which tanks where non operative at a give time".
This is very premitive and somewhat dirty way but easy to understand, I think.
For loop of rows, check time after 4 rows is 2 hours far.
(if 1 is True) Check all of corresponding five values of df['no_operation_flag'] are 1.
(if 2 is True) Put 1 in corresponding five values of df['final_flag'].
# make col with zero
df['final_flag'] = 0
for i in range(1, len(df)-4):
j = i + 4
dt1 = df['date'].iloc[i]+' '+df['time'].iloc[i]
ts1 = pd.to_datetime(dt1)
dt2 = df['date'].iloc[j]+' '+df['time'].iloc[j]
ts2 = pd.to_datetime(dt2)
# timedelta is 2 hours?
if ts2 - ts1 == datetime.timedelta(hours=2, minutes=0):
# all of no_operation_flag == 1?
if (df['no_operation_flag'].iloc[i:j+1] == 1).all():
df['final_flag'].iloc[i:j+1] = 1
My data looks something like this:
ID1 ID2 Date Values
1 1 2018-01-05 75
1 1 2018-01-06 83
1 1 2018-01-07 17
1 1 2018-01-08 15
1 2 2018-02-01 85
1 2 2018-02-02 98
2 1 2018-02-15 54
2 1 2018-02-16 17
2 1 2018-02-17 83
2 1 2018-02-18 94
2 2 2017-12-18 16
2 2 2017-12-19 84
2 2 2017-12-20 47
2 2 2017-12-21 28
2 2 2017-12-22 38
All the operations must be done within groups of ['ID1', 'ID2'].
What I want to do is upsample the dataframe in a way such that I end up with a sub-dataframe for each 'Date' index which includes all previous dates including the current one from it's own ['ID1', 'ID2'] group. The resulting dataframe should look like this:
ID1 ID2 DateGroup Date Values
1 1 2018-01-05 2018-01-05 75
1 1 2018-01-06 2018-01-05 75
1 1 2018-01-06 2018-01-06 83
1 1 2018-01-07 2018-01-05 75
1 1 2018-01-07 2018-01-06 83
1 1 2018-01-07 2018-01-07 17
1 1 2018-01-08 2018-01-05 75
1 1 2018-01-08 2018-01-06 83
1 1 2018-01-08 2018-01-07 17
1 1 2018-01-08 2018-01-08 15
1 2 2018-02-01 2018-02-01 85
1 2 2018-02-02 2018-02-01 85
1 2 2018-02-02 2018-02-02 98
2 1 2018-02-15 2018-02-15 54
2 1 2018-02-16 2018-02-15 54
2 1 2018-02-16 2018-02-16 17
2 1 2018-02-17 2018-02-15 54
2 1 2018-02-17 2018-02-16 17
2 1 2018-02-17 2018-02-17 83
2 1 2018-02-18 2018-02-15 54
2 1 2018-02-18 2018-02-16 17
2 1 2018-02-18 2018-02-17 83
2 1 2018-02-18 2018-02-18 94
2 2 2017-12-18 2017-12-18 16
2 2 2017-12-19 2017-12-18 16
2 2 2017-12-19 2017-12-19 84
2 2 2017-12-20 2017-12-18 16
2 2 2017-12-20 2017-12-19 84
2 2 2017-12-20 2017-12-20 47
2 2 2017-12-21 2017-12-18 16
2 2 2017-12-21 2017-12-19 84
2 2 2017-12-21 2017-12-20 47
2 2 2017-12-21 2017-12-21 28
2 2 2017-12-22 2017-12-18 16
2 2 2017-12-22 2017-12-19 84
2 2 2017-12-22 2017-12-20 47
2 2 2017-12-22 2017-12-21 28
2 2 2017-12-22 2017-12-22 38
The dataframe I'm working with is quite big (~20 million rows), thus I would like to avoid iterating through each row.
Is it possible to use a function or combination of pandas functions like resample/apply/reindex to achieve what I need?
Assuming ID1 and ID2 is your original Index. You should reset the index, set Date as Index, reset the index back to [ID1, ID2]:
df = df.reset_index().set_index(['Date']).resample('d').ffill().reset_index().set_index(['ID1','ID2'])
If your 'Date' field is string, then you should be converting it into datetime before resampling on that field. You can use the below for that:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
I have the following dataframe in Pandas. Score and Date_of_interest columns are to be calculated. Below it is already filled out to make the explanation of the problem easy.
First let's assume that Score and Date_of_interest columns are filled with NaN's only. Below are the steps to fill the values in them.
a) We are trying to get one date of interest, based on the criteria described below for one PC_id eg. PC_id 200 has 1998-04-10 02:25:00 and so on.
b) To solve this problem we take the PC_id column and check each row to find the change in Item_id, each has a score of 1. For the same Item_id like in 1st row and second row, has 1 and 1 so the value starts with 1 but does not change in second row.
c) While moving and calculating the score for the second row it also checks the Datetime difference, if the previous one is more than 24 hours old, it is dropped and score is reset to 1 and cursor moves to third row.
d) When the Score reaches 2, we have reached the qualifying score as in row no 5(index 4) and we copy the corresponding Datetime in Date_of_interest column.
e) We start the new cycle for new PC_id as in row six.
Datetime Item_id PC_id Value Score Date_of_interest
0 1998-04-8 01:00:00 1 200 35 1 NaN
1 1998-04-8 02:00:00 1 200 92 1 NaN
2 1998-04-10 02:00:00 2 200 35 1 NaN
3 1998-04-10 02:15:00 2 200 92 1 NaN
4 1998-04-10 02:25:00 3 200 92 2 1998-04-10 02:25:00
5 1998-04-10 03:00:00 1 201 93 1 NaN
6 1998-04-12 03:30:00 3 201 94 1 NaN
7 1998-04-12 04:00:00 4 201 95 2 NaN
8 1998-04-12 04:00:00 4 201 26 2 1998-04-12 04:00:00
9 1998-04-12 04:30:00 2 201 98 3 NaN
10 1998-04-12 04:50:00 1 202 100 1 NaN
11 1998-04-15 05:00:00 4 202 100 1 NaN
12 1998-04-15 05:15:00 3 202 100 2 1998-04-15 05:15:00
13 1998-04-15 05:30:00 2 202 100 3 NaN
14 1998-04-15 06:00:00 3 202 100 NaN NaN
15 1998-04-15 06:00:00 3 202 222 NaN NaN
Final table should be as follows:
PC_id Date_of_interest
0 200 1998-04-10 02:25:00
1 201 1998-04-12 04:00:00
2 202 1998-04-15 05:15:00
Thanks for helping.
Update : Code I am working on currently:
df_merged_unique = df_merged['PC_id'].unique()
score = 0
for i, row in df_merged.iterrows():
for elem in df_merged_unique:
first_date = row['Datetime']
first_item = 0
if row['PC_id'] == elem:
if row['Score'] < 2:
if row['Item_id'] != first_item:
if row['Datetime']-first_date <= pd.datetime.timedelta(days=1):
score += 1
row['Score'] = score
first_date = row['Datetime']
else:
pass
else:
pass
else:
row['Date_of_interest'] = row['Datetime']
break
else:
pass
Usually having to resort to iterative/imperative methods is a sign of trouble when working with pandas. Given the dataframe
In [111]: df2
Out[111]:
Datetime Item_id PC_id Value
0 1998-04-08 01:00:00 1 200 35
1 1998-04-08 02:00:00 1 200 92
2 1998-04-10 02:00:00 2 200 35
3 1998-04-10 02:15:00 2 200 92
4 1998-04-10 02:25:00 3 200 92
5 1998-04-10 03:00:00 1 201 93
6 1998-04-12 03:30:00 3 201 94
7 1998-04-12 04:00:00 4 201 95
8 1998-04-12 04:00:00 4 201 26
9 1998-04-12 04:30:00 2 201 98
10 1998-04-12 04:50:00 1 202 100
11 1998-04-15 05:00:00 4 202 100
12 1998-04-15 05:15:00 3 202 100
13 1998-04-15 05:30:00 2 202 100
14 1998-04-15 06:00:00 3 202 100
15 1998-04-15 06:00:00 3 202 222
you could first group by PC_id
In [112]: the_group = df2.groupby('PC_id')
and then apply the search using diff() to get the rows where Item_id and Datetime change appropriately
In [357]: (the_group['Item_id'].diff() != 0) & \
...: (the_group['Datetime'].diff() <= timedelta(days=1))
Out[357]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
9 True
10 False
11 False
12 True
13 True
14 True
15 False
16 False
dtype: bool
and then just take the first date (first match) in each group, if any
In [341]: df2[(the_group['Item_id'].diff() != 0) &
...: (the_group['Datetime'].diff() <= timedelta(days=1))]\
...: .groupby('PC_id').first()['Datetime'].reset_index()
Out[341]:
PC_id Datetime
0 200 1998-04-10 02:25:00
1 201 1998-04-12 04:00:00
2 202 1998-04-15 05:15:00
I have the following dataframe df:
timestamp objectId result
0 2015-11-24 09:00:00 Stress 3
1 2015-11-24 09:00:00 Productivity 0
2 2015-11-24 09:00:00 Abilities 4
3 2015-11-24 09:00:00 Challenge 0
4 2015-11-24 10:00:00 Productivity 87
5 2015-11-24 10:00:00 Abilities 84
6 2015-11-24 10:00:00 Challenge 58
7 2015-11-24 10:00:00 Stress 25
8 2015-11-24 11:00:00 Productivity 93
9 2015-11-24 11:00:00 Abilities 93
10 2015-11-24 11:00:00 Challenge 93
11 2015-11-24 11:00:00 Stress 19
12 2015-11-24 12:00:00 Challenge 90
13 2015-11-24 12:00:00 Abilities 96
14 2015-11-24 12:00:00 Stress 94
15 2015-11-24 12:00:00 Productivity 88
16 2015-11-24 13:00:00 Productivity 12
17 2015-11-24 13:00:00 Challenge 17
18 2015-11-24 13:00:00 Abilities 89
19 2015-11-24 13:00:00 Stress 13
I would like to achieve a barchart like the following
Where instead of a,b,c,d there would be the labels in the column ObjectID the y-axis should correspond to the column result and x-axis should be the values grouped of the column timestamp.
I tried several things but nothing worked. This was the closest, but the plot() method doesn't take any customisation via parameters (e.g. kind='bar' doesn't work).
groups = df.groupby('objectId')
sgb = groups['result']
sgb.plot()
Any other idea?
import seaborn as sns
In [36]:
df.timestamp = df.timestamp.factorize()[0]
In [39]:
df.objectId = df.objectId.map({'Stress' : 'a' , 'Productivity' : 'b' , 'Abilities' : 'c' , 'Challenge' : 'd'})
In [41]:
df
Out[41]:
timestamp objectId result
0 0 a 3
1 0 b 0
2 0 c 4
3 0 d 0
4 1 b 87
5 1 c 84
6 1 d 58
7 1 a 25
8 2 b 93
9 2 c 93
10 2 d 93
11 2 a 19
12 3 d 90
13 3 c 96
14 3 a 94
15 3 b 88
16 4 b 12
17 4 d 17
18 4 c 89
19 4 a 13
In [40]:
sns.barplot(x = 'timestamp' , y = 'result' , hue = 'objectId' , data = df );
The answer of #NaderHisham is a very easy solution!
But just as a reference, if you for some reason cannot use seaborn, this is a pure pandas/matplotlib solution:
You need to reshape your data, so the different objectIds becomes the columns:
In [20]: df.set_index(['timestamp', 'objectId'])['result'].unstack()
Out[20]:
objectId Abilities Challenge Productivity Stress
timestamp
09:00:00 4 0 0 3
10:00:00 84 58 87 25
11:00:00 93 93 93 19
12:00:00 96 90 88 94
13:00:00 89 17 12 13
If you make a bar plot of this, you get the desired result:
In [24]: df.set_index(['timestamp', 'objectId'])['result'].unstack().plot(kind='bar')
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0xc44a5c0>