Pandas GroupBy with CumulativeSum per Contiguous Groups

Pandas GroupBy with CumulativeSum per Contiguous Groups - python

I'm looking to understand the number of times we are in an 'Abnormal State' before we have an 'Event'. My objective is to modify my dataframe to get the following output where everytime we reach an 'event', the 'Abnormal State Grouping' resets to count from 0.
We can go through a number of 'Abnormal States' before we reach an 'Event', which is deemed a failure. (i.e. The lightbulb is switched on and off for several periods before it finally shorts out resulting in an event).
I've written the following code to get my AbnormalStateGroupings to increment into relevant groupings for my analysis which has worked fine. However, we want to 'reset' the count of our 'AbnormalStates' after each event (i.e. lightbulb failure):
dataframe['AbnormalStateGrouping'] = (dataframe['AbnormalState']!=dataframe['AbnormalState'].shift()).cumsum()
I have created an additional column which let's me know what 'event' we are at via:
dataframe['Event_Or_Not'].cumsum() #I have a boolean representation of the Event Column represented and we use .cumsum() to get the relevant groupings (i.e. 1st Event, 2nd Event, 3rd Event etc.)
I've come close previously using the following:
eventOrNot = dataframe['Event'].eq(0)
eventMask = (eventOrNot.ne(eventOrNot.shift())&eventOrNot).cumsum()
dataframe['AbnormalStatePerEvent'] =dataframe.groupby(['Event',eventMask]).cumcount().add(1)
However, this hasn't given me the desired output that I'm after (as per below).
I think I'm close however - Could anyone please advise what I could try to do next so that for each lightbulb failure, the abnormal state count resets and starts counting the # of abnormal states we have gone through before the next lightbulb failure?
State I want to get to with AbnormalStateGrouping
You would note that when an 'Event' is detected, the Abnormal State count resets to 1 and then starts counting again.
Current State of Dataframe
Please find an attached data source below:
https://filebin.net/ctjwk7p3gulmbgkn

I assume that your source DataFrame has only Date/Time (either string
or datetime), Event (string) and AbnormalState (int) columns.
To compute your grouping column, run:
dataframe['AbnormalStateGrouping'] = dataframe.groupby(
dataframe['Event'][::-1].notnull().cumsum()).AbnormalState\
.apply(lambda grp: (grp != grp.shift()).cumsum())
The result, for your initial source data, included as a picture, is:
Date/Time Event AbnormalState AbnormalStateGrouping
0 2018-01-01 01:00 NaN 0 1
1 2018-01-01 02:00 NaN 0 1
2 2018-01-01 03:00 NaN 1 2
3 2018-01-01 04:00 NaN 1 2
4 2018-01-01 05:00 NaN 0 3
5 2018-01-01 06:00 NaN 0 3
6 2018-01-01 07:00 NaN 0 3
7 2018-01-01 08:00 NaN 1 4
8 2018-01-01 09:00 NaN 1 4
9 2018-01-01 10:00 NaN 0 5
10 2018-01-01 11:00 NaN 0 5
11 2018-01-01 12:00 NaN 0 5
12 2018-01-01 13:00 NaN 1 6
13 2018-01-01 14:00 NaN 1 6
14 2018-01-01 15:00 NaN 0 7
15 2018-01-01 16:00 Event 0 7
16 2018-01-01 17:00 NaN 1 1
17 2018-01-01 18:00 NaN 1 1
18 2018-01-01 19:00 NaN 0 2
19 2018-01-01 20:00 NaN 0 2
Note the way of grouping:
dataframe['Event'][::-1].notnull().cumsum()
Due to [::-1], cumsum function is computed from the last row
to the first.
Thus:
rows with hours 01:00 thru 16:00 are in group 1,
remaining rows (hour 17:00 thru 20:00) are in group 0.
Then, to AbnormalState, separately for each group, a lambda function
is applied, so each cumulative sum starts from 1 just in each group
(after each Event).
Edit following the comment as of 22:18:12Z
The reason why I compute the cumsum for grouping in reversed order
is that when you run it in normal order:
dataframe['Event'].notnull().cumsum()
then:
rows with index 0 thru 14 (before the row with Event) have
this sum == 0,
row with index 15 and following rows have this sum == 1.
Try yourself both versions, without and with [::-1].
The result in normal order (without [::-1]) is that:
Event row is in the same group with the following rows,
so the reset occurs just on this row.
To check the whole result, run my code without [::-1] and you will see
that the ending part of the result contains:
Date/Time Event AbnormalState AbnormalStateGrouping
14 2018-01-01 15:00:00 NaN 0 7
15 2018-01-01 16:00:00 Event 0 1
16 2018-01-01 17:00:00 NaN 1 2
17 2018-01-01 18:00:00 NaN 1 2
18 2018-01-01 19:00:00 NaN 0 3
19 2018-01-01 20:00:00 NaN 0 3
so that the Event row to has AbnormalStateGrouping == 1.
But you want this row to have AbnormalStateGrouping in a sequence of
previous grouping states (in this case 7) and reset should occur
from the next row on.
So the Event row should be in same group with preceding rows, what
is the result of my code.

Related

Finding historical seasonal average for given month in a monthly series in a dataframe time-series

I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.

Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)

If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)

Pandas: Calculate running difference between prior record between date-time stamps

I have a data set where I'm tracking user interactions. If the same user has two interactions in the same half-hour period, I want to count that as a single interaction, so I need to calc the difference in time between records in a dataframe
I'm doing this in Pandas. Assume it's sorted by user_id, then datetime stamp. And it needs to reset when a new ID is encountered so previous ID needs to be stored to compare to current ID. Here's desired output
user id datetime desired column: minute diff from prior timestamp
1 2020-03-27T12:29:00 NAN
1 2020-03-27T12:31:00 2
1 2020-03-27T14:03:00 92
1 2020-03-27T14:27:00 24
2 2020-03-27T11:29:00 NAN
2 2020-03-27T14:29:00 180
2 2020-03-27T14:54:00 25
2 2020-03-27T18:20:00 216
I've tried playing around with Pandas.DataFram.rolling, but I either severely misunderstand its usage (possible!) or it just doesn't have the functionality I'm looking for.
Thanks!

Group by user column and do a pandas.Series.diff, then get total minute.
df['datetime'] = pd.to_datetime(df['datetime'])
df['output'] = df.groupby('user').datetime.diff().dt.total_seconds().div(60)
Output
user datetime output
0 1 2020-03-27 12:29:00 NaN
1 1 2020-03-27 12:31:00 2.0
2 1 2020-03-27 14:03:00 92.0
3 1 2020-03-27 14:27:00 24.0
4 2 2020-03-27 11:29:00 NaN
5 2 2020-03-27 14:29:00 180.0
6 2 2020-03-27 14:54:00 25.0
7 2 2020-03-27 18:20:00 206.0

How to count by time frequency using groupby - pandas

I'm trying to count a frequency of 2 events by the month using 2 columns from my df. What I have done so far has counted all events by the unique time which is not efficient enough as there are too many results. I wish to create a graph with the results afterwards.
I've tried adapting my code by the answers on the SO questions:
[How to groupby time series by 10 minutes using pandas?
[Counting frequency of occurrence by month-year using python panda
[Pandas Groupby using time frequency
but can not seem to get the command working when I input freq='day' within the groupby command.
My code is:
print(df.groupby(['Priority', 'Create Time']).Priority.count())
which initially produced something like 170000 results in the structure of the following:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
...
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
...
But now for some reason (I'm using Jupyter Notebook) it only produces:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
Name: Priority, dtype: int64
No idea why the output has changed to only 5 results (maybe I unknowingly changed something).
I would like the results to be in the following format:
Priority month Count
1.0 2011-01 a
2011-02 b
2011-03 c
...
2.0 2011-01 x
2011-02 y
2011-03 z
...
Top points for showing how to change the frequency correctly for other values as well, for example hour/day/month/year. With the answers please could you explain what is going on in your code as I am new and learning pandas and wish to understand the process. Thank you.

One possible solution is convert datetime column to months periods by Series.dt.to_period:
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Or use Grouper:
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Sample:
np.random.seed(123)
df = pd.DataFrame({'Create Time':pd.date_range('2019-01-01', freq='10D', periods=10),
'Priority':np.random.choice([0,1], size=10)})
print (df)
Create Time Priority
0 2019-01-01 0
1 2019-01-11 1
2 2019-01-21 0
3 2019-01-31 0
4 2019-02-10 0
5 2019-02-20 0
6 2019-03-02 0
7 2019-03-12 1
8 2019-03-22 1
9 2019-04-01 0
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Priority Create Time
0 2019-01 3
2019-02 2
2019-03 1
2019-04 1
1 2019-01 1
2019-03 2
Name: Priority, dtype: int64
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Priority Create Time
0 2019-01-01 3
2019-02-01 2
2019-03-01 1
2019-04-01 1
1 2019-01-01 1
2019-03-01 2
Name: Priority, dtype: int64

summing a time-series column upon it self with a condition

I'm currently working with some data that I receive from an engineering plant, the data comes out(roughly) as the following :
df = pd.DataFrame({'ID' : np.random.randint(1,25,size=5),
'on/off' : np.random.randint(0,2,size=5),
'Time' : pd.date_range(start='01/01/2019',periods=5,freq='5s')})
print(df)
ID on/off Time
0 17 0 2019-01-01 00:00:00
1 21 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 1 2019-01-01 00:00:15
4 12 0 2019-01-01 00:00:20
the 0 and 1 in the on/off column correspond to when a machine is on or off (0 = on 1 = off)
currently, I use the following line of beautiful code to get the difference between my column as the data is rolling
df['Time Difference'] = (df.time - df.time.shift())
print(df)
ID on/off Time Time Difference
0 17 0 2019-01-01 00:00:00 NaT
1 21 0 2019-01-01 00:00:05 00:00:05
2 12 1 2019-01-01 00:00:10 00:00:05
3 12 1 2019-01-01 00:00:15 00:00:05
4 12 0 2019-01-01 00:00:20 00:00:05
now as this dataframe is quite verbose (each week I'll receive about 150k rows)
what would be the best way to sum amount time a machine is off (where df['on/off] == 1) until the next 0 comes along? so in the above example for the 1st of January 2019 the machine of ID 12 didn't run for 15 seconds until it resumed at 00:00:20

Here's an approach that works for a simple example of one machine that varies between on and off during the course of one day. It works regardless of whether the machine is in on or off state in the first row.
df = pd.DataFrame({'ID': [12, 12, 12, 12, 12],
'on/off': [0,0,1,0,1],
'Time': ['2019-01-01 00:00:00', '2019-01-01 00:00:05', '2019-01-01 00:00:10','2019-01-01 00:00:15','2019-01-01 00:00:20']
})
ID on/off Time
0 12 0 2019-01-01 00:00:00
1 12 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 0 2019-01-01 00:00:15
4 12 1 2019-01-01 00:00:20
First I made sure the Time column dtype is datetime64:
df['Time'] = pd.to_datetime(df['Time'])
Then I get the indices of all rows where the state changed (either from off to on, or from on to off:
s = df[df['on/off'].shift(1) != df['on/off']].index
df = df.loc[s]
Then I create a column called time shift, which shows the timestamp of the most recent row where power state changed:
df['time shift'] = df['Time'].shift(1)
At this point the dataframe looks like this:
ID on/off Time time shift
0 12 0 2019-01-01 00:00:00 NaT
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
3 12 0 2019-01-01 00:00:15 2019-01-01 00:00:10
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
Now, since we want to count the duration that the machine was off, I look at only the row indices where the state became on:
r = df[df['on/off'] == 1].index
df = df.loc[r]
At this point, the dataframe looks as it does below. Notice that the time shift column is displaying the point at which the machine most recently turned off, prior to the time being displayed in Time column, which is the timestamp when the machine turned back on. Finding the difference between these two columns will give us the length of each duration that the machine was off during the day:
ID on/off Time time shift
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
The following line calculates total off-time, by summing the durations of each period that the machine was in its off state:
(df['Time'] - df['time shift']).sum()
Which outputs:
Timedelta('0 days 00:00:15')
Some additional context on how the Pandas .shift() method works:
Shift takes all the rows in a column, and moves them either forward or back by a certain amount. .shift(1) tells pandas to move the index of each row forward, or up, by 1. .shift(-1) tells pandas to move the index of each row back, or down, by 1. Alternately put, .shift(1) lets you look at the value of a column at the previous row index, and .shift(-1) lets you look at the value of a column at the next row index, relative a given row in a column. It's a handy way to compare a column's values across different rows, without resorting to for-loops.

PYTHON: Pandas datetime index range to change column values

I have a dataframe indexed using a 12hr frequency datetime:
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 0
2007-09-28 12:00:00 NaN NaN 0
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I use column 'ls' as a binary variable with default value '0' using:
data['ls'] = 0
I have a list of days in the form '2007-09-28' from which I wish to update all 'ls' values from 0 to 1.
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 1
2007-09-28 12:00:00 NaN NaN 1
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I understand how this can be done using another column variable ie:
data.ix[data.id == '1'], ['ls'] = 1
yet this does not work using datetime index.
Could you let me know what the method for datetime index is?

You have a list of days in the form '2007-09-28':
days = ['2007-09-28', ...]
then you can modify your df using:
df['ls'][pd.DatetimeIndex(df.index.date).isin(days)] = 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.