Pandas create a column iteratively - increasing after specific threshold

Pandas create a column iteratively - increasing after specific threshold - python

I have a simple table which the datetime is formatted correctly on.
Datetime
Diff
2021-01-01 12:00:00
0
2021-01-01 12:02:00
2
2021-01-01 12:04:00
2
2021-01-01 12:010:00
6
2021-01-01 12:020:00
10
2021-01-01 12:022:00
2
I would like to add a label/batch name which increases when a specific threshold/cutoff time is the difference. The output (with a threshold of diff > 7) I am hoping to achieve is:
Datetime
Diff
Batch
2021-01-01 12:00:00
0
A
2021-01-01 12:02:00
2
A
2021-01-01 12:04:00
2
A
2021-01-01 12:010:00
6
A
2021-01-01 12:020:00
10
B
2021-01-01 12:022:00
2
B
Batch doesn't need to be 'A','B','C' - probably easier to increase numerically.
I cannot find a solution online but I'm assuming there is a method to split the table on all values below the threshold, apply the batch label and concatenate again. However I cannot seem to get it working.
Any insight appreciated :)

Since True and False values represent 1 and 0 when summed, you can use this to create a cumulative sum on a boolean column made by df.Diff > 7:
df['Batch'] = (df.Diff > 7).cumsum()

You can use:
df['Batch'] = df['Datetime'].diff().dt.total_seconds().gt(7*60) \
.cumsum().add(65).apply(chr)
print(df)
# Output:
Datetime Diff Batch
0 2021-01-01 12:00:00 0 A
1 2021-01-01 12:02:00 2 A
2 2021-01-01 12:04:00 2 A
3 2021-01-01 12:10:00 6 A
4 2021-01-01 12:20:00 10 B
5 2021-01-01 12:22:00 2 B
Update
For a side question: apply(char) goes through A-Z, what method would you use to achieve AA, AB for batches greater than 26
Try something like this:
# Adapted from openpyxl
def chrext(i):
s = ''
while i > 0:
i, r = divmod(i, 26)
i, r = (i, r) if r > 0 else (i-1, 26)
s += chr(r-1+65)
return s[::-1]
df['Batch'] = df['Datetime'].diff().dt.total_seconds().gt(7*60) \
.cumsum().add(1).apply(chrext)
For demonstration purpose, if you replace 1 by 27:
>>> df
Datetime Diff Batch
0 2021-01-01 12:00:00 0 AA
1 2021-01-01 12:02:00 2 AA
2 2021-01-01 12:04:00 2 AA
3 2021-01-01 12:10:00 6 AA
4 2021-01-01 12:20:00 10 AB
5 2021-01-01 12:22:00 2 AB

You can achieve this by creating a custom group that has the properties you want. After you group the values your batch is simply group number. You don't have to use groupby with only an existing column. You can give a custom index and it is really powerful.
from datetime import timedelta
df['batch'] == df.groupby(((df['Datetime'] - df['Datetime'].min()) // timedelta(minutes=7)).ngroup()

Related

Replacing or sequencing in pandas dataframe column based on previous values and other column

I have a pandas df:
date day_of_week day
2021-01-01 3 1
2021-01-02 4 2
2021-01-03 5 0
2021-01-04 6 1
2021-01-05 7 2
2021-01-06 1 3
2021-01-07 2 0
2021-01-08 3 0
I would like to change numeration for 'day' column based on the 'day_of_week' column values. For example, if the event starts before Thursday (<4) I want to use numbering for 'day' column values, which are greater than 0, from 20 (instead of 1) and forth. If the event starts on Thursday but before Monday (>=4) I want to use numbering for values, which are greater than 0, from 30 (instead of 1) and forth.
The table should look like this:
date day_of_week day
2021-01-01 3 20
2021-01-02 4 21
2021-01-03 5 0
2021-01-04 6 30
2021-01-05 7 31
2021-01-06 1 32
2021-01-07 2 0
2021-01-08 3 0
I tried to use np.where to substitute values but I don't how to iterate through rows and insert values based on previous rows.
Please help!

We can use cumsum create the group then select the 20 or 30 by transform first day of each group
s = df.groupby(df['day'].eq(1).cumsum())['day_of_week'].transform('first')
df['day'] = df.day.where(df.day==0, df.day + np.where(s<4,19,29))
df
Out[16]:
date day_of_week day
0 2021-01-01 3 20
1 2021-01-02 4 21
2 2021-01-03 5 0
3 2021-01-04 6 30
4 2021-01-05 7 31
5 2021-01-06 1 32
6 2021-01-07 2 0
7 2021-01-08 3 0

Create New DataFrame, assigning a count for each instance in a time frame

Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'start_date':['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-03','2021-01-04','2021-02-03','2021-03-04','2021-03-05']})
plan_dates
start_date end_date
0 2021-01-01 2021-01-03
1 2021-01-02 2021-01-04
2 2021-01-03 2021-02-03
3 2021-01-04 2021-03-04
4 2021-01-05 2021-03-05
I would like to create a new DataFrame which has 2 columns:
date
count of active plans (the count of cases where the date is within the start_date & end_date in each row of the plan_dates df)
INTENDED DF:
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
Any help would be greatly appreciated.

First convert both columns to datetimes and add one day to end_date, then repeat index by Index.repeat with subtraction of days and add counter values by GroupBy.cumcount with to_timedelta, last count by Series.value_counts with come data cleaning and converting to DataFrame:
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df1 = (df['start_date'].add(pd.to_timedelta(counter, unit='d'))
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='count_active_plans'))
print (df1)
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
.. ... ...
59 2021-03-01 2
60 2021-03-02 2
61 2021-03-03 2
62 2021-03-04 2
63 2021-03-05 1
[64 rows x 2 columns]

Python Pandas pattern serching

Good evening,
I have a question about detecting certain pattern. I don't know whether my question has specific terminology.
I have a pandas dataframe like this:
0 1 ... 8
0 date price ... pattern
1 2021-01-01 31.18 ... 0
2 2021-01-02 20.32 ... 1
3 2021-01-03 10.32 ... 1
4 2021-01-04 21.32 ... -1
5 2021-01-05 44.32 ... 0
6 2021-01-06 45.32 ... -1
7 2021-01-07 41.32 ... 1
8 2021-01-08 78.32 ... -1
9 2021-01-09 44.32 ... 1
10 2021-01-10 123.32 ... 1
11 2021-01-11 25.32 ... -1
How can I detect the pattern which is [-1 following after 1] in IF statement.
For example:
Grabbing price column from index 3 and 4 because pattern column at index 3 is 1 and index 4 is -1 which match my condition.
Next would be index 7 and 8 then index 10 and 11.
I probably convey my question pretty vague, however I don't really know how to describe it.

You can use three following solutions, but the first and second ones are more pandaic:
First:
prices = df.where((df.pattern==-1)&(df.pattern.shift()==1)).dropna().price
Second:
df['pattern2'] = df.pattern.shift()
# Selecting just prices of meeting condition
prices = df.loc[df.apply(lambda x: True if ((x['pattern'] == -1) & (x['pattern2'] == 1)) else False, axis=1), 'price']
Third:
prices = df.loc[(df.pattern - df.pattern.shift() == -2), 'price']

You can try shift and check for match
df['pattern_2'] = df['pattern'].shift(1)
df_new = df.iloc[[j for i in df.loc[(df['pattern'] == -1) & (df['pattern_2'] == 1), :].index for j in range(i-1, i+1)], :]
print(df_new)
date price pattern pattern_2
2 2021-01-03 10.32 1 1.0
3 2021-01-04 21.32 -1 1.0
6 2021-01-07 41.32 1 -1.0
7 2021-01-08 78.32 -1 1.0
9 2021-01-10 123.32 1 1.0
10 2021-01-11 25.32 -1 1.0

You can use Series.diff with Series.shift with boolean indexing.
m = df['pattern'].diff(-1).eq(2)
df[m|m.shift()]
date price pattern
3 2021-01-03 10.32 1
4 2021-01-04 21.32 -1
7 2021-01-07 41.32 1
8 2021-01-08 78.32 -1
10 2021-01-10 123.32 1
11 2021-01-11 25.32 -1
Details
df.pattern.diff(-1) calculates difference b/w ith element and i+1th element. So, when ith element is 1 and i+1th is -1 output would be 2(1 - -1)
_.eq(2) would marks True where the difference is 2.
m|m.shift() is for taken ith row as well as i+1th row.

Pandas GroupBy with CumulativeSum per Contiguous Groups

I'm looking to understand the number of times we are in an 'Abnormal State' before we have an 'Event'. My objective is to modify my dataframe to get the following output where everytime we reach an 'event', the 'Abnormal State Grouping' resets to count from 0.
We can go through a number of 'Abnormal States' before we reach an 'Event', which is deemed a failure. (i.e. The lightbulb is switched on and off for several periods before it finally shorts out resulting in an event).
I've written the following code to get my AbnormalStateGroupings to increment into relevant groupings for my analysis which has worked fine. However, we want to 'reset' the count of our 'AbnormalStates' after each event (i.e. lightbulb failure):
dataframe['AbnormalStateGrouping'] = (dataframe['AbnormalState']!=dataframe['AbnormalState'].shift()).cumsum()
I have created an additional column which let's me know what 'event' we are at via:
dataframe['Event_Or_Not'].cumsum() #I have a boolean representation of the Event Column represented and we use .cumsum() to get the relevant groupings (i.e. 1st Event, 2nd Event, 3rd Event etc.)
I've come close previously using the following:
eventOrNot = dataframe['Event'].eq(0)
eventMask = (eventOrNot.ne(eventOrNot.shift())&eventOrNot).cumsum()
dataframe['AbnormalStatePerEvent'] =dataframe.groupby(['Event',eventMask]).cumcount().add(1)
However, this hasn't given me the desired output that I'm after (as per below).
I think I'm close however - Could anyone please advise what I could try to do next so that for each lightbulb failure, the abnormal state count resets and starts counting the # of abnormal states we have gone through before the next lightbulb failure?
State I want to get to with AbnormalStateGrouping
You would note that when an 'Event' is detected, the Abnormal State count resets to 1 and then starts counting again.
Current State of Dataframe
Please find an attached data source below:
https://filebin.net/ctjwk7p3gulmbgkn

I assume that your source DataFrame has only Date/Time (either string
or datetime), Event (string) and AbnormalState (int) columns.
To compute your grouping column, run:
dataframe['AbnormalStateGrouping'] = dataframe.groupby(
dataframe['Event'][::-1].notnull().cumsum()).AbnormalState\
.apply(lambda grp: (grp != grp.shift()).cumsum())
The result, for your initial source data, included as a picture, is:
Date/Time Event AbnormalState AbnormalStateGrouping
0 2018-01-01 01:00 NaN 0 1
1 2018-01-01 02:00 NaN 0 1
2 2018-01-01 03:00 NaN 1 2
3 2018-01-01 04:00 NaN 1 2
4 2018-01-01 05:00 NaN 0 3
5 2018-01-01 06:00 NaN 0 3
6 2018-01-01 07:00 NaN 0 3
7 2018-01-01 08:00 NaN 1 4
8 2018-01-01 09:00 NaN 1 4
9 2018-01-01 10:00 NaN 0 5
10 2018-01-01 11:00 NaN 0 5
11 2018-01-01 12:00 NaN 0 5
12 2018-01-01 13:00 NaN 1 6
13 2018-01-01 14:00 NaN 1 6
14 2018-01-01 15:00 NaN 0 7
15 2018-01-01 16:00 Event 0 7
16 2018-01-01 17:00 NaN 1 1
17 2018-01-01 18:00 NaN 1 1
18 2018-01-01 19:00 NaN 0 2
19 2018-01-01 20:00 NaN 0 2
Note the way of grouping:
dataframe['Event'][::-1].notnull().cumsum()
Due to [::-1], cumsum function is computed from the last row
to the first.
Thus:
rows with hours 01:00 thru 16:00 are in group 1,
remaining rows (hour 17:00 thru 20:00) are in group 0.
Then, to AbnormalState, separately for each group, a lambda function
is applied, so each cumulative sum starts from 1 just in each group
(after each Event).
Edit following the comment as of 22:18:12Z
The reason why I compute the cumsum for grouping in reversed order
is that when you run it in normal order:
dataframe['Event'].notnull().cumsum()
then:
rows with index 0 thru 14 (before the row with Event) have
this sum == 0,
row with index 15 and following rows have this sum == 1.
Try yourself both versions, without and with [::-1].
The result in normal order (without [::-1]) is that:
Event row is in the same group with the following rows,
so the reset occurs just on this row.
To check the whole result, run my code without [::-1] and you will see
that the ending part of the result contains:
Date/Time Event AbnormalState AbnormalStateGrouping
14 2018-01-01 15:00:00 NaN 0 7
15 2018-01-01 16:00:00 Event 0 1
16 2018-01-01 17:00:00 NaN 1 2
17 2018-01-01 18:00:00 NaN 1 2
18 2018-01-01 19:00:00 NaN 0 3
19 2018-01-01 20:00:00 NaN 0 3
so that the Event row to has AbnormalStateGrouping == 1.
But you want this row to have AbnormalStateGrouping in a sequence of
previous grouping states (in this case 7) and reset should occur
from the next row on.
So the Event row should be in same group with preceding rows, what
is the result of my code.

How to count by time frequency using groupby - pandas

I'm trying to count a frequency of 2 events by the month using 2 columns from my df. What I have done so far has counted all events by the unique time which is not efficient enough as there are too many results. I wish to create a graph with the results afterwards.
I've tried adapting my code by the answers on the SO questions:
[How to groupby time series by 10 minutes using pandas?
[Counting frequency of occurrence by month-year using python panda
[Pandas Groupby using time frequency
but can not seem to get the command working when I input freq='day' within the groupby command.
My code is:
print(df.groupby(['Priority', 'Create Time']).Priority.count())
which initially produced something like 170000 results in the structure of the following:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
...
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
...
But now for some reason (I'm using Jupyter Notebook) it only produces:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
Name: Priority, dtype: int64
No idea why the output has changed to only 5 results (maybe I unknowingly changed something).
I would like the results to be in the following format:
Priority month Count
1.0 2011-01 a
2011-02 b
2011-03 c
...
2.0 2011-01 x
2011-02 y
2011-03 z
...
Top points for showing how to change the frequency correctly for other values as well, for example hour/day/month/year. With the answers please could you explain what is going on in your code as I am new and learning pandas and wish to understand the process. Thank you.

One possible solution is convert datetime column to months periods by Series.dt.to_period:
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Or use Grouper:
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Sample:
np.random.seed(123)
df = pd.DataFrame({'Create Time':pd.date_range('2019-01-01', freq='10D', periods=10),
'Priority':np.random.choice([0,1], size=10)})
print (df)
Create Time Priority
0 2019-01-01 0
1 2019-01-11 1
2 2019-01-21 0
3 2019-01-31 0
4 2019-02-10 0
5 2019-02-20 0
6 2019-03-02 0
7 2019-03-12 1
8 2019-03-22 1
9 2019-04-01 0
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Priority Create Time
0 2019-01 3
2019-02 2
2019-03 1
2019-04 1
1 2019-01 1
2019-03 2
Name: Priority, dtype: int64
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Priority Create Time
0 2019-01-01 3
2019-02-01 2
2019-03-01 1
2019-04-01 1
1 2019-01-01 1
2019-03-01 2
Name: Priority, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.