so I am trying to figure out how I can identify consecutive repeating values in a data frame column in python, and then be able to set a number for how many consecutive repeating values I am looking for. I will explain further here.
I have the following data frame:
DateTime Value
-------------------------------
2015-03-11 06:00:00 1
2015-03-11 07:00:00 1
2015-03-11 08:00:00 1
2015-03-11 09:00:00 1
2015-03-11 10:00:00 0
2015-03-11 11:00:00 0
2015-03-11 12:00:00 0
2015-03-11 13:00:00 0
2015-03-11 14:00:00 0
2015-03-11 15:00:00 0
...
Now I have the following question: In the "Value" column, is there ever an instance where there are "2" or more consecutive "0" values? Yes! Now I want to return a "True".
Now I have this data frame:
DateTime Value
-------------------------------
2015-03-11 06:00:00 1
2015-03-11 07:00:00 1
2015-03-11 08:00:00 0
2015-03-11 09:00:00 0
2015-03-11 10:00:00 1
2015-03-11 11:00:00 0
2015-03-11 12:00:00 0
2015-03-11 13:00:00 0
2015-03-11 14:00:00 1
2015-03-11 15:00:00 1
...
Now I have the following question: In the "Value" column, is there ever an instance where there are "3" or more consecutive "0" values? Yes! Now I want to return a "True".
And of course, if the answer is "No", then I would want to return a "False"
How can this be done in python? What is this process even called? How can you set this so that you can change the number of consecutive values being looked for?
First, you can use .shift() to create a new column that has the same values as your column Value. Than
df["Value_shif"] = df["Value"].shift()
output:
DateTime Value Value_shif
0 2015-03-11 06:00:00 1 NaN
1 2015-03-11 07:00:00 1 1.0
2 2015-03-11 08:00:00 0 1.0
3 2015-03-11 09:00:00 1 0.0
than you can compare them and get True/False:
df["Value"] == df["Value_shif"]
output:
0 False
1 True
2 False
3 False
than Sum the number of repeating values:
df["count"] = (df["Value"] == df["Value_shif"]).cumsum()
cumsum() will treat True as 1 and False as 0
output:
DateTime Value Value_shif count
0 2015-03-11 06:00:00 1 NaN 0
1 2015-03-11 07:00:00 1 1.0 1
2 2015-03-11 08:00:00 0 1.0 1
3 2015-03-11 09:00:00 1 0.0 1
if Sum is larger than 1 then you have consecutive repeating values.
Once you have this info you can filter the dataframe under specific conditions, check for specific values if the number_of_times it occurs is larger than a certain amount.
def check(dataframe, value, number_of_times):
"""
Check for condition
"""
df = dataframe.copy()
df = df[df['Value'] == value]
if df["count"].max() >= number_of_times:
return True
else:
return False
print(check(df, 1, 1))
True
print(check(df, 0, 3))
False
You'll need to check for specific boundary conditions to make sure everything works as intended. The problem with shift() is that it creates NaN as the first value and removes the last value from the column...
To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:
# for the second frame
>>> consecutives = df.Value.diff().ne(0).cumsum()
>>> consecutives
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 5
But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaNs wherever we don't have 0 in the original series:
>>> masked_consecs = consecutives.mask(df.Value.ne(0))
>>> masked_consecs
0 NaN
1 NaN
2 2.0
3 2.0
4 NaN
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN
Now we can group by this series and look at the groups' sizes:
>>> consec_sizes = df.Value.groupby(masked_consecs).size().to_numpy()
>>> consec_sizes
array([2, 3])
The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:
>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True
Now we can wrap this procedure in a function for reusability:
def is_consec_found(series, value=0, threshold=2):
# mark consecutive groups
consecs = series.diff().ne(0).cumsum()
# disregard those groups that are not of `value`
masked_consecs = consecs.mask(series.ne(value))
# get size of each
consec_sizes = series.groupby(masked_consecs).size().to_numpy()
# check sizes agains the threshold
is_okay = (consec_sizes >= threshold).any()
# whether a suitable sequence is found or not
return is_okay
and we can run it as:
# these are all for the second dataframe you posted
>>> is_consec_found(df.Value, value=0, threshold=2)
True
>>> is_consec_found(df.Value, value=0, threshold=5)
False
>>> is_consec_found(df.Value, value=1, threshold=2)
True
>>> is_consec_found(df.Value, value=1, threshold=3)
False
Related
I have a data frame that looks like this.
ID
Start
End
1
2020-12-13
2020-12-20
1
2020-12-26
2021-01-20
1
2020-02-20
2020-02-21
2
2020-12-13
2020-12-20
2
2021-01-11
2021-01-20
2
2021-02-15
2021-02-26
Using pandas, I am trying to group by ID and then subtract the start date from a current row from the end date of the previous row.
If the difference is greater than 5 then it should return True
I'm new to pandas, and I've been trying to figure this out all day.
Two assumptions:
By difference greater than 5, you mean 5 days
You mean the absolute difference
So I am starting with this dataframe to which I added the column 'above_5_days'.
df
ID start end above_5_days
0 1 2020-12-13 2020-12-20 None
1 1 2020-12-26 2021-01-20 None
2 1 2020-02-20 2020-02-21 None
3 2 2020-12-13 2020-12-20 None
4 2 2021-01-11 2021-01-20 None
5 2 2021-02-15 2021-02-26 None
this will be the groupby object that will be used to apply the operation on each ID-group
id_grp = df.groupby("ID")
the following is the operation that will be applied on each subset
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
# if you don't want the absolute difference, remove .abs()
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
return x
Now apply this to the whole group and store it in a newdf
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 False
1 1 2020-12-26 2021-01-20 True
2 1 2020-02-20 2020-02-21 True
3 2 2020-12-13 2020-12-20 False
4 2 2021-01-11 2021-01-20 True
5 2 2021-02-15 2021-02-26 True
>>>>>>> I should point out that:
in this case, there are only False values because shifting down the end column for each group will make a NaN value in the first row of the column, which returns a NaN value when subtracted from. So the False values are just the boolean versions of None.
That is why, I would personally change the function to:
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
x.loc[to_subtract_from.isna(), "above_5_days"] = None
return x
When rerunning this, you can see that the extra line right before the return statement will set the value in the new column to NaN if the shifted end times are NaN.
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 NaN
1 1 2020-12-26 2021-01-20 1.0
2 1 2020-02-20 2020-02-21 1.0
3 2 2020-12-13 2020-12-20 NaN
4 2 2021-01-11 2021-01-20 1.0
5 2 2021-02-15 2021-02-26 1.0
I'm looking to understand the number of times we are in an 'Abnormal State' before we have an 'Event'. My objective is to modify my dataframe to get the following output where everytime we reach an 'event', the 'Abnormal State Grouping' resets to count from 0.
We can go through a number of 'Abnormal States' before we reach an 'Event', which is deemed a failure. (i.e. The lightbulb is switched on and off for several periods before it finally shorts out resulting in an event).
I've written the following code to get my AbnormalStateGroupings to increment into relevant groupings for my analysis which has worked fine. However, we want to 'reset' the count of our 'AbnormalStates' after each event (i.e. lightbulb failure):
dataframe['AbnormalStateGrouping'] = (dataframe['AbnormalState']!=dataframe['AbnormalState'].shift()).cumsum()
I have created an additional column which let's me know what 'event' we are at via:
dataframe['Event_Or_Not'].cumsum() #I have a boolean representation of the Event Column represented and we use .cumsum() to get the relevant groupings (i.e. 1st Event, 2nd Event, 3rd Event etc.)
I've come close previously using the following:
eventOrNot = dataframe['Event'].eq(0)
eventMask = (eventOrNot.ne(eventOrNot.shift())&eventOrNot).cumsum()
dataframe['AbnormalStatePerEvent'] =dataframe.groupby(['Event',eventMask]).cumcount().add(1)
However, this hasn't given me the desired output that I'm after (as per below).
I think I'm close however - Could anyone please advise what I could try to do next so that for each lightbulb failure, the abnormal state count resets and starts counting the # of abnormal states we have gone through before the next lightbulb failure?
State I want to get to with AbnormalStateGrouping
You would note that when an 'Event' is detected, the Abnormal State count resets to 1 and then starts counting again.
Current State of Dataframe
Please find an attached data source below:
https://filebin.net/ctjwk7p3gulmbgkn
I assume that your source DataFrame has only Date/Time (either string
or datetime), Event (string) and AbnormalState (int) columns.
To compute your grouping column, run:
dataframe['AbnormalStateGrouping'] = dataframe.groupby(
dataframe['Event'][::-1].notnull().cumsum()).AbnormalState\
.apply(lambda grp: (grp != grp.shift()).cumsum())
The result, for your initial source data, included as a picture, is:
Date/Time Event AbnormalState AbnormalStateGrouping
0 2018-01-01 01:00 NaN 0 1
1 2018-01-01 02:00 NaN 0 1
2 2018-01-01 03:00 NaN 1 2
3 2018-01-01 04:00 NaN 1 2
4 2018-01-01 05:00 NaN 0 3
5 2018-01-01 06:00 NaN 0 3
6 2018-01-01 07:00 NaN 0 3
7 2018-01-01 08:00 NaN 1 4
8 2018-01-01 09:00 NaN 1 4
9 2018-01-01 10:00 NaN 0 5
10 2018-01-01 11:00 NaN 0 5
11 2018-01-01 12:00 NaN 0 5
12 2018-01-01 13:00 NaN 1 6
13 2018-01-01 14:00 NaN 1 6
14 2018-01-01 15:00 NaN 0 7
15 2018-01-01 16:00 Event 0 7
16 2018-01-01 17:00 NaN 1 1
17 2018-01-01 18:00 NaN 1 1
18 2018-01-01 19:00 NaN 0 2
19 2018-01-01 20:00 NaN 0 2
Note the way of grouping:
dataframe['Event'][::-1].notnull().cumsum()
Due to [::-1], cumsum function is computed from the last row
to the first.
Thus:
rows with hours 01:00 thru 16:00 are in group 1,
remaining rows (hour 17:00 thru 20:00) are in group 0.
Then, to AbnormalState, separately for each group, a lambda function
is applied, so each cumulative sum starts from 1 just in each group
(after each Event).
Edit following the comment as of 22:18:12Z
The reason why I compute the cumsum for grouping in reversed order
is that when you run it in normal order:
dataframe['Event'].notnull().cumsum()
then:
rows with index 0 thru 14 (before the row with Event) have
this sum == 0,
row with index 15 and following rows have this sum == 1.
Try yourself both versions, without and with [::-1].
The result in normal order (without [::-1]) is that:
Event row is in the same group with the following rows,
so the reset occurs just on this row.
To check the whole result, run my code without [::-1] and you will see
that the ending part of the result contains:
Date/Time Event AbnormalState AbnormalStateGrouping
14 2018-01-01 15:00:00 NaN 0 7
15 2018-01-01 16:00:00 Event 0 1
16 2018-01-01 17:00:00 NaN 1 2
17 2018-01-01 18:00:00 NaN 1 2
18 2018-01-01 19:00:00 NaN 0 3
19 2018-01-01 20:00:00 NaN 0 3
so that the Event row to has AbnormalStateGrouping == 1.
But you want this row to have AbnormalStateGrouping in a sequence of
previous grouping states (in this case 7) and reset should occur
from the next row on.
So the Event row should be in same group with preceding rows, what
is the result of my code.
I have a data set (sample) like below
Date Value
2019-05-01 0
2019-05-02 0
2019-05-03 0
2019-05-04 0
2019-05-05 0
2019-05-06 0
2019-05-07 0
2019-05-08 1
2019-05-09 0
I want to transform it such that, if I encounter Value=1, then I take the 3 values from 2 days before and fill it as 1. Also set the current value to be 0.
In other words, the transformed data set should look like this
Date Value
2019-05-01 0
2019-05-02 0
2019-05-03 1
2019-05-04 1
2019-05-05 1
2019-05-06 0
2019-05-07 0
2019-05-08 0
2019-05-09 0
Do notice, that in the example above, 2019-05-08 was set to 0 after transformation, and 2019-05-03 to 2019-05-05 was set to 1 (last value set to 1 is 2 days before 2019-05-08 and 3 days preceding 2019-05-05 is also set to 1).
If two consecutive values show up as 1, we start the date calculation from the last value that shows up as 1.
I think I can do this via for loops, but was looking to see if any inbuilt functions can help me with this.
Thanks!
There could be more precise ways of solving this problem. However, I could only think of solving this using the index values(say i) where Value==1 and then grab the index values at preceding locations(2 dates before means i-3 and then two more values above it means i-4, i-5) and assign the Value to 1. Finally, set the Value back to 0 for the index location(s) that were originally found for Value==1.
In [53]: df = pd.DataFrame({'Date':['2019-05-01','2019-05-02', '2019-05-03','2019-05-04','2019-05-05', '2019-05-06','20
...: 19-05-07','2019-05-08','2019-05-09'], 'Value':[0,0,0,0,0,0,0,1,0]})
...:
...:
In [54]: val_1_index = df.loc[df.Value == 1].index.tolist()
In [55]: val_1_index_decr = [(i-3, i-4, i-5) for i in val_1_index]
In [56]: df.loc[df['Value'].index.isin([i for i in val_1_index_decr[0]]), 'Value'] = 1
In [57]: df.loc[df['Value'].index.isin(val_1_index), 'Value'] = 0
In [58]: df
Out[58]:
Date Value
0 2019-05-01 0
1 2019-05-02 0
2 2019-05-03 1
3 2019-05-04 1
4 2019-05-05 1
5 2019-05-06 0
6 2019-05-07 0
7 2019-05-08 0
8 2019-05-09 0
A one line solution, assuming that df is your original dataframe:
df['Value'] = pd.Series([1 if 1 in df.iloc[i+3:i+6].values else 0 for i in df.index])
Here I work on index rather than dates, so I assume that you have one day per row and days are consecutive as shown in your example.
To fit also for this request:
If two consecutive values show up as 1, we start the date calculation from the last value that shows up as 1.
I can propose a two line solution:
validones = [True if df.iloc[i]['Value'] == 1 and df.iloc[i+1]['Value'] == 0 else False for i in df.index]
df['Value'] = pd.Series([1 if any(validones[i+3:i+6]) else 0 for i in range(len(validones))])
Basically first I build a list of boolean to check if the 1 in df['Value'] is not followed by another 1 and use this boolean list to perform the substitutions.
No sure about the efficiency of this solution because one needs to create three new columns but this also works:
df['shiftedValues'] = \
df['Value'].shift(-3, fill_value=0) + \
df['Value'].shift(-4, fill_value=0) + \
df['Value'].shift(-5, fill_value=0)
Note that the shift is done by row and not by day.
To shift by actual days I would first index by dates
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df['shiftedValues'] = \
df['Value'].shift(-3, freq='1D', fill_value=0).asof(df.index) + \
df['Value'].shift(-4, freq='1D', fill_value=0).asof(df.index) + \
df['Value'].shift(-5, freq='1D', fill_value=0).asof(df.index)
# Out:
# Value shiftedValues
# Date
# 2019-05-01 0 0.0
# 2019-05-02 0 0.0
# 2019-05-03 0 1.0
# 2019-05-04 0 1.0
# 2019-05-05 0 1.0
# 2019-05-06 0 0.0
# 2019-05-07 0 0.0
# 2019-05-08 1 0.0
# 2019-05-09 0 0.0
Now this works correctly for dates, for instance if df is (note the missing and repeated days)
Date Value
0 2019-05-01 0
1 2019-05-02 0
2 2019-05-03 0
3 2019-05-04 0
4 2019-05-05 0
5 2019-05-05 0
6 2019-05-07 0
7 2019-05-08 1
8 2019-05-09 0
then you get
Value shiftedValues
Date
2019-05-01 0 0.0
2019-05-02 0 0.0
2019-05-03 0 1.0
2019-05-04 0 1.0
2019-05-05 0 1.0
2019-05-05 0 1.0
2019-05-07 0 0.0
2019-05-08 1 0.0
2019-05-09 0 0.0
I have a Dataframe with sportsbetting data containing: match_id, team_id, goals_scored and a datetime column for the time the match started. I want to add a column to this dataframe that for each row shows the sum of the goals scored by each team for the previous n matches.
I made up some mock data, because i like football, but like Jacob H suggests it's best to always supply a sample data frame with the question.
import pandas as pd
import numpy as np
np.random.seed(2)
d = {'match_id': np.arange(10)
,'team_id': ['City','City','City','Utd','Utd','Utd','Albion','Albion','Albion','Albion']
,'goals_scored': np.random.randint(0,5,10)
,'time_played': [0,1,2,0,1,2,0,1,2,3]}
df = pd.DataFrame(data=d)
#previous n matches
n=2
#some Saturday 3pm kickoffs.
rng = pd.date_range('2017-12-02 15:00:00','2017-12-25 15:00:00',freq='W')
# change the time_played integers to the datetimes
df['time_played'] = df['time_played'].map(lambda x: rng[x])
#be sure the sort order is correct
df = df.sort_values(['team_id','time_played'])
# a rolling sum() and then shift(1) to align value with row as per question
df['total_goals'] = df.groupby(['team_id'])['goals_scored'].apply(lambda x: x.rolling(n).sum())
df['total_goals'] = df.groupby(['team_id'])['total_goals'].shift(1)
which produces:
goals_scored match_id team_id time_played total_goals->(in previous n)
6 2 6 Albion 2017-12-03 15:00:00 NaN
7 1 7 Albion 2017-12-10 15:00:00 NaN
8 3 8 Albion 2017-12-17 15:00:00 3.0
9 2 9 Albion 2017-12-24 15:00:00 4.0
0 0 0 City 2017-12-03 15:00:00 NaN
1 0 1 City 2017-12-10 15:00:00 NaN
2 3 2 City 2017-12-17 15:00:00 0.0
3 2 3 Utd 2017-12-03 15:00:00 NaN
4 3 4 Utd 2017-12-10 15:00:00 NaN
5 0 5 Utd 2017-12-17 15:00:00 5.0
There's probably a more efficient way to do this with aggregation functions, but here's a solution where, for each entry, you're filtering your whole dataframe to isolate that team and date range, and then summing the goals.
df['goals_to_date'] = df.apply(lambda row: np.sum(df[(df['team_id'] == row['team_id'])\
&(df['datetime'] < row['datetime'])]['goals_scored']), axis = 1)
I have a dataframe indexed using a 12hr frequency datetime:
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 0
2007-09-28 12:00:00 NaN NaN 0
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I use column 'ls' as a binary variable with default value '0' using:
data['ls'] = 0
I have a list of days in the form '2007-09-28' from which I wish to update all 'ls' values from 0 to 1.
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 1
2007-09-28 12:00:00 NaN NaN 1
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I understand how this can be done using another column variable ie:
data.ix[data.id == '1'], ['ls'] = 1
yet this does not work using datetime index.
Could you let me know what the method for datetime index is?
You have a list of days in the form '2007-09-28':
days = ['2007-09-28', ...]
then you can modify your df using:
df['ls'][pd.DatetimeIndex(df.index.date).isin(days)] = 1