I have a data frame that looks like this.
ID
Start
End
1
2020-12-13
2020-12-20
1
2020-12-26
2021-01-20
1
2020-02-20
2020-02-21
2
2020-12-13
2020-12-20
2
2021-01-11
2021-01-20
2
2021-02-15
2021-02-26
Using pandas, I am trying to group by ID and then subtract the start date from a current row from the end date of the previous row.
If the difference is greater than 5 then it should return True
I'm new to pandas, and I've been trying to figure this out all day.
Two assumptions:
By difference greater than 5, you mean 5 days
You mean the absolute difference
So I am starting with this dataframe to which I added the column 'above_5_days'.
df
ID start end above_5_days
0 1 2020-12-13 2020-12-20 None
1 1 2020-12-26 2021-01-20 None
2 1 2020-02-20 2020-02-21 None
3 2 2020-12-13 2020-12-20 None
4 2 2021-01-11 2021-01-20 None
5 2 2021-02-15 2021-02-26 None
this will be the groupby object that will be used to apply the operation on each ID-group
id_grp = df.groupby("ID")
the following is the operation that will be applied on each subset
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
# if you don't want the absolute difference, remove .abs()
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
return x
Now apply this to the whole group and store it in a newdf
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 False
1 1 2020-12-26 2021-01-20 True
2 1 2020-02-20 2020-02-21 True
3 2 2020-12-13 2020-12-20 False
4 2 2021-01-11 2021-01-20 True
5 2 2021-02-15 2021-02-26 True
>>>>>>> I should point out that:
in this case, there are only False values because shifting down the end column for each group will make a NaN value in the first row of the column, which returns a NaN value when subtracted from. So the False values are just the boolean versions of None.
That is why, I would personally change the function to:
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
x.loc[to_subtract_from.isna(), "above_5_days"] = None
return x
When rerunning this, you can see that the extra line right before the return statement will set the value in the new column to NaN if the shifted end times are NaN.
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 NaN
1 1 2020-12-26 2021-01-20 1.0
2 1 2020-02-20 2020-02-21 1.0
3 2 2020-12-13 2020-12-20 NaN
4 2 2021-01-11 2021-01-20 1.0
5 2 2021-02-15 2021-02-26 1.0
Related
so I am trying to figure out how I can identify consecutive repeating values in a data frame column in python, and then be able to set a number for how many consecutive repeating values I am looking for. I will explain further here.
I have the following data frame:
DateTime Value
-------------------------------
2015-03-11 06:00:00 1
2015-03-11 07:00:00 1
2015-03-11 08:00:00 1
2015-03-11 09:00:00 1
2015-03-11 10:00:00 0
2015-03-11 11:00:00 0
2015-03-11 12:00:00 0
2015-03-11 13:00:00 0
2015-03-11 14:00:00 0
2015-03-11 15:00:00 0
...
Now I have the following question: In the "Value" column, is there ever an instance where there are "2" or more consecutive "0" values? Yes! Now I want to return a "True".
Now I have this data frame:
DateTime Value
-------------------------------
2015-03-11 06:00:00 1
2015-03-11 07:00:00 1
2015-03-11 08:00:00 0
2015-03-11 09:00:00 0
2015-03-11 10:00:00 1
2015-03-11 11:00:00 0
2015-03-11 12:00:00 0
2015-03-11 13:00:00 0
2015-03-11 14:00:00 1
2015-03-11 15:00:00 1
...
Now I have the following question: In the "Value" column, is there ever an instance where there are "3" or more consecutive "0" values? Yes! Now I want to return a "True".
And of course, if the answer is "No", then I would want to return a "False"
How can this be done in python? What is this process even called? How can you set this so that you can change the number of consecutive values being looked for?
First, you can use .shift() to create a new column that has the same values as your column Value. Than
df["Value_shif"] = df["Value"].shift()
output:
DateTime Value Value_shif
0 2015-03-11 06:00:00 1 NaN
1 2015-03-11 07:00:00 1 1.0
2 2015-03-11 08:00:00 0 1.0
3 2015-03-11 09:00:00 1 0.0
than you can compare them and get True/False:
df["Value"] == df["Value_shif"]
output:
0 False
1 True
2 False
3 False
than Sum the number of repeating values:
df["count"] = (df["Value"] == df["Value_shif"]).cumsum()
cumsum() will treat True as 1 and False as 0
output:
DateTime Value Value_shif count
0 2015-03-11 06:00:00 1 NaN 0
1 2015-03-11 07:00:00 1 1.0 1
2 2015-03-11 08:00:00 0 1.0 1
3 2015-03-11 09:00:00 1 0.0 1
if Sum is larger than 1 then you have consecutive repeating values.
Once you have this info you can filter the dataframe under specific conditions, check for specific values if the number_of_times it occurs is larger than a certain amount.
def check(dataframe, value, number_of_times):
"""
Check for condition
"""
df = dataframe.copy()
df = df[df['Value'] == value]
if df["count"].max() >= number_of_times:
return True
else:
return False
print(check(df, 1, 1))
True
print(check(df, 0, 3))
False
You'll need to check for specific boundary conditions to make sure everything works as intended. The problem with shift() is that it creates NaN as the first value and removes the last value from the column...
To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:
# for the second frame
>>> consecutives = df.Value.diff().ne(0).cumsum()
>>> consecutives
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 5
But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaNs wherever we don't have 0 in the original series:
>>> masked_consecs = consecutives.mask(df.Value.ne(0))
>>> masked_consecs
0 NaN
1 NaN
2 2.0
3 2.0
4 NaN
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN
Now we can group by this series and look at the groups' sizes:
>>> consec_sizes = df.Value.groupby(masked_consecs).size().to_numpy()
>>> consec_sizes
array([2, 3])
The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:
>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True
Now we can wrap this procedure in a function for reusability:
def is_consec_found(series, value=0, threshold=2):
# mark consecutive groups
consecs = series.diff().ne(0).cumsum()
# disregard those groups that are not of `value`
masked_consecs = consecs.mask(series.ne(value))
# get size of each
consec_sizes = series.groupby(masked_consecs).size().to_numpy()
# check sizes agains the threshold
is_okay = (consec_sizes >= threshold).any()
# whether a suitable sequence is found or not
return is_okay
and we can run it as:
# these are all for the second dataframe you posted
>>> is_consec_found(df.Value, value=0, threshold=2)
True
>>> is_consec_found(df.Value, value=0, threshold=5)
False
>>> is_consec_found(df.Value, value=1, threshold=2)
True
>>> is_consec_found(df.Value, value=1, threshold=3)
False
Good evening,
I have a question about detecting certain pattern. I don't know whether my question has specific terminology.
I have a pandas dataframe like this:
0 1 ... 8
0 date price ... pattern
1 2021-01-01 31.18 ... 0
2 2021-01-02 20.32 ... 1
3 2021-01-03 10.32 ... 1
4 2021-01-04 21.32 ... -1
5 2021-01-05 44.32 ... 0
6 2021-01-06 45.32 ... -1
7 2021-01-07 41.32 ... 1
8 2021-01-08 78.32 ... -1
9 2021-01-09 44.32 ... 1
10 2021-01-10 123.32 ... 1
11 2021-01-11 25.32 ... -1
How can I detect the pattern which is [-1 following after 1] in IF statement.
For example:
Grabbing price column from index 3 and 4 because pattern column at index 3 is 1 and index 4 is -1 which match my condition.
Next would be index 7 and 8 then index 10 and 11.
I probably convey my question pretty vague, however I don't really know how to describe it.
You can use three following solutions, but the first and second ones are more pandaic:
First:
prices = df.where((df.pattern==-1)&(df.pattern.shift()==1)).dropna().price
Second:
df['pattern2'] = df.pattern.shift()
# Selecting just prices of meeting condition
prices = df.loc[df.apply(lambda x: True if ((x['pattern'] == -1) & (x['pattern2'] == 1)) else False, axis=1), 'price']
Third:
prices = df.loc[(df.pattern - df.pattern.shift() == -2), 'price']
You can try shift and check for match
df['pattern_2'] = df['pattern'].shift(1)
df_new = df.iloc[[j for i in df.loc[(df['pattern'] == -1) & (df['pattern_2'] == 1), :].index for j in range(i-1, i+1)], :]
print(df_new)
date price pattern pattern_2
2 2021-01-03 10.32 1 1.0
3 2021-01-04 21.32 -1 1.0
6 2021-01-07 41.32 1 -1.0
7 2021-01-08 78.32 -1 1.0
9 2021-01-10 123.32 1 1.0
10 2021-01-11 25.32 -1 1.0
You can use Series.diff with Series.shift with boolean indexing.
m = df['pattern'].diff(-1).eq(2)
df[m|m.shift()]
date price pattern
3 2021-01-03 10.32 1
4 2021-01-04 21.32 -1
7 2021-01-07 41.32 1
8 2021-01-08 78.32 -1
10 2021-01-10 123.32 1
11 2021-01-11 25.32 -1
Details
df.pattern.diff(-1) calculates difference b/w ith element and i+1th element. So, when ith element is 1 and i+1th is -1 output would be 2(1 - -1)
_.eq(2) would marks True where the difference is 2.
m|m.shift() is for taken ith row as well as i+1th row.
I have a df that looks like this:
And I'm trying to turn it into this:
the following code gets me a list of a list that I can convert to a df and includes the first 3 columns of expected output, but not sure how to get the number columns I need (note: I have way more than 3 number columns but using this as a simple illustration).
x=[['ID','Start','End','Number1','Number2','Number3']]
for i in range(len(df)):
if not(df.iloc[i-1]['DateSpellIndicator']):
ID= df.iloc[i]['ID']
start = df.iloc[i]['Date']
if not(df.iloc[i]['DateSpellIndicator']):
newrow = [ID, start,df.iloc[i]['Date'],...]
x.append(newrow)
Here's one way to do it by making use of pandas groupby.
Input Dataframe:
ID DATE NUM TORF
0 1 2020-01-01 40 True
1 1 2020-02-01 50 True
2 1 2020-03-01 60 False
3 1 2020-06-01 70 True
4 2 2020-07-01 20 True
5 2 2020-08-01 30 False
Output Dataframe:
END ID Number1 Number2 Number3 START
0 2020-08-01 2 20 30.0 NaN 2020-07-01
1 2020-06-01 1 70 NaN NaN 2020-06-01
2 2020-03-01 1 40 50.0 60.0 2020-01-01
Code:
new_df=pd.DataFrame()
#create groups based on ID
for index, row in df.groupby('ID'):
#Within each group split at the occurence of False
dfnew=np.split(row, np.where(row.TORF == False)[0] + 1)
for sub_df in dfnew:
#within each subgroup
if sub_df.empty==False:
dfmod=pd.DataFrame({'ID':sub_df['ID'].iloc[0],'START':sub_df['DATE'].iloc[0],'END':sub_df['DATE'].iloc[-1]},index=[0])
j=0
for nindex, srow in sub_df.iterrows():
dfmod['Number{}'.format(j+1)]=srow['NUM']
j=j+1
#concatenate the existing and modified dataframes
new_df=pd.concat([dfmod, new_df], axis=0)
new_df.reset_index(drop=True)
Some of the steps could be reduced to get the same output.
I used cumsum to get the fist and last date. Used list to get the columns the way you want. Please note the output has different column names than your example. I assume you can change them the way you want.
df ['new1'] = ~df['datespell']
df['new2'] = df['new1'].cumsum()-df['new1']
check = df.groupby(['id', 'new2']).agg({'date': {'start': 'first', 'end': 'last'}, 'number': {'cols': lambda x: list(x)}})
check.columns = check.columns.droplevel(0)
check.reset_index(inplace=True)
pd.concat([check,check['cols'].apply(pd.Series)], axis=1).drop(['cols'], axis=1)
id new2 start end 0 1 2
0 1 0 2020-01-01 2020-03-01 40.0 50.0 60.0
1 1 1 2020-06-01 2020-06-01 70.0 NaN NaN
2 2 1 2020-07-01 2020-08-01 20.0 30.0 NaN
Here is the dataframe i used.
id date number datespell new1 new2
0 1 2020-01-01 40 True False 0
1 1 2020-02-01 50 True False 0
2 1 2020-03-01 60 False True 0
3 1 2020-06-01 70 True False 1
4 2 2020-07-01 20 True False 1
5 2 2020-08-01 30 False True 1
I am now to python and pandas.
I have the following dateframe. I would like to combine the start and end date if they are in consecutive day.
data = {"Project":["A","A","A",'A',"B","B"], "Start":[dt.datetime(2020,1,1),dt.datetime(2020,1,16),dt.datetime(2020,1,31),dt.datetime(2020,7,1),dt.datetime(2020,1,31),dt.datetime(2020,2,16)],"End":[dt.datetime(2020,1,15),dt.datetime(2020,1,30),dt.datetime(2020,2,15),dt.datetime(2020,7,15),dt.datetime(2020,2,15),dt.datetime(2020,2,20)]}
df = pd.DataFrame(data)
Project Start End
0 A 2020-01-01 2020-01-15
1 A 2020-01-16 2020-01-30
2 A 2020-01-31 2020-02-15
3 A 2020-07-01 2020-07-15
4 B 2020-01-31 2020-02-15
5 B 2020-02-16 2020-02-20
And my expected result:
Project Start End
0 A 2020-01-01 2020-02-15
1 A 2020-07-01 2020-07-15
2 B 2020-01-31 2020-02-20
If the next day of end is another start, I would like to combine the two rows.
Is there any pandas function can do this?
Thank a lot!
Create a mask with groupby and shift, then assign the values directly and drop_duplicates:
mask = df.groupby("Project").apply(lambda d: (d["Start"].shift(-1)-d["End"]).dt.days<=1).reset_index(drop=True)
df.loc[mask, "End"]= df["End"].shift(-1)
print (df.drop_duplicates(subset=["Project","End"],keep="first"))
Project Start End
0 A 2020-01-01 2020-01-30
2 A 2020-05-01 2020-05-15
3 A 2020-07-01 2020-07-15
4 B 2020-02-01 2020-02-20
For multiple rows instead, one way is to create an array of dates in long form by list comprehension & pd.date_range, and then get a mask grouped by cumsum, and finally get the min/max date of each group:
s = [(i[0],x) for i in df.to_numpy() for x in pd.date_range(*i[1:])]
new = pd.DataFrame(index=pd.MultiIndex.from_tuples(s,names=["Project","Date"])).reset_index()
mask = new.groupby("Project")["Date"].diff().dt.days.gt(1).cumsum()
print (new.groupby(["Project", mask]).agg({"min","max"}))
Date
min max
Project Date
A 0 2020-01-01 2020-02-15
1 2020-07-01 2020-07-15
B 1 2020-01-31 2020-02-20
I have a data set (sample) like below
Date Value
2019-05-01 0
2019-05-02 0
2019-05-03 0
2019-05-04 0
2019-05-05 0
2019-05-06 0
2019-05-07 0
2019-05-08 1
2019-05-09 0
I want to transform it such that, if I encounter Value=1, then I take the 3 values from 2 days before and fill it as 1. Also set the current value to be 0.
In other words, the transformed data set should look like this
Date Value
2019-05-01 0
2019-05-02 0
2019-05-03 1
2019-05-04 1
2019-05-05 1
2019-05-06 0
2019-05-07 0
2019-05-08 0
2019-05-09 0
Do notice, that in the example above, 2019-05-08 was set to 0 after transformation, and 2019-05-03 to 2019-05-05 was set to 1 (last value set to 1 is 2 days before 2019-05-08 and 3 days preceding 2019-05-05 is also set to 1).
If two consecutive values show up as 1, we start the date calculation from the last value that shows up as 1.
I think I can do this via for loops, but was looking to see if any inbuilt functions can help me with this.
Thanks!
There could be more precise ways of solving this problem. However, I could only think of solving this using the index values(say i) where Value==1 and then grab the index values at preceding locations(2 dates before means i-3 and then two more values above it means i-4, i-5) and assign the Value to 1. Finally, set the Value back to 0 for the index location(s) that were originally found for Value==1.
In [53]: df = pd.DataFrame({'Date':['2019-05-01','2019-05-02', '2019-05-03','2019-05-04','2019-05-05', '2019-05-06','20
...: 19-05-07','2019-05-08','2019-05-09'], 'Value':[0,0,0,0,0,0,0,1,0]})
...:
...:
In [54]: val_1_index = df.loc[df.Value == 1].index.tolist()
In [55]: val_1_index_decr = [(i-3, i-4, i-5) for i in val_1_index]
In [56]: df.loc[df['Value'].index.isin([i for i in val_1_index_decr[0]]), 'Value'] = 1
In [57]: df.loc[df['Value'].index.isin(val_1_index), 'Value'] = 0
In [58]: df
Out[58]:
Date Value
0 2019-05-01 0
1 2019-05-02 0
2 2019-05-03 1
3 2019-05-04 1
4 2019-05-05 1
5 2019-05-06 0
6 2019-05-07 0
7 2019-05-08 0
8 2019-05-09 0
A one line solution, assuming that df is your original dataframe:
df['Value'] = pd.Series([1 if 1 in df.iloc[i+3:i+6].values else 0 for i in df.index])
Here I work on index rather than dates, so I assume that you have one day per row and days are consecutive as shown in your example.
To fit also for this request:
If two consecutive values show up as 1, we start the date calculation from the last value that shows up as 1.
I can propose a two line solution:
validones = [True if df.iloc[i]['Value'] == 1 and df.iloc[i+1]['Value'] == 0 else False for i in df.index]
df['Value'] = pd.Series([1 if any(validones[i+3:i+6]) else 0 for i in range(len(validones))])
Basically first I build a list of boolean to check if the 1 in df['Value'] is not followed by another 1 and use this boolean list to perform the substitutions.
No sure about the efficiency of this solution because one needs to create three new columns but this also works:
df['shiftedValues'] = \
df['Value'].shift(-3, fill_value=0) + \
df['Value'].shift(-4, fill_value=0) + \
df['Value'].shift(-5, fill_value=0)
Note that the shift is done by row and not by day.
To shift by actual days I would first index by dates
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df['shiftedValues'] = \
df['Value'].shift(-3, freq='1D', fill_value=0).asof(df.index) + \
df['Value'].shift(-4, freq='1D', fill_value=0).asof(df.index) + \
df['Value'].shift(-5, freq='1D', fill_value=0).asof(df.index)
# Out:
# Value shiftedValues
# Date
# 2019-05-01 0 0.0
# 2019-05-02 0 0.0
# 2019-05-03 0 1.0
# 2019-05-04 0 1.0
# 2019-05-05 0 1.0
# 2019-05-06 0 0.0
# 2019-05-07 0 0.0
# 2019-05-08 1 0.0
# 2019-05-09 0 0.0
Now this works correctly for dates, for instance if df is (note the missing and repeated days)
Date Value
0 2019-05-01 0
1 2019-05-02 0
2 2019-05-03 0
3 2019-05-04 0
4 2019-05-05 0
5 2019-05-05 0
6 2019-05-07 0
7 2019-05-08 1
8 2019-05-09 0
then you get
Value shiftedValues
Date
2019-05-01 0 0.0
2019-05-02 0 0.0
2019-05-03 0 1.0
2019-05-04 0 1.0
2019-05-05 0 1.0
2019-05-05 0 1.0
2019-05-07 0 0.0
2019-05-08 1 0.0
2019-05-09 0 0.0