Python Pandas pattern serching - python

Good evening,
I have a question about detecting certain pattern. I don't know whether my question has specific terminology.
I have a pandas dataframe like this:
0 1 ... 8
0 date price ... pattern
1 2021-01-01 31.18 ... 0
2 2021-01-02 20.32 ... 1
3 2021-01-03 10.32 ... 1
4 2021-01-04 21.32 ... -1
5 2021-01-05 44.32 ... 0
6 2021-01-06 45.32 ... -1
7 2021-01-07 41.32 ... 1
8 2021-01-08 78.32 ... -1
9 2021-01-09 44.32 ... 1
10 2021-01-10 123.32 ... 1
11 2021-01-11 25.32 ... -1
How can I detect the pattern which is [-1 following after 1] in IF statement.
For example:
Grabbing price column from index 3 and 4 because pattern column at index 3 is 1 and index 4 is -1 which match my condition.
Next would be index 7 and 8 then index 10 and 11.
I probably convey my question pretty vague, however I don't really know how to describe it.

You can use three following solutions, but the first and second ones are more pandaic:
First:
prices = df.where((df.pattern==-1)&(df.pattern.shift()==1)).dropna().price
Second:
df['pattern2'] = df.pattern.shift()
# Selecting just prices of meeting condition
prices = df.loc[df.apply(lambda x: True if ((x['pattern'] == -1) & (x['pattern2'] == 1)) else False, axis=1), 'price']
Third:
prices = df.loc[(df.pattern - df.pattern.shift() == -2), 'price']

You can try shift and check for match
df['pattern_2'] = df['pattern'].shift(1)
df_new = df.iloc[[j for i in df.loc[(df['pattern'] == -1) & (df['pattern_2'] == 1), :].index for j in range(i-1, i+1)], :]
print(df_new)
date price pattern pattern_2
2 2021-01-03 10.32 1 1.0
3 2021-01-04 21.32 -1 1.0
6 2021-01-07 41.32 1 -1.0
7 2021-01-08 78.32 -1 1.0
9 2021-01-10 123.32 1 1.0
10 2021-01-11 25.32 -1 1.0

You can use Series.diff with Series.shift with boolean indexing.
m = df['pattern'].diff(-1).eq(2)
df[m|m.shift()]
date price pattern
3 2021-01-03 10.32 1
4 2021-01-04 21.32 -1
7 2021-01-07 41.32 1
8 2021-01-08 78.32 -1
10 2021-01-10 123.32 1
11 2021-01-11 25.32 -1
Details
df.pattern.diff(-1) calculates difference b/w ith element and i+1th element. So, when ith element is 1 and i+1th is -1 output would be 2(1 - -1)
_.eq(2) would marks True where the difference is 2.
m|m.shift() is for taken ith row as well as i+1th row.

Related

Replacing or sequencing in pandas dataframe column based on previous values and other column

I have a pandas df:
date day_of_week day
2021-01-01 3 1
2021-01-02 4 2
2021-01-03 5 0
2021-01-04 6 1
2021-01-05 7 2
2021-01-06 1 3
2021-01-07 2 0
2021-01-08 3 0
I would like to change numeration for 'day' column based on the 'day_of_week' column values. For example, if the event starts before Thursday (<4) I want to use numbering for 'day' column values, which are greater than 0, from 20 (instead of 1) and forth. If the event starts on Thursday but before Monday (>=4) I want to use numbering for values, which are greater than 0, from 30 (instead of 1) and forth.
The table should look like this:
date day_of_week day
2021-01-01 3 20
2021-01-02 4 21
2021-01-03 5 0
2021-01-04 6 30
2021-01-05 7 31
2021-01-06 1 32
2021-01-07 2 0
2021-01-08 3 0
I tried to use np.where to substitute values but I don't how to iterate through rows and insert values based on previous rows.
Please help!
We can use cumsum create the group then select the 20 or 30 by transform first day of each group
s = df.groupby(df['day'].eq(1).cumsum())['day_of_week'].transform('first')
df['day'] = df.day.where(df.day==0, df.day + np.where(s<4,19,29))
df
Out[16]:
date day_of_week day
0 2021-01-01 3 20
1 2021-01-02 4 21
2 2021-01-03 5 0
3 2021-01-04 6 30
4 2021-01-05 7 31
5 2021-01-06 1 32
6 2021-01-07 2 0
7 2021-01-08 3 0

Pandas create a column iteratively - increasing after specific threshold

I have a simple table which the datetime is formatted correctly on.
Datetime
Diff
2021-01-01 12:00:00
0
2021-01-01 12:02:00
2
2021-01-01 12:04:00
2
2021-01-01 12:010:00
6
2021-01-01 12:020:00
10
2021-01-01 12:022:00
2
I would like to add a label/batch name which increases when a specific threshold/cutoff time is the difference. The output (with a threshold of diff > 7) I am hoping to achieve is:
Datetime
Diff
Batch
2021-01-01 12:00:00
0
A
2021-01-01 12:02:00
2
A
2021-01-01 12:04:00
2
A
2021-01-01 12:010:00
6
A
2021-01-01 12:020:00
10
B
2021-01-01 12:022:00
2
B
Batch doesn't need to be 'A','B','C' - probably easier to increase numerically.
I cannot find a solution online but I'm assuming there is a method to split the table on all values below the threshold, apply the batch label and concatenate again. However I cannot seem to get it working.
Any insight appreciated :)
Since True and False values represent 1 and 0 when summed, you can use this to create a cumulative sum on a boolean column made by df.Diff > 7:
df['Batch'] = (df.Diff > 7).cumsum()
You can use:
df['Batch'] = df['Datetime'].diff().dt.total_seconds().gt(7*60) \
.cumsum().add(65).apply(chr)
print(df)
# Output:
Datetime Diff Batch
0 2021-01-01 12:00:00 0 A
1 2021-01-01 12:02:00 2 A
2 2021-01-01 12:04:00 2 A
3 2021-01-01 12:10:00 6 A
4 2021-01-01 12:20:00 10 B
5 2021-01-01 12:22:00 2 B
Update
For a side question: apply(char) goes through A-Z, what method would you use to achieve AA, AB for batches greater than 26
Try something like this:
# Adapted from openpyxl
def chrext(i):
s = ''
while i > 0:
i, r = divmod(i, 26)
i, r = (i, r) if r > 0 else (i-1, 26)
s += chr(r-1+65)
return s[::-1]
df['Batch'] = df['Datetime'].diff().dt.total_seconds().gt(7*60) \
.cumsum().add(1).apply(chrext)
For demonstration purpose, if you replace 1 by 27:
>>> df
Datetime Diff Batch
0 2021-01-01 12:00:00 0 AA
1 2021-01-01 12:02:00 2 AA
2 2021-01-01 12:04:00 2 AA
3 2021-01-01 12:10:00 6 AA
4 2021-01-01 12:20:00 10 AB
5 2021-01-01 12:22:00 2 AB
You can achieve this by creating a custom group that has the properties you want. After you group the values your batch is simply group number. You don't have to use groupby with only an existing column. You can give a custom index and it is really powerful.
from datetime import timedelta
df['batch'] == df.groupby(((df['Datetime'] - df['Datetime'].min()) // timedelta(minutes=7)).ngroup()

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

How to calculate date difference between rows in pandas

I have a data frame that looks like this.
ID
Start
End
1
2020-12-13
2020-12-20
1
2020-12-26
2021-01-20
1
2020-02-20
2020-02-21
2
2020-12-13
2020-12-20
2
2021-01-11
2021-01-20
2
2021-02-15
2021-02-26
Using pandas, I am trying to group by ID and then subtract the start date from a current row from the end date of the previous row.
If the difference is greater than 5 then it should return True
I'm new to pandas, and I've been trying to figure this out all day.
Two assumptions:
By difference greater than 5, you mean 5 days
You mean the absolute difference
So I am starting with this dataframe to which I added the column 'above_5_days'.
df
ID start end above_5_days
0 1 2020-12-13 2020-12-20 None
1 1 2020-12-26 2021-01-20 None
2 1 2020-02-20 2020-02-21 None
3 2 2020-12-13 2020-12-20 None
4 2 2021-01-11 2021-01-20 None
5 2 2021-02-15 2021-02-26 None
this will be the groupby object that will be used to apply the operation on each ID-group
id_grp = df.groupby("ID")
the following is the operation that will be applied on each subset
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
# if you don't want the absolute difference, remove .abs()
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
return x
Now apply this to the whole group and store it in a newdf
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 False
1 1 2020-12-26 2021-01-20 True
2 1 2020-02-20 2020-02-21 True
3 2 2020-12-13 2020-12-20 False
4 2 2021-01-11 2021-01-20 True
5 2 2021-02-15 2021-02-26 True
>>>>>>> I should point out that:
in this case, there are only False values because shifting down the end column for each group will make a NaN value in the first row of the column, which returns a NaN value when subtracted from. So the False values are just the boolean versions of None.
That is why, I would personally change the function to:
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
x.loc[to_subtract_from.isna(), "above_5_days"] = None
return x
When rerunning this, you can see that the extra line right before the return statement will set the value in the new column to NaN if the shifted end times are NaN.
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 NaN
1 1 2020-12-26 2021-01-20 1.0
2 1 2020-02-20 2020-02-21 1.0
3 2 2020-12-13 2020-12-20 NaN
4 2 2021-01-11 2021-01-20 1.0
5 2 2021-02-15 2021-02-26 1.0

Replacing values in dataframe with 0s and 1s based on conditions

I would like to filter and replace. For the columns with are lower or higher than zero and not NaN's, I would like to set for one and the others, set to zero.
mask = ((ts[x] > 0)
| (ts[x] < 0))
ts[mask]=1
ts[ts[x]==1]
I did this and is working but I have to deal with the values that do not attend this condition replacing with zero.
Any recommendations? I am quite confusing, and also would be better to use where function in this case?
Thanks all!
Sample Data
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
Expected result
asset.relativeSetpoint.350
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
You can do this by applying a logical AND on the two conditions and converting the resultant mask to integer.
df
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
(df['asset.relativeSetpoint.350'].ne(0)
& df['asset.relativeSetpoint.350'].notnull()).astype(int)
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
Name: asset.relativeSetpoint.350, dtype: int64
The first condition df['asset.relativeSetpoint.350'].ne(0) gets a boolean mask of all elements that are not equal to 0 (this would include <0, >0, and NaN).
The second condition df['asset.relativeSetpoint.350'].notnull() will get a boolean mask of elements that are not NaNs.
The two masks are ANDed, and converted to integer.
How about using apply?
df[COLUMN_NAME] = df[COLUMN_NAME].apply(lambda x: 1 if x != 0 else 0)

Categories

Resources