dataframe set true after the time that meets specific condition daily - python

I need to set condition column to True after the time at which the price was 20 or higher daily like below.
I want to avoid using apply function because I got several millions data. I think apply requires too much time.

Use GroupBy.cummax or GroupBy.cumsum per days and compare for greater or equal by Series.ge:
df['datetime'] = pd.to_datetime(df['datetime'])
df['condition'] = df.groupby([df['datetime'].dt.date])['price'].cummax().ge(20)
If need test also per compid:
df['condition'] = df.groupby(['compid', df['datetime'].dt.date])['price'].cummax().ge(20)
print (df)
compid datetime price condition
0 1 2020-11-06 00:00:00 10 False
1 1 2020-11-06 00:00:10 20 True
2 1 2020-11-06 00:00:20 5 True
3 1 2020-11-07 00:00:00 20 True
4 1 2020-11-07 00:00:10 5 True
5 1 2020-11-07 00:00:20 25 True

You can use np.where with df.cumsum:
In [1306]: import numpy as np
In [1307]: df['condition'] = np.where(df.groupby(df.datetime.dt.date).price.cumsum().ge(20), 'TRUE', 'FALSE')
In [1308]: df
Out[1308]:
compid datetime price condition
0 1 2020-11-06 00:00:00 10 FALSE
1 1 2020-11-06 00:00:10 20 TRUE
2 1 2020-11-06 00:00:20 5 TRUE
3 1 2020-11-07 00:00:00 20 TRUE
4 1 2020-11-07 00:00:10 5 TRUE
5 1 2020-11-07 00:00:20 25 TRUE
OR, if you need bool values in condition column, do this:
In [1309]: df['condition'] = np.where(df.groupby(df.datetime.dt.date).price.cumsum().ge(20), True, False)
In [1310]: df
Out[1310]:
compid datetime price condition
0 1 2020-11-06 00:00:00 10 False
1 1 2020-11-06 00:00:10 20 True
2 1 2020-11-06 00:00:20 5 True
3 1 2020-11-07 00:00:00 20 True
4 1 2020-11-07 00:00:10 5 True
5 1 2020-11-07 00:00:20 25 True

Related

Time since first ever occurrence in Pandas

I have the following data frame in Pandas:
df = pd.DataFrame({
'ID': [1,2,1,1,2,3,1,3,3,3,2],
'date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28']
})
I want to produce a column time since first occur that is the time passed in days since their first occurrence.
Here is what I did:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df.sort_values(by=['ID', 'date'], ascending = [True, False], inplace=True)
and I got the sorted data frame
ID date
6 1 2022-01-12
3 1 2021-11-28
0 1 2021-04-28
2 1 2011-03-01
1 2 2022-05-21
10 2 2009-05-28
4 2 1992-12-01
9 3 2022-01-01
7 3 2019-02-28
8 3 2001-03-28
5 3 1999-10-28
so the output should look like
ID date time since first occur
6 1 2022-01-12 3970
3 1 2021-11-28 3925
0 1 2021-04-28 3711
2 1 2011-03-01 0
1 2 2022-05-21 10763
10 2 2009-05-28 6022
4 2 1992-12-01 0
9 3 2022-01-01 8101
7 3 2019-02-28 7063
8 3 2001-03-28 517
5 3 1999-10-28 0
Thanks in advance for helping.
After sorting the dataframe, you can get the difference between date and minimal date in group
df['time since first occur'] = (df['date'] - df.groupby('ID')['date'].transform('min')).dt.days
print(df)
ID date time since first occur
6 1 2022-01-12 3970
3 1 2021-11-28 3925
0 1 2021-04-28 3711
2 1 2011-03-01 0
1 2 2022-05-21 10763
10 2 2009-05-28 6022
4 2 1992-12-01 0
9 3 2022-01-01 8101
7 3 2019-02-28 7063
8 3 2001-03-28 517
5 3 1999-10-28 0

pandas generate a sequence of dates according to a pattern

I have this sequence of dates and I want to create a column with a flag according to a 3-2 pattern: 3 days in a row flagged, then 2 days not flagged etc.
import pandas as pd
date_pattern = pd.date_range(start='2020-01-01', end='2020-06-30')
date_pattern = pd.DataFrame({"my_date": date_pattern})
date_pattern
wanted a 'flag' column, having for instance 1 for the range 01 to 03 jan, then from 06 to 08 jan, etc.
You can use modulo 5 with index values and then compare for less like 3 for each fourth and fifth row:
date_pattern['flag'] = date_pattern.index % 5 < 3
#alternative for not default index
#date_pattern['flag'] = np.arange(len(date_pattern)) % 5 < 3
print(date_pattern.head(15))
my_date flag
0 2020-01-01 True
1 2020-01-02 True
2 2020-01-03 True
3 2020-01-04 False
4 2020-01-05 False
5 2020-01-06 True
6 2020-01-07 True
7 2020-01-08 True
8 2020-01-09 False
9 2020-01-10 False
10 2020-01-11 True
11 2020-01-12 True
12 2020-01-13 True
13 2020-01-14 False
14 2020-01-15 False

Days between this and next time a column value is True?

I am trying to do a date calculation counting days passing between events in a non-date column in pandas.
I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'date':[
'01.01.2020','02.01.2020','03.01.2020','10.01.2020',
'01.01.2020','04.02.2020','20.02.2020','21.02.2020',
'01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
'user_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'other_val':[0,0,0,100,0,100,0,10,10,0,0,10],
'booly':[True, False, False, True,
True, False, False, True,
True, True, True, True]})
Now, I've been unable to figure out how to create a new column stating the number of days that passed between each True value in the 'booly' column, for each user. So for each row with a True in the 'booly' column, how many days is it until the next row with a True in the 'booly' column occurs, like so:
date user_id booly days_until_next_booly
01.01.2020 1 True 9
02.01.2020 1 False None
03.01.2020 1 False None
10.01.2020 1 True None
01.01.2020 2 True 51
04.02.2020 2 False None
20.02.2020 2 False None
21.01.2020 2 True None
01.02.2020 3 True 9
10.02.2020 3 True 10
20.02.2020 3 True 29
20.03.2020 3 True None
# sample data
df = pd.DataFrame({'date':[
'01.01.2020','02.01.2020','03.01.2020','10.01.2020',
'01.01.2020','04.02.2020','20.02.2020','21.02.2020',
'01.02.2020','10.02.2020','20.02.2020','20.03.2020'],
'user_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'other_val':[0,0,0,100,0,100,0,10,10,0,0,10],
'booly':[True, False, False, True,
True, False, False, True,
True, True, True, True]})
# convert data to date time format
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# use loc with groupby to calculate the difference between True values
df.loc[df['booly'] == True, 'days_until_next_booly'] = df.loc[df['booly'] == True].groupby('user_id')['date'].diff().shift(-1)
date user_id other_val booly days_until_next_booly
0 2020-01-01 1 0 True 9 days
1 2020-01-02 1 0 False NaT
2 2020-01-03 1 0 False NaT
3 2020-01-10 1 100 True NaT
4 2020-01-01 2 0 True 51 days
5 2020-02-04 2 100 False NaT
6 2020-02-20 2 0 False NaT
7 2020-02-21 2 10 True NaT
8 2020-02-01 3 10 True 9 days
9 2020-02-10 3 0 True 10 days
10 2020-02-20 3 0 True 29 days
11 2020-03-20 3 10 True NaT
(
df
# fist convert the date column to datetime format
.assign(date=lambda x: pd.to_datetime(x['date'], dayfirst=True))
# sort your dates
.sort_values('date')
# calculate the difference between subsequent dates
.assign(date_diff=lambda x: x['date'].diff(1).shift(-1))
# Groupby your booly column to calculate the cumulative days between True values
.assign(date_diff_cum=lambda x: x.groupby(x['booly'].cumsum())['date_diff'].transform('sum').where(x['booly'] == True))
)
Output:
date user_id other_val booly date_diff date_diff_cum
2020-01-01 2 0 True 1 days 9 days
2020-01-02 1 0 False 1 days NaT
2020-01-03 1 0 False 7 days NaT
2020-01-10 1 100 True 22 days 22 days
2020-02-01 1 0 True 0 days 0 days
2020-02-01 3 10 True 3 days 9 days
2020-02-04 2 10 False 6 days NaT
2020-02-10 3 0 True 10 days 10 days
2020-02-20 2 100 False 0 days NaT
2020-02-20 3 0 True 1 days 1 days
2020-02-21 2 0 True 28 days 28 days
2020-03-20 3 10 True NaT 0 days

flattening time series data from pandas df

I have a df that looks like this:
And I'm trying to turn it into this:
the following code gets me a list of a list that I can convert to a df and includes the first 3 columns of expected output, but not sure how to get the number columns I need (note: I have way more than 3 number columns but using this as a simple illustration).
x=[['ID','Start','End','Number1','Number2','Number3']]
for i in range(len(df)):
if not(df.iloc[i-1]['DateSpellIndicator']):
ID= df.iloc[i]['ID']
start = df.iloc[i]['Date']
if not(df.iloc[i]['DateSpellIndicator']):
newrow = [ID, start,df.iloc[i]['Date'],...]
x.append(newrow)
Here's one way to do it by making use of pandas groupby.
Input Dataframe:
ID DATE NUM TORF
0 1 2020-01-01 40 True
1 1 2020-02-01 50 True
2 1 2020-03-01 60 False
3 1 2020-06-01 70 True
4 2 2020-07-01 20 True
5 2 2020-08-01 30 False
Output Dataframe:
END ID Number1 Number2 Number3 START
0 2020-08-01 2 20 30.0 NaN 2020-07-01
1 2020-06-01 1 70 NaN NaN 2020-06-01
2 2020-03-01 1 40 50.0 60.0 2020-01-01
Code:
new_df=pd.DataFrame()
#create groups based on ID
for index, row in df.groupby('ID'):
#Within each group split at the occurence of False
dfnew=np.split(row, np.where(row.TORF == False)[0] + 1)
for sub_df in dfnew:
#within each subgroup
if sub_df.empty==False:
dfmod=pd.DataFrame({'ID':sub_df['ID'].iloc[0],'START':sub_df['DATE'].iloc[0],'END':sub_df['DATE'].iloc[-1]},index=[0])
j=0
for nindex, srow in sub_df.iterrows():
dfmod['Number{}'.format(j+1)]=srow['NUM']
j=j+1
#concatenate the existing and modified dataframes
new_df=pd.concat([dfmod, new_df], axis=0)
new_df.reset_index(drop=True)
Some of the steps could be reduced to get the same output.
I used cumsum to get the fist and last date. Used list to get the columns the way you want. Please note the output has different column names than your example. I assume you can change them the way you want.
df ['new1'] = ~df['datespell']
df['new2'] = df['new1'].cumsum()-df['new1']
check = df.groupby(['id', 'new2']).agg({'date': {'start': 'first', 'end': 'last'}, 'number': {'cols': lambda x: list(x)}})
check.columns = check.columns.droplevel(0)
check.reset_index(inplace=True)
pd.concat([check,check['cols'].apply(pd.Series)], axis=1).drop(['cols'], axis=1)
id new2 start end 0 1 2
0 1 0 2020-01-01 2020-03-01 40.0 50.0 60.0
1 1 1 2020-06-01 2020-06-01 70.0 NaN NaN
2 2 1 2020-07-01 2020-08-01 20.0 30.0 NaN
Here is the dataframe i used.
id date number datespell new1 new2
0 1 2020-01-01 40 True False 0
1 1 2020-02-01 50 True False 0
2 1 2020-03-01 60 False True 0
3 1 2020-06-01 70 True False 1
4 2 2020-07-01 20 True False 1
5 2 2020-08-01 30 False True 1

pandas how to check differences between column values are within a range or not in each group

I have the following df,
cluster_id date
1 2018-01-02
1 2018-02-01
1 2018-03-30
2 2018-04-01
2 2018-04-23
2 2018-05-18
3 2018-06-01
3 2018-07-30
3 2018-09-30
I like to create a boolean column recur_pmt, which is set to True if all differences between consecutive values of date in each cluster (df.groupby('cluster_id')) are 30 < x < 40; and False otherwise. So the result is like,
cluster_id date recur_pmt
1 2018-01-02 False
1 2018-02-01 False
1 2018-03-30 False
2 2018-04-01 True
2 2018-04-23 True
2 2018-05-18 True
3 2018-06-01 False
3 2018-07-30 False
3 2018-09-30 False
I tried
df['recur_pmt'] = df.groupby('cluster_id')['date'].apply(
lambda x: (20 < x.diff().dropna().dt.days < 40).all())
but it did not work. I am also wondering can it use transform as well in this case.
Use transform with Series.between and parameter inclusive=False:
df['recur_pmt'] = df.groupby('cluster_id')['date'].transform(
lambda x: (x.diff().dropna().dt.days.between(20, 40, inclusive=False)).all())
print (df)
cluster_id date recur_pmt
0 1 2018-01-02 False
1 1 2018-02-01 False
2 1 2018-03-30 False
3 2 2018-04-01 True
4 2 2018-04-23 True
5 2 2018-05-18 True
6 3 2018-06-01 False
7 3 2018-07-30 False
8 3 2018-09-30 False

Categories

Resources