How to check a time-range in Pandas? - python

I have a dataframe like this:
time
2018-06-25 20:42:00
2016-06-26 23:51:00
2017-05-34 12:29:00
2016-03-11 10:14:00
Now I created a column like this
df['isEIDRange'] = 0
Let's say, EID festivate is on 15 June 2018.
So I want to fill 1 value in isEIDRange column. If the date is between 10 June 2018 to 20 June 2018 (5 days before and 5 days after EID)
How can I do it?
Something like?
df.loc[ (df.time > 15 June - 5 days) & (df.time < 15 June + 5 days), 'isEIDRange' ] = 1

Use Series.between function for test values with cast mask to integers:
df['isEIDRange'] = df['time'].between('2018-06-10', '2018-06-20').astype(int)
If want dynamic solution:
df = pd.DataFrame({"time": pd.date_range("2018-06-08", "2018-06-22")})
#print (df)
date = '15 June 2018'
d = pd.to_datetime(date)
diff = pd.Timedelta(5, unit='d')
df['isEIDRange1'] = df['time'].between(d - diff, d + diff).astype(int)
df['isEIDRange2'] = df['time'].between(d - diff, d + diff, inclusive=False).astype(int)
print (df)
time isEIDRange1 isEIDRange2
0 2018-06-08 0 0
1 2018-06-09 0 0
2 2018-06-10 1 0
3 2018-06-11 1 1
4 2018-06-12 1 1
5 2018-06-13 1 1
6 2018-06-14 1 1
7 2018-06-15 1 1
8 2018-06-16 1 1
9 2018-06-17 1 1
10 2018-06-18 1 1
11 2018-06-19 1 1
12 2018-06-20 1 0
13 2018-06-21 0 0
14 2018-06-22 0 0
Or set values by numpy.where:
df['isEIDRange'] = np.where(df['time'].between(d - diff, d + diff), 1, 0)

You can use loc or np.where:
import numpy as np
df['isEIDRange'] = np.where((df['time'] > '2018-06-10') & (df['time'] < '2018-06-20'),1,df['isEIDRange']
This means that when the column time is between 2018-06-10 and 2018-06-20, the column isEIDRange will be equal to 1, otherwise it will retain it's original value (0).

You can use pandas date_range for this:
eid = pd.date_range("15/10/2019", "20/10/2019")
df = pd.DataFrame({"dates": pd.date_range("13/10/2019", "20/10/2019")})
df["eid"] = 0
df.loc[df["dates"].isin(eid), "eid"] = 1
and output:
dates eid
0 2019-10-13 0
1 2019-10-14 0
2 2019-10-15 1
3 2019-10-16 1
4 2019-10-17 1
5 2019-10-18 1
6 2019-10-19 1
7 2019-10-20 1

Related

Calculation of how many touch points the customer has had in the last 6 months

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had. I also got an KeyError.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
df_new = df.groupby(['customerId', 'fromDate'], as_index=False)['fromDate'].count()
df_new['count_from_date'] = df_new['fromDate']
df = df.merge(df_new['count_from_date'], how='inner', left_index=True, right_index=True)
(df.set_index(['fromDate']).sort_index().groupby('customerId').apply(lambda s: s['count_from_date'].rolling('180D').sum())- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']
[OUT] KeyError: 'count_from_date'
What I want
customerId fromDate occur_last_6_months
0 1 2022-06-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 1 # 2022-05-20 = 1
2 1 2022-05-25 1 # 2022-05-20 = 1
3 1 2022-05-20 0 # No in the last 6 months
4 1 2021-09-05 0 # No in the last 6 months
5 2 2022-06-02 0 # No in the last 6 months
6 3 2021-03-01 1 # 2021-02-01 = 1
7 3 2021-02-01 0 # No in the last 6 months
If possible sum duplicated values like second and third row count matched values in mask by sum only True values:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1 <= d1[:, None])).sum(axis=1) - 1
return x
df = df.groupby('customerId').apply(f)
print(df)
customerId fromDate last_month occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 2
2 1 2022-05-25 2021-11-25 2
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
If need subtract by all count per duplciated dates instead subtract 1 use GroupBy.transform with size:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1 <= d1[:, None])).sum(axis=1)
return x
df = df.groupby('customerId').apply(f)
s = df.groupby(['customerId', 'fromDate'])['customerId'].transform('size')
df['occur_last_6_months'] -= s
print(df)
customerId fromDate last_month occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

cumulative count and sum based on date frame

I have a DataFrame df, that, once sorted by date, looks like this:
User Date price
0 2 2020-01-30 50
1 1 2020-02-02 30
2 2 2020-02-28 50
3 2 2020-04-30 10
4 1 2020-12-28 10
5 1 2020-12-30 20
I want to compute, for each row:
the number of row in the last month, and
the sum price in the last month.
On the example above, the output that I'm looking for:
User Date price NumlastMonth Totallastmonth
0 2 2020-01-30 50 0 0
1 1 2020-02-02 30 0 0 # not 1, 50 ???
2 2 2020-02-28 50 1 50
3 2 2020-04-30 10 0 0
4 1 2020-12-28 10 0 0
5 1 2020-12-30 20 1 10 # not 0, 0 ???
I tried this, but the result is for all last row not only last month.
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumcount()
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumsum()
Taking literally the question (acknowledging that the example doesn't quite match the description of the question), we could do:
tally = df.groupby(pd.Grouper(key='Date', freq='M')).agg({'User': 'count', 'price': sum})
tally.index += pd.offsets.Day(1)
tally = tally.reindex(index=df.Date, method='ffill', fill_value=0)
On your input, that gives:
>>> tally
User price
Date
2020-01-30 0 0
2020-02-02 1 50
2020-02-28 1 50
2020-04-30 0 0
2020-12-28 0 0
2020-12-30 0 0
After that, it's easy to change the column names and concat:
df2 = pd.concat([
df.set_index('Date'),
tally.rename(columns={'User': 'NumlastMonth', 'price': 'Totallastmonth'})
], axis=1)
# out:
User price NumlastMonth Totallastmonth
Date
2020-01-30 2 50 0 0
2020-02-02 1 30 1 50
2020-02-28 2 50 1 50
2020-04-30 2 10 0 0
2020-12-28 1 10 0 0
2020-12-30 1 20 0 0
​```

Return duration for each id

I have a large list of events being tracked with a timestamp appended to each:
I currently have the following table:
ID Time_Stamp Event
1 2/20/2019 18:21 0
1 2/20/2019 19:46 0
1 2/21/2019 18:35 0
1 2/22/2019 11:39 1
1 2/22/2019 16:46 0
1 2/23/2019 7:40 0
2 6/5/2019 0:10 0
3 7/31/2019 10:18 0
3 8/23/2019 16:33 0
4 6/26/2019 20:49 0
What I want is the following [but not sure if it's possible]:
ID Time_Stamp Conversion Total_Duration_Days Conversion_Duration
1 2/20/2019 18:21 0 2.555 1.721
1 2/20/2019 19:46 0 2.555 1.721
1 2/21/2019 18:35 0 2.555 1.721
1 2/22/2019 11:39 1 2.555 1.721
1 2/22/2019 16:46 1 2.555 1.934
1 2/23/2019 7:40 0 2.555 1.934
2 6/5/2019 0:10 0 1.00 0.000
3 7/31/2019 10:18 0 23.260 0.000
3 8/23/2019 16:33 0 23.260 0.000
4 6/26/2019 20:49 0 1.00 0.000
For #1 Total Duration = Max Date - Min Date [2.555 Days]
For #2 Conversion Duration = Conversion Date - Min Date [1.721 Days] - following actions post the conversion can remain at the calculated duration
I have attempted the following:
df.reset_index(inplace=True)
df.groupby(['ID'])['Time_Stamp].diff().fillna(0)
This kind of does what I want, but it's showing the difference between each event, not the min time stamp to the max time stamp
conv_test = df.reset_index(inplace=True)
min_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('min').to_frame('MinTime')
max_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('max').to_frame('MaxTime')
conv_test = conv_test.set_index('ID').merge(min_df, left_index=True, right_index=True)
conv_test = conv_test.merge(max_df, left_index=True, right_index=True)
conv_test['Durartion'] = conv_test['MaxTime'] - conv_test['MinTime']
This gives me Total_Duration_Days which is great [feel free to offer a more elegant solution
Any ideas on how I can get Conversion_Duration?
You can use GroupBy.transform with min and max for Series with same size like original, so possible subtract for Total_Duration_Days and then filter only 1 rows by Event, create Series by DataFrame.set_index and convert to dict, then Series.map for new Series, so possible subtract minimal values per groups:
df['Time_Stamp'] = pd.to_datetime(df['Time_Stamp'])
min1 = df.groupby('ID')['Time_Stamp'].transform('min')
max1 = df.groupby('ID')['Time_Stamp'].transform('max')
df['Total_Duration_Days'] = max1.sub(min1).dt.total_seconds() / (3600 * 24)
d = df.loc[df['Event'] == 1].set_index('ID')['Time_Stamp'].to_dict()
new1 = df['ID'].map(d)
Because possible multiple 1 per groups is added solution only for this groups - testing, if more 1 per groups in mask, get Series new2 and then use Series.combine_first with mapped Series new1.
Reason is improve performance, because a bit complicated processing multiple 1.
mask = df['Event'].eq(1).groupby(df['ID']).transform('sum').gt(1)
g = df[mask].groupby('ID')['Event'].cumsum().replace({0:np.nan})
new2 = (df[mask].groupby(['ID', g])['Time_Stamp']
.transform('first')
.groupby(df['ID'])
.bfill())
df['Conversion_Duration'] = (new2.combine_first(new1)
.sub(min1)
.dt.total_seconds().fillna(0) / (3600 * 24))
print (df)
ID Time_Stamp Event Total_Duration_Days Conversion_Duration
0 1 2019-02-20 18:21:00 0 2.554861 1.720833
1 1 2019-02-20 19:46:00 0 2.554861 1.720833
2 1 2019-02-21 18:35:00 0 2.554861 1.720833
3 1 2019-02-22 11:39:00 1 2.554861 1.720833
4 1 2019-02-22 16:46:00 1 2.554861 1.934028
5 1 2019-02-23 07:40:00 0 2.554861 1.934028
6 2 2019-06-05 00:10:00 0 0.000000 0.000000
7 3 2019-07-31 10:18:00 0 23.260417 0.000000
8 3 2019-08-23 16:33:00 0 23.260417 0.000000
9 4 2019-06-26 20:49:00 0 0.000000 0.000000

Subtracting Rows based on ID Column - Pandas

I have a dataframe which looks like this:
UserId Date_watched Days_not_watch
1 2010-09-11 5
1 2010-10-01 8
1 2010-10-28 1
2 2010-05-06 12
2 2010-05-18 5
3 2010-08-09 10
3 2010-09-25 5
I want to find out the no. of days the user gave as a gap, so I want a column for each row for each user and my dataframe should look something like this:
UserId Date_watched Days_not_watch Gap(2nd watch_date - 1st watch_date - days_not_watch)
1 2010-09-11 5 0 (First gap will be 0 for all users)
1 2010-10-01 8 15 (11th Sept+5=16th Sept; 1st Oct - 16th Sept=15days)
1 2010-10-28 1 9
2 2010-05-06 12 0
2 2010-05-18 5 0 (because 6th May+12 days=18th May)
3 2010-08-09 10 0
3 2010-09-25 4 36
3 2010-10-01 2 2
I have mentioned the formula for calculating the Gap beside the column name of the dataframe.
Here is one approach using groupby + shift:
# sort by date first
df['Date_watched'] = pd.to_datetime(df['Date_watched'])
df = df.sort_values(['UserId', 'Date_watched'])
# calculate groupwise start dates, shifted
grp = df.groupby('UserId')
starts = grp['Date_watched'].shift() + \
pd.to_timedelta(grp['Days_not_watch'].shift(), unit='d')
# calculate timedelta gaps
df['Gap'] = (df['Date_watched'] - starts).fillna(pd.Timedelta(0))
# convert to days and then integers
df['Gap'] = (df['Gap'] / pd.Timedelta('1 day')).astype(int)
print(df)
UserId Date_watched Days_not_watch Gap
0 1 2010-09-11 5 0
1 1 2010-10-01 8 15
2 1 2010-10-28 1 19
3 2 2010-05-06 12 0
4 2 2010-05-18 5 0
5 3 2010-08-09 10 0
6 3 2010-09-25 5 37

Categories

Resources