Identify amount of due date between subsequent rows - python

I have a table that group by ID and sorted transaction date as shown below.
id transactions_date membership_expire_date
1 2016-11-16 2016-12-16
1 2016-12-15 2017-01-14
1 2017-01-15 2017-02-14
1 2017-02-15 2017-03-17
2 2015-01-31 2015-03-03
2 2015-02-28 2015-03-31
2 2015-04-05 2015-05-01
I want calculate if the users were late on the due date. For example, on userid 1, on the second row's transactions_date, user performed payment before the membership_expire_date stated on 1st row(within, equal to or 1 day after membership_expire_date are considered as punctual), therefore the amount of due = 0. However, for userid 2 last row, the user paid on 2015-04-05. Therefore, 2015-04-05 - 2015-03-31 - 1 days(one day after membership_expire_date is fine) = 4 days due.
How should I compute them? I am stuck after sorted them this way.
transactions_train = transactions_train.sort_values(by=['id','transaction_date', 'membership_expire_date'], ascending=True)
The expected result is something like below.
id transactions_date membership_expire_date late_count
1 2016-11-16 2016-12-16 0
1 2016-12-15 2017-01-14 0
1 2017-01-15 2017-02-14 0
1 2017-02-16 2017-03-17 1
2 2015-01-31 2015-03-03 0
2 2015-02-28 2015-03-31 0
2 2015-04-05 2015-05-01 4

You need to look at shift, indeed.
def days_due(group):
print('-', group)
day = pd.Timedelta('1d')
days_late = ((group['transactions_date'] - group['membership_expire_date'].shift()) / day - 1)
days_late = days_late.where(days_late > 0)
return days_late.fillna(0).astype(int)
df['late_count'] = pd.concat(days_due(group) for idx, group in df.groupby('id'))
id transactions_date membership_expire_date late_count
0 1 2016-11-16 2016-12-16 0
1 1 2016-12-15 2017-01-14 0
2 1 2017-01-15 2017-02-14 0
3 1 2017-02-16 2017-03-17 1
4 2 2015-01-31 2015-03-03 0
5 2 2015-02-28 2015-03-31 0
6 2 2015-04-05 2015-05-01 4

Related

pandas: find the first friday following the first monday of the month

currently trying to figure this out with code like this..but not quite yet able to get it:
df[(df.index.day_of_week==0) & (df.index.day<15) & (df.shift(-4).index.day_of_week==4)]
this is what the data looks like. (i've added the day_of_week column for convenience). basically, i am trying to find the first day_of_week=0 (monday) in the month, then filter for the first day_of_week=4 after that (friday)
close day_of_week
date
2022-07-01 3825.330078 4
2022-07-05 3831.389893 1
2022-07-06 3845.080078 2
2022-07-07 3902.620117 3
2022-07-08 3899.379883 4
2022-07-11 3854.429932 0
2022-07-12 3818.800049 1
2022-07-13 3801.780029 2
2022-07-14 3790.379883 3
2022-07-15 3863.159912 4
...
2022-08-01 4118.629883 0
2022-08-02 4091.189941 1
2022-08-03 4155.169922 2
2022-08-04 4151.939941 3
2022-08-05 4145.189941 4
2022-08-08 4140.060059 0
2022-08-09 4122.470215 1
2022-08-10 4210.240234 2
2022-08-11 4207.270020 3
2022-08-12 4280.149902 4
...
2022-09-01 3966.850098 3
2022-09-02 3924.260010 4
2022-09-06 3908.189941 1
2022-09-07 3979.870117 2
2022-09-08 4006.179932 3
2022-09-09 4067.360107 4
2022-09-12 4110.410156 0
2022-09-13 3932.689941 1
2022-09-14 3946.010010 2
2022-09-15 3901.350098 3
2022-09-16 3873.330078 4
...
2022-10-03 3678.429932 0
2022-10-04 3790.929932 1
2022-10-05 3783.280029 2
2022-10-06 3744.520020 3
2022-10-07 3639.659912 4
2022-10-10 3612.389893 0
2022-10-11 3588.840088 1
2022-10-12 3577.030029 2
...
2022-11-01 3856.100098 1
2022-11-02 3759.689941 2
2022-11-03 3719.889893 3
2022-11-04 3770.550049 4
2022-11-07 3806.800049 0
2022-11-08 3828.110107 1
This should return:
2022-07-15 3863.159912 4
2022-08-05 4145.189941 4
2022-09-16 3873.330078 4
2022-10-07 3639.659912 4
EDIT:
while i dont expect this to work, curious as to why this returns no results? is shifting not supported when filtering in this manner
df[(df.index.day_of_week==0) & (df.index.day<15) & (df.shift(-4).index.day_of_week==4)]
You can group by [year, month, day_of_week] and do a cumcount to assign to each row the number of times its day_of_week has appeared in this month.
Then, grab the rows corresponding to the first monday of the month using the filter day_of_week == 0 & cumcount == 0 and shift their index by 4 days to get the following Fridays. Finally, we do an intersection to filter out shifted indexes that do not exist in the original frame.
wanted_indices = df.index[df.groupby([df.index.year, df.index.month, df['dow']]).cumcount().eq(0) & df['dow'].eq(0)].shift(4, 'D')
df.loc[wanted_indices.intersection(df.index)]
Result on your example dataframe
close dow
date
2022-07-15 3863.159912 4
2022-08-05 4145.189941 4
2022-09-16 3873.330078 4
2022-10-07 3639.659912 4
I feel like the simplest way to do that is to recognize that the Friday after the first Monday of each month must be no less than day 5 (Monday must be at least day 1). So, just first filter by that requirement and then filter by day_of_week.
df_filtered = df[df.index.dt.day >= 5]
first_fridays = df[df['day_of_week'] == 5]

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

Pandas - Times series multiple slices of a dataframe groupby Id

What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18

split, groupby, combine in Pandas to find a difference in dates

I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4

Pandas groupby date

I have a DataFrame with events. One or more events can occur at a date (so the date can't be an index). The date range is several years. I want to groupby years and months and have a count of the Category values. Thnx
in [12]: df = pd.read_excel('Pandas_Test.xls', 'sheet1')
In [13]: df
Out[13]:
EventRefNr DateOccurence Type Category
0 86596 2010-01-02 00:00:00 3 Small
1 86779 2010-01-09 00:00:00 13 Medium
2 86780 2010-02-10 00:00:00 6 Small
3 86781 2010-02-09 00:00:00 17 Small
4 86898 2010-02-10 00:00:00 6 Small
5 86898 2010-02-11 00:00:00 6 Small
6 86902 2010-02-17 00:00:00 9 Small
7 86908 2010-02-19 00:00:00 3 Medium
8 86908 2010-03-05 00:00:00 3 Medium
9 86909 2010-03-06 00:00:00 8 Small
10 86930 2010-03-12 00:00:00 29 Small
11 86934 2010-03-16 00:00:00 9 Small
12 86940 2010-04-08 00:00:00 9 High
13 86941 2010-04-09 00:00:00 17 Small
14 86946 2010-04-14 00:00:00 10 Small
15 86950 2011-01-19 00:00:00 12 Small
16 86956 2011-01-24 00:00:00 13 Small
17 86959 2011-01-27 00:00:00 17 Small
I tried:
df.groupby(df['DateOccurence'])
For the month and year break out I often add additional columns to the data frame that break out the dates into each piece:
df['year'] = [t.year for t in df.DateOccurence]
df['month'] = [t.month for t in df.DateOccurence]
df['day'] = [t.day for t in df.DateOccurence]
It adds space complexity (adding columns to the df) but is less time complex (less processing on groupby) than a datetime index but it's really up to you. datetime index is the more pandas way to do things.
After breaking out by year, month, day you can do any groupby you need.
df.groupby['year','month'].Category.apply(pd.value_counts)
To get months across multiple years:
df.groupby['month'].Category.apply(pd.value_counts)
Or in Andy Hayden's datetime index
df.groupby[di.month].Category.apply(pd.value_counts)
You can simply pick which method fits your needs better.
You can apply value_counts to the SeriesGroupby (for the column):
In [11]: g = df.groupby('DateOccurence')
In [12]: g.Category.apply(pd.value_counts)
Out[12]:
DateOccurence
2010-01-02 Small 1
2010-01-09 Medium 1
2010-02-09 Small 1
2010-02-10 Small 2
2010-02-11 Small 1
2010-02-17 Small 1
2010-02-19 Medium 1
2010-03-05 Medium 1
2010-03-06 Small 1
2010-03-12 Small 1
2010-03-16 Small 1
2010-04-08 High 1
2010-04-09 Small 1
2010-04-14 Small 1
2011-01-19 Small 1
2011-01-24 Small 1
2011-01-27 Small 1
dtype: int64
I actually hoped this to return the following DataFrame, but you need to unstack it:
In [13]: g.Category.apply(pd.value_counts).unstack(-1).fillna(0)
Out[13]:
High Medium Small
DateOccurence
2010-01-02 0 0 1
2010-01-09 0 1 0
2010-02-09 0 0 1
2010-02-10 0 0 2
2010-02-11 0 0 1
2010-02-17 0 0 1
2010-02-19 0 1 0
2010-03-05 0 1 0
2010-03-06 0 0 1
2010-03-12 0 0 1
2010-03-16 0 0 1
2010-04-08 1 0 0
2010-04-09 0 0 1
2010-04-14 0 0 1
2011-01-19 0 0 1
2011-01-24 0 0 1
2011-01-27 0 0 1
If there were multiple different Categories with the same Date they would be on the same row...

Categories

Resources