pandas: find the first friday following the first monday of the month - python

currently trying to figure this out with code like this..but not quite yet able to get it:
df[(df.index.day_of_week==0) & (df.index.day<15) & (df.shift(-4).index.day_of_week==4)]
this is what the data looks like. (i've added the day_of_week column for convenience). basically, i am trying to find the first day_of_week=0 (monday) in the month, then filter for the first day_of_week=4 after that (friday)
close day_of_week
date
2022-07-01 3825.330078 4
2022-07-05 3831.389893 1
2022-07-06 3845.080078 2
2022-07-07 3902.620117 3
2022-07-08 3899.379883 4
2022-07-11 3854.429932 0
2022-07-12 3818.800049 1
2022-07-13 3801.780029 2
2022-07-14 3790.379883 3
2022-07-15 3863.159912 4
...
2022-08-01 4118.629883 0
2022-08-02 4091.189941 1
2022-08-03 4155.169922 2
2022-08-04 4151.939941 3
2022-08-05 4145.189941 4
2022-08-08 4140.060059 0
2022-08-09 4122.470215 1
2022-08-10 4210.240234 2
2022-08-11 4207.270020 3
2022-08-12 4280.149902 4
...
2022-09-01 3966.850098 3
2022-09-02 3924.260010 4
2022-09-06 3908.189941 1
2022-09-07 3979.870117 2
2022-09-08 4006.179932 3
2022-09-09 4067.360107 4
2022-09-12 4110.410156 0
2022-09-13 3932.689941 1
2022-09-14 3946.010010 2
2022-09-15 3901.350098 3
2022-09-16 3873.330078 4
...
2022-10-03 3678.429932 0
2022-10-04 3790.929932 1
2022-10-05 3783.280029 2
2022-10-06 3744.520020 3
2022-10-07 3639.659912 4
2022-10-10 3612.389893 0
2022-10-11 3588.840088 1
2022-10-12 3577.030029 2
...
2022-11-01 3856.100098 1
2022-11-02 3759.689941 2
2022-11-03 3719.889893 3
2022-11-04 3770.550049 4
2022-11-07 3806.800049 0
2022-11-08 3828.110107 1
This should return:
2022-07-15 3863.159912 4
2022-08-05 4145.189941 4
2022-09-16 3873.330078 4
2022-10-07 3639.659912 4
EDIT:
while i dont expect this to work, curious as to why this returns no results? is shifting not supported when filtering in this manner
df[(df.index.day_of_week==0) & (df.index.day<15) & (df.shift(-4).index.day_of_week==4)]

You can group by [year, month, day_of_week] and do a cumcount to assign to each row the number of times its day_of_week has appeared in this month.
Then, grab the rows corresponding to the first monday of the month using the filter day_of_week == 0 & cumcount == 0 and shift their index by 4 days to get the following Fridays. Finally, we do an intersection to filter out shifted indexes that do not exist in the original frame.
wanted_indices = df.index[df.groupby([df.index.year, df.index.month, df['dow']]).cumcount().eq(0) & df['dow'].eq(0)].shift(4, 'D')
df.loc[wanted_indices.intersection(df.index)]
Result on your example dataframe
close dow
date
2022-07-15 3863.159912 4
2022-08-05 4145.189941 4
2022-09-16 3873.330078 4
2022-10-07 3639.659912 4

I feel like the simplest way to do that is to recognize that the Friday after the first Monday of each month must be no less than day 5 (Monday must be at least day 1). So, just first filter by that requirement and then filter by day_of_week.
df_filtered = df[df.index.dt.day >= 5]
first_fridays = df[df['day_of_week'] == 5]

Related

Python - Tell if there is a non consecutive date in pandas dataframe

I have a pandas data frame with dates. I need to know if every other date pair is consecutive.
2 1988-01-01
3 2015-01-31
4 2015-02-01
5 2015-05-31
6 2015-06-01
7 2021-11-16
11 2021-11-17
12 2022-10-05
8 2022-10-06
9 2022-10-12
10 2022-10-13
# How to build this example dataframe
df=pd.DataFrame({'date':pd.to_datetime(['1988-01-01','2015-01-31','2015-02-01', '2015-05-31','2015-06-01', '2021-11-16', '2021-11-17', '2022-10-05', '2022-10-06', '2022-10-12', '2022-10-13'])})
Each pair should be consecutive. I have tried different sorting but everything I see relates to the entire series being consecutive. I need to compare each pair of dates after the first date.
cb_gap = cb_sorted.sort_values('dates').groupby('dates').diff() > pd.to_timedelta('1 day')
What I need to see is this...
2 1988-01-01 <- Ignore the start date
3 2015-01-31 <- these dates have no gap
4 2015-02-01
5 2015-05-31 <- these dates have no gap
6 2015-06-01
7 2021-11-16 <- these have a gap!!!!
11 2021-11-18
12 2022-10-05 <- these have no gap
8 2022-10-06
9 2022-10-12
One way is to use shift and compute differences.
pd.DataFrame({'date':df.date,'diff':df.date.shift(-1)-df.date})[1::2]
returns
date diff
1 2015-01-31 1 days
3 2015-05-31 1 days
5 2021-11-16 1 days
7 2022-10-05 1 days
9 2022-10-12 1 days
It is also faster
Method
Timeit
Naveed's
4.23 ms
This one
0.93 ms
here is one way to do it
btw, what is your expected output? the answer get you the difference b/w the consecutive dates skipping the first row and populate diff column
# make date into datetime
df['date'] = pd.to_datetime(df['date'])
# create two intermediate DF skipping the first and taking alternate values
# and concat them along x-axis
df2=pd.concat([df.iloc[1:].iloc[::2].reset_index()[['id','date']],
df.iloc[2:].iloc[::2].reset_index()[['id','date']]
],axis=1 )
# take the difference of second date from the first one
df2['diff']=df2.iloc[:,3]-df2.iloc[:,1]
df2
id date id date diff
0 3 2015-01-31 4 2015-02-01 1 days
1 5 2015-05-31 6 2015-06-01 1 days
2 7 2021-11-16 11 2021-11-17 1 days
3 12 2022-10-05 8 2022-10-06 1 days
4 9 2022-10-12 10 2022-10-13 1 days

How to calculate monthly changes in a time series using pandas dataframe

As I am new to Python I am probably asking for something basic for most of you. However, I have a df where 'Date' is the index, another column that is returning the month related to the Date, and one Data column.
Mnth TSData
Date
2012-01-05 1 192.6257
2012-01-12 1 194.2714
2012-01-19 1 192.0086
2012-01-26 1 186.9729
2012-02-02 2 183.7700
2012-02-09 2 178.2343
2012-02-16 2 172.3429
2012-02-23 2 171.7800
2012-03-01 3 169.6300
2012-03-08 3 168.7386
2012-03-15 3 167.1700
2012-03-22 3 165.9543
2012-03-29 3 165.0771
2012-04-05 4 164.6371
2012-04-12 4 164.6500
2012-04-19 4 166.9171
2012-04-26 4 166.4514
2012-05-03 5 166.3657
2012-05-10 5 168.2543
2012-05-17 5 176.8271
2012-05-24 5 179.1971
2012-05-31 5 183.7120
2012-06-07 6 195.1286
I wish to calculate monthly changes in the data set that I can later use in a boxplot. So from the table above the results i seek are:
Mnth Chng
1 -8,9 (183,77 - 192,66)
2 -14,14 (169,63 - 183,77)
3 -5 (164,63 - 169,63)
4 1,73 (166,36 - 164,63)
5 28,77 (195,13 - 166,36)
and so on...
any suggestions?
thanks :)
IIUC, starting from this as df:
Date Mnth TSData
0 2012-01-05 1 192.6257
1 2012-01-12 1 194.2714
2 2012-01-19 1 192.0086
3 2012-01-26 1 186.9729
4 2012-02-02 2 183.7700
...
20 2012-05-24 5 179.1971
21 2012-05-31 5 183.7120
22 2012-06-07 6 195.1286
you can use:
df.groupby('Mnth')['TSData'].first().diff().shift(-1)
# or
# -df.groupby('Mnth')['TSData'].first().diff(-1)
NB. the data must be sorted by date to have the desired date to be used in the computation as the first item of each group (df.sort_values(by=['Mnth', 'Date']))
output:
Mnth
1 -8.8557
2 -14.1400
3 -4.9929
4 1.7286
5 28.7629
6 NaN
Name: TSData, dtype: float64
I'll verify that we have a datetime index:
df.index = pd.to_datetime(df.index)
Then it's simply a matter of using resample:
df['TSData'].resample('M').first().diff().shift(freq='-1M')
Output:
Date
2011-12-31 NaN
2012-01-31 -8.8557
2012-02-29 -14.1400
2012-03-31 -4.9929
2012-04-30 1.7286
2012-05-31 28.7629
Name: TSData, dtype: float64

Pandas - Times series multiple slices of a dataframe groupby Id

What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18

Pandas to_datetime function giving erratic output

My data frame has a column 'Date' which is of type object but I want to convert it to pandas time series. So I am using pd.to_datetime function. This function is converting the datatype but giving erratic output.
code:
x1['TS'] = pd.to_datetime(x1['Date'])
x1['Day'] = x1['TS'].dt.dayofweek
x1[['Date', 'TS', 'Day']].iloc[::1430,:]
Now notice the output closely and see the columns Date and TS. it should be same but in some cases, its different.
output :
Date TS Day
0 01-12-2017 2017-01-12 3
1430 01-12-2017 2017-01-12 3
2860 02-12-2017 2017-02-12 6
4290 03-12-2017 2017-03-12 6
5720 04-12-2017 2017-04-12 2
7150 05-12-2017 2017-05-12 4
8580 07-12-2017 2017-07-12 2
10010 08-12-2017 2017-08-12 5
11440 09-12-2017 2017-09-12 1
12870 09-12-2017 2017-09-12 1
14300 10-12-2017 2017-10-12 3
15730 11-12-2017 2017-11-12 6
17160 12-12-2017 2017-12-12 1
18590 13-12-2017 2017-12-13 2
20020 14-12-2017 2017-12-14 3
21450 15-12-2017 2017-12-15 4
22880 16-12-2017 2017-12-16 5
24310 17-12-2017 2017-12-17 6
25740 18-12-2017 2017-12-18 0
27170 19-12-2017 2017-12-19 1
28600 20-12-2017 2017-12-20 2
30030 21-12-2017 2017-12-21 3
31460 22-12-2017 2017-12-22 4
32890 23-12-2017 2017-12-23 5
34320 24-12-2017 2017-12-24 6
35750 25-12-2017 2017-12-25 0
37180 26-12-2017 2017-12-26 1
38610 27-12-2017 2017-12-27 2
40040 28-12-2017 2017-12-28 3
41470 29-12-2017 2017-12-29 4
42900 30-12-2017 2017-12-30 5
44330 31-12-2017 2017-12-31 6
45760 01-01-2018 2018-01-01 0
47190 02-01-2018 2018-02-01 3
48620 03-01-2018 2018-03-01 3
50050 04-01-2018 2018-04-01 6
51480 05-01-2018 2018-05-01 1
52910 06-01-2018 2018-06-01 4
54340 07-01-2018 2018-07-01 6
55770 08-01-2018 2018-08-01 2
57200 09-01-2018 2018-09-01 5
58630 10-01-2018 2018-10-01 0
60060 11-01-2018 2018-11-01 3
61490 12-01-2018 2018-12-01 5
62920 13-01-2018 2018-01-13 5
64350 14-01-2018 2018-01-14 6
65780 15-01-2018 2018-01-15 0
67210 16-01-2018 2018-01-16 1
Oops! Looks like your dates start with the day being first. You'll have to tell pandas to handle that accordingly. Set the dayfirst flag to True when calling to_datetime.
x1['TS'] = pd.to_datetime(x1['Date'], dayfirst=True)
When you pass in a time without specifying the format, Pandas tries to guess at the format in a naive manner. It was assuming that what is your day is actually your month but then when it sees that it is month 13, realizes that can't be the month column and switches.
The following should work but I like #cᴏʟᴅsᴘᴇᴇᴅ's solution better because it is simpler to just raise the dayfirst flag.
To fix this, provide the current format to the to_datetime function.
The documentation gives the following example which you can modify to fit your situation:
pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
See details here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html
Time format conventions (what %Y means and so on) are here: https://docs.python.org/3.2/library/time.html

Identify amount of due date between subsequent rows

I have a table that group by ID and sorted transaction date as shown below.
id transactions_date membership_expire_date
1 2016-11-16 2016-12-16
1 2016-12-15 2017-01-14
1 2017-01-15 2017-02-14
1 2017-02-15 2017-03-17
2 2015-01-31 2015-03-03
2 2015-02-28 2015-03-31
2 2015-04-05 2015-05-01
I want calculate if the users were late on the due date. For example, on userid 1, on the second row's transactions_date, user performed payment before the membership_expire_date stated on 1st row(within, equal to or 1 day after membership_expire_date are considered as punctual), therefore the amount of due = 0. However, for userid 2 last row, the user paid on 2015-04-05. Therefore, 2015-04-05 - 2015-03-31 - 1 days(one day after membership_expire_date is fine) = 4 days due.
How should I compute them? I am stuck after sorted them this way.
transactions_train = transactions_train.sort_values(by=['id','transaction_date', 'membership_expire_date'], ascending=True)
The expected result is something like below.
id transactions_date membership_expire_date late_count
1 2016-11-16 2016-12-16 0
1 2016-12-15 2017-01-14 0
1 2017-01-15 2017-02-14 0
1 2017-02-16 2017-03-17 1
2 2015-01-31 2015-03-03 0
2 2015-02-28 2015-03-31 0
2 2015-04-05 2015-05-01 4
You need to look at shift, indeed.
def days_due(group):
print('-', group)
day = pd.Timedelta('1d')
days_late = ((group['transactions_date'] - group['membership_expire_date'].shift()) / day - 1)
days_late = days_late.where(days_late > 0)
return days_late.fillna(0).astype(int)
df['late_count'] = pd.concat(days_due(group) for idx, group in df.groupby('id'))
id transactions_date membership_expire_date late_count
0 1 2016-11-16 2016-12-16 0
1 1 2016-12-15 2017-01-14 0
2 1 2017-01-15 2017-02-14 0
3 1 2017-02-16 2017-03-17 1
4 2 2015-01-31 2015-03-03 0
5 2 2015-02-28 2015-03-31 0
6 2 2015-04-05 2015-05-01 4

Categories

Resources