How long are Pandas groupby objects remembered?

How long are Pandas groupby objects remembered? - python

I have the following example Python 3.4 script. It does the following:
creates a dataframe,
converts the date variable to datetime64 format,
creates a groupby object based on two categorical variables,
produces a dataframe that contains a count of the number items in each group,
merges count dataframe back with original dataframe to create a column containing the number of rows in each group
creates a column containing the difference in dates between sequential rows.
Here is the script:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
This script produces the following output:
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 243 days 06:18:00
2 old 2015-06-04 12:34:00 female 2 3 NaT
3 old 2015-09-04 23:03:00 female 3 3 92 days 10:29:00
4 old 2015-04-21 12:59:00 female 6 3 -137 days +13:56:00
5 old 2015-12-04 01:00:00 male 4 6 NaT
6 old 2015-04-15 07:12:00 male 5 6 -233 days +06:12:00
7 old 2015-06-05 11:12:00 male 9 6 51 days 04:00:00
8 old 2015-05-19 19:22:00 male 12 6 -17 days +08:10:00
9 old 2015-04-06 12:57:00 male 15 6 -44 days +17:35:00
10 old 2015-06-15 03:23:00 male 17 6 69 days 14:26:00
11 young 2015-12-05 14:19:00 female 11 4 NaT
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 163 days 18:28:00
And this exactly what I'd expect. However, it seems to rely on creating the groupby object twice (in exactly the same way). If the second groupby definition is commented out, it seems to lead to a very different output in the diff column:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
# ****** THIS TIME THE FOLLOWING GROUPBY DEFINITION IS COMMENTED OUT *****
# tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
And, this time the output is very different (and NOT what I wanted at all)
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 NaT
2 old 2015-06-04 12:34:00 female 2 3 92 days 10:29:00
3 old 2015-09-04 23:03:00 female 3 3 NaT
4 old 2015-04-21 12:59:00 female 6 3 -233 days +06:12:00
5 old 2015-12-04 01:00:00 male 4 6 -137 days +13:56:00
6 old 2015-04-15 07:12:00 male 5 6 NaT
7 old 2015-06-05 11:12:00 male 9 6 NaT
8 old 2015-05-19 19:22:00 male 12 6 51 days 04:00:00
9 old 2015-04-06 12:57:00 male 15 6 243 days 06:18:00
10 old 2015-06-15 03:23:00 male 17 6 NaT
11 young 2015-12-05 14:19:00 female 11 4 -17 days +08:10:00
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 -44 days +17:35:00
(In my real-life script the results seem to be a little erratic, sometimes it works and sometimes it doesn't. But in the above script, the different outputs seem to occur consistently.)
Why is it necessary to recreate the groupby object on what is, essentially, the same dataframe (albeit with an additional column added) immediately before using the .diff() function? This seems very dangerous to me.

Not the same, the index has changed. For example:
tempDF.loc[1].id # before
10
tempDF.loc[1].id # after
2
So if you compute tempGroupby with the old tempDF and then change the indexes in tempDF when you do this:
tempDF['diff'] = tempGroupby['date'].diff()
the indexes do not match as you expect. You are assigning to each row the difference corresponding to the row that had that index in the old tempDF.

Related

Pandas conditional outer join based on timedelta (merge_asof)

I have multiple dataframes that I need to merge into a single dataset based on a unique identifier (uid), and on the timedelta between dates in each dataframe.
Here's a simplified example of the dataframes:
df1
uid tx_date last_name first_name meas_1
0 60 2004-01-11 John Smith 1.3
1 60 2016-12-24 John Smith 2.4
2 61 1994-05-05 Betty Jones 1.2
3 63 2006-07-19 James Wood NaN
4 63 2008-01-03 James Wood 2.9
5 65 1998-10-08 Tom Plant 4.2
6 66 2000-02-01 Helen Kerr 1.1
df2
uid rx_date last_name first_name meas_2
0 60 2004-01-14 John Smith A
1 60 2017-01-05 John Smith AB
2 60 2017-03-31 John Smith NaN
3 63 2006-07-21 James Wood A
4 64 2002-04-18 Bill Jackson B
5 65 1998-10-08 Tom Plant AA
6 65 2005-12-01 Tom Plant B
7 66 2013-12-14 Helen Kerr C
Basically I am trying to merge records for the same person from two separate sources, where there link between records for unique individuals is the 'uid', and the link between rows (where it exists) for each individiual is a fuzzy relationship between 'tx_date' and 'rx_date' that can (usually) be accomodated by a specific time delta. There won't always be an exact or fuzzy match between dates, data could be missing from any column except 'uid', and each dataframe will contain a different but intersecting subset of 'uid's.
I need to be able to concatenate rows where the 'uid' columns match, and where the absolute time delta between 'tx_date' and 'rx_date' is within a given range (e.g. max delta of 14 days). Where the time delta is outside that range, or one of either 'tx_date' or 'rx_date' is missing, or where the 'uid' exists in only one of the dataframes, I still need to retain the data in that row. The end result should be something like:
uid tx_date rx_date first_name last_name meas_1 meas_2
0 60 2004-01-11 2004-01-14 John Smith 1.3 A
1 60 2016-12-24 2017-01-05 John Smith 2.4 AB
2 60 NaT 2017-03-31 John Smith NaN NaN
3 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood NaN NaN
6 64 2002-04-18 NaT Bill Jackson NaN B
7 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
8 65 NaT 2005-12-01 Tom Plant NaN B
9 66 2000-02-01 NaT Helen Kerr 1.1 NaN
10 66 NaT 2013-12-14 Helen Kerr NaN C
Seems like pandas.merge_asof should be useful here, but I've not been able to get it to do quite what I need.
Trying merge_asof on two of the real dataframes I have gave an error ValueError: left keys must be sorted
As per this question the problem there was actually due to there being NaT values in the 'date' column for some rows. I dropped the rows with NaT values, and sorted the 'date' columns in each dataframe, but the result still isn't quite what I need.
The code below shows the steps taken.
import pandas as pd
df1['date'] = df1['tx_date']
df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = df1['date'].dropna()
df1 = df1.sort_values('date')
df2['date'] = df2['rx_date']
df2['date'] = pd.to_datetime(df2['date'])
df2['date'] = df2['date'].dropna()
df2 = df2.sort_values('date')
df_merged = (pd.merge_asof(df1, df2, on='date', by='uid', tolerance=pd.Timedelta('14 days'))).sort_values('uid')
Result:
uid tx_date rx_date last_name_x first_name_x meas_1 meas_2
3 60 2004-01-11 2004-01-14 John Smith 1.3 A
6 60 2016-12-24 2017-01-05 John Smith 2.4 AB
0 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood 2.9 NaN
1 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
2 66 2000-02-01 NaT Helen Kerr 1.1 NaN
It looks like a left join rather than a full outer join, so anywhere there's a row in df2 without a match on 'uid' and 'date' in df1 is lost (and it's not really clear from this simplified example, but I also need to add the rows back in where the date was NaT).
Is there some way to achieve a lossless merge, either by somehow doing an outer join with merge_asof, or using some other approach?

How to continue the week number when the year changes using pandas

Example: By using
df['Week_Number'] = df['Date'].dt.strftime('%U')
for 29/12/2019 the week is 52. and this week is from 29/12/2019 to 04/01/2020.
but for 01/01/2020 the week is getting as 00.
I require the week for 01/01/2020 also as 52. and for 05/01/2020 to 11/01/2020 as 53. This need to be continued.

I used a logic to solve the question.
First of all, let's write a function to create an instance of Dataframe involving dates from 2019-12-01 to 2020-01-31 by a function
def create_date_table(start='2019-12-01', end='2020-01-31'):
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df["Week_start_from_Monday"] = df.Date.dt.isocalendar().week
df['Week_start_from_Sunday'] = df['Date'].dt.strftime('%U')
return df
Run the function and observe the Dataframe
date_df=create_date_table()
date_df.head(n=40)
There are two fields in the Dataframe about weeks, Week_start_from_Monday and Week_start_from_Sunday, the difference come from they count Monday or Sunday as the first day of a week.
In this case, Week_start_from_Sunday is the one we need to focus on.
Now we write a function to add a column containing weeks continuing from last year, not reset to 00 when we enter a new year.
def add_continued_week_field(date: Timestamp, df_start_date: str = '2019-12-01') -> int:
start_date = datetime.strptime(df_start_date, '%Y-%m-%d')
year_of_start_date = start_date.year
year_of_date = date.year
week_of_date = date.strftime("%U")
year_diff = year_of_date - year_of_start_date
if year_diff == 0:
continued_week = int(week_of_date)
else:
continued_week = year_diff * 52 + int(week_of_date)
return continued_week
Let's apply the function add_continued_week_field to the dates' Dataframe.
date_df['Week_continue'] = date_df['Date'].apply(add_continued_week_field)
We can see the new added field in the dates' Dataframe

As stated in converting a pandas date to week number, you can use df['Date'].dt.week to get week numbers.
To let it continue you maybe could sum up the last week number with new week-values, something like this? I cannot test this right now...
if(df['Date'].dt.strftime('%U') == 53):
last = df['Date'].dt.strftime('%U')
df['Week_Number'] = last + df['Date'].dt.strftime('%U')

You can do this with isoweek and isoyear.
I don't see how you arrive at the values you present with '%U' so I will assume that you want to map the week starting on Sunday 2019-12-29 ending on 2020-01-04 to 53, and that you want to map the following week to 54 and so on.
For weeks to continue past the year you need isoweek.
isocalendar() provides a tuple with isoweek in the second element and a corresponding unique isoyear in the first element.
But isoweek starts on Monday so we have to add one day so the Sunday is interpreted as Monday and counted to the right week.
2019 is subtracted to have years starting from 0, then every year is multiplied with 53 and the isoweek is added. Finally there is an offset of 1 so you arrive at 53.
In [0]: s=pd.Series(["29/12/2019", "01/01/2020", "05/01/2020", "11/01/2020"])
dts = pd.to_datetime(s,infer_datetime_format=True)
In [0]: (dts + pd.DateOffset(days=1)).apply(lambda x: (x.isocalendar()[0] -2019)*53 + x.isocalendar()[1] -1)
Out[0]:
0 53
1 53
2 54
3 54
dtype: int64
This of course assumes that all iso years have 53 weeks which is not the case, so instead you would want to compute the number of iso weeks per iso year since 2019 and sum those up.

Maybe you are looking for this. I fixed an epoch. If you have dates earlier than 2019, you can choose other epoch.
epoch= pd.Timestamp("2019-12-23")
# Test data:
df=pd.DataFrame({"Date":pd.date_range("22/12/2019",freq="1D",periods=25)})
df["Day_name"]=df.Date.dt.day_name()
# Calculation:
df["Week_Number"]=np.where(df.Date.astype("datetime64").le(epoch), \
df.Date.dt.week, \
df.Date.sub(epoch).dt.days//7+52)
df
Date Day_name Week_Number
0 2019-12-22 Sunday 51
1 2019-12-23 Monday 52
2 2019-12-24 Tuesday 52
3 2019-12-25 Wednesday 52
4 2019-12-26 Thursday 52
5 2019-12-27 Friday 52
6 2019-12-28 Saturday 52
7 2019-12-29 Sunday 52
8 2019-12-30 Monday 53
9 2019-12-31 Tuesday 53
10 2020-01-01 Wednesday 53
11 2020-01-02 Thursday 53
12 2020-01-03 Friday 53
13 2020-01-04 Saturday 53
14 2020-01-05 Sunday 53
15 2020-01-06 Monday 54
16 2020-01-07 Tuesday 54
17 2020-01-08 Wednesday 54
18 2020-01-09 Thursday 54
19 2020-01-10 Friday 54
20 2020-01-11 Saturday 54
21 2020-01-12 Sunday 54
22 2020-01-13 Monday 55
23 2020-01-14 Tuesday 55
24 2020-01-15 Wednesday 55

I got here wanting to know how to label consecutive weeks - I'm not sure if that's exactly what the question is asking but I think it might be. So here is what I came up with:
# Create dataframe with example dates
# It has a datetime index and a column with day of week (just to check that it's working)
dates = pd.date_range('2019-12-15','2020-01-10')
df = pd.DataFrame(dates.dayofweek,index=dates,columns=['dow'])
# Add column
# THESE ARE THE RELEVANT LINES
woy = df.index.weekofyear
numbered = np.cumsum(np.diff(woy,prepend=woy[0])!=0)
# Append for easier comparison
df['week_num'] = numbered
df then looks like this:
dow week_num
2019-12-15 6 0
2019-12-16 0 1
2019-12-17 1 1
2019-12-18 2 1
2019-12-19 3 1
2019-12-20 4 1
2019-12-21 5 1
2019-12-22 6 1
2019-12-23 0 2
2019-12-24 1 2
2019-12-25 2 2
2019-12-26 3 2
2019-12-27 4 2
2019-12-28 5 2
2019-12-29 6 2
2019-12-30 0 3
2019-12-31 1 3
2020-01-01 2 3
2020-01-02 3 3
2020-01-03 4 3
2020-01-04 5 3
2020-01-05 6 3
2020-01-06 0 4
2020-01-07 1 4
2020-01-08 2 4
2020-01-09 3 4
2020-01-10 4 4

Comparing daily value in each year in DataFrame to same day-number's value in another specific year

I have a daily time series of closing prices of a financial instrument going back to 1990.
I am trying to compare the daily percentage change for each trading day of the previous years to it's respective trading day in 2019. I have 41 trading days of data for 2019 at this time.
I get so far as filtering down and creating a new DataFrame with only the first 41 dates, closing prices, daily percentage changes, and the "trading day of year" ("tdoy") classifier for each day in the set, but am not having luck from there.
I've found other Stack Overflow questions that help people compare datetime days, weeks, years, etc. but I am not able to recreate this because of the arbitrary value each "tdoy" represents.
I won't bother creating a sample DataFrame because of the number of rows so I've linked the CSV I've come up with to this point: Sample CSV.
I think the easiest approach would just be to create a new column that returns what the 2019 percentage change is for each corresponding "tdoy" (Trading Day of Year) using df.loc, and if I could figure this much out I could then create yet another column to do the simple difference between that year/day's percentage change to 2019's respective value. Below is what I try to use (and I've tried other variations) to no avail.
df['2019'] = df['perc'].loc[((df.year == 2019) & (df.tdoy == df.tdoy))]
I've tried to search Stack and Google in probably 20 different variations of my problem and can't seem to find an answer that fits my issue of arbitrary "Trading Day of Year" classification.
I'm sure the answer is right in front of my face somewhere but I am still new to data wrangling.

First step is to import the csv properly. I'm not sure if you made the adjustment, but your data's date column is a string object.
# import the csv and assign to df. parse dates to datetime
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
# filter the dataframe so that you only have 2019 and 2018 data
df=df[df['year'] >= 2018]
df.tail()
Unnamed: 0 Dates last perc year tdoy
1225 7601 2019-02-20 29.96 0.007397 2019 37
1226 7602 2019-02-21 30.49 0.017690 2019 38
1227 7603 2019-02-22 30.51 0.000656 2019 39
1228 7604 2019-02-25 30.36 -0.004916 2019 40
1229 7605 2019-02-26 30.03 -0.010870 2019 41
Put the tdoy and year into a multiindex.
# create a multiindex
df.set_index(['tdoy','year'], inplace=True)
df.tail()
Dates last perc
tdoy year
37 2019 7601 2019-02-20 29.96 0.007397
38 2019 7602 2019-02-21 30.49 0.017690
39 2019 7603 2019-02-22 30.51 0.000656
40 2019 7604 2019-02-25 30.36 -0.004916
41 2019 7605 2019-02-26 30.03 -0.010870
Make pivot table
# make a pivot table and assign it to a variable
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1.head()
year 2018 2019
tdoy
1 33.08 27.55
2 33.38 27.90
3 33.76 28.18
4 33.74 28.41
5 33.65 28.26
Create calculated column
# create the new column
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
df1
year 2018 2019 pct_change
tdoy
1 33.08 27.55 -0.167170
2 33.38 27.90 -0.164170
3 33.76 28.18 -0.165284
4 33.74 28.41 -0.157973
5 33.65 28.26 -0.160178
6 33.43 28.18 -0.157045
7 33.55 28.32 -0.155887
8 33.29 27.94 -0.160709
9 32.97 28.17 -0.145587
10 32.93 28.11 -0.146371
11 32.93 28.24 -0.142423
12 32.79 28.23 -0.139067
13 32.51 28.77 -0.115042
14 32.23 29.01 -0.099907
15 32.28 29.01 -0.101301
16 32.16 29.06 -0.096393
17 32.52 29.38 -0.096556
18 32.68 29.51 -0.097001
19 32.50 30.03 -0.076000
20 32.79 30.30 -0.075938
21 32.87 30.11 -0.083967
22 33.08 30.42 -0.080411
23 33.07 30.17 -0.087693
24 32.90 29.89 -0.091489
25 32.51 30.13 -0.073208
26 32.50 30.38 -0.065231
27 33.16 30.90 -0.068154
28 32.56 30.81 -0.053747
29 32.21 30.87 -0.041602
30 31.96 30.24 -0.053817
31 31.85 30.33 -0.047724
32 31.57 29.99 -0.050048
33 31.80 29.89 -0.060063
34 31.70 29.95 -0.055205
35 31.54 29.95 -0.050412
36 31.54 29.74 -0.057070
37 31.86 29.96 -0.059636
38 32.07 30.49 -0.049267
39 32.04 30.51 -0.047753
40 32.36 30.36 -0.061805
41 32.62 30.03 -0.079399
Altogether without comments and data, the codes looks like:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df=df[df['year'] >= 2018]
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
[EDIT] poster requesting for all dates compared to 2019.
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
Ignore year filter above, create pivot table
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
Create a loop going through the years/columns and create a new field for each year comparing to 2019.
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
To view some data...
df1.loc[1:4, "1990_pct_change":"1994_pct_change"]
year 1990_pct_change 1991_pct_change 1992_pct_change 1993_pct_change 1994_pct_change
tdoy
1 0.494845 0.328351 0.489189 0.345872 -0.069257
2 0.496781 0.364971 0.516304 0.361640 -0.045828
3 0.523243 0.382050 0.527371 0.369956 -0.035262
4 0.524960 0.400888 0.531536 0.367838 -0.034659
Final code for all years:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
df1

I also came up with my own answer more along the lines of what I was trying to originally accomplish. DataFrame I'll work with for the example. df:
Dates last perc year tdoy
0 2016-01-04 29.93 -0.020295 2016 2
1 2016-01-05 29.63 -0.010023 2016 3
2 2016-01-06 29.59 -0.001350 2016 4
3 2016-01-07 29.44 -0.005069 2016 5
4 2017-01-03 34.57 0.004358 2017 2
5 2017-01-04 34.98 0.011860 2017 3
6 2017-01-05 35.00 0.000572 2017 4
7 2017-01-06 34.77 -0.006571 2017 5
8 2018-01-02 33.38 0.009069 2018 2
9 2018-01-03 33.76 0.011384 2018 3
10 2018-01-04 33.74 -0.000592 2018 4
11 2018-01-05 33.65 -0.002667 2018 5
12 2019-01-02 27.90 0.012704 2019 2
13 2019-01-03 28.18 0.010036 2019 3
14 2019-01-04 28.41 0.008162 2019 4
15 2019-01-07 28.26 -0.005280 2019 5
I created a DataFrame with only the 2019 values for tdoy and perc
df19 = df[['tdoy','perc']].loc[df['year'] == 2019]
and then zipped a dictionary for those values
perc19 = dict(zip(df19.tdoy,df19.perc))
to end up with
perc19=
{2: 0.012704174228675058,
3: 0.010035842293906852,
4: 0.008161816891412365,
5: -0.005279831045406497}
Then map these keys with the tdoy column in the original DataFrame to create a column titled 2019 that has the corresponding 2019 percentage change value for that trading day
df['2019'] = df['tdoy'].map(perc19)
and then create a vs2019 column where I find the difference of 2019 vs. perc and square it yielding
Dates last perc year tdoy 2019 vs2019
0 2016-01-04 29.93 -0.020295 2016 2 0.012704 6.746876
1 2016-01-05 29.63 -0.010023 2016 3 0.010036 3.995038
2 2016-01-06 29.59 -0.001350 2016 4 0.008162 1.358162
3 2016-01-07 29.44 -0.005069 2016 5 -0.005280 0.001590
4 2017-01-03 34.57 0.004358 2017 2 0.012704 0.431608
5 2017-01-04 34.98 0.011860 2017 3 0.010036 0.033038
6 2017-01-05 35.00 0.000572 2017 4 0.008162 0.864802
7 2017-01-06 34.77 -0.006571 2017 5 -0.005280 0.059843
8 2018-01-02 33.38 0.009069 2018 2 0.012704 0.081880
9 2018-01-03 33.76 0.011384 2018 3 0.010036 0.018047
10 2018-01-04 33.74 -0.000592 2018 4 0.008162 1.150436
From here I can groupby in various ways and further calculate to find most similar trending percentage changes vs. the year I am comparing against (2019).

Pandas rolling apply to df where filter based on values in current row

I have a Pandas dataframe with a datetime column (that I've used as a DatetimeIndex) that has a categorical column, and a numerical column. I'd like to apply a complex function to the numerical column when the categorical column is the same as the current row, in a short (ten-day) window lagging the current row (non-inclusive).
As a contrived example:
name = ['steve', 'bob', 'harry', 'jeff'] * 5
df = pd.DataFrame(
index=pd.DatetimeIndex(start='2018-10-10', end='2018-10-29', freq='D'),
data={'value': [x for x in range(20)],
'name': names
}
)
produces a simple dataframe, to which I'd like to add another column (result) that calculates the number of rows * the sum of the values in 'value' (or something - just a formula that there's not a Pandas built-in function for). So for the dataframe above, I'd like the following:
num name result
2018-10-10 0 steve NaN
2018-10-11 1 bob NaN
2018-10-12 2 harry NaN
2018-10-13 3 jeff NaN
2018-10-14 4 steve 0
2018-10-15 5 bob 1
2018-10-16 6 harry 2
2018-10-17 7 jeff 3
2018-10-18 8 steve 8
2018-10-19 9 bob 12
2018-10-20 10 harry 16
2018-10-21 11 jeff 20
2018-10-22 12 steve 24
2018-10-23 13 bob 28
2018-10-24 14 harry 32
2018-10-25 15 jeff 36
2018-10-26 16 steve 40
2018-10-27 17 bob 44
2018-10-28 18 harry 48
2018-10-29 19 jeff 52
I can write my own function for this and use it in pandas.apply:
def rolling_apply(df, time, window_size=timedelta(days=10)):
event_time = time
event_name = df[df.index == time]['names'].iloc[0]
return df[
(df['names'] == event_name) &
(df.index < event_time) &
(df.index >= event_time - window_size)
]
df['result'] = df.apply(lambda x: rolling_apply(df, x.name)['value'].sum() * rolling_apply(df, x.name).count(), axis=1)
but performance gets pretty terrible pretty quickly as my data grows. pandas.rolling.apply seems sort of appropriate, but I can't quite make it fit what I want to do.
Any suggestions or help would be very much appreciated!

Pandas/Python Pulling end of month rows from dataframe into separate dataframe

Currently I have a time series data frame as follows:
dfMain =
Date Portfolio Value
0 2016-07-01 1.000000e+06
1 2016-07-08 1.025168e+06
2 2016-07-15 1.028053e+06
3 2016-07-22 1.024184e+06
4 2016-07-29 1.022491e+06
5 2016-08-05 1.023241e+06
6 2016-08-12 1.030325e+06
7 2016-08-19 1.032742e+06
8 2016-08-26 1.032567e+06
9 2016-09-02 1.028614e+06
10 2016-09-09 9.930876e+05
11 2016-09-16 9.956875e+05
12 2016-09-23 1.010174e+06
13 2016-09-30 1.010388e+06
14 2016-10-07 1.004989e+06
15 2016-10-14 9.924929e+05
16 2016-10-21 9.969708e+05
17 2016-10-28 9.816373e+05
18 2016-11-04 9.563689e+05
19 2016-11-11 9.869579e+05
20 2016-11-18 9.936929e+05
21 2016-11-25 1.009625e+06
Given that the dataframe can be different (can't just pull specific rows from example) what would be the best way to pull the closest to the end of month dates from the dataframe? for example index 4 would be pulled because that is the closest to the end of month date.
Any tips would be greatly appreciated!

Group on the month number and find the last record:
df.Date = pd.to_datetime(df.Date, errors='coerce')
df.groupby(df.Date.dt.month).last()
Date Portfolio Value
Date
7 2016-07-29 1022491.0
8 2016-08-26 1032567.0
9 2016-09-30 1010388.0
10 2016-10-28 981637.3
11 2016-11-25 1009625.0
If rows aren't sorted by Date, call sort_values first:
df.sort_values('Date').groupby(df.Date.dt.month).last()
Date Portfolio Value
Date
7 2016-07-29 1022491.0
8 2016-08-26 1032567.0
9 2016-09-30 1010388.0
10 2016-10-28 981637.3
11 2016-11-25 1009625.0
Should work in any case.
If you have dates spanning multiple years, better to groupby on the year-month:
df.sort_values('Date').groupby([df.Date.dt.year, df.Date.dt.month]).last()

You need to sort the dates and then find the last value for each group.
df['Date'] = pd.to_datetime(df['Date'])
grp = df.sort_values('Date').groupby(df['Date'].dt.month)
pd.DataFrame([grp.get_group(x).iloc[-1] for x in grp.groups])
Output:
Date Portfolio Value
4 2016-07-29 1022491.0
8 2016-08-26 1032567.0
13 2016-09-30 1010388.0
17 2016-10-28 981637.3
21 2016-11-25 1009625.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.