Python resample to only keep every 5th day by group - python

I have a dataframe, consisting of daily stock observations, date and PERMNO (Identifier). I want to resample the dataframe to only consist of observations for every 5th trading day for every stock. The dataframe looks something like the below:
[10610 rows x 3 columns]
PERMNO date RET gret cumret_5d
0 10001.0 2010-01-04 -0.004856 0.995144 NaN
1 10001.0 2010-01-05 -0.005856 0.994144 NaN
2 10001.0 2010-01-06 0.011780 1.011780 NaN
3 10001.0 2010-01-07 -0.033940 0.966060 NaN
4 10001.0 2010-01-08 0.038150 1.038150 3.888603e-03
5 10001.0 2010-01-11 0.015470 1.015470 2.439321e-02
6 10001.0 2010-01-12 -0.004760 0.995240 2.552256e-02
7 10001.0 2010-01-13 -0.003350 0.996650 1.018706e-02
8 10001.0 2010-01-14 -0.001928 0.998072 4.366128e-02
9 10001.0 2010-01-15 -0.007730 0.992270 -2.462285e-03
10 10002.0 2010-01-05 -0.011690 0.988310 NaN
11 10002.0 2010-01-06 0.011826 1.011826 NaN
12 10002.0 2010-01-07 -0.021420 0.978580 NaN
13 10002.0 2010-01-08 0.004974 1.004974 NaN
14 10002.0 2010-01-11 -0.023760 0.976240 -3.992141e-02
15 10002.0 2010-01-12 0.002028 1.002028 -2.659527e-02
16 10002.0 2010-01-13 0.009780 1.009780 -2.856358e-02
17 10002.0 2010-01-14 0.017380 1.017380 9.953183e-03
18 10002.0 2010-01-15 -0.008865 0.991135 -3.954383e-03
19 10002.0 2010-02-18 -0.006958 0.993042 1.318849e-02
The result I want to produce is:
[10610 rows x 3 columns]
PERMNO date RET gret cumret_5d
4 10001.0 2010-01-08 0.038150 1.038150 3.888603e-03
9 10001.0 2010-01-15 -0.007730 0.992270 -2.462285e-03
13 10002.0 2010-01-08 0.004974 1.004974 NaN
18 10002.0 2010-01-15 -0.008865 0.991135 -3.954383e-03
I.e I want to keep observations for dates (2010-01-08), (2010-01-15), (2010-01-22)... continuing up until today. The problem is that not every stock contains the same dates (some may have its first trading day in the middle of a month). Further, every 5th trading day is not continuously every 7th day due to holidays.
I have tried using
crsp_daily = crsp_daily.groupby('PERMNO').resample('5D',on='date')
Which just resulted in an empty dataframe:
Out:
DatetimeIndexResamplerGroupby [freq=<Day>, axis=0, closed=left, label=left, convention=e, origin=start_day]
Any ideas on how to solve this problem?

You could loop through the values of PERMNO and then for each subset use .iloc[::5] to get every 5th row. Then concat each resulting DataFrame together:
dfs = []
for val in crsp_daily['PERMNO'].unique():
dfs.append(crsp_daily[crsp_daily['PERMNO'] == val].iloc[::5])
result = pd.concat(dfs)

For future reference, I solved it by:
def remove_nonrebalancing_dates(df,gap):
count = pd.DataFrame(df.set_index('date').groupby('date'), columns=['date', 'tmp']).reset_index()
del count['tmp']
count['index'] = count['index'] + 1
count = count[(count['index'].isin(range(gap, len(count['index']) + 1, gap)))]
df = df[(df['date'].isin(count['date']))]
return df
dataframe with containing only every 5th trading day can then be defined as:
df = remove_nonrebalancing_dates(df,5)

Related

Combining 2 dataframes of different lengths by nearest observed datetime index

I am currently trying to find a way to merge specific rows of df2 to df1 based on their datetime indices in a way that avoids lookahead bias so that I can add external features (df2) to my main dataset (df1) for ML applications. The lengths of the dataframes are different, and the datetime indices aren't increasing at a constant rate. My current thought process is to do this by using nested loops and if statements, but this method would be too slow as the dataframes I am trying to do this on both have over 30000 rows each. Is there a faster way of doing this?
df1
index a b
2015-06-02 16:00:00 0 5
2015-06-05 16:00:00 1 6
2015-06-06 16:00:00 2 7
2015-06-11 16:00:00 3 8
2015-06-12 16:00:00 4 9
df2
index c d
2015-06-02 9:03:00 10 16
2015-06-02 15:12:00 11 17
2015-06-02 16:07:00 12 18
... ... ...
2015-06-12 15:29:00 13 19
2015-06-12 16:02:00 14 20
2015-06-12 17:33:00 15 21
df_combined
(because you can't see the rows at 06-05, 06-06, 06-11, I just have NaN as the row values to make it easier to interpret)
index a b c d
2015-06-02 16:00:00 0 5 11 17
2015-06-05 16:00:00 1 NaN NaN NaN
2015-06-06 16:00:00 2 NaN NaN NaN
2015-06-11 16:00:00 3 NaN NaN NaN
2015-06-12 16:00:00 4 9 13 19
df_combined.loc[0, ['c', 'd']] and df_combined.loc[4, ['c', 'd']] are 11,17 and 13,19 respectively instead of 12,18 and 14,20 to avoid lookahead bias because in a live scenario, those values haven't been observed yet.
IIUC, you need merge_asof. assuming your index are ordered in time, it is with the direction backward.
print(pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='backward'))
# a b c d
# 2015-06-02 16:00:00 0 5 11 17
# 2015-06-05 16:00:00 1 6 12 18
# 2015-06-06 16:00:00 2 7 12 18
# 2015-06-11 16:00:00 3 8 12 18
# 2015-06-12 16:00:00 4 9 13 19
Note that the dates 06-05, 06-06, 06-11 are not NaN but it is the last values in df2 (for 2015-06-02 16:07:00) being available before these dates in your given data.
Note: if what your dates are actually a column named index and not your index, then do:
print(pd.merge_asof(df1, df2, on='index', direction='backward'))

Loop through rows of dataframe using re.compile().split()

I have a dataframe taht consists of 1 column and several rows. Each of these rows is constructed in the same way: -timestamp- value1 value2 value3 -timestamp- value 4 value5 value6 ...
The timestamps have this format: YYYY-MM-DD HH:MM:SS and the values are number with 2 decimals.
I would like to make a new dataframe that has the individual timestamps in one row and the related values in the next row.
I managed to get the expected result linewise with regex but not for the entire dataframe.
My code so far:
#input dataframe
data.head()
values
0 2020-05-12 10:00:00 12.07 13 11.56 ... 2020-05-12 10:00:01 11.49 17 5.67...
1 2020-05-12 10:01:00 11.49 17 5.67 ... 2020-05-12 10:01:01 12.07 13 11.56...
2 2020-05-12 10:02:00 14.29 18 11.28 ... 2020-05-12 10:02:01 13.77 18 7.43...
test = data['values'].iloc[0] #first row of data
row1 = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(test)
df_row1 = pd.DataFrame(row1)
df_row1.head()
values
0 2020-05-12 10:00:00
1 12.07 13.79 15.45 17.17 18.91 14.91 12.35 14....
2 2020-05-12 10:00:01
3 12.48 13.96 13.88 15.57 18.46 15.0 13.65 14.6...
#trying the same for the entire dataframe
for row in data:
df_new = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(row)
print(df_new)
['values']
My question now is how can I loop through the rows of my dataframe and get the expected result?
In case you want to first split the lines and extract the values to columns, be aware you can use str.extract. Using named grouping in your regular expression it will automatically assign the columns for your dataframe
split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
extract_values = r"(?P<date>\d{4}-\d{2}-\d{2})\s(?P<time>\d{2}:\d{2}:\d{2})\s(?P<value_one>.*?)\s(?P<value_two>.*?)\s(?P<value_three>.*?)$"
df = pd.DataFrame([{
"value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",
},{
"value": "2020-05-13 10:00:00 14.07 13 15.56 2020-05-16 10:00:02 11.51 18 5.69",
}])
df = df["value"].str.split(split_line).explode().str.extract(extract_values, expand=True)
print(df)
# date time value_one value_two value_three
# 0 2020-05-12 10:00:00 12.07 13 11.56
# 0 2020-06-12 11:00:00 13.07 16 11.16
# 0 2020-05-12 10:00:01 11.49 17 5.67
# 1 2020-05-13 10:00:00 14.07 13 15.56
# 1 2020-05-16 10:00:02 11.51 18 5.69
In case you do not know the number of groups after the date and time use split rather than a regular expression. I would suggest something like this:
split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"
df = pd.DataFrame([{
"value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",
},{
"value": "2020-05-13 10:00:00 14.07 13 14 15 15.56 2020-05-16 10:00:02 11.51 18 5.69",
}])
df = df["value"].str.split(split_line).explode().reset_index()
df = df['value'].str.split(" ").apply(pd.Series)
df.columns = [f"col_{col}" for col in df.columns]
print(df)
# col_0 col_1 col_2 col_3 col_4 col_5 col_6
# 0 2020-05-12 10:00:00 12.07 13 11.56 NaN NaN
# 1 2020-06-12 11:00:00 13.07 16 11.16 NaN NaN
# 2 2020-05-12 10:00:01 11.49 17 5.67 NaN NaN
# 3 2020-05-13 10:00:00 14.07 13 14 15 15.56
# 4 2020-05-16 10:00:02 11.51 18 5.69 NaN NaN
You don't need to loop through the rows to get the result instead, you can use Series.str.split to split the given series around the delimiter, the delimiter in this case would be a regular expression. Then you can use DataFrame.explode to transform each element in a list-likes to seperate rows.
Use:
data["values"] = data["values"].str.split(r'\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})')
data = data.explode("values")
data["values"] = data["values"].str.split(r'(?<=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})\s+')
data = data.explode("values").reset_index(drop=True)
print(data)
This resulting dataframe data should look like:
values
0 2020-05-12 10:00:00
1 12.07 13 11.56
2 2020-05-12 10:00:01
3 11.49 17 5.67
4 2020-05-12 10:01:00
5 11.49 17 5.67
6 2020-05-12 10:01:01
7 12.07 13 11.56
8 2020-05-12 10:02:00
9 14.29 18 11.28
10 2020-05-12 10:02:01
11 13.77 18 7.43

Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN

Adding extra days for each month in pandas

In a pandas df, I have number of days for a given month in the first col and Amount in the sec col. How can I add the days that are not in there for that month in the first col and give the value 0 for it in the second col
df = pd.DataFrame({
'Date':['5/23/2019', '5/9/2019'],
'Amount':np.random.choice([10000])
})
I would like the result to look like the following:
Expected Output
Date Amount
0 5/01/2019 0
1 5/02/2019 0
.
.
. 5/23/2019 1000
. 5/24/2019 0
Look at date_range from pandas.
I'm assuming that 5/31/2019 is not in your output like the comment asks because you want the differences between the min and max dates?
I convert the date column to a datetime type. I pass the min and max date to date_range and store that in a dataframe. then I do left join.
df['Date'] = pd.to_datetime(df['Date'])
date_range = pd.DataFrame(pd.date_range(start=df['Date'].min(), end=df['Date'].max()), columns=['Date'])
final_df = pd.merge(date_range, df, how='left')
Date Amount
0 2019-05-09 10000.0
1 2019-05-10 NaN
2 2019-05-11 NaN
3 2019-05-12 NaN
4 2019-05-13 NaN
5 2019-05-14 NaN
6 2019-05-15 NaN
7 2019-05-16 NaN
8 2019-05-17 NaN
9 2019-05-18 NaN
10 2019-05-19 NaN
11 2019-05-20 NaN
12 2019-05-21 NaN
13 2019-05-22 NaN
14 2019-05-23 10000.0

Drop nan rows in pandas that are not in the middle

I have a pandas dataframe which are indexed by time,
For example:
Time Value
2010-01-01 nan
2010-01-02 nan
2010-01-03 3
2010-01-04 4
2010-01-05 5
2010-01-06 3
2010-01-07 nan
2010-01-08 nan
2010-01-09 3
2010-01-10 3
2010-01-11 4
2010-01-12 5
2010-01-13 3
2010-01-14 nan
2010-01-15 nan
In this example, I would like to drop the first two and last two rows. But not the rows with nan in the middle. Is there a way to do this?
You can use index of first valid value and last valid value to filter the dataframe:
df.loc[df.Value.first_valid_index(): df.Value.last_valid_index()]
Result:
Value
Time
2010-01-03 3.0
2010-01-04 4.0
2010-01-05 5.0
2010-01-06 3.0
2010-01-07 NaN
2010-01-08 NaN
2010-01-09 3.0
2010-01-10 3.0
2010-01-11 4.0
2010-01-12 5.0
2010-01-13 3.0
Supposing data is your dataframe:
a, b = data.dropna().index[[0, -1]]
You could also consider selecting a specific column, e.g. using data['Value'] instead of data.
This way you get the starting and ending indices not containing NaN. Then you just have to get that slice (being careful to include that last row):
data[a:b+1]
Result:
Time Value
2010-01-03 3
2010-01-04 4
2010-01-05 5
2010-01-06 3
2010-01-07 nan
2010-01-08 nan
2010-01-09 3
2010-01-10 3
2010-01-11 4
2010-01-12 5
2010-01-13 3
Single-row solution following #unutbu's tip to use loc:
data.loc[slice(*data.dropna().index[[0, -1]])]
Using bfill and ffill
df[df.Value.ffill().notnull()&df.Value.bfill().notnull()]
Out[464]:
Time Value
2 2010-01-03 3.0
3 2010-01-04 4.0
4 2010-01-05 5.0
5 2010-01-06 3.0
6 2010-01-07 NaN
7 2010-01-08 NaN
8 2010-01-09 3.0
9 2010-01-10 3.0
10 2010-01-11 4.0
11 2010-01-12 5.0
12 2010-01-13 3.0

Categories

Resources