Bucketing dates and checking if between date range in pandas

Bucketing dates and checking if between date range in pandas - python

How can I check what category a date falls into if it is between a the dates in the date field? I cannot use merge_asof as the work; pandas is only v0.18.
d = {'buckets': ['1D', '1W', '1M'], 'dates': ['03-05-2018', '10-05-2018', '03-06-2018']}
date_buckets = pd.DataFrame(data=d)
buckets dates
0 1D 03-05-2018
1 1W 10-05-2018
2 1M 03-06-2018
So, for example, if given the date 07-05-2018, how can I return 1W? I would need to do this for hundreds of rows so would need to be efficient.
thanks,

You can use pandas.cut for binning values:
import pandas as pd
d = {'buckets': ['1D', '1W', '1M'],
'dates': ['03-05-2018', '10-05-2018', '03-06-2018']}
df_bin = pd.DataFrame(data=d)
df_bin['dates'] = pd.to_datetime(df_bin['dates'], dayfirst=True)\
.dt.strftime('%Y%m%d').astype(int)
df = pd.DataFrame({'date': ['07-05-2018']})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)\
.dt.strftime('%Y%m%d').astype(int)
df['Tenor'] = pd.cut(df['date'],
bins=df_bin['dates'],
labels=df_bin['buckets'].iloc[1:])
print(df)
date Tenor
0 20180507 1W

Here's one way that could easily be extended to a larger set of dates to match:
scalar_date = pd.DataFrame(index=[pd.to_datetime("07-05-2018", format="%d-%m-%Y")])
scalar_date.join(date_buckets. \
set_index('dates'). \
reindex(pd.date_range(date_buckets.dates.min(), \
date_buckets.dates.max()), \
method='bfill'))
# buckets
# 2018-05-07 1W
The idea here is to resize your date_buckets dataframe (using .reindex with method='bfill'), so that you can easily join it to a dataframe with your lookup dates.

Related

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.

Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.

I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

How to interpolate for x date using Python

I have df with the endDate column, which I want to use as my x array and discountFactor column as y array. The goal is to interpolate discountFactor at some date within the range of x array, let's say 2020-12-31 using cubic spline method.
The source df below:
lst = [['6M', '2020-10-26', '2021-04-26', '-0.521684','1.002611'], ['1Y', '2020-10-26', '2021-10-26', '-0.534855','1.005377'],
['5Y', '2020-10-26', '2025-10-27','-0.495927','1.025184']]
df = pd.DataFrame(lst, columns =['tenor', 'startDate', 'endDate','ratePercent','discountFactor'], dtype = float)
Please see the following steps, which unfortunately lead to df with all NaN columns, even there where I had known discountFacor values
# Converting date column strings to date objects
df['endDate'] = pd.to_datetime(df['endDate'], format='%Y-%m-%d')
df['startDate'] = pd.to_datetime(df['startDate'], format='%Y-%m-%d')
# Setting endDate column as my index
df.set_index("endDate", inplace= True)
# Creating daily dates range between start and end dates from the df
dates_range = pd.date_range(start='2020-10-26', end='2025-10-27')
idx = pd.DatetimeIndex(dates_range)
df = df.reindex(idx)

Filter particular date in a DF column

I want to filter particular date in a DF column.
My code:
df
df["Crawl Date"]=pd.to_datetime(df["Crawl Date"]).dt.date
date=pd.to_datetime("03-21-2020")
df=df[df["Crawl Date"]==date]
It is showing no match.
Note: df column is having time also with date which need to be trimmed.
Thanks in advance.

The following script assumes that the 'Crawl Dates' column contains strings:
import pandas as pd
import datetime
column_names = ["Crawl Date"]
df = pd.DataFrame(columns = column_names)
#Populate dataframe with dates
df.loc[0] = ['03-21-2020 23:45:57']
df.loc[1] = ['03-22-2020 23:12:33']
df["Crawl Date"]=pd.to_datetime(df["Crawl Date"]).dt.date
date=pd.to_datetime("03-21-2020")
df=df[df["Crawl Date"]==date]
Then df returns:
Crawl Date 0 2020-03-21

adding dates to a pandas data frame

I currently have a df in pandas with a variable called 'Dates' that records the data an complaint was filed.
data = pd.read_csv("filename.csv")
Dates
Initially Received
07-MAR-08
08-APR-08
19-MAY-08
As you can see there are missing dates between when complaints are filed, also multiple complaints may have been filed on the same day. Is there a way to fill in the missing days while keeping complaints that were filed on the same day the same?
I tried creating a new df with datetime and merging the dataframes together,
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df = pd.DataFrame(data=days)
df.index = range(3653)
dates = pd.merge(days, data['Dates'], how='inner')
but I get the following error:
ValueError: can not merge DataFrame with instance of type <class
'pandas.tseries.index.DatetimeIndex'>
Here are the first four rows of data

You were close, there's an issue with your input
First do:
df = pd.read_csv('filename.csv', skiprows = 1)
Then
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df_clean = df.reset_index()
df_clean['idx dates'] = pd.to_datetime(df_clean['Initially Received'])
df2 = pd.DataFrame(data=days, index = range(3653), columns=['full dates'])
dates = pd.merge(df2, df_clean, left_on='full dates', right_on = 'idx dates', how='left')

Create your date range, and use merge to outer join it to the original dataframe, preserving duplicates.
import pandas as pd
from io import StringIO
TESTDATA = StringIO(
"""Dates;fruit
05-APR-08;apple
08-APR-08;banana
08-APR-08;pear
11-APR-08;grapefruit
""")
df = pd.read_csv(TESTDATA, sep=';', parse_dates=['Dates'])
dates = pd.date_range(start='04-APR-2008', end='12-APR-2008').to_frame()
pd.merge(
df, dates, left_on='Dates', right_on=0,
how='outer').sort_values(by=['Dates']).drop(columns=0)
# Dates fruit
# 2008-04-04 NaN
# 2008-04-05 apple
# 2008-04-06 NaN
# 2008-04-07 NaN
# 2008-04-08 banana
# 2008-04-08 pear
# 2008-04-09 NaN
# 2008-04-10 NaN
# 2008-04-11 grapefruit
# 2008-04-12 NaN

Create pandas column of pd.date_range

I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?

You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0

You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)

I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bucketing dates and checking if between date range in pandas - python

Related

How to create lag feature in pandas in this case?

How to interpolate for x date using Python

Filter particular date in a DF column

adding dates to a pandas data frame

Create pandas column of pd.date_range

Categories

Resources