How to interpolate for x date using Python - python

I have df with the endDate column, which I want to use as my x array and discountFactor column as y array. The goal is to interpolate discountFactor at some date within the range of x array, let's say 2020-12-31 using cubic spline method.
The source df below:
lst = [['6M', '2020-10-26', '2021-04-26', '-0.521684','1.002611'], ['1Y', '2020-10-26', '2021-10-26', '-0.534855','1.005377'],
['5Y', '2020-10-26', '2025-10-27','-0.495927','1.025184']]
df = pd.DataFrame(lst, columns =['tenor', 'startDate', 'endDate','ratePercent','discountFactor'], dtype = float)
Please see the following steps, which unfortunately lead to df with all NaN columns, even there where I had known discountFacor values
# Converting date column strings to date objects
df['endDate'] = pd.to_datetime(df['endDate'], format='%Y-%m-%d')
df['startDate'] = pd.to_datetime(df['startDate'], format='%Y-%m-%d')
# Setting endDate column as my index
df.set_index("endDate", inplace= True)
# Creating daily dates range between start and end dates from the df
dates_range = pd.date_range(start='2020-10-26', end='2025-10-27')
idx = pd.DatetimeIndex(dates_range)
df = df.reindex(idx)

Related

How do I distribute weekly average ni data into daily with every days of a week will eventually have its own index .in Python

I have a text file which is as shown as below:
07-31-1995:1.179
08-07-1995:1.174
08-14-1995:1.172
08-21-1995:1.171
08-28-1995:1.163
09-04-1995:1.16
09-11-1995:1.15
09-18-1995:1.157
09-25-1995:1.156
10-02-1995:1.151
10-09-1995:1.14
10-16-1995:1.133
10-23-1995:1.125
This file is named as bill.txt , I want to create a new file which will have data as shown as following:
07-31-1995:1.179
07-01-1995:1.179
07-02-1995:1.179
07-03-1995:1.179
07-04-1995:1.179
07-05-1995:1.179
07-06-1995:1.179
08-07-1995:1.174
08-08-1995:1.174
08-09-1995:1.174
08-10-1995:1.174
08-11-1995:1.174
08-12-1995:1.174
08-13-1995:1.174
08-14-1995:1.172
08-15-1995:1.172
08-16-1995:1.172
08-17-1995:1.172
08-18-1995:1.172
08-19-1995:1.172
08-20-1995:1.172
08-21-1995:1.171
08-22-1995:1.171
.
.
.
I have tried creating a dataframe and using pandas,numpy,_datetime and also .ffill, resample and so on . But i have been getting errors again and again.
df = pd.read_csv('original_dates.txt', header = None)
df.columns = ['original']
# split the date from the trailing number
df[['date', 'number']] = df['original'].str.split(':', expand=True)
# reindex to fill in the missing dates
dates = pd.date_range(df.date.min(), df.date.max())
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
reindexed_df = df.reindex(dates)
# reset the index and return the date column to a string
reindexed_df.reset_index(inplace=True)
reindexed_df['date'] = reindexed_df['index'].dt.strftime('%m-%d-%Y')
# forward fill the missing numbers
reindexed_df['number'] = reindexed_df['number'].ffill()
reindexed_df.drop(['original', 'index'], axis=1)
# join the columns back together
reindexed_df['new_col'] = reindexed_df[['date', 'number']].apply(lambda x: ':'.join(x), axis=1)

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

Data manipulation with date in DataFrame in Python Pandas?

I have DataFrame like below:
df = pd.DataFrame({"data" : ["25.01.2020", and many more other dates...]})
df["data"] = pd.to_datetime(df["data"], format = "%d%m%Y")
And I have a series of special dates like below:
special_date = pd.Series(pd.to_datetime(["16.01.2020",
"27.01.2020",
and many more other dates...], dayfirst=True))
And I need to calculate 2 more columns in this DataFrame:
col1 = number of weeks to the next special date
col2 = number of weeks after las special date
So I need results like below:
col1 = 1 because next special date after 25.01 is 27.01 so it is the same week
col2 = 2 because last special date before 25.01 is 16.01 so i is 2 weeks ago
*please be aware that I have many more dates, so code needs to work for more dates than only 2 special dates or only 1 data from df.
You can use broadcasting to create a matrix of time deltas and than calculate the minima for your new columns
import numpy as np, pandas as pd
df = pd.DataFrame({'data': pd.to_datetime(["01.01.2020","25.01.2020","20.02.2020"], dayfirst=True)})
s = pd.Series(pd.to_datetime(["16.01.2020","27.01.2020","08.02.2020","19.02.2020"], dayfirst=True))
delta = (s.to_numpy()[:,None] - df['data'].to_numpy()).astype('timedelta64[D]') / np.timedelta64(1, 'D')
n = np.min( delta, 0, where=delta> 0, initial=np.inf)
p = np.min(-delta, 0, where=delta<=0, initial=np.inf)
df['next'] = np.ceil(n/7) #consider np.floor
df['prev'] = np.ceil(p/7)
Alternatively to using the where argument you could perform the steps by hand:
n = delta.copy(); n[delta<=0] = np.inf; n = np.abs(np.min(n,0))
p = delta.copy(); p[delta> 0] = -np.inf; p = np.abs(np.min(-p,0))

Create pandas column of pd.date_range

I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?
You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0
You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)
I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]

Bucketing dates and checking if between date range in pandas

How can I check what category a date falls into if it is between a the dates in the date field? I cannot use merge_asof as the work; pandas is only v0.18.
d = {'buckets': ['1D', '1W', '1M'], 'dates': ['03-05-2018', '10-05-2018', '03-06-2018']}
date_buckets = pd.DataFrame(data=d)
buckets dates
0 1D 03-05-2018
1 1W 10-05-2018
2 1M 03-06-2018
So, for example, if given the date 07-05-2018, how can I return 1W? I would need to do this for hundreds of rows so would need to be efficient.
thanks,
You can use pandas.cut for binning values:
import pandas as pd
d = {'buckets': ['1D', '1W', '1M'],
'dates': ['03-05-2018', '10-05-2018', '03-06-2018']}
df_bin = pd.DataFrame(data=d)
df_bin['dates'] = pd.to_datetime(df_bin['dates'], dayfirst=True)\
.dt.strftime('%Y%m%d').astype(int)
df = pd.DataFrame({'date': ['07-05-2018']})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)\
.dt.strftime('%Y%m%d').astype(int)
df['Tenor'] = pd.cut(df['date'],
bins=df_bin['dates'],
labels=df_bin['buckets'].iloc[1:])
print(df)
date Tenor
0 20180507 1W
Here's one way that could easily be extended to a larger set of dates to match:
scalar_date = pd.DataFrame(index=[pd.to_datetime("07-05-2018", format="%d-%m-%Y")])
scalar_date.join(date_buckets. \
set_index('dates'). \
reindex(pd.date_range(date_buckets.dates.min(), \
date_buckets.dates.max()), \
method='bfill'))
# buckets
# 2018-05-07 1W
The idea here is to resize your date_buckets dataframe (using .reindex with method='bfill'), so that you can easily join it to a dataframe with your lookup dates.

Categories

Resources