I am required to find the calculate the 'time to maturity' by finding the difference between 'maturity' and 'trd_exctn_dt' as shown below.
If I had the following sample data:
cusip_id
trd_exctn_dt
maturity
time_to_maturity
0007TAA2
2015-01-26
2023-05-15
3031 days
0007TAA2
2015-03-26
2023-05-15
2972 days
0007TAA2
2015-05-01
2023-05-15
2936 days
0007TAA2
2015-07-27
2023-05-15
2849 days
My desired output would be:
cusip_id
trd_exctn_dt
maturity
time_to_maturity
0007TAA2
2015-05-01
2023-05-15
2936 days
For this specific cusip_id, because the maturity date is in the 5th month, I am looking for the trd_exctn_dt in the 5th month, in order to calculate the time to maturity. However, I want to do this for several bond issues, where 'maturity' will not necessarily occur within the 5th month'. For example, for another bond issue, the maturity date may be 2023-11-06, therefore I would be looking for the trd_exctn_dt in the 11th month for that bond issue.
Any ideas on how I would do this would be much appreciated!
This solution assumes that you want every row where maturity month equals trd_exctn_dt month.
Code
df.columns = [c.strip() for c in df.columns] # Remove whitespace from column names
# Convert to datetime
df['trd_exctn_dt'] = pd.to_datetime(df['trd_exctn_dt'])
df['maturity'] = pd.to_datetime(df['maturity'])
df['time_to_maturity'] = df['maturity'] - df['trd_exctn_dt'] # If you need to recalculate
df[df['trd_exctn_dt'].dt.month == df['maturity'].dt.month] # Filter for same month in both columns
Output
cusip_id trd_exctn_dt maturity time_to_maturity
2 0007TAA2 2015-05-01 2023-05-15 2936 days
Related
Consider this sample data created by this code:
import random
np.random.seed(0)
rng = pd.date_range('2017-09-19', periods=1000, freq='D')
randomlist = np.random.choice(1000, 10000, replace=True)
print(f'randomlist length is {len(randomlist)}')
test = pd.DataFrame({ 'id': randomlist[:(len(rng))], 'Date': rng, 'Val': np.random.randn(len(rng)) })
The desired output is a groupby id, summing all values, but only within a particular date range of the Date column. Even more complicated than that, I want to see the total Val by id for dates that are the following:
Using the date which is one month later than the earliest date for each id and one year later than that starting date of one month later than the earliest date.
So, for example, if my data appeared this way:
id Date Val
0 684 2017-09-19 0.640472
1 684 2017-10-20 -0.732568
2 501 2017-08-21 -1.141365
3 501 2017-09-22 -0.283020
4 501 2017-09-23 0.725941
5 684 2017-09-24 0.56789
I would want the groupby to only consider the dates for id 684 between 2017-10-19 (i.e. one month later than the earliest date) and 2018-10-19 (i.e. one year after the earliest date plus one month).
I have tried straight groupby and Grouper to no avail. None seem to have this ability to limit the consideration by date. Perhaps I am missing something easy? Thanks for taking a look
I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64
We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".
I have a Pandas dataframe with the index of daily timestep just as below:
oldman.head()
Value
date
1992-01-01 1080.4
1992-01-02 1080.4
1992-01-03 1080.4
1992-01-04 1080.0
1992-01-05 1079.6
...
starting from 1992-01-01 to 2016-12-31. I want to extract weekly mean values of each year. However, my weeks should be in special way. There should be 52 weeks in a year with 365 days but with the last week of 8 days! The first week should start from January 1st of each year.
I am wondering how am I supposed to extract this kind of weeks from a daily timestep data.
Thanks,
I modified COLDSPEED's solution a bit an added in the last week as 8 days. It's worth noting that on leap years that last "week" is actually 9 days. The following example will only work when you include all of a year. The reason for this is that my function assumes the last row in the groupby is actually the last week of the year.
#make some data
df = pd.DataFrame(index=pd.date_range("1992-1-1","1992-12-31"))
df["value"] = 1
#add a counting variable
df["count"] = 1
df = df.groupby(pd.Grouper(freq='Y'))\
.resample('7D')\
.sum()\
.reset_index(level=0, drop=True)\
def chop_last_week(df):
df1=df.copy()
df1.iloc[-2] += df1.iloc[-1]
return df1.iloc[:-1]
df = df.groupby(df.index.year)\
.apply(chop_last_week)\
.reset_index(level=0, drop=True)
df["mean"] = df["value"]/df["count"]
df.tail(5)
It's not the cleanest solution but it runs quickly.
I processing time-series data within a pandas dataframe. The datetime index is incomplete (i.e. some dates are missing).
I want to create a new column with a datetime series of 1 year offset, but only containg dates present in the original datetimeindex . The challenge is that the exact 1y match is not present in the index in many cases.
Index (Input) 1 year offset (Output)
1/2/2014 None
1/3/2014 None
1/6/2014 None
1/7/2014 None
1/9/2014 None
1/10/2014 None
1/2/2015 1/2/2014
1/5/2015 1/3/2014
1/6/2015 1/6/2014
1/7/2015 1/7/2014
1/8/2015 1/9/2014
1/9/2015 1/10/2014
The requirements are as follows:
Every date as of 1/2/2015 must have a corresponding offset date (no blanks)
Every date within the "offset date" group must also be present in the Index column (i.e. introduction of new dates, like 1/8/2014, is not desired
All offset dates must be ordered in an ascending way (the sequence of dates must be preserved)
What I have tried so far:
The Dateoffset doesn't help, since it is insensitive to dates not present in the index.
The .shift method data["1 year offset (Output)"] = data.Index.shift(365) doesn't help because the number of dates within the index is different across the years.
What I am trying to do now has several steps:
Apply Dateoffset method at first to create "temp 1 year offset"
Remove single dates from "temp 1 year offset" that are not present in datetimeindex using set(list) method and replace cells by NaN
Select dates in datetimeindex whose "temp 1 year offset" is NaN and substract one year
Map the Dates from (3) to its closest date in the datetimeindex using argmin
The challenge here is that I am getting double entries as well as a descending order of days in some cases. Those mess up with the results in the following way (see the timedeltas between day n and day n+1):
Index (Input) 1 year offset (Output) Timedelta
4/17/2014 4/16/2014 1
4/22/2014 4/17/2014 1
4/23/2014 4/25/2014 8
4/24/2014 None
4/25/2014 4/22/2014 -3
4/28/2014 4/23/2014 1
4/29/2014 4/24/2014 1
4/30/2014 4/25/2014 1
In any case, this last approach seems to be an overkill concerning the simplicity of the underlying goal. Is there a faster and more simple way to do it?
How to group every date in an uneven pandas datetime series with the closest date one year ago in the same series?
This would be a way:
However look at this thread to properly handle 1 year when the year has 366 days:
Add one year in current date PYTHON
This code therefore needs some small modifications.
import pandas as pd
import datetime
df = pd.DataFrame(dict(dates=[
'1/3/2014',
'1/6/2014',
'1/7/2014',
'1/9/2014',
'1/10/2014',
'1/2/2015',
'1/5/2015',
'1/6/2015',
'1/7/2015',
'1/8/2015',
'1/9/2015']))
# Convert column to datetime
df.dates = pd.to_datetime(df.dates)
# Store min(year) as a variable
minyear = min(df.dates).year
# Calculate the day with timedelta -365 days (might fail on 2012?)
df['offset'] = [(i + datetime.timedelta(days=-365)).date()
if i.year != minyear else None for i in df.dates]
df
Returns:
dates offset
0 2014-01-03 None
1 2014-01-06 None
2 2014-01-07 None
3 2014-01-09 None
4 2014-01-10 None
5 2015-01-02 2014-01-02
6 2015-01-05 2014-01-05
7 2015-01-06 2014-01-06
8 2015-01-07 2014-01-07
9 2015-01-08 2014-01-08
10 2015-01-09 2014-01-09