Interpolate a date between two other dates to get a value - python

I have this pandas dataframe:
ISIN MATURITY PRICE
0 AR121489 Corp 29/09/2019 5.300
1 AR714081 Corp 29/12/2019 7.500
2 AT452141 Corp 29/06/2020 2.950
3 QJ100923 Corp 29/09/2020 6.662
My question is if there exists a way to interpolate a date in the column "MATURITY" and get the price value of that date. For example, If I select the date 18/11/2019, the value of the price on that date should be between 5.300 and 7.500. I don't know if what I am asking is possible but thank you so much for taking your time to read it and trying to help me.

What you can do if you wanted a daily frequency interpolated is first create a daily frequency range with your start and end-dates.
new_df = pd.DataFrame()
new_df["MATURITY"] = pd.date_range(start='29/09/2019', end='29/09/2020')
new_df = pd.concat([new_df,old_df], join="outer", axis=1)
new_df["PRICE"] = new_df["PRICE"].interpolate(method = "linear")

I would treat the dates as datetime objects and for interpolation convert the date from datetime object to some time-interval value i.e. either seconds since 20XX-XX-XX 00:00:00 or days and the same I would do for the output timemoments. After that the interpolation works also with NumPy interpolate method.
In matplotlib.dates there is a method date2num and also num2date worth to try.

Related

Sorting dataframe rows by Day of Date wise

I have made my dataframe. But I want to sort it by the date wise..For example, I want data for 02.01.2016 just after 01.01.2016.
df_data_2311 = df_data_231.groupby('Date').agg({'Wind Offshore in [MW]': ['sum']})
df_data_2311 = pd.DataFrame(df_data_2311)
After running this, I got the below output. This dataframe has 2192 rows.
Wind Offshore in [MW]
sum
Date
01.01.2016 5249.75
01.01.2017 12941.75
01.01.2018 19020.00
01.01.2019 13723.00
01.01.2020 17246.25
... ...
31.12.2017 21322.50
31.12.2018 13951.75
31.12.2019 21457.25
31.12.2020 16491.25
31.12.2021 35683.25
Kindly let me know How would I sort this data of the day of the date.
You can use the sort_values function in pandas.
df_data_2311.sort_values(by=["Date"])
However in order to sort them by the Date column you will need reset_index() on your grouped dataframe and then to convert the date values to datetime, you can use pandas.to_datetime.
df_data_2311 = df_data_231.groupby('Date').agg({'Wind Offshore in [MW]': ['sum']}).reset_index()
df_data_2311["Date"] = pandas.to_datetime(df_data_2311["Date"], format="%d.%m.%Y")
df_data_2311 = df_data_2311.sort_values(by=["Date"])
I recommend reviewing the pandas docs.

The matplotlib chart changes when I change the index in python pandas dataframe

I have a dataset of S&P500 historical prices with the date, the price and other data that i don't need now to solve my problem.
Date Price
0 1981.01 6.19
1 1981.02 6.17
2 1981.03 6.24
3 1981.04 6.25
. . .
and so on till 2020
The date is a float with the year, a dot and the month.
I tried to plot all historical prices with matplotlib.pyplot as plt.
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result. I used df["Price"].tail(100) so you can see better the difference between the first and the second graph(You are going to see in a sec).
But then I tried to set the index from the one before(0, 1, 2 etc..) to the df["Date"] column in the DataFrame in order to see the date in the x axis.
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result, and it's quite disappointing.
I have the Date where it should be in the x axis but the problem is that the graph is different from the one before which is the right one.
If you need the dataset to try out the problem here you can find it.
It is called U.S. Stock Markets 1871-Present and CAPE Ratio.
Hope you've understood everything.
Thanks in advance
UPDATE
I found something that could cause the problem. If you look in depth at the date you can see that in month #10 each is written as a float(in the original dataset) like this: example Year:1884 1884.1. The problem occur when you use pd.to_datetime() to transform the Date float series to a Datetime. So the problem could be that the date in the month #10, when converted into a Datetime, become: (example from before) 1884-01-01 which is the first month in the year and it has an effect on the final plot.
SOLUTION
Finally, I solved my problem!
Yes, the error was the one I explain in the UPDATE paragraph, so I decided to add a 0 as a String where the lenght of the Date (as a string) is 6 in order to change, for example: 1884.1 ==> 1884.10
df["len"] = df["Date"].apply(len)
df["Date"] = df["Date"].where(df["len"] == 7, df["Date"] + "0")
Then i drop the len column i've just created.
df.drop(columns="len", inplace=True)
At the end I changed the "Date" to a Datetime with pd.to_datetime
df["Date"] = pd.to_datetime(df["Date"], format='%Y.%m')
df = df.set_index("Date")
And then I plot
df["Price"].tail(100).plot()
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
The easiest way would be to transform the date into an actual datetime index. This way matplotlib will automatically pick it up and plot it accordingly. For example, given your date format, you could do:
df["Date"] = pd.to_datetime(df["Date"].astype(str), format='%Y.%m')
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
Currently, the first plot you showed is actually plotting the Price column against the index, which seems to be a regular range index from 0 - 1800 and something. You suggested your data starts in 1981, so although each observation is evenly spaced on the x axis (it's spaced at an interval of 1, which is the jump from one index value to the next). That's why the chart looks reasonable. Yet the x-axis values don't.
Now when you set the Date (as float) to be the index, note that you're not evenly covering the interval between, for example, 1981 and 1982. You have evenly spaced values from 1981.1 - 1981.12, but nothing from 1981.12 - 1982. That's why the second chart is also plotted as expected. Setting the index to a DatetimeIndex as described above should remove this issue, as Matplotlib will know how to evenly space the dates along the x-axis.
I think your problem is that your Date is of float type and taking it as an x-axis does exactly what is expected for taking an array of the kind ([2012.01, 2012.02, ..., 2012.12, 2013.01....]) as x-axis. You might convert the Date column to a DateTimeIndex first and then use the built-in pandas plot method:
df["Price"].tail(100).plot()
It is not a good idea to treat df['Date'] as float. It should be converted into pandas datetime64[ns]. This can be achieved using pandas pd.to_datetime method.
Try this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('ie_data.csv')
df=df[['Date','Price']]
df.dropna(inplace=True)
#converting to pandas datetime format
df['Date'] = df['Date'].astype(str).map(lambda x : x.split('.')[0] + x.split('.')[1])
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m')
df.set_index(['Date'],inplace=True)
#plotting
df.plot() #full data plot
df.tail(100).plot() #plotting just the tail
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
Output:

Grouping by week and product and summing over IDs in pandas

I have a pandas dataframe containing amongst others the columns Product_ID, Producttype and a Timestamp. It looks roughly like this:
df
ID Product Time
C561 PX 2017-01-01
00:00:00
T801 PT 2017-01-01
00:00:01
I already converted the Time column into the datetime format.
Now I would like to sum up the number of different IDs per Product in a particular week.
I already tried a for loop:
for data['Time'] in range(start='1/1/2017', end='8/1/2017'):
data.groupby('Product')['ID'].sum()
But range requires an integer.
I also thought about using pd.Grouper with freq="1W" but then I don't know how to combine it with both Product and ID.
Any help is greatly appreciated!

pandas: MultiIndex not showing when plotting DataFrame

I am plotting the following pandas MultiIndex DataFrame:
print(log_returns_weekly.head())
AAPL MSFT TSLA FB GOOGL
Date Date
2016 1 -0.079078 0.005278 -0.155689 0.093245 0.002512
2 -0.001288 -0.072344 0.003811 -0.048291 -0.059711
3 0.119746 0.082036 0.179948 0.064994 0.061744
4 -0.150731 -0.102087 0.046722 0.030044 -0.074852
5 0.069314 0.067842 -0.075598 0.010407 0.056264
with the first sub-index representing the year, and the second one the week from that specific year.
This is simply achieved via the pandas plot() method; however, as seen below, the x-axis will not be in a (year, week) format i.e. (2016, 1), (2016, 2) etc. Instead, it simply shows 'Date,Date' - does anyone therefore know how I can overcome this issue?
log_returns_weekly.plot(figsize(8,8))
You need to convert your multiindex to single one and add a day, so it would be like this: 2016-01-01.
log1 = log_returns_weekly.set_index(log_returns_weekly.index.map(lambda x: pd.datetime(*x,1)))
log1.plot()

Changing frequency of pandas Period and PeriodIndex

I am importing some stock data that has annual report information into a pandas DataFrame. But the date for the annual report end date is an odd month (end of january) rather than end of year.
years = ['2017-01-31', '2016-01-31', '2015-01-31']
df = pd.DataFrame(data = years, columns = ['years'])
df
Out[357]:
years
0 2017-01-31
1 2016-01-31
2 2015-01-31
When I try to add in a PeriodIndex which shows the period of time the report data is valid for, it defaults to ending in December rather than inferring it from the date string
df.index = pd.PeriodIndex(df['years'], freq ='A')
df.index
Out[367]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-DEC]',
name='years', freq='A-DEC')
Note that the frequency should be 'A-JAN'.
I assume this means that the end date can't be inferred from PeriodIndex and the end date string I gave it.
I can change it using the asfreq method and anchored offsets anchored offsets using "A-JAN" as the frequency string. But, this changes all of the individual periods in the PeriodIndex rather than individually as years can have different reporting end dates for their annual report (in the case of a company that changed their reporting period).
Is there a way to interpret each date string and correctly set each period for each row in my pandas frame?
My end goal is to set a period column or index that has a frequency of 'annual' but with the period end date set to the date from the corresponding row of the years column.
** Expanding this question a bit further. Consider that I have many stocks with 3-4 years of annual financial data for each and all with varying start and end dates for their annual reporting frequencies (or quarterly for that matter).
Out[14]:
years tickers
0 2017-01-31 PG
1 2016-01-31 PG
2 2015-01-31 PG
3 2017-05-31 T
4 2016-05-31 T
5 2015-05-31 T
What I'm trying to get to is a column with proper Period objects that are configured with proper end dates (from the year column) and all with annual frequencies. I've thought about trying to iterate through the years and use apply.map or lambda function and the pd.Period function. It may be that a PeriodIndex can't exist with varying Period Objects in it that have varying end dates. something like
for row in df.years:
s.append(pd.Period(row, freq='A")
df['period']= s
#KRkirov got me thinking. It appears the Period constructor is not smart enough to set the end date of the frequency by reading the date string. I was able to get the frequency end date right by building up an anchor string from the end date of the reporting period as follows:
# return a month in 3 letter abbreviation format (eg. "JAN")
df['offset'] = df['years'].dt.strftime('%b').str.upper()
# now build up an anchor offset string (eg. "A-JAN" )
# for quarterly report (eg. "Q-JAN") for q report ending January for year
df['offset_strings'] = "A" + '-' + df.offset
Anchor strings are documented in the pandas docs here.
And then iterate through the rows of the DataFrame to construct each Period and put it in a list, then add the list of Period objects (which is coerced to a PeriodIndex) to a column.
ps = []
for i, r in df.iterrows():
p = pd.Period(r['years'], freq = r['offset_strings']))
ps.append(p)
df['period'] = ps
This returns a proper PeriodIndex with the Period Objects set correctly:
df['period']
Out[40]:
0 2017
1 2016
2 2015
Name: period, dtype: object
df['period'][0]
Out[41]: Period('2017', 'A-JAN')
df.index = df.period
df.index
Out[43]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-JAN]',
name='period', freq='A-JAN')
Not pretty, but I could not find another way.

Categories

Resources