Change data interval and add the average to the original data - python

I have a dataset that includes a country's temperature in 2020 and the projected temperature rise in 2050. I'm hoping to create a dataset that assumes the linear growth of temperature between 20201 and 2050 for this country. Take the sample df as an example. The temperature in 2020 for country A is 5 degree; by 2050, the temperature is projected to rise by 3 degree. In other words, the temperature would rise by 0.1 degree per year.
Country Temperature 2020 Temperature 2050
A 5 3
The desired output is df2
Country Year Temperature
A 2020 5
A 2021 5.1
A 2022 5.2
I tried to use resample but it seems to only work for scenario when the frequency is within a year (month, quarter). I also tried interpolate. But neither works.
df = df.reindex(pd.date_range(start='20211231', end='20501231', freq='12MS'))
df2 = df.interpolate(method='linear')

You can use something like this:
import numpy as np
import pandas as pd
def interpolate(df, start, stop):
a = np.empty((stop - start, df.shape[0]))
a[1:-1] = np.nan
a[0] = df[f'Temperature {start}']
a[-1] = df[f'Temperature {stop}']
df2 = pd.DataFrame(a, index=pd.date_range(start=f'{start+1}', end=f'{stop+1}', freq='Y'))
return df2.interpolate(method='linear')
df = pd.DataFrame([["A", 5, 3]], columns=["Country", f"Temperature 2020", f"Temperature 2050"])
df[f"Temperature 2050"] += df[f"Temperature 2020"]
print(interpolate(df, 2020, 2050))
This will output
2021-01-01 5.000000
2022-01-01 5.103448
2023-01-01 5.206897
2024-01-01 5.310345
2025-01-01 5.413793
2026-01-01 5.517241
2027-01-01 5.620690
2028-01-01 5.724138
2029-01-01 5.827586
2030-01-01 5.931034
2031-01-01 6.034483
2032-01-01 6.137931
2033-01-01 6.241379
2034-01-01 6.344828
2035-01-01 6.448276
2036-01-01 6.551724
2037-01-01 6.655172
2038-01-01 6.758621
2039-01-01 6.862069
2040-01-01 6.965517
2041-01-01 7.068966
2042-01-01 7.172414
2043-01-01 7.275862
2044-01-01 7.379310
2045-01-01 7.482759
2046-01-01 7.586207
2047-01-01 7.689655
2048-01-01 7.793103
2049-01-01 7.896552
2050-01-01 8.000000

Related

Pandas: Annualized Returns

I have a dataframe with quarterly returns of financial entities and I want to calculate 1, 3, 5 10-year annualized returns. The formula for calculating annualized returns is:
R = product(1+r)^(4/N) -1
r are the quarterly return of an entity, N is the number of quarters
for example 3-year annualized return is:
R_3yr = product(1+r)^(4/12) -1 = ((1+r1)*(1+r2)*(1+r3)*...*(1+r12))^(1/3) -1
r1, r2, r3 ... r12 are the quarterly returns of the previous 11 quarters plus current quarter.
I created a code which provides the right results but it is very slow because it is looping through each row of the dataframe. The code below is an extract of my code for 1-year and 3-year annualized retruns (I applied the same concept for 5, 7, 10, 15 and 20-year returns). r_qrt is the field with the quarterly returns
import pandas as pd
import numpy as np
#create dataframe where I append the results
df_final = pd.DataFrame()
columns=['Date','Entity','r_qrt','R_1yr','R_3yr']
#loop thorugh the dataframe
for row in df.itertuples():
R_1yr=np.nan #1-year annualized return
R_3yr=np.nan #3-year annualized return
#Calculate 1 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-1)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=4:
b=(1+(temp_table.r_qrt))[-4:].product()
R_1yr=(b-1)
#Calculate 3 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-3)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=12:
b=(1+(temp_table.r_qrt))[-12:].product()
R_3yr=((b**(1/3))-1)
d=[row.Date,row.Entity,row.r_qrt,R_1yr,R_3yr]
df_final = df_final.append(pd.Series(d, index=columns), ignore_index=True)
df_final looks as below (only reporting 1-year return results for space limitations)
Date
Entity
r_qrt
R_1yr
2015-03-31
A
0.035719
NaN
2015-06-30
A
0.031417
NaN
2015-09-30
A
0.030872
NaN
2015-12-31
A
0.029147
0.133335
2016-03-31
A
0.022100
0.118432
2016-06-30
A
0.020329
0.106408
2016-09-30
A
0.017676
0.092245
2016-12-31
A
0.017304
0.079676
2015-03-31
B
0.034705
NaN
2015-06-30
B
0.037772
NaN
2015-09-30
B
0.036726
NaN
2015-12-31
B
0.031889
0.148724
2016-03-31
B
0.029567
0.143020
2016-06-30
B
0.028958
0.133312
2016-09-30
B
0.028890
0.124746
2016-12-31
B
0.030389
0.123110
I am sure there is a more efficient way to run the same calculations but I have not been able to find it. My code is not efficient and takes more than 2 hours for large dataframes with long time series and many entities.
Thanks
see (https://www.investopedia.com/terms/a/annualized-total-return.asp) for the definition of annualized return
data=[ 3, 7, 5, 12, 1]
def annualize_rate(data):
retVal=0
accum=1
for item in data:
print(1+(item/100))
accum*=1+(item/100)
retVal=pow(accum,1/len(data))-1
return retVal
print(annualize_rate(data))
output
0.05533402290765199
2015 (a and b)
data=[0.133335,0.148724]
print(annualize_rate(data))
output:
0.001410292043902306
2016 (a&b)
data=[0.079676,0.123110]
print(annualize_rate(data))
output
0.0010139064424810051
you can store each year annualized value then use pct_chg to get a 3 year result
data=[0.05,0.06,0.07]
df=pd.DataFrame({'Annualized':data})
df['Percent_Change']=df['Annualized'].pct_change().fillna(0)
amount=1
returns_plus_one=df['Percent_Change']+1
cumulative_return = returns_plus_one.cumprod()
df['Cumulative']=cumulative_return.mul(amount)
df['2item']=df['Cumulative'].rolling(window=2).mean().plot()
print(df)
For future reference of other users, this is the new version of the code that I implemented following Golden Lion suggestion:
def compoundfunct(arr):
return np.product(1+arr)**(4/len(arr)) - 1
# 1-yr annulized return
df["R_1Yr"]=df.groupby('Entity')['r_qrt'].rolling(4).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
# 3-yr annulized return
df["R_3Yr"]=df.groupby('Entity')['r_qrt'].rolling(12).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
The performance of the previous code was 36.4 sec for a dataframe of 5,640 rows. The new code is more than 10x faster, it took 2.8 sec
One of the issues with this new code is that one has to make sure that rows are sorted by group (Entity in my case) and date before running the calculations, otherwise results could be wrong.
Thanks,
S.

Resample a pandas dataframe with multiple variables

I have a dataframe in long format with data on a 15 min interval for several variables. If I apply the resample method to get the average daily value, I get the average values of all variables for a given time interval (and not the average value for speed, distance).
Does anyone know how to resample the dataframe and keep the 2 variables?
Note: The code below contains an EXAMPLE dataframe in long format, my real example loads data from csv and has different time intervals and frequencies for the variables, so I cannot simply resample the dataframe in wide format.
import pandas as pd
import numpy as np
dti = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = dti)
# Average speed in miles per hour
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
df['distance'] = df['speed'] * 0.25
df.reset_index(inplace=True)
df2 = df.melt (id_vars = 'index')
df3 = df2.resample('d', on='index').mean()
IIUC:
>>> df.groupby(df.index.date).mean()
speed distance
2015-01-01 29.562500 7.390625
2015-01-02 31.885417 7.971354
2015-01-03 30.895833 7.723958
2015-01-04 30.489583 7.622396
2015-01-05 28.500000 7.125000
... ... ...
2015-12-27 28.552083 7.138021
2015-12-28 29.437500 7.359375
2015-12-29 29.479167 7.369792
2015-12-30 28.864583 7.216146
2015-12-31 48.000000 12.000000
[365 rows x 2 columns]

Calculating growth rates on specific level of multilevel index in Pandas

I have a dataset that I want to use to calculate the average quarterly growth rate, broken down by each year in the dataset.
Right now I have a dataframe with a multi-level grouping, and I'd like to apply the gmean function from scipy.stats to each year within the dataset.
The code I use to get the quarterly growth rates looks like this:
df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)
Which gives me this as a result:
So basically I want the geometric mean of (1.162409, 1.659756, 1.250600) for 2014, and the other quarterly growth rates for every other year.
Instinctively, I want to do something like this:
(df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)).apply(gmean, level=0)
But this doesn't work.
I don't know what your data looks like so I'm gonna make some random sample data:
dates = pd.date_range('2014-01-01', '2017-12-31')
n = 5000
np.random.seed(1)
df = pd.DataFrame({
'Order Date': np.random.choice(dates, n),
'Sales': np.random.uniform(1, 100, n)
})
Order Date Sales
0 2016-11-27 82.458720
1 2014-08-24 66.790309
2 2017-01-01 75.387001
3 2016-06-24 9.272712
4 2015-12-17 48.278467
And the code:
# Total sales per quarter
q = df.groupby(pd.Grouper(key='Order Date', freq='Q'))['Sales'].sum()
# Q-over-Q growth rate
q = (q / q.shift()).fillna(1)
# Y-over-Y growth rate
from scipy.stats import gmean
y = q.groupby(pd.Grouper(freq='Y')).agg(gmean) - 1
y.index = y.index.year
y.index.name = 'Year'
y.to_frame('Avg. Quarterly Growth').style.format('{:.1%}')
Result:
Avg. Quarterly Growth
Year
2014 -4.1%
2015 -0.7%
2016 3.5%
2017 -1.1%

How to generate a rolling mean for a specific date range and location with pandas

I have a large data set with names of stores, dates and profits.
My data set is not the most organized but I now have it in this df.
df
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
I proudly created a function to get each day into one df by itself until I realized it would be very time consuming to do one for each day.
def avg(n):
return df.loc[df['Date'] == "May" + " " + str(n) + " " +str(2018)]
where n would be the date I want to get. So that function get me just the dates I want.
What I really need is to have a way to get all dates I want in a list and to append them to a pd for each day. I tried doing this but did not work out.
def avg(n):
dlist= []
for i in n:
dlist= df.loc[df['Date'] == "May" + " " + str(i) + " " +str(2018)]
dlist=pd.DataFrame(dlist)
dlist.append(i)
return dlist
df2=avg([21,23,24,25])
My goal there was to have all the dates of (21,23,24,25) for the May
into its own series of df.
But it was a total fail got this error
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I am not sure if it's also possible to add a rolling average or mean, to columns for each day of (21,23,24,25), but that's where analysis will conclude.
output desired
Store Date Profit Rolling Mean
ABC May 1 2018 234 250
XYZ May 1 2018 410 401
AZY May 1 2018 145 415
where the rolling mean is for the past 30 days. Above all, I would like to have each day into its own df where I can save it to csv file the end.
Rolling Mean:
The example data given in the question, has data in the format of May 1 2018, which can't be used for rolling. Rolling requires a datetime index.
Instead of string splitting the original Date column, it should be converted to datetime, using df.Date = pd.to_datetime(df.Date), which will give dates in the format 2018-05-01
With a properly formatted datetime column, use df['Day'] = df.Date.dt.day and df['Month'] = df.Date.dt.month_name() to get a Day and Month column, if desired.
Given the original data:
Original Data:
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
Transformed Original Data:
df.Date = pd.to_datetime(df.Date)
df['Day'] = df.Date.dt.day
df['Month'] = df.Date.dt.month_name()
Store Date Profit Day Month
ABC 2018-05-01 234 1 May
XYZ 2018-05-01 410 1 May
AZY 2018-05-01 145 1 May
ABC 2018-05-02 234 2 May
XYZ 2018-05-02 410 2 May
AZY 2018-05-02 145 2 May
Rolling Example:
The example dataset is insufficient to produce a 30-day rolling average
In order to have a 30-day rolling mean, there needs to be more than 30 days of data for each store (i.e. on the 31st day, you get the 1st mean, for the previous 30 days)
The following example will setup a dataframe consisting of every day in 2018, a random profit between 100 and 1001, and a random store, chosen from ['ABC', 'XYZ', 'AZY'].
Extended Sample:
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
Store Date Profit
ABC 2018-01-01 901
AZY 2018-01-02 540
AZY 2018-01-03 417
XYZ 2018-01-04 280
XYZ 2018-01-05 384
XYZ 2018-01-06 104
XYZ 2018-01-07 691
ABC 2018-01-08 376
XYZ 2018-01-09 942
XYZ 2018-01-10 297
df.set_index('Date', inplace=True)
df_rolling = df.groupby(['Store']).rolling(30).mean()
df_rolling.rename(columns={'Profit': '30-Day Rolling Mean'}, inplace=True)
df_rolling.reset_index(inplace=True)
df_rolling.head():
Note the first 30-days for each store, will be NaN
Store Date 30-Day Rolling Mean
ABC 2018-01-01 NaN
ABC 2018-01-03 NaN
ABC 2018-01-07 NaN
ABC 2018-01-11 NaN
ABC 2018-01-13 NaN
df_rolling.tail():
Store Date 30-Day Rolling Mean
XYZ 2018-12-17 556.966667
XYZ 2018-12-18 535.633333
XYZ 2018-12-19 534.733333
XYZ 2018-12-24 551.066667
XYZ 2018-12-27 572.033333
Plot:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
g = sns.lineplot(x='Date', y='30-Day Rolling Mean', data=df_rolling, hue='Store')
for item in g.get_xticklabels():
item.set_rotation(60)
plt.show()
Alternatively: A dataframe for each store:
It's also possible to create a separate dataframe for each store and put it inside a dict
This alternative makes is easier to plot a more detailed graph with less code
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
df_dict = dict()
for store in df.Store.unique():
df_dict[store] = df[['Date', 'Profit']][df.Store == store]
df_dict[store].set_index('Date', inplace=True)
df_dict[store]['Profit: 30-Day Rolling Mean'] = df_dict[store].rolling(30).mean()
print(df_dict.keys())
>>> dict_keys(['ABC', 'XYZ', 'AZY'])
print(df_dict['ABC'].head())
Plot:
import matplotlib.pyplot as plt
_, axes = plt.subplots(1, 1, figsize=(13, 8), sharex=True)
for k, v in df_dict.items():
axes.plot(v['Profit'], marker='.', linestyle='-', linewidth=0.5, label=k)
axes.plot(v['Profit: 30-Day Rolling Mean'], marker='o', markersize=4, linestyle='-', linewidth=0.5, label=f'{k} Rolling')
axes.legend()
axes.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Profit ($)')
plt.xlabel('Date')
plt.title('Recorded Profit vs. 30-Day Rolling Mean of Profit')
plt.show()
Get a dataframe for a specific month:
Recall, this is randomly generated data, so the stores don't have data for every day of the month.
may_df = dict()
for k, v in df_dict.items():
v.reset_index(inplace=True)
may_df[k] = v[v.Date.dt.month_name() == 'May']
may_df[k].set_index('Date', inplace=True)
print(may_df['XYZ'])
Plot: May data only:
Save dataframes:
pandas.DataFrame.to_csv()
may_df.reset_index(inplace=True)
may_df.to_csv('may.csv', index=False)
A simple solution may be groupby()
Check out this example :
import pandas as pd
listt = [['a',2,3],
['b',5,7],
['a',3,9],
['a',1,3],
['b',9,4],
['a',4,7],
['c',7,2],
['a',2,5],
['c',4,7],
['b',5,5]]
my_df = pd.DataFrame(listt)
my_df.columns=['Class','Day_1','Day_2']
my_df.groupby('Class')['Day_1'].mean()
OutPut :
Class
a 2.400000
b 6.333333
c 5.500000
Name: Day_1, dtype: float64
Note : Similarly You can Group your data by Date and get Average of your Profit.

Pandas: Calculate average of values for a time frame

I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64

Categories

Resources