append value from loop linear regression in a certain position in dataframe - python

I am trying to append values from linear regression in a rolling window. The storable values are supposed to be appended in a certain position of my df (i.e. having a df 2300 x 2300, the first value from regression should be in the first col at 228th row and so on so forth).
Here below is my code.
Any help is more than welcome.
df_rolling_tstat # 2300 x 2300 dataframe
for i in range(len(Switz_fund_ret.iloc[1:, 1:2].columns)):
s = Switz_fund_ret.loc[birth_date[i], :]
start = s['contatore']
e = Switz_fund_ret.loc[death_date[i], :]
end = e['contatore']
window = 12
for j in range(end-start):
roll_one = Switz_fund_ret[i].iloc[start+j:start+window+j]
#market excess return del mercato quando il fondo era in attività
roll_two = Switz_fund_ret[2308].iloc[start+j:start+window+j]
#risk free rate quando il fondo era in attività
roll_three = Switz_fund_ret[2309].iloc[start+j:start+window+j]
roll_excess_return_fund = roll_one - roll_three
roll_two = sm.add_constant(roll_two)
roll_y=np.array(roll_excess_return_fund, dtype=float)
roll_x=np.array(roll_two, dtype=float)
roll_model = sm.OLS(roll_y, roll_x).fit()
roll_reg.append(roll_model)
alpha_roll.append(roll_model.params[0])
t_stat_roll.append(roll_model.tvalues[0])
p_value_roll.append(roll_model.pvalues[0])
I would like, for instance, to retrieve roll_model.pvalues[0] and put it in the first column of df at 228th position. Afterward, for the 2nd regression, I want to store roll_model.pvalues[0] at 229th entry.
Many thanks.

Related

How to save two variables in the same csv file with different time steps

I am trying to save time series of two variables that have different forecast steps. How can I modify the code below to be able to save both variables with different time steps in the same csv file. One of them starts the cycle at 000 and the other from 003 h of forecast.
But when I try to save, the following error occurs: IndexError: index 112 is out of bounds for axis 0 with size 112, sending another variable with 114 time steps.
lat = GFS.variables['latitude'][:]
lon = GFS.variables['longitude'][:]
times = GFS['valid_time'][:]
time_cycle = radiation['valid_time'][:]
unit = GFS['time'].units
step = GFS['step']
for key, value in stations.iterrows():
#print(key,value[0], value[1], value[2])
station = value[0]
file_name = "{}{}".format(station,".csv")
#print(file_name)
lon_point = value[1]
lat_point = value[2]
########################################
# Encontrando o ponto de Latitude e Longitude mais próximo das estações
# Squared difference of lat and lon
sq_diff_lat = (lat - lat_point)**2
sq_diff_lon = (lon - lon_point)**2
# Identifying the index of the minimum value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
print("Generating time series for station {}".format(station))
ref_date = datetime.datetime(int(unit[14:18]),int(unit[19:21]),int(unit[22:24]),int(unit[25:27]))
rad_data = list()
pblh_data = list()
for index, time in enumerate(times):
date_time = ref_date+datetime.timedelta(seconds=int(time))
date_range.append(date_time)
step_data.append(step[index].values)
pblh_data.append(hpbl[index, min_index_lat, min_index_lon].values)
if index_rad, time_cycle in enumerate(time_cycle):
rad_data.append(radiation[index_rad, min_index_lat, min_index_lon].values)
#print(date_range)
df = pd.DataFrame(date_range, columns = ["Date-Time"])
df["Date-Time"] = date_range
df = df.set_index(["Date-Time"])
df["Forecast ({})".format('valid time')] = step_data
df["RAD ({})".format('W m**-2')] = rad_data
df["PBLH ({})".format('m')] = pblh_data
print("The following time series is being saved as .csv files")
df.to_csv(os.path.join(dir_out,file_name), sep=';',encoding="utf-8", index=True)
#df.to_parquet(os.path.join(dir_out,file_name),
# engine='auto',
# compression='default',
# write_index=True,
# overwrite=True,
# append=False)
print("\n! !Successfuly saved all the Time Series the output Directory!!\n{}".format(dir_out))
That is, the PBLH variable has 114 time steps, while the RAD variable has 112, but I would like to save both variables in the same csv file. How should I modify the loop time (PBLH) and time_cycle (RAD) to put in the same csv?

Calculating returns with short positions (backtest)

My goal is to write a function that returns a vector of portfolio returns for each period (i.e. day) from a pandas dataframe of security prices. For simplicity, let's assume that the initial weights are equally split between securities A and B. Prices are given by the following dataframe:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=20)
prices = pd.DataFrame({'A': np.linspace(20, 50, num=20),
'B': np.linspace(100, 200, num=20)},
index=dates)
Further, we assume that asset A is the asset where we initiate a short position and we go long asset B.
Calculating discrete returns from a "zero-investment position" like a short position (i.e. in asset A) in a first step and overall portfolio returns from the weighted returns of single assets that constitute the portfolio in a second step is not trivial, and before I put my so far attempt, which is not working correctly (key problem being the loss from the short position in asset A exceeding -100% on 2013-01-14), I am greatful for any kind of help - may it be theoretical or code.
You are forgetting asset “C”, as in collateral. No matter how generous your broker might be (not!), most exchanges and national regulatory organizations would require collateral. you might read about some wealthy hedge fund guy doing these trades with just long and short positions, but when it goes south you also read about the HF guy losing his art collection— which was the collateral.
Equity margin requirements in the USA would require 50% collateral to start, and at least 25% maintenance margin while the short trades were open. This is enforced by exchanges and regulatory authorities. Treasury bonds might have more favorable requirements, but even then the margin (collateral) is not zero.
Since a long/short doubles your risk (what happens if the long position goes down, and short position goes up?), your broker likely would require more margin than the minimum.
Add asset “C”, collateral, to your calculations and the portfolio returns become straight forward
Thank you for your answer #Stripedbass. Based on your comment, can the return process of a portfolio consisting of the two stocks be described by the following equations?:
The terms with + and - are the market values of the long and short position respectively such that the difference of them represents the net value of the two positions. If we assume that we want to be "market neutral" at the beginning of the trade t=0, the net value is zero.
For t > 0 these net positions represent the unrealised gains or losses of the long and short position that were opened and have not yet been closed. The term C denotes the money that we actually hold. It consists of the initial collateral and the cumulative gains and losses from the stock positions.
The overall return per period from trading the two securtities is then calculated as the simple return of the account V.
Based on this, you could define the following function and for short posititions choose option type='shares':
def weighted_return(type, df, weights):
capital = 100
#given the input dataframe contains return series
if type == "returns":
# create price indices
df.fillna(0, inplace=True)
df_price_index = pd.DataFrame(index=df.index, columns=df.columns)
df_price_index.iloc[0] = 100 + df.iloc[0]
for i in np.arange(1, len(df_price_index)):
for col in df_price_index.columns:
df_price_index[col].iloc[i] = df_price_index[col].iloc[i - 1] * (1 + df[col].iloc[i])
n = 0
ind_acc = []
for stock in df.columns:
ind_capital = capital * weights[n]
moves = (df_price_index[stock].diff()) * ind_capital / df_price_index[stock][0]
ind_acc.append(moves)
n += 1
pair_ind_accounts = pd.concat(ind_acc, axis=1)
portfolio_acc = pair_ind_accounts.sum(1).cumsum() + capital
df_temp_returns_combined = portfolio_acc.pct_change()
df_temp_returns_combined[0] = np.sum(weights * df.iloc[0].values)
df_temp_returns_combined = pd.DataFrame(df_temp_returns_combined)
df_temp_returns_combined.columns = ["combinedReturns"]
#given the input dataframe contains price series
if type == "prices":
n = 0
ind_acc = []
for stock in df.columns:
ind_capital = capital * weights[n]
moves = (df[stock].diff()) * ind_capital / df[stock][0]
ind_acc.append(moves)
n += 1
pair_ind_accounts = pd.concat(ind_acc, axis=1)
portfolio_acc = pair_ind_accounts.sum(1).cumsum() + capital
df_temp_returns_combined = portfolio_acc.pct_change()
df_temp_returns_combined[0] = np.NaN
df_temp_returns_combined = pd.DataFrame(df_temp_returns_combined)
df_temp_returns_combined.columns = ["combinedReturns"]
#given the input dataframe contains return series and the strategy is long/short
if type == "shares":
exposures = []
for stock in df.columns:
shares = 1/df[stock][0]
exposure = df[stock] * shares
exposures.append(exposure)
df_temp = pd.concat(exposures, axis=1)
index_long = np.where(np.array(weights) == 1)
index_short = np.where(np.array(weights) == -1)
df_temp_account = pd.DataFrame(df_temp.iloc[:,index_long[0]].values - df_temp.iloc[:,index_short[0]].values) + 1
df_temp_returns_combined = df_temp_account.pct_change()
df_temp_returns_combined.columns = ["combinedReturns"]
df_temp_returns_combined.index = df.index
return pd.DataFrame(df_temp_returns_combined)

Python: How to iterate over rows and calculate value based on previous row

I have sales data till Jul-2020 and want to predict the next 3 months using a recovery rate.
This is the dataframe:
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
This is how it looks:
Now, I want to add a "Predicted" column resulting into this dataframe:
The first value 300 at row 3, is basically (200 * 1.5/1). This will be our base value going ahead, so next value i.e. 500 is basically (300 * 2.5/1.5) and so on.
How do I iterate over row every row, starting from row 3 onwards? I tried using shift() but couldn't iterate over the rows.
You could do it like this:
import pandas as pd
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
test['Prediction'] = test['Sales']
for i in range(1, len(test)):
#prevent division by zero
if test.loc[i-1, 'Recovery'] != 0:
test.loc[i, 'Prediction'] = test.loc[i-1, 'Prediction'] * test.loc[i, 'Recovery'] / test.loc[i-1, 'Recovery']
The sequence you have is straight up just Recovery * base level (Sales = 200)
You can compute that sequence like this:
valid_sales = test.Sales > 0
prediction = (test.Recovery * test.Sales[valid_sales].iloc[-1]).rename("Predicted")
And then combine by index, insert column or concat:
pd.concat([test, prediction], axis=1)

Using pandas to clean massive dataset

So kinda a newb here, but I have this dataset that is transposed wkardly, I want to have this back to our guy in the next week, and I've gotten pretty close to completing - I think.
The problem I am facing is getting the data into one data frame. When I run the code, and print from the for loop, I can see the chunks of values that will need to be concatenated. however, i cant find a way to store all the values. when I do, I just get one chunk.
import pandas as pd
import numpy as np
df = pd.read_excel("DATA,h",
header = None,
dtype = object)
ranges = []
last_index = 0
def clean(df12,df13):
df12 = df12.T
df13 = df13.T
value1 = pd.DataFrame(df12)
value2 = pd.DataFrame(df13)
final_value = value1.append(value2)
return(final_value)
for i, row in df.iterrows():
rows = df.iloc[i]
if rows[9] == 'Member' or rows[9] == 'Non-Pledging Member':
if last_index == 0:
last_index = i
else:
ranges.append([last_index, i])
last_index = i
df44 = beans(row,row)
print(df44)
when I print rows from the for loop I get all the values I need in the terminal, but if i store it in a value or dataframe.. I just see one of those blocks of data. Does anyone know whats going on?
data: there are 15k of these
Proctor, Terry 206-915-3555 Member
620 33rd Ave E 16283
Seattle, WA 98112
what I am shooting for:
Proctor, Terry, 620 33rd Ave E, Seattle, WA, 98112, 206-915-3555, Member

How do I avoid a loop with Python/Pandas to build an equity curve?

I am trying to build an equity curve in Python using Pandas. For those not in the know, an equity curve is a cumulative tally of investing profits/losses day by day. The code below works but it is incredibly slow. I've tried to build an alternate using Pandas .iloc and such but nothing is working. I'm not sure if it is possible to do this outside of a loop given how I have to reference the prior row(s).
for today in range(len(f1)): #initiate a loop that runs the length of the "f1" dataframe
if today == 0: #if the index value is zero (aka first row in the dataframe) then...
f1.loc[today,'StartAUM'] = StartAUM #Set intial assets
f1.loc[today,'Shares'] = 0 #dummy placeholder for shares; no trading on day 1
f1.loc[today,'PnL'] = 0 #dummy placeholder for P&L; no trading day 1
f1.loc[today,'EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = today - 1 #used to reference the rows (see below)
f1.loc[today,'StartAUM'] = f1.loc[yesterday,'EndAUM'] #todays starting aseets are yesterday's ending assets
f1.loc[today,'Shares'] = f1.loc[yesterday,'EndAUM']//f1.loc[yesterday,'Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
f1.loc[today,'PnL'] = f1.loc[today,'Shares']*f1.loc[today,'Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
f1.loc[today,'EndAUM'] = f1.loc[today,'StartAUM']+f1.loc[today,'PnL'] #ending assets are starting assets + today's P&L
There is a good example here: http://www.pythonforfinance.net/category/basic-data-analysis/ and I know that there is an example in Wes McKinney's book Python for Data Analysis. You might be able to find it here: http://wesmckinney.com/blog/python-for-financial-data-analysis-with-pandas/
Have you tried using iterrows() to construct the for loop?
for index, row in f1.iterrows():
if today == 0:
row['StartAUM'] = StartAUM #Set intial assets
row['Shares'] = 0 #dummy placeholder for shares; no trading on day 1
row['PnL'] = 0 #dummy placeholder for P&L; no trading day 1
row['EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = row[today] - 1 #used to reference the rows (see below)
row['StartAUM'] = row['EndAUM'] #todays starting aseets are yesterday's ending assets
row['Shares'] = row['EndAUM']//['Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
row['PnL'] = row['Shares']*row['Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
row['EndAUM'] = row['StartAUM']+row['PnL'] #ending assets are starting assets + today's P&L
Probably the code is so slow as loc goes through f1 from beginning every time. iterrows() uses the same dataframe as it loops through it row by row.
See more details about iterrows() here.
You need to vectorize the operations (don't iterate with for but rather compute whole column at once)
# fill the initial values
f1['StartAUM'] = StartAUM # Set intial assets
f1['Shares'] = 0 # dummy placeholder for shares; no trading on day 1
f1['PnL'] = 0 # dummy placeholder for P&L; no trading day 1
f1['EndAUM'] = StartAUM # s
#do the computations (vectorized)
f1['StartAUM'].iloc[1:] = f1['EndAUM'].iloc[:-1]
f1['Shares'].iloc[1:] = f1['EndAUM'].iloc[:-1] // f1['Shareprice'].iloc[:-1]
f1['PnL'] = f1['Shares'] * f1['Outcome1']
f1['EndAUM'] = f1['StartAUM'] + f1 ['PnL']
EDIT: this will not work correctly since StartAUM, EndAUM, Shares depend on each other and cannot be computed one without another. I didn't notice that before.
Can you try the following:
#import relevant modules
import pandas as pd
import numpy as np
from pandas_datareader import data
import matplotlib.pyplot as plt
#download data into DataFrame and create moving averages columns
f1 = data.DataReader('AAPL', 'yahoo',start='1/1/2017')
StartAUM = 1000000
#populate DataFrame with starting values
f1['Shares'] = 0
f1['PnL'] = 0
f1['EndAUM'] = StartAUM
#Set shares held to be the previous day's EndAUM divided by the previous day's closing price
f1['Shares'] = f1['EndAUM'].shift(1) / f1['Adj Close'].shift(1)
#Set the day's PnL to be the number of shares held multiplied by the change in closing price from yesterday to today's close
f1['PnL'] = f1['Shares'] * (f1['Adj Close'] - f1['Adj Close'].shift(1))
#Set day's ending AUM to be previous days ending AUM plus daily PnL
f1['EndAUM'] = f1['EndAUM'].shift(1) + f1['PnL']
#Plot the equity curve
f1['EndAUM'].plot()
Does the above solve your issue?
The solution was to use the Numba package. It performs the loop task in a fraction of the time.
https://numba.pydata.org/
The arguments/dataframe can be passed to the numba module/function. I will try to write up a more detailed explanation with code when time permits.
Thanks to all
In case others come across this, you can definitely make an equity curve without loops.
Dummy up some data
import pandas as pd
import numpy as np
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13, 10)
# Some data to work with
np.random.seed(1)
stock = pd.DataFrame(
np.random.randn(100).cumsum() + 10,
index=pd.date_range('1/1/2020', periods=100, freq='D'),
columns=['Close']
)
stock['ma_5'] = stock['Close'].rolling(5).mean()
stock['ma_15'] = stock['Close'].rolling(15).mean()
Holdings: simple long/short based on moving average crossover signals
longs = stock['Close'].where(stock['ma_5'] > stock['ma_15'], np.nan)
shorts = stock['Close'].where(stock['ma_5'] < stock['ma_15'], np.nan)
# Quick plot
stock.plot()
longs.plot(lw=5, c='green')
shorts.plot(lw=5, c='red')
EQUITY CURVE:
Identify which side (l/s) has first holding (ie: first trade, in this case, short), then keep the initial trade price and subsequently cumulatively sum the daily changes (there would normally be more nan's in the series if you have exit rules as well for when you are out of the market), and finally forward fill over the nan values and fill any last remaining nans with zeros. Its basically the same for the second opposite holdings (in this case, long) except don't keep the starting price. The other important thing is to invert the short daily changes (ie: negative changes should be positive to the PnL).
lidx = np.where(longs > 0)[0][0]
sidx = np.where(shorts > 0)[0][0]
startdx = min(lidx, sidx)
# For first holding side, keep first trade price, then calc daily change fwd and ffill nan's
# For second holdng side, get cumsum of daily changes, ffill and fillna(0) (make sure short changes are inverted)
if lidx == startdx:
lcurve = longs.diff() # get daily changes
lcurve[lidx] = longs[lidx] # put back initial starting price
lcurve = lcurve.cumsum().ffill() # add dialy changes/ffill to build curve
scurve = -shorts.diff().cumsum().ffill().fillna(0) # get daily changes (make declines positive changes)
else:
scurve = -shorts.diff() # get daily changes (make declines positive changes)
scurve[sidx] = shorts[sidx] # put back initial starting price
scurve = scurve.cumsum().ffill() # add dialy changes/ffill to build curve
lcurve = longs.diff().cumsum().ffill().fillna(0) # get daily changes
Add the 2 long/short curves together to get the final equity curve
eq_curve = lcurve + scurve
# quick plot
stock.iloc[:, :3].plot()
longs.plot(lw=5, c='green', label='Long')
shorts.plot(lw=5, c='red', label='Short')
eq_curve.plot(lw=2, ls='dotted', c='orange', label='Equity Curve')
plt.legend()

Categories

Resources