I am having an issue with this function. I am wanting to perform a cross-sectional regression on 25 portfolios ranked on value and size. I have 7 independent variables as the right side of the equation.
import pandas as pd
import numpy as np
from linearmodels import FamaMacBeth
#creating a multi_index of independent variables
ind_var = pd.read_excel('FAMA_MACBETH.xlsx')
ind_var['date'] = pd.to_datetime(ind_var['date'])
# dropping our dependent variables
ind_var = ind_var.drop(['Mkt_rf', 'div_innovations', 'term_innovations',
'def_innovations', 'rf_innovations', 'hml_innovations',
'smb_innovations'],axis = 1)
ind_var = pd.DataFrame(ind_var.set_index('date').stack())
ind_var.columns = ['x']
x = np.asarray(ind_var)
len(x)
11600
#creatiing a multi_index of dependent variables
# reading in our data
dep_var = pd.read_excel('FAMA_MACBETH.xlsx')
dep_var['date'] = pd.to_datetime(dep_var['date'])
# dropping our independent variables
dep_var = dep_var.drop(['SMALL_LoBM', 'ME1_BM2', 'ME1_BM3', 'ME1_BM4',
'SMALL_HiBM', 'ME2_BM1', 'ME2_BM2', 'ME2_BM3', 'ME2_BM4', 'ME2_BM5',
'ME3_BM1', 'ME3_BM2', 'ME3_BM3', 'ME3_BM4', 'ME3_BM5', 'ME4_BM1',
'ME4_BM2', 'ME4_BM3', 'ME4_BM4', 'ME4_BM5', 'BIG_LoBM', 'ME5_BM2',
'ME5_BM3', 'ME5_BM4', 'BIG_HiBM'],axis = 1)
dep_var = pd.DataFrame(dep_var.set_index('date').stack())
dep_var.columns = ['y']
y = np.asarray(dep_var)
len(y)
3248
mod = FamaMacBeth(y, x)
res = mod.fit(cov_type='kernel', kernel='Parzen')
output with tstats and errors ideally
I have tried numerous methods of getting this to work. I am really thinking of using SAS at this point. Really, I would prefer to get this running with pandas
I expect a cross-sectional regression output with standard errors and t stats
I got it to work in one go. See this site and run the lines of code for OLS below: "Here the difference is presented using the canonical Grunfeld data on investment."
(Note that this line is important: etdata = data.set_index(['firm','year']), else Python won't know the correct dimensions to run F&McB on.)
Then run:
from linearmodels import FamaMacBeth
FamaMacBeth(etdata.invest,etdata[['value','capital']]).fit()
Note, I updated linearmodels to the latest version, that got me access to the data.
Related
I am trying to solve a multiobjective optimization problem with 3 objectives and 2 decision variables using NSGA 2. The pymoo code for NSGA2 algorithm and termination criteria is given below. My pop_size is 100 and n_offspring is 100. The algorithm is iterated over 100 generations. I want to store all 100 values of decision variables considered in each generation for all 100 generations in a dataframe.
NSGA2 implementation in pymoo code:
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.factory import get_sampling, get_crossover, get_mutation
algorithm = NSGA2(
pop_size=20,
n_offsprings=10,
sampling=get_sampling("real_random"),
crossover=get_crossover("real_sbx", prob=0.9, eta=15),
mutation=get_mutation("real_pm", prob=0.01,eta=20),
eliminate_duplicates=True
)
from pymoo.factory import get_termination
termination = get_termination("n_gen", 100)
from pymoo.optimize import minimize
res = minimize(MyProblem(),
algorithm,
termination,
seed=1,
save_history=True,
verbose=True)
What I have tried (My reference: stackoverflow question):
import pandas as pd
df2 = pd.DataFrame (algorithm.pop)
df2.head(10)
The result from above code is blank and on passing
print(df2)
I get
Empty DataFrame
Columns: []
Index: []
Glad you intend to use pymoo for your research. You have correctly enabled the save_history option, which means you can access the algorithm objects.
To have all solutions from the run, you can combine the offsprings (algorithm.off) from each generation. Don't forget the Population objects contain Individual objectives. With the get method you can get the X and F or other values. See the code below.
import pandas as pd
from pymoo.algorithms.nsga2 import NSGA2 from pymoo.factory import get_sampling, get_crossover, get_mutation, ZDT1 from pymoo.factory import get_termination from pymoo.model.population import Population from pymoo.optimize import minimize
problem = ZDT1()
algorithm = NSGA2(
pop_size=20,
n_offsprings=10,
sampling=get_sampling("real_random"),
crossover=get_crossover("real_sbx", prob=0.9, eta=15),
mutation=get_mutation("real_pm", prob=0.01,eta=20),
eliminate_duplicates=True )
termination = get_termination("n_gen", 10)
res = minimize(problem,
algorithm,
termination,
seed=1,
save_history=True,
verbose=True)
all_pop = Population()
for algorithm in res.history:
all_pop = Population.merge(all_pop, algorithm.off)
df = pd.DataFrame(all_pop.get("X"), columns=[f"X{i+1}" for i in range(problem.n_var)])
print(df)
Another way would be to use a callback and fill the data frame each generation. Similar as shown here: https://pymoo.org/interface/callback.html
I'm trying to write a block of code that will allow me to identify the risk contribution of assets in a portfolio. The covariance matrix is a 6x6 pandas dataframe.
My code is as follows:
import numpy as np
import pandas as pd
weights = np.array([.1,.2,.05,.25,.1,.3])
data = pd.DataFrame(np.random.randn(1000,6),columns = 'a','b','c','d','e','f'])
covariance = data.cov()
portfolio_variance = (weights*covariance*weights.T)[0,0]
sigma = np.sqrt(portfolio_variance)
marginal_risk = covariance*weights.T
risk_contribution = np.multiply(marginal_risk, weights.T)/sigma
print(risk_contribution)
When I try to run the code I get a KeyError, and if I remove the [0,0] from portfolio_variance I get output that doesn't seem to make sense.
Can somebody point me to my error(s)?
Three problems with your code:
Open your list operator square brackets on line 6:
data = pd.DataFrame(np.random.randn(1000,6),columns = ['a','b','c','d','e','f'])
You're using the two dimensional indexing operator wrong. You can't say [0,0], you have to say [0][0].
And last, because you named the columns, you have to use them when indexing, so it's actually ['a'][0]:
portfolio_variance = (weights*covariance*weights.T)['a'][0]
Final working code:
import numpy as np
import pandas as pd
weights = np.array([.1,.2,.05,.25,.1,.3])
data = pd.DataFrame(np.random.randn(1000,6),columns = ['a','b','c','d','e','f'])
covariance = data.cov()
portfolio_variance = (weights*covariance*weights.T)['a'][0]
sigma = np.sqrt(portfolio_variance)
marginal_risk = covariance*weights.T
risk_contribution = np.multiply(marginal_risk, weights.T)/sigma
print(risk_contribution)
portfolio_variance =(weights*covariance*weights.T)
portfolio_variance should be
portfolio_variance =(weights#covariance#weights.T)
This will provide the portfolio variance, which should be a single number.
same for marginal risk, it should be
marginal_risk = covariance#weights.T
I am playing around with this code which is for Univariate linear mixed effects modelling. The data set denotes:
students as s
instructors as d
departments as dept
service as service
In the syntax of R's lme4 package (Bates et al., 2015), the model implemented can be summarized as:
y ~ 1 + (1|students) + (1|instructor) + (1|dept) + service
where 1 denotes an intercept term,(1|x) denotes a random effect for x, and x denotes a fixed effect.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import edward as ed
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from edward.models import Normal
from observations import insteval
data = pd.DataFrame(data, columns=metadata['columns'])
train = data.sample(frac=0.8)
test = data.drop(train.index)
train.head()
s_train = train['s'].values
d_train = train['dcodes'].values
dept_train = train['deptcodes'].values
y_train = train['y'].values
service_train = train['service'].values
n_obs_train = train.shape[0]
s_test = test['s'].values
d_test = test['dcodes'].values
dept_test = test['deptcodes'].values
y_test = test['y'].values
service_test = test['service'].values
n_obs_test = test.shape[0]
n_s = max(s_train) + 1 # number of students
n_d = max(d_train) + 1 # number of instructors
n_dept = max(dept_train) + 1 # number of departments
n_obs = train.shape[0] # number of observations
# Set up placeholders for the data inputs.
s_ph = tf.placeholder(tf.int32, [None])
d_ph = tf.placeholder(tf.int32, [None])
dept_ph = tf.placeholder(tf.int32, [None])
service_ph = tf.placeholder(tf.float32, [None])
# Set up fixed effects.
mu = tf.get_variable("mu", [])
service = tf.get_variable("service", [])
sigma_s = tf.sqrt(tf.exp(tf.get_variable("sigma_s", [])))
sigma_d = tf.sqrt(tf.exp(tf.get_variable("sigma_d", [])))
sigma_dept = tf.sqrt(tf.exp(tf.get_variable("sigma_dept", [])))
# Set up random effects.
eta_s = Normal(loc=tf.zeros(n_s), scale=sigma_s * tf.ones(n_s))
eta_d = Normal(loc=tf.zeros(n_d), scale=sigma_d * tf.ones(n_d))
eta_dept = Normal(loc=tf.zeros(n_dept), scale=sigma_dept * tf.ones(n_dept))
yhat = (tf.gather(eta_s, s_ph) +
tf.gather(eta_d, d_ph) +
tf.gather(eta_dept, dept_ph) +
mu + service * service_ph)
y = Normal(loc=yhat, scale=tf.ones(n_obs))
#Inference
q_eta_s = Normal(
loc=tf.get_variable("q_eta_s/loc", [n_s]),
scale=tf.nn.softplus(tf.get_variable("q_eta_s/scale", [n_s])))
q_eta_d = Normal(
loc=tf.get_variable("q_eta_d/loc", [n_d]),
scale=tf.nn.softplus(tf.get_variable("q_eta_d/scale", [n_d])))
q_eta_dept = Normal(
loc=tf.get_variable("q_eta_dept/loc", [n_dept]),
scale=tf.nn.softplus(tf.get_variable("q_eta_dept/scale", [n_dept])))
latent_vars = {
eta_s: q_eta_s,
eta_d: q_eta_d,
eta_dept: q_eta_dept}
data = {
y: y_train,
s_ph: s_train,
d_ph: d_train,
dept_ph: dept_train,
service_ph: service_train}
inference = ed.KLqp(latent_vars, data)
This works fine in the univariate case for Linear mixed effects modelling. I am trying to extend this approach to the multivariate case. Any ideas are more than welcome.
There are a number of ways to conduct linear mixed effects models in Python. It looks like you've adapted the Tensorflow approach but if that is not a hard requirement then there are several other potentially more convenient options.
You can use the Statsmodels implementation of LMER which is conveniently contained in Python but the syntax is a bit different from traditional formulaic expressions from R's LMER. It looks like you are using python to split your data to training and test sets so you can also write a loop to call the
You can also install R and rpy2 on your local machine and call the LMER packages from your Python environment. This allows you to keep your familiarity with working in R but allows you to do everything else in Python. All you have to do is use the rmagic %%R or (%R for inline) in your cell block in Jupyter Notebooks to pass variables and models between Python and R. The latter would be useful if you are passing the train/test data you split in Python to R to run lmer and retrieve the parameters back in a loop.
Lastly, another option is to use Pymer4 which is a wrapper for rpy2 allowing you to directly call LMER in R but without having to deal with rmagic.
I wrote a tutorial on how to use LMER with each of these methods which also works on Cloud setups like Google Colab. These methods will all allow you to run the multivariate approach like you asked for using the LMER in R but from a Python environment.
I have a pandas data frame that contains several columns. I need to perform a multivariate linear regression. Before doing that i would like to analyze the R,R2,adjusted R2 and p value of each independent variable with respect to the dependent variable.
For the R and R2 I have no problem, since i can calculate the R matrix and the select only the dependent variable and then see the R coefficient between it and all the independent variables. Then i can square these values to obtain the R2.
My problem is how to do the same with the adjusted R2 and the p value
At the end what i want to obtain is somenthing like that:
Variable R R2 ADJUSTED_R2 p_value
A 0.4193 0.1758 ...
B 0.2620 0.0686 ...
C 0.2535 0.0643 ...
All the values are with respect to the dependent variable let's say Y.
The following will not give you ALL the answers, but it WILL get you going using python, pandas and statsmodels for regression analyses.
Given a dataframe like this...
# Imports
import pandas as pd
import numpy as np
import itertools
# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df_1 = df_1.set_index(rng)
print(df_1)
...you can get any regression results using the statsmodels library and altering the result = model.rsquared part in the snippet below:
x = df_1['x1']
x = sm.add_constant(x)
model = sm.OLS(df_1['y'], x).fit()
result = model.rsquared
print(result)
Now you have r-squared. Use model.pvalues for the p-value. And use dir(model)to have closer look at other model results (there is more in the output than what you can see below):
Now, this should get you going to obtain your desired results.
To get desired results for ALL combinations of variables / columns, the question and answer here should get you very far.
Edit: You can have a closer look at some common regression results using model.summary(). Using that together with dir(model) you can see that not ALL regression results are availabel the same way that pvalues are using model.pvalues. To get Durbin-Watson, for example, you'll have to use durbinwatson = sm.stats.stattools.durbin_watson(model.fittedvalues, axis=0).
This post has got more information on the issue.
Firstly, there are a few topics on this but they involve deprecated packages with pandas etc. Suppose I'm trying to predict a variable w with variables x,y and z. I want to run a multiple linear regression to try and predict w. There are quite a few solutions that will produce the coefficients but I'm not sure how to use these. So, in pseudocode;
import numpy as np
from scipy import stats
w = np.array((1,2,3,4,5,6,7,8,9,10)) # Time series I'm trying to predict
x = np.array((1,3,6,1,4,6,8,9,2,2)) # The three variables to predict w
y = np.array((2,7,6,1,5,6,3,9,5,7))
z = np.array((1,3,4,7,4,8,5,1,8,2))
def model(w,x,y,z):
# do something!
return guess # where guess is some 10 element array formed
# using multiple linear regression of x,y,z
guess = model(w,x,y,z)
r = stats.pearsonr(w,guess) # To see how good guess is
Hopefully this makes sense as I'm new to MLR. There is probably a package in scipy that does all this so any help welcome!
You can use the normal equation method.
Let your equation be of the form : ax+by+cz +d =w
Then
import numpy as np
x = np.asarray([[1,3,6,1,4,6,8,9,2,2],
[2,7,6,1,5,6,3,9,5,7],
[1,3,4,7,4,8,5,1,8,2],
[1,1,1,1,1,1,1,1,1,1]]).T
y = numpy.asarray([1,2,3,4,5,6,7,8,9,10]).T
a,b,c,d = np.linalg.pinv((x.T).dot(x)).dot(x.T.dot(y))
Think I've found out now. If anyone could confirm that this produces the correct results that'd be great!
import numpy as np
from scipy import stats
# What I'm trying to predict
y = [-6,-5,-10,-5,-8,-3,-6,-8,-8]
# Array that stores two predictors in columns
x = np.array([[-4.95,-4.55],[-10.96,-1.08],[-6.52,-0.81],[-7.01,-4.46],[-11.54,-5.87],[-4.52,-11.64],[-3.36,-7.45],[-2.36,-7.33],[-7.65,-10.03]])
# Fit linear least squares and get regression coefficients
beta_hat = np.linalg.lstsq(x,y)[0]
print(beta_hat)
# To store my best guess
estimate = np.zeros((9))
for i in range(0,9):
# y = x1b1 + x2b2
estimate[i] = beta_hat[0]*x[i,0]+beta_hat[1]*x[i,1]
# Correlation between best guess and real values
print(stats.pearsonr(estimate,y))