Multivariate linear mixed effects model in Python - python

I am playing around with this code which is for Univariate linear mixed effects modelling. The data set denotes:
students as s
instructors as d
departments as dept
service as service
In the syntax of R's lme4 package (Bates et al., 2015), the model implemented can be summarized as:
y ~ 1 + (1|students) + (1|instructor) + (1|dept) + service
where 1 denotes an intercept term,(1|x) denotes a random effect for x, and x denotes a fixed effect.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import edward as ed
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from edward.models import Normal
from observations import insteval
data = pd.DataFrame(data, columns=metadata['columns'])
train = data.sample(frac=0.8)
test = data.drop(train.index)
train.head()
s_train = train['s'].values
d_train = train['dcodes'].values
dept_train = train['deptcodes'].values
y_train = train['y'].values
service_train = train['service'].values
n_obs_train = train.shape[0]
s_test = test['s'].values
d_test = test['dcodes'].values
dept_test = test['deptcodes'].values
y_test = test['y'].values
service_test = test['service'].values
n_obs_test = test.shape[0]
n_s = max(s_train) + 1 # number of students
n_d = max(d_train) + 1 # number of instructors
n_dept = max(dept_train) + 1 # number of departments
n_obs = train.shape[0] # number of observations
# Set up placeholders for the data inputs.
s_ph = tf.placeholder(tf.int32, [None])
d_ph = tf.placeholder(tf.int32, [None])
dept_ph = tf.placeholder(tf.int32, [None])
service_ph = tf.placeholder(tf.float32, [None])
# Set up fixed effects.
mu = tf.get_variable("mu", [])
service = tf.get_variable("service", [])
sigma_s = tf.sqrt(tf.exp(tf.get_variable("sigma_s", [])))
sigma_d = tf.sqrt(tf.exp(tf.get_variable("sigma_d", [])))
sigma_dept = tf.sqrt(tf.exp(tf.get_variable("sigma_dept", [])))
# Set up random effects.
eta_s = Normal(loc=tf.zeros(n_s), scale=sigma_s * tf.ones(n_s))
eta_d = Normal(loc=tf.zeros(n_d), scale=sigma_d * tf.ones(n_d))
eta_dept = Normal(loc=tf.zeros(n_dept), scale=sigma_dept * tf.ones(n_dept))
yhat = (tf.gather(eta_s, s_ph) +
tf.gather(eta_d, d_ph) +
tf.gather(eta_dept, dept_ph) +
mu + service * service_ph)
y = Normal(loc=yhat, scale=tf.ones(n_obs))
#Inference
q_eta_s = Normal(
loc=tf.get_variable("q_eta_s/loc", [n_s]),
scale=tf.nn.softplus(tf.get_variable("q_eta_s/scale", [n_s])))
q_eta_d = Normal(
loc=tf.get_variable("q_eta_d/loc", [n_d]),
scale=tf.nn.softplus(tf.get_variable("q_eta_d/scale", [n_d])))
q_eta_dept = Normal(
loc=tf.get_variable("q_eta_dept/loc", [n_dept]),
scale=tf.nn.softplus(tf.get_variable("q_eta_dept/scale", [n_dept])))
latent_vars = {
eta_s: q_eta_s,
eta_d: q_eta_d,
eta_dept: q_eta_dept}
data = {
y: y_train,
s_ph: s_train,
d_ph: d_train,
dept_ph: dept_train,
service_ph: service_train}
inference = ed.KLqp(latent_vars, data)
This works fine in the univariate case for Linear mixed effects modelling. I am trying to extend this approach to the multivariate case. Any ideas are more than welcome.

There are a number of ways to conduct linear mixed effects models in Python. It looks like you've adapted the Tensorflow approach but if that is not a hard requirement then there are several other potentially more convenient options.
You can use the Statsmodels implementation of LMER which is conveniently contained in Python but the syntax is a bit different from traditional formulaic expressions from R's LMER. It looks like you are using python to split your data to training and test sets so you can also write a loop to call the
You can also install R and rpy2 on your local machine and call the LMER packages from your Python environment. This allows you to keep your familiarity with working in R but allows you to do everything else in Python. All you have to do is use the rmagic %%R or (%R for inline) in your cell block in Jupyter Notebooks to pass variables and models between Python and R. The latter would be useful if you are passing the train/test data you split in Python to R to run lmer and retrieve the parameters back in a loop.
Lastly, another option is to use Pymer4 which is a wrapper for rpy2 allowing you to directly call LMER in R but without having to deal with rmagic.
I wrote a tutorial on how to use LMER with each of these methods which also works on Cloud setups like Google Colab. These methods will all allow you to run the multivariate approach like you asked for using the LMER in R but from a Python environment.

Related

Is pyomo kernel compatible with ROmodel?

I know that ROmodel works with pyomo.environ, but I haven’t been able to get it to work with pyomo.kernel. Admittedly, it says here that pyomo.kernel is not compatible with extension modules. Here’s my attempt at getting it to work with ROmodel:
import romodel as ro
import pyomo.kernel as pmo
import numpy as np
# Create Pyomo model using pyomo.environ
m = pmo.block()
m.x = pmo.variable(value = 1, lb = 0, ub = 4)
# Add some regular constraints (not uncertain)
m.c1 = pmo.constraint(m.x <= 0)
m.c2 = pmo.constraint(0 <= m.x)
# Create Objective function
m.o = pmo.objective(-m.x)
# # solve deterministic model
solver = pmo.SolverFactory('ipopt') # other options gurobi, ipopt, cplex, glpk
solver.solve(m)
print('decision variable values are')
print('objective value is ', m.o.expr())
# create Robust model
from romodel.uncset import PolyhedralSet
# # Define polyhedral set
m.uncset_d = PolyhedralSet(mat=[[1]], rhs=[100]) # can't add this to block object because no attribute "parent"
upper_bound = 500
m.uncset_b = PolyhedralSet(mat=[[1]], rhs=[upper_bound])
m.d = ro.UncParam([0], nominal=[0.1], uncset=m.uncset_d)
m.b = ro.UncParam([0], nominal=[1], uncset=m.uncset_b)
m.unc_const_d = pmo.constraint(expr = m.x[0] + m.d[0] <= 1)
m.unc_const_b = pmo.constraint(expr = m.x[0] + m.b[0] <= 1)
I was excited to find pyomo.kernel, because I’m writing a robust optimization problem, and it would be nice to take advantage of the vectorized constraints in pyomo.kernel. However, if pyomo.kernel isn’t compatible with ROmodel, then this won’t work.
Is there either:
A way to get pyomo.kernel to integrate with ROmodel?
OR
Another Robust Optimization framework that extends Pyomo which offers vectorized constraints? I saw PyROS and RSOME. But I couldn’t tell if those offer vectorized constraints.

Tensorflow Probability Error: OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed

I am trying to estimate a model in tensorflow using NUTS by providing it a likelihood function. I have checked the likelihood function is returning reasonable values. I am following the setup here for setting up NUTS:
https://rlhick.people.wm.edu/posts/custom-likes-tensorflow.html
and some of the examples here for setting up priors, etc.:
https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/Multilevel_Modeling_Primer.ipynb
My code is in a colab notebook here:
https://drive.google.com/file/d/1L9JQPLO57g3OhxaRCB29do2m808ZUeex/view?usp=sharing
I get the error: OperatorNotAllowedInGraphError: iterating overtf.Tensoris not allowed: AutoGraph did not convert this function. Try decorating it directly with #tf.function. This is my first time using tensorflow and I am quite lost interpreting this error. It would also be ideal if I could pass the starting parameter values as a single input (example I am working off doesn't do it, but I assume it is possible).
Update
It looks like I had to change the position of the #tf.function decorator. The sampler now runs, but it gives me the same value for all samples for each of the parameters. Is it a requirement that I pass a joint distribution through the log_prob() function? I am clearly missing something. I can run the likelihood through bfgs optimization and get reasonable results (I've estimated the model via maximum likelihood with fixed parameters in other software). It looks like I need to define the function to return a joint distribution and call log_prob(). I can do this if I set it up as a logistic regression (logit choice model is logistically distributed in differences). However, I lose the standard closed form.
My function is as follows:
#tf.function
def mmnl_log_prob(init_mu_b_time,init_sigma_b_time,init_a_car,init_a_train,init_b_cost,init_scale):
# Create priors for hyperparameters
mu_b_time = tfd.Sample(tfd.Normal(loc=init_mu_b_time, scale=init_scale),sample_shape=1).sample()
# HalfCauchy distributions are too wide for logit discrete choice
sigma_b_time = tfd.Sample(tfd.Normal(loc=init_sigma_b_time, scale=init_scale),sample_shape=1).sample()
# Create priors for parameters
a_car = tfd.Sample(tfd.Normal(loc=init_a_car, scale=init_scale),sample_shape=1).sample()
a_train = tfd.Sample(tfd.Normal(loc=init_a_train, scale=init_scale),sample_shape=1).sample()
# a_sm = tfd.Sample(tfd.Normal(loc=init_a_sm, scale=init_scale),sample_shape=1).sample()
b_cost = tfd.Sample(tfd.Normal(loc=init_b_cost, scale=init_scale),sample_shape=1).sample()
# Define a heterogeneous random parameter model with MultivariateNormalDiag()
# Use MultivariateNormalDiagPlusLowRank() to define nests, etc.
b_time = tfd.Sample(tfd.MultivariateNormalDiag( # b_time
loc=mu_b_time,
scale_diag=sigma_b_time),sample_shape=num_idx).sample()
# Definition of the utility functions
V1 = a_train + tfm.multiply(b_time,TRAIN_TT_SCALED) + b_cost * TRAIN_COST_SCALED
V2 = tfm.multiply(b_time,SM_TT_SCALED) + b_cost * SM_COST_SCALED
V3 = a_car + tfm.multiply(b_time,CAR_TT_SCALED) + b_cost * CAR_CO_SCALED
print("Vs",V1,V2,V3)
# Definition of loglikelihood
eV1 = tfm.multiply(tfm.exp(V1),TRAIN_AV_SP)
eV2 = tfm.multiply(tfm.exp(V2),SM_AV_SP)
eV3 = tfm.multiply(tfm.exp(V3),CAR_AV_SP)
eVD = eV1 + eV2 +
eV3
print("eVs",eV1,eV2,eV3,eVD)
l1 = tfm.multiply(tfm.truediv(eV1,eVD),tf.cast(tfm.equal(CHOICE,1),tf.float32))
l2 = tfm.multiply(tfm.truediv(eV2,eVD),tf.cast(tfm.equal(CHOICE,2),tf.float32))
l3 = tfm.multiply(tfm.truediv(eV3,eVD),tf.cast(tfm.equal(CHOICE,3),tf.float32))
ll = tfm.reduce_sum(tfm.log(l1+l2+l3))
print("ll",ll)
return ll
The function is called as follows:
nuts_samples = 1000
nuts_burnin = 500
chains = 4
## Initial step size
init_step_size=.3
init = [0.,0.,0.,0.,0.,.5]
##
## NUTS (using inner step size averaging step)
##
#tf.function
def nuts_sampler(init):
nuts_kernel = tfp.mcmc.NoUTurnSampler(
target_log_prob_fn=mmnl_log_prob,
step_size=init_step_size,
)
adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
inner_kernel=nuts_kernel,
num_adaptation_steps=nuts_burnin,
step_size_getter_fn=lambda pkr: pkr.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
)
samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
num_results=nuts_samples,
current_state=init,
kernel=adapt_nuts_kernel,
num_burnin_steps=100,
parallel_iterations=5)
return samples_nuts_, stats_nuts_
samples_nuts, stats_nuts = nuts_sampler(init)
I have an answer to my question! It is simply a matter of different nomenclature. I need to define my model as a softmax function, which I knew was what I would call a "logit model", but it just wasn't clicking for me. The following blog post gave me the epiphany:
http://khakieconomics.github.io/2019/03/17/Putting-it-all-together.html

Pyomo time-dependant model?

I'm quite new to pyomo but I'm having a hard time figuring how to create a time dependant model and plot it on a graph. By time dependant I mean just a variable that assumes different values for each time step (like from 1 to T in this case).
I used this very simple model but when I run the script I receive in output only one solution. How can I change that?
I also have errors related to the constraint function but I'm not sure what's wrong
(ValueError: Constraint 'constraint[1]' does not have a proper value. Found . at 0x7f202b540850>' Expecting a tuple or equation.)
I'd like to show how the value of x(t) varies in all timesteps.
Any help is appreciated.
from __future__ import division
from pyomo.environ import *
from pyomo.opt import SolverFactory
import sys
model = AbstractModel()
model.n = Param()
model.T = RangeSet(1, model.n)
model.a = Param(model.T)
model.b = Param(model.T)
model.x = Var(model.T, domain= NonNegativeReals)
data = DataPortal()
data.load(filename='N.csv', range='N', param=model.n)
data.load(filename='A.csv', range= 'A', param=model.a)
data.load(filename='B.csv', range= 'B', param=model.b)
def objective(model):
return model.x
model.OBJ = Objective(rule=objective)
def somma(model):
return model.a[t]*model.x[t] for t in model.T) >= model.b[t] for t in model.T
model.constraint = Constraint(model.T, rule=somma)
instance = model.create_instance(data)
opt = SolverFactory('glpk')
results = opt.solve(instance)
You can build up lists of the values you would like to plot like this:
T_plot = list(instance.T)
x_plot = [value(instance.x[t]) for t in T_plot]
and then use your favorite Python plotting package to make the plots. I usually use Matplotlib.

Is there a difference between Python Boruta and R Boruta?

I used the Boruta package in R and Python for the same dataset. And all the steps and other methods I applied are the same. But results of Boruta is different in Python and R for feature selection. In R, 46 feature are selected but 20 feature is selected in Python. What is the reason?
R
M_boruta <- Boruta::Boruta(is_churn ~ . -cust_id, data = Mobile, doTrace = 2)
print(M_boruta)
plot(M_boruta, xlab = "", xaxt = "n")
lz_2 <- lapply(1:ncol(M_boruta$ImpHistory),function(i)
M_boruta$ImpHistory[is.finite(M_boruta$ImpHistory[,i]),i])
names(lz_2) <- colnames(M_boruta$ImpHistory)
Labels_2 <- sort(sapply(lz_2,median))
axis(side = 1,las=2,labels = names(Labels_2),
at = 1:ncol(M_boruta$ImpHistory), cex.axis = 0.7)
M_boruta_attr <- getSelectedAttributes(M_boruta, withTentative = F)
M_boruta_df <- Mobile[ ,(names(Mobile) %in% M_boruta_attr)]
str(M_boruta_df)]
Python
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
rfc = RandomForestClassifier(n_estimators=1000, n_jobs=-1, class_weight='balanced', max_depth=50)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2)
churn_gsm_bor_x = churn_gsm_bor.iloc[:,1:].values
churn_gsm_bor_y = churn_gsm_bor.iloc[:,0].values.ravel()
boruta_selector.fit(churn_gsm_bor_x, churn_gsm_bor_y)
print("=============BORUTA==============")
print(boruta_selector.n_features_)
print(boruta_selector.support_)
print(boruta_selector.ranking_)
churn_gsm_bor_x_filter=boruta_selector.transform(churn_gsm_bor_x)
print(churn_gsm_bor_x_filter)
This might be because the parameters you specify for the Random Forest classifier in Python differ from the default parameters you use in R (cf. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, or ranger in more recent versions of Boruta: https://cran.r-project.org/web/packages/ranger/ranger.pdf). I'd point out that you are also setting the maximum depth of the trees in the Python implementation higher than recommended (cf. https://github.com/scikit-learn-contrib/boruta_py) - personally I've found that this can have a large effect on how many features are selected.

getting linear models fama macbeth function output

I am having an issue with this function. I am wanting to perform a cross-sectional regression on 25 portfolios ranked on value and size. I have 7 independent variables as the right side of the equation.
import pandas as pd
import numpy as np
from linearmodels import FamaMacBeth
#creating a multi_index of independent variables
ind_var = pd.read_excel('FAMA_MACBETH.xlsx')
ind_var['date'] = pd.to_datetime(ind_var['date'])
# dropping our dependent variables
ind_var = ind_var.drop(['Mkt_rf', 'div_innovations', 'term_innovations',
'def_innovations', 'rf_innovations', 'hml_innovations',
'smb_innovations'],axis = 1)
ind_var = pd.DataFrame(ind_var.set_index('date').stack())
ind_var.columns = ['x']
x = np.asarray(ind_var)
len(x)
11600
#creatiing a multi_index of dependent variables
# reading in our data
dep_var = pd.read_excel('FAMA_MACBETH.xlsx')
dep_var['date'] = pd.to_datetime(dep_var['date'])
# dropping our independent variables
dep_var = dep_var.drop(['SMALL_LoBM', 'ME1_BM2', 'ME1_BM3', 'ME1_BM4',
'SMALL_HiBM', 'ME2_BM1', 'ME2_BM2', 'ME2_BM3', 'ME2_BM4', 'ME2_BM5',
'ME3_BM1', 'ME3_BM2', 'ME3_BM3', 'ME3_BM4', 'ME3_BM5', 'ME4_BM1',
'ME4_BM2', 'ME4_BM3', 'ME4_BM4', 'ME4_BM5', 'BIG_LoBM', 'ME5_BM2',
'ME5_BM3', 'ME5_BM4', 'BIG_HiBM'],axis = 1)
dep_var = pd.DataFrame(dep_var.set_index('date').stack())
dep_var.columns = ['y']
y = np.asarray(dep_var)
len(y)
3248
mod = FamaMacBeth(y, x)
res = mod.fit(cov_type='kernel', kernel='Parzen')
output with tstats and errors ideally
I have tried numerous methods of getting this to work. I am really thinking of using SAS at this point. Really, I would prefer to get this running with pandas
I expect a cross-sectional regression output with standard errors and t stats
I got it to work in one go. See this site and run the lines of code for OLS below: "Here the difference is presented using the canonical Grunfeld data on investment."
(Note that this line is important: etdata = data.set_index(['firm','year']), else Python won't know the correct dimensions to run F&McB on.)
Then run:
from linearmodels import FamaMacBeth
FamaMacBeth(etdata.invest,etdata[['value','capital']]).fit()
Note, I updated linearmodels to the latest version, that got me access to the data.

Categories

Resources