Is there a way to estimate Poisson interaction effect in python statsmodels? - python

Does statsmodels in Python have a way to estimate interaction with a 95% confidence interval? This would be the linear combination of the model's parameter estimates.
Given the example below, I would like to get the effect of being in arm 'b' among people in place 'there', it would require estimating the linear combination of model parameters:
Beta arm + Delta arm*place, but also including the appropriate confidence interval.
I'm aware of mod.params and mod.conf_int(), but does statsmodels have another methods for linear combinations?
import random
import pandas as pd
import statsmodels.api as sm
import patsy
import numpy as np
cases = np.array([random.randint(0,10) for i in range(200)])
arm = [random.choice(['a', 'b']) for i in range(200)]
place = [random.choice(['here', 'there']) for i in range(200)]
df = pd.DataFrame({'arm': arm, 'place': place})
exog = patsy.dmatrix('arm + place + arm * place', df, return_type='dataframe')
mod = sm.GLM(endog=cases, exog=exog, family=sm.families.Poisson()).fit()
mod.summary()

Bollen's Delta Method is frequently used to get the confidence interval for the linear combination b1 * x + b2 * x * z.
I'm not sure how and to what extent Statsmodels incorporates the Delta Method.
If you want to go down the results.get_prediction route just make sure all the 'other covariates' (if any) are set to their sample or population mean.

Related

Apply SciPy newton method to optimize a pandas dataframe Weibull sum

I'm a novice programmer, but know my way around excel. However, I'm trying to teach myself python to enable myself to work with much larger datasets and, primarily, because I'm finding it really interesting and enjoyable.
I'm trying to figure out how to recreate the Excel goal seek function (I believe SciPy newton should be equivalent) within the script I have written. However, instead of being able to define a simple function f(x) for which to find the root of, I'm trying to find the root of the sum of a dataframe column. And I have no idea how to approach this.
My code up until the goal seek part is as follows:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import weibull_min
# need to use a gamma function later on, so import math
import math
%matplotlib inline
# create dataframe using lidar experimental data
df = pd.read_csv(r'C:\Users\Latitude\Documents\Coursera\Wind Resource\Proj' \
'ect\Wind_Lidar_40and140.txt',
sep=' ',
header=None,
names=['Year','Month','Day','Hour','v_40','v_140'])
# add in columns for velocity cubed
df['v_40_cubed'] = df['v_40']**3
df['v_140_cubed'] = df['v_140']**3
# calculate mean wind speed, mean cubed wind speed, mean wind speed cubed
# use these to calculate energy patter factor, c and k
v_40_bar = df['v_40'].mean()
v_40_cubed_bar = df['v_40_cubed'].mean()
v_40_bar_cubed = v_40_bar ** 3
# energy pattern factor = epf
epf = v_40_cubed_bar / v_40_bar_cubed
# shape parameter = k
k_40 = 1 + 3.69/epf**2
# scale factor = c
# use imported math library to use gamma function math.gamma
c_40 = v_40_bar / math.gamma(1+1/k_40)
# create new dataframe from current, using bins of 0.25 and generate frequency for these
#' bins'
bins_1 = np.linspace(0,16,65,endpoint=True)
freq_df = df.apply(pd.Series.value_counts, bins=bins_1)
# tidy up the dataframe by dropping superfluous columns and adding in a % time column for
# frequency
freq_df_tidy = freq_df.drop(['Year','Month','Day','Hour','v_40_cubed','v_140_cubed'], axis=1)
freq_df_tidy['v_40_%time'] = freq_df_tidy['v_40']/freq_df_tidy['v_40'].sum()
# add in usable bin value for potential calculation of weibull
freq_df_tidy['windspeed_bin'] = np.linspace(0,16,64,endpoint=False)
# calculate weibull column and wind power density from the weibull fit
freq_df_tidy['Weibull_40'] = weibull_min.pdf(freq_df_tidy['windspeed_bin'], k_40, loc=0, scale=c_40)/4
freq_df_tidy['Wind_Power_Density_40'] = 0.5 * 1.225 * freq_df_tidy['Weibull_40'] * freq_df_tidy['windspeed_bin']**3
# calculate wind power density from experimental data
df['Wind_Power_Density_40'] = 0.5 * 1.225 * df['v_40']**3
At this stage, the result from the Weibull data, round(freq_df_tidy['Wind_Power_Density_40'].sum(),2) gives 98.12.
The result from the experimental data, round(df['Wind_Power_Density_40'].mean(),2) gives 101.14.
My aim now is to optimise the parameter c_40, which is used in the weibull calculated weibull power density (98.12), so that the result of the function round(freq_df_tidy['Wind_Power_Density_40'].sum(),2) is close to equal the experimental wind power density (101.14).
Any help on this would be hugely appreciated. Apologies if I've entered too much code into the request - I wanted to provide as much detail as possible. From my research, I think the SciPy newton method should do the trick, but I can't figure out how to apply it here.

Why statsmodels' ARIMA(1,0,0) is not equivalent to AutoReg(1)?

I am comparing the results from arima_model and ar_model. Here is what I can't understand:
Why are the resulting coefficients different? Is it because of the estimation method? (Different settings of the method property of fit() don't give identical results)
After getting the coefficients and backtesting the fitted results I match those of the AR(1) but not of ARIMA(1). Why?
What is ARIMA really doing in this simplest setting, isnt it supposed to be able to reproduce AR?
import pandas_datareader as pdr
import datetime
aapl = pdr.get_data_yahoo('AAPL', start=datetime.datetime(2006,1,1), end=datetime.datetime(2020,6,30))
aapl = aapl.resample('M').mean()
aapl['close_pct_change'] = aapl['Close'].pct_change()
from statsmodels.tsa.arima_model import ARIMA
mod = ARIMA(aapl['close_pct_change'][1:], order=(1,0,0))
res1 = mod.fit(method='mle')
print(res1.summary())
from statsmodels.tsa.ar_model import AutoReg, ar_select_order
mod = AutoReg(aapl['close_pct_change'][1:], 1)
res2 = mod.fit()
print(res2.summary())
fitted_check1 = res1.params[0] + res1.params[1]*aapl['close_pct_change'][1:].shift(1)
print(fitted_check1[1:] - res1.fittedvalues)
fitted_check2 = res2.params[0] + res2.params[1]*aapl['close_pct_change'][1:].shift(1)
print(fitted_check2[1:] - res2.fittedvalues)
Why are the resulting coefficients different? Is it because of the estimation method? (Different settings of the method property of fit() don't give identical results)
AutoReg estimates parameters using OLS which is conditional (on the first observation) maximum likelihood. ARIMA implements full maximum likelihood and so uses the available information in the first observation when estimating parameters. In very large samples, the coefficients should be very close, and they are equal in their asymptotic limit. IN practice, they will always differ, although the difference should usually be minor.
After getting the coefficients and backtesting the fitted results I match those of the AR(1) but not of ARIMA(1). Why?
The two models use different representations. AutoReg(1)'s model is Y(t) = a + b Y(t-1) + eps(t). ARIMA(1,0,0) is specified as (Y(t) - c) = b * (Y(t-1) - c) + eps(t). If |b|<1, then in the large sample limit c = a / (1-b), although in finite samples this identity will not hold exactly.
What is ARIMA really doing in this simplest setting, isnt it supposed to be able to reproduce AR?
No. ARIMA uses the statsmodels Statespace framework which can estimate a wide range of models using Gaussian MLE.
ARIMA is essentially a special case of SARIMAX and this notebook provides a good introduction.

Estimating Posterior in Python?

I'm new to Bayesian stats and I'm trying to estimate the posterior of a poisson (likelihood) and gamma distribution (prior) in Python. The parameter I'm trying to estimate is the lambda variable in the poisson distribution. I think the posterior will take the form of a gamma distribution (conjugate prior?) but I don't want to leverage that. The only thing I'm given is the data (named "my_data"). Here's my code:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats
x=np.linspace(1,len(my_data),len(my_data))
lambda_estimate=np.mean(my_data)
prior= scipy.stats.gamma.pdf(x,alpha,beta) #the parameters dont matter for now
likelihood_temp = lambda yi, a: scipy.stats.poisson.pmf(yi, a)
likelihood = lambda y, a: np.log(np.prod([likelihood_temp(data, a) for data in my_data]))
posterior=likelihood(my_data,lambda_estimate) * prior
When I try to plot the posterior I get an empty plot. I plotted the prior and it looks fine, so I think the issue is the likelihood. I took the log because the data is fairly large and I didn't want things to get unstable. Can anyone point out the issues in my code? Any help would be appreciated.
In Bayesian statistics, one goal is to calculate the posterior distribution of the parameter (lambda) given the data and the prior over a range of possible values for lambda. In your code, you calculating the prior over the array x, but you are taking a single value for lambda to calculate the likelihood. The posterior and likelihood should be over x as well, something like:
posterior = [likelihood(my_data, lambda_i) for lambda_i in x] * prior
(assuming you are not taking the logs of the prior and likelihood)
You might want to take a look at the PyMC3 library.
I would recommend you to have a look at the conjugate_prior module.
You could just type:
from conjugate_prior import GammaPoisson
model = GammaPoisson(prior_a, prior_b)
model = model.update(...)
credible_interval = model.posterior(lower_bound, upper_bound)

Solving linearised least squares using statsmodels

I'm trying to translate a simple linearised least squares problem to statsmodels, in order to learn how to use it for iterative least squares:
The (contrived) data comprise measurements of the time it takes for a ball to drop a given distance.
distance time
10 1.430
20 2.035
30 2.460
40 2.855
Using these measurements, I want to determine the acceleration due to gravity, using:
t = sqrt(2s/g)
This is (obviously) non-linear, but I can linearise it (F(x- + 𝛿x) = l0 + v, where x- is a provisional value), then use a provisional value for g (10) to calculate F(g), and iterate if necessary:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
measurements = pd.DataFrame({
'distance': [10, 20, 30, 40],
'time': [1.430, 2.035, 2.460, 2.855]
})
prov_g = 10
measurements['fg'] = measurements['distance'].apply(
lambda d: ((2 * d) ** 0.5) * (prov_g ** -0.5))
measurements['A_matrix'] = measurements['distance'].apply(
lambda d: -np.sqrt(d / 2) * (prov_g ** -1.5))
measurements['b'] = measurements['time'] - measurements['fg']
ATA = np.dot(measurements['A_matrix'], measurements['A_matrix'].T)
ATb = np.dot(measurements['A_matrix'].T, measurements['b'])
x = np.dot(ATA ** -1, ATb)
updated_g = prov_g + x
updated_g
>>> 9.807
What I can't figure out from the examples is how I can use solve statsmodels to do what I've just done manually (linearising the problem, then solving using matrix multiplication)
statsmodels is not directly of any help here, at least not yet.
I think your linearized non-linear least square optimization is essentially what scipy.optimize.leastsq does internally. It has several more user friendly or extended wrappers, for example scipy.optimize.curve_fit or the lmfit package.
Statsmodels currently does not have a generic version of an equivalent iterative solver.
Statsmodels uses iteratively reweighted least squares as optimizer in several models like GLM and RLM. However, those are model specific implementations. In those cases statsmodels uses WLS (weighted least square) to calculate the equivalent of your solution for the linear model in calculating the next step.

statsmodels - plotting the fitted distribution

The following code fits a oversimplified generalized linear model using statsmodels
model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()
This gives the coefficient and a stderr:
coef stderr
Intercept 2.9471 0.120
Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.
But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.
Is there a way to retrieve the parameters from the results of the fitting?
How can I plot the PMF?
Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.
However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.
The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2
Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.
Something like this should be made available in statsmodels.
In a few example runs, I got results like this
data generating parameters 50 0.25
estimated params 51.7167511571 0.256814610633
estimated params 50.0985814878 0.249989725917
Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.
import numpy as np
from scipy import stats
import statsmodels.api as sm
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
loglike_method = 'nb1' # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
print dist0.mean()
print res.params
mu = res.predict() # use this for mean if not constant
mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[1]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print 'data generating parameters', n, p
print 'estimated params ', size, prob
#estimated distribution
dist_est = stats.nbinom(size, prob)
BTW: I ran into this before but didn't have time to look at it
https://github.com/statsmodels/statsmodels/issues/106

Categories

Resources