I'm porting a Stata model to Python, and seeing different results for Python and Stata for linear regression using the same input data (available # https://drive.google.com/file/d/0B8PLy9yAUHvlcTI1SG5sdzdnaWc/view?usp=sharing)
The Stata codes are as below:
reg growth time*
predict ghat
predict resid, residuals
And result is (first 5 rows):
. list growth ghat resid
+----------------------------------+
| growth ghat resid |
|----------------------------------|
1. | 2.3527029 2.252279 .1004239 |
2. | 2.377728 2.214551 .163177 |
3. | 2.3547957 2.177441 .177355 |
4. | 3.0027488 2.140942 .8618064 |
5. | 3.0249328 2.10505 .9198825 |
In Python, the codes are:
import pandas as pd
from sklearn.linear_model import LinearRegression
def linear_regression(df, dep_col, indep_cols):
lf = LinearRegression(normalize=True)
lf.fit(df[indep_cols.split(' ')], df[dep_col])
return lf
df = pd.read_stata('/tmp/python.dta')
lr = linear_regression(df, 'growth', 'time time2 time3 time4 time5')
df['ghat'] = lr.predict(df['time time2 time3 time4 time5'.split(' ')])
df['resid'] = df.growth - df.ghat
df.head(5)['growth ghat resid'.split(' ')]
and the result is:
growth ghat resid
0 2.352703 3.026936 -0.674233
1 2.377728 2.928860 -0.551132
2 2.354796 2.833610 -0.478815
3 3.002749 2.741135 0.261614
4 3.024933 2.651381 0.373551
I also tried in R, and got the same result as in Python. I could not figure out the root cause: is it because the algorithm used in Stata is a little bit different? I can tell from the source code that sklearn uses the ordinary least squares, but have no idea about the one in Stata.
Could anyone advise here?
---------- Edit 1 -----------
I tried to specify the data type in Stata as double, but Stata is still producing the same result as using float. The Stata codes for generating are as below:
gen double growth = .
foreach lag in `lags' {
replace growth = ma_${metric}_per_`group' / l`lag'.ma_${metric}_per_`group' - 1 if nlag == `lag' & in_sample
}
gen double time = day - td(01jan2010) + 1
forvalues i = 2/5 {
gen double time`i' = time^`i'
}
---------- Edit 2 -----------
It's confirmed that Stata does drop the time variable due to collinearity. The message was not seen before as our Stata codes enable the quiet model to suppress undesired messages. This cannot be disabled in Stata per my investigation. So it appears that I need to detect collinearity and remove collinear column(s) in Python as well.
. reg growth time*,
note: time omitted because of collinearity
Source | SS df MS Number of obs = 381
-------------+------------------------------ F( 4, 376) = 126.10
Model | 37.6005042 4 9.40012605 Prob > F = 0.0000
Residual | 28.0291465 376 .074545602 R-squared = 0.5729
-------------+------------------------------ Adj R-squared = 0.5684
Total | 65.6296507 380 .172709607 Root MSE = .27303
------------------------------------------------------------------------------
growth | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
time | 0 (omitted)
time2 | -.0098885 .0009231 -10.71 0.000 -.0117037 -.0080734
time3 | .0000108 1.02e-06 10.59 0.000 8.77e-06 .0000128
time4 | -4.40e-09 4.20e-10 -10.47 0.000 -5.22e-09 -3.57e-09
time5 | 6.37e-13 6.15e-14 10.35 0.000 5.16e-13 7.58e-13
_cons | 3322.727 302.7027 10.98 0.000 2727.525 3917.93
------------------------------------------------------------------------------
The predictors are the 1st ... 5th powers of time which varies between 1627 and 2007 (presumably a calendar year, not that it matters). Even with modern software it would have been prudent to shift the origin of time to reduce the numerical strain, e.g. to work with powers of (time - 1800).
Any way, redoing the regression shows that Stata drops the first predictor as collinear. What happens in Python and R? These are different reactions to a numerically tricky challenge.
(Fitting a quintic polynomial rarely has scientific value, but that may not be of concern here. The fitted curve based on powers 2 to 5 doesn't work very well for these data, which appear economic. It makes more sense that the first 5 residuals are all positive, but that isn't true of them all!)
It is a wild card issue. In your Stata code time* will match time2, time3... but not time. If the Python code is changed to lr = linear_regression(df, 'growth', 'time2 time3 time4 time5') it will crank out the exact same result.
Edit
Appears Stata dropped the 1st independent variable. The fit can be visualized as follows:
lr1 = linear_regression(df, 'growth', 'time time2 time3 time4 time5')
lr2 = linear_regression(df, 'growth', 'time2 time3 time4 time5')
pred_x1 = ((np.linspace(1620, 2000)[..., np.newaxis]**np.array([1,2,3,4,5]))*lr1.coef_).sum(1)+lr1.intercept_
pred_x2 = ((np.linspace(1620, 2000)[..., np.newaxis]**np.array([2,3,4,5]))*lr2.coef_).sum(1)+lr2.intercept_
plt.plot(np.linspace(1620, 2000), pred_x1, label='Python/R fit')
plt.plot(np.linspace(1620, 2000), pred_x2, label='Stata fit')
plt.plot(df.time, df.growth, '+', label='Data')
plt.legend(loc=0)
And the residual sum of squares:
In [149]:
pred1 = (df.time.values[..., np.newaxis]**np.array([1,2,3,4,5])*lr1.coef_).sum(1)+lr1.intercept_
pred2 = (df.time.values[..., np.newaxis]**np.array([2,3,4,5])*lr2.coef_).sum(1)+lr2.intercept_
print 'Python fit RSS',((pred1 - df.growth.values)**2).sum()
print 'Stata fit RSS',((pred2 - df.growth.values)**2).sum()
Python fit RSS 7.2062436549
Stata fit RSS 28.0291464826
Related
Based on the work of Kuo et al (Kuo, H.-I., Chen, C.-C., Tseng, W.-C., Ju, L.-F., Huang, B.-W. (2007). Assessing impacts of SARS and Avian Flu on international tourism demand to Asia. Tourism Management. Retrieved from: https://www.sciencedirect.com/science/article/abs/pii/S0261517707002191?via%3Dihub), I am measuring the effect of COVID-19 on tourism demand.
My panel data can be found here: https://www.dropbox.com/s/t0pkwrj59zn22gg/tourism_covid_data-total.csv?dl=0
I would like to use a first-difference transformation model(GMMDIFF) and treat the lags of the dependent variable (tourism demand) as instruments for the lagged dependent variable. The dynamic and first difference version of the tourism demand model:
Δyit = η2Δ yit-1 + η3 ΔSit + Δuit
where, y is tourism demand, i refers to COVID-19 infected countries, t is time, S is the number of SARS cases, and u is the fixed effects decomposition of the error term.
Up to now, using python I managed to get some results using the Panel OLS:
import pandas as pd
import numpy as np
from linearmodels import PanelOLS
import statsmodels.api as sm
tourism_covid_data=pd.read_csv('../Data/Data - Dec2021/tourism_covid_data-total.csv, header=0, parse_dates=['month_year']
tourism_covid_data['l.tourism_demand']=tourism_covid_data['tourism_demand'].shift(1)
tourism_covid_data=tourism_covid_data.dropna()
exog = sm.add_constant(tourism_covid_data[['l.tourism_demand','monthly cases']])
mod = PanelOLS(tourism_covid_data['tourism_demand'], exog, entity_effects=True)
fe_res = mod.fit()
fe_res
I am trying to find the solution and use GMM for my data, however, it seems that GMM is not widely used in python and not other similar questions are available on stack. Any ideas on how I can work here?
I just tried your data. I don't think your data fits diff GMM or system GMM because it is a T(=48) >>N(=4) long panel. Anyway, pydynpd still produces results. In both cases, I had to collapse instrument matrix to reduce the issue with too many instruments.
Model 1: diff GMM; treating "monthly cases" as predetermined variable
import pandas as pd
from pydynpd import regression
df = pd.read_csv("tourism_covid_data-total.csv") #, index_col=False)
df['monthly_cases']=df['monthly cases']
command_str='tourism_demand L1.tourism_demand monthly_cases | gmm(tourism_demand, 2 6) gmm(monthly_cases, 1 2)| nolevel collapse '
mydpd = regression.abond(command_str, df, ['Country', 'month_year'])
The output:
Python 3.9.7 (default, Sep 10 2021, 14:59:43)
[GCC 11.2.0] on linux
Warning: system and difference GMMs do not work well on long (T>=N) panel data
Dynamic panel-data estimation, two-step difference GMM
Group variable: Country Number of obs = 184
Time variable: month_year Number of groups = 4
Number of instruments = 7
+-------------------+-----------------+---------------------+------------+-----------+
| tourism_demand | coef. | Corrected Std. Err. | z | P>|z| |
+-------------------+-----------------+---------------------+------------+-----------+
| L1.tourism_demand | 0.7657082 | 0.0266379 | 28.7450196 | 0.0000000 |
| monthly_cases | -182173.5644815 | 171518.4068348 | -1.0621225 | 0.2881801 |
+-------------------+-----------------+---------------------+------------+-----------+
Hansen test of overid. restrictions: chi(5) = 3.940 Prob > Chi2 = 0.558
Arellano-Bond test for AR(1) in first differences: z = -1.04 Pr > z =0.299
Arellano-Bond test for AR(2) in first differences: z = 1.00 Pr > z =0.319
Model 2: diff GMM; treating the lag of "monthly cases" as exogenous variable
command_str='tourism_demand L1.tourism_demand L1.monthly_cases | gmm(tourism_demand, 2 6) iv(L1.monthly_cases)| nolevel collapse '
mydpd = regression.abond(command_str, df, ['Country', 'month_year'])
Output:
Warning: system and difference GMMs do not work well on long (T>=N) panel data
Dynamic panel-data estimation, two-step difference GMM
Group variable: Country Number of obs = 184
Time variable: month_year Number of groups = 4
Number of instruments = 6
+-------------------+-----------------+---------------------+------------+-----------+
| tourism_demand | coef. | Corrected Std. Err. | z | P>|z| |
+-------------------+-----------------+---------------------+------------+-----------+
| L1.tourism_demand | 0.7413765 | 0.0236962 | 31.2866594 | 0.0000000 |
| L1.monthly_cases | -190277.2987977 | 164169.7711072 | -1.1590276 | 0.2464449 |
+-------------------+-----------------+---------------------+------------+-----------+
Hansen test of overid. restrictions: chi(4) = 1.837 Prob > Chi2 = 0.766
Arellano-Bond test for AR(1) in first differences: z = -1.05 Pr > z =0.294
Arellano-Bond test for AR(2) in first differences: z = 1.00 Pr > z =0.318
Model 3: similar to Model 2, but a system GMM.
command_str='tourism_demand L1.tourism_demand L1.monthly_cases | gmm(tourism_demand, 2 6) iv(L1.monthly_cases)| collapse '
mydpd = regression.abond(command_str, df, ['Country', 'month_year'])
Output:
Warning: system and difference GMMs do not work well on long (T>=N) panel data
Dynamic panel-data estimation, two-step system GMM
Group variable: Country Number of obs = 188
Time variable: month_year Number of groups = 4
Number of instruments = 8
+-------------------+-----------------+---------------------+------------+-----------+
| tourism_demand | coef. | Corrected Std. Err. | z | P>|z| |
+-------------------+-----------------+---------------------+------------+-----------+
| L1.tourism_demand | 0.5364657 | 0.0267678 | 20.0414904 | 0.0000000 |
| L1.monthly_cases | -216615.8306112 | 177416.0961037 | -1.2209480 | 0.2221057 |
| _con | -10168.9640333 | 8328.7444649 | -1.2209480 | 0.2221057 |
+-------------------+-----------------+---------------------+------------+-----------+
Hansen test of overid. restrictions: chi(5) = 1.876 Prob > Chi2 = 0.866
Arellano-Bond test for AR(1) in first differences: z = -1.06 Pr > z =0.288
Arellano-Bond test for AR(2) in first differences: z = 0.99 Pr > z =0.322
There is a python package that supports system and difference GMM on dynamic panel models
https://github.com/dazhwu/pydynpd
Features include: (1) difference and system GMM, (2) one-step and two-step estimators, (3) robust standard errors including the one suggested by Windmeijer (2005), (4) Hansen over-identification test, (5) Arellano-Bond test for autocorrelation, (6) time dummies, (7) allows users to collapse instruments to reduce instrument proliferation issue, and (8) a simple grammar for model specification.
I have two questions on statsmodels GLM:
Is there a way to tell the glm to automatically set the level with most observations as the base (i.e. parameter 0) for each factor? If not, is there a reason for this?
Is there a way to display or extract the names of the base levels (i.e. the level with param = 0) from the GLM? I know the predict function works fine, but I am extracting the GLM output to use it elsewhere and would love to automate this.
I know the workaround that I can use Treatment in the formula e.g. instead of formula='y~C(x)' I can write formula='y~C(x, Treatment("abc"))'. I am using this for question 2 currently, and I suppose I could extend it to question 1 if I chase the data and the formula through a function to enhance the formula, but was wondering if there is some cleaner way to do this or a feature in the pipeline to be able to do this?
Cheers SO
For anyone who might be interested, I implemented the workaround that I mentioned above with the following function. It works fine, however it has the disadvantage that you get into trouble if you have brackets () [] in the levels. You can avoid this trouble by just handing over the treatment already, but not perfect.
def add_treatment_to_formula(formula:str, df:pd.DataFrame, exposure:str):
""" Little helper to add the Treatment field (this is the statsmodel terminology for the base level,
i.e. the level that gets the parameter=0 (factor=1)) to GLM formulas. It sets it to the level with the
largest exposure found in the df handed over.
:param formula: Example: 'claimamount ~ age + C(postcode) + C(familystatus, Treatment(2))' will get turned into
'claimamount ~ age + C(postcode, Treatment("12435")) + C(familystatus, Treatment(2))'. The familystatus already had
a treatment, the age is not categorical in this example, so only the postcode gets transformed.
:type formula: str
:param df: DataFrame with at least the columns that show up in the right side of the formula (age, postcode,
familystatus in the example) and the column containing the exposure as named in the exposure argument
:type df: pd.DataFrame()
:param exposure: Name of the column in the df containing the exposure
:type exposure: str
:return: Formula with treatments added
:rtype: str
"""
l_yx = formula.split('~')
if len(l_yx)>2:
log.error("This does not look like a formula, more than 1 ~ found.")
l_xs = l_yx[1].split('+')
l_xs_enh = []
for x in l_xs:
# if the treatment field is already set up don't change it. Also, if the field is not
# categorical, don't change it
if ('Treatment' in x) | (not 'C(' in x):
l_xs_enh.append(x)
else: # get field with largest exposure here and set it as treatment field
field = x[x.find('(') + 1:x.find(')')].strip()
df_exposure = df.groupby(field)[exposure].sum().reset_index()
treatment = df_exposure.loc[df_exposure[exposure]==max(df_exposure[exposure]), field].values[0]
if isinstance(treatment, str):
quotes = '"'
else:
quotes=''
x_enh = f'C({field}, Treatment({quotes}{treatment}{quotes}))'
l_xs_enh.append(x_enh)
formula_enhanced = l_yx[0] + '~ ' + ' + '.join(l_xs_enh)
return formula_enhanced
It's better to do this by categorizing the variable before feeding it into the glm. This can be achieved by using pd.Categorial , for example using a simulated dataset:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(123)
df = pd.DataFrame({'y':np.random.uniform(0,1,100),
'x':np.random.choice(['a','b','c','d'],100)})
Here d would be the reference level since it has most observations:
df.x.value_counts()
d 28
b 27
c 26
a 19
If the subsequent order after the reference is not important, you can simply do:
df['x'] = pd.Categorical(df['x'],df.x.value_counts().index)
The reference level is simply:
df.x.cat.categories[0]
'd'
Regression on this:
model = smf.glm(formula = 'y ~ x',data=df).fit()
And you can see the reference is d:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 96
Model Family: Gaussian Df Model: 3
Link Function: identity Scale: 0.059173
Method: IRLS Log-Likelihood: 1.5121
Date: Tue, 23 Feb 2021 Deviance: 5.6806
Time: 09:16:31 Pearson chi2: 5.68
No. Iterations: 3
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5108 0.046 11.111 0.000 0.421 0.601
x[T.b] -0.0953 0.066 -1.452 0.146 -0.224 0.033
x[T.c] 0.0633 0.066 0.956 0.339 -0.067 0.193
x[T.a] -0.0005 0.072 -0.007 0.994 -0.142 0.141
==============================================================================
Another option is to use the treatment you have pointed to, so the first task is to get the top level:
np.random.seed(123)
df = pd.DataFrame({'y':np.random.uniform(0,1,100),
'x':np.random.choice(['a','b','c','d'],100)})
ref = df.x.describe().top
from patsy.contrasts import Treatment
contrast = Treatment(reference=ref)
mod = smf.glm("y ~ C(x, Treatment)", data=df).fit()
This question references to book "O'Relly Practical Statistics for Data Scientists 2nd Edition" chapter 3, session Chi-Square Test.
The book provides an example of one Chi-square test case, where it assumes a website with three different headlines that run by 1000 visitors. The result shows the # of clicks from each headline.
The Observed data is the following:
Headline A B C
Click 14 8 12
No-click 986 992 988
The expected value is calculated in the following:
Headline A B C
Click 11.13 11.13 11.13
No-click 988.67 988.67 988.67
The Pearson residual is defined as:
Where the table is now:
Headline A B C
Click 0.792 -0.990 0.198
No-click -0.085 0.106 -0.021
The Chi-square statistic is the sum of the squared Pearson residuals: . Which is 1.666
So far so good.
Now here comes the resampling part:
1. Assuming a box of 34 ones and 2966 zeros
2. Shuffle, and take three samples of 1000 and count how many ones(Clicks)
3. Find the squared differences between the shuffled counts and expected counts then sum them.
4. Repeat steps 2 to 3, a few thousand times.
5. The P-value is how often does the resampled sum of squared deviations exceed the observed.
The resampling python test code is provided by the book as following:
(Can be downloaded from https://github.com/gedeck/practical-statistics-for-data-scientists/tree/master/python/code)
## Practical Statistics for Data Scientists (Python)
## Chapter 3. Statistial Experiments and Significance Testing
# > (c) 2019 Peter C. Bruce, Andrew Bruce, Peter Gedeck
# Import required Python packages.
from pathlib import Path
import random
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import power
import matplotlib.pylab as plt
DATA = Path('.').resolve().parents[1] / 'data'
# Define paths to data sets. If you don't keep your data in the same directory as the code, adapt the path names.
CLICK_RATE_CSV = DATA / 'click_rates.csv'
...
## Chi-Square Test
### Chi-Square Test: A Resampling Approach
# Table 3-4
click_rate = pd.read_csv(CLICK_RATE_CSV)
clicks = click_rate.pivot(index='Click', columns='Headline', values='Rate')
print(clicks)
# Table 3-5
row_average = clicks.mean(axis=1)
pd.DataFrame({
'Headline A': row_average,
'Headline B': row_average,
'Headline C': row_average,
})
# Resampling approach
box = [1] * 34
box.extend([0] * 2966)
random.shuffle(box)
def chi2(observed, expected):
pearson_residuals = []
for row, expect in zip(observed, expected):
pearson_residuals.append([(observe - expect) ** 2 / expect
for observe in row])
# return sum of squares
return np.sum(pearson_residuals)
expected_clicks = 34 / 3
expected_noclicks = 1000 - expected_clicks
expected = [34 / 3, 1000 - 34 / 3]
chi2observed = chi2(clicks.values, expected)
def perm_fun(box):
sample_clicks = [sum(random.sample(box, 1000)),
sum(random.sample(box, 1000)),
sum(random.sample(box, 1000))]
sample_noclicks = [1000 - n for n in sample_clicks]
return chi2([sample_clicks, sample_noclicks], expected)
perm_chi2 = [perm_fun(box) for _ in range(2000)]
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p-value: {resampled_p_value:.4f}')
chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')
Now, i ran the perm_fun(box) for 2,000 times and obtained a resampled P-value of 0.4775.
However, if I ran perm_fun(box) for 10,000 times, and 100,000 times, I was able to obtain a resampled P-value of 0.84 both times. It seems to me the P-value should be around 0.84.
Why is the stats.chi2_contigency showing a such smaller numbers?
The result I get for running 2000 times is:
Observed chi2: 1.6659
Resampled p-value: 0.8300
Observed chi2: 1.6659
p-value: 0.4348
And if I were to run it 10,000 times, the result is:
Observed chi2: 1.6659
Resampled p-value: 0.8386
Observed chi2: 1.6659
p-value: 0.4348
software version:
pandas.__version__: 0.25.1
numpy.__version__: 1.16.5
scipy.__version__: 1.3.1
statsmodels.__version__: 0.10.1
sys.version_info: 3.7.4
I ran your code trying 2000, 10000, and 100000 loops, and all three times I got close to .47. I did, however, get an error at this line that I had to fix:
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
Here perm_chi2 is a list and chi2observed is a float, so I wonder how this code ever ran for you (perhaps whatever you did to fix it was the source of the error). In any case, changing it to the intended
resampled_p_value = sum([1*(x > chi2observed) for x in perm_chi2]) / len(perm_chi2)
allowed me to run it and get close to .47.
Make sure that when you change the number of iterations, you do so only by changing the 2000, none of the other numbers.
I am trying to understand the results of Mixed Linear Models provided by Python statsmodel package. I want to avoid pitfalls in my data analysis and interpretation. The questions are after the data loading/output code block.
Loading data and fitting model:
import statsmodels.api as sm
import statsmodels.formula.api as smf
data = sm.datasets.get_rdataset("dietox", "geepack").data
md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])
mdf = md.fit()
print mdf.summary()
Mixed Linear Model Regression Results
========================================================
Model: MixedLM Dependent Variable: Weight
No. Observations: 861 Method: REML
No. Groups: 72 Scale: 11.3669
Min. group size: 11 Likelihood: -2404.7753
Max. group size: 12 Converged: Yes
Mean group size: 12.0
--------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
--------------------------------------------------------
Intercept 15.724 0.788 19.952 0.000 14.179 17.268
Time 6.943 0.033 207.939 0.000 6.877 7.008
Group Var 40.394 2.149
========================================================
Q1. (a) What exactly is Group Var coefficient (params)? I thought it is the variance of Group Var (cov_params), but the default output does not match the in-built method output.
Q1. (b) What does "Group Var" parameter (params) means?
print "-----Parameters-----"
print mdf.params
print
print "-----Covariance matrix-----"
print mdf.cov_params()
-----Parameters-----
Intercept 15.723523
Time 6.942505
Group Var 3.553634
dtype: float64
-----Covariance matrix-----
Intercept Time Group Var
Intercept 0.621028 -0.007222 0.000052
Time -0.007222 0.001115 -0.000012
Group Var 0.000052 -0.000012 0.406197
Q2. (a) What does standard error (bse) for Group Var mean? Why is Group Var estimate not reported in default output? Is it not important?
Q2. (b) How is is different from standard error for variance (bse_re)?
print "-----Standard errors-----"
print mdf.bse
print
print "-----Standard errors of random effects-----"
print mdf.bse_re
-----Standard errors-----
Intercept 0.788053
Time 0.033387
Group Var 0.637336
dtype: float64
-----Standard errors of random effects-----
Group Var 2.148771
dtype: float64
Q3. Why are t-value and p-value not reported for random parameters in summary()?
print "-----t-values (or z-values?)-----"
print mdf.tvalues
print
print "-----p-values-----"
print mdf.pvalues
-----t-values (or z-values?)-----
Intercept 19.952366
Time 207.938608
Group Var 5.575760
dtype: float64
-----p-values-----
Intercept 1.429597e-88
Time 0.000000e+00
Group Var 2.464519e-08
dtype: float64
Reference: https://www.statsmodels.org/dev/mixed_linear.html
I need to do some multinomial regression in Julia. In R I get the following result:
library(nnet)
data <- read.table("Dropbox/scripts/timeseries.txt",header=TRUE)
multinom(y~X1+X2,data)
# weights: 12 (6 variable)
initial value 10985.024274
iter 10 value 10438.503738
final value 10438.503529
converged
Call:
multinom(formula = y ~ X1 + X2, data = data)
Coefficients:
(Intercept) X1 X2
2 0.4877087 0.2588725 0.2762119
3 0.4421524 0.5305649 0.3895339
Residual Deviance: 20877.01
AIC: 20889.01
Here is my data
My first attempt was using Regression.jl. The documentation is quite sparse for this package so I am not sure which category is used as baseline, which parameters the resulting output corresponds to, etc. I filed an issue to ask about these things here.
using DataFrames
using Regression
import Regression: solve, Options, predict
dat = readtable("timeseries.txt", separator='\t')
X = convert(Matrix{Float64},dat[:,2:3])
y = convert(Vector{Int64},dat[:,1])
ret = solve(mlogisticreg(X',y,3), reg=ZeroReg(), options=Options(verbosity=:iter))
the result is
julia> ret.sol
3x2 Array{Float64,2}:
-0.573027 -0.531819
0.173453 0.232029
0.399575 0.29979
but again, I am not sure what this corresponds to.
Next I tried the Julia wrapper to Python's SciKitLearn:
using ScikitLearn
#sk_import linear_model: LogisticRegression
model = ScikitLearn.fit!(LogisticRegression(multi_class="multinomial", solver = "lbfgs"), X, y)
model[:coef_]
3x2 Array{Float64,2}:
-0.261902 -0.220771
-0.00453731 0.0540354
0.266439 0.166735
but I have not figured out how to extract the coefficients from this model. Updated with coefficients. These also don't look like the R results.
Any help trying to replicate R's results would be appreciate (using whatever package!)
Note the response variables are just the discretized time-lagged response i.e.
julia> dat[1:3,:]
3x3 DataFrames.DataFrame
| Row | y | X1 | X2 |
|-----|---|----|----|
| 1 | 3 | 1 | 0 |
| 2 | 3 | 0 | 1 |
| 3 | 1 | 0 | 1 |
for row 2 you can see that the response (0, 1) means the previous observation was a 3. Similarly (1,0) means previous observation was a 2 and (0,0) means previous observation was a 1.
Update:
For Regression.jl it seems it does not fit an intercept by default (and they call it "bias" instead of an intercept). By adding this term we get results very similar to python (not sure what the third column is though..)
julia> ret = solve(mlogisticreg(X',y,3, bias=1.0), reg=ZeroReg(), options=Options(verbosity=:iter))
julia> ret.sol
3x3 Array{Float64,2}:
-0.263149 -0.221923 -0.309949
-0.00427033 0.0543008 0.177753
0.267419 0.167622 0.132196
UPDATE:
Since the model coefficients are not identifiable I should not be expecting them to be the same acrossed these different implementations. However, the predicted probabilities should be the same, and in fact they are (using R, Regression.jl, or ScikitLearn).