I need to do some multinomial regression in Julia. In R I get the following result:
library(nnet)
data <- read.table("Dropbox/scripts/timeseries.txt",header=TRUE)
multinom(y~X1+X2,data)
# weights: 12 (6 variable)
initial value 10985.024274
iter 10 value 10438.503738
final value 10438.503529
converged
Call:
multinom(formula = y ~ X1 + X2, data = data)
Coefficients:
(Intercept) X1 X2
2 0.4877087 0.2588725 0.2762119
3 0.4421524 0.5305649 0.3895339
Residual Deviance: 20877.01
AIC: 20889.01
Here is my data
My first attempt was using Regression.jl. The documentation is quite sparse for this package so I am not sure which category is used as baseline, which parameters the resulting output corresponds to, etc. I filed an issue to ask about these things here.
using DataFrames
using Regression
import Regression: solve, Options, predict
dat = readtable("timeseries.txt", separator='\t')
X = convert(Matrix{Float64},dat[:,2:3])
y = convert(Vector{Int64},dat[:,1])
ret = solve(mlogisticreg(X',y,3), reg=ZeroReg(), options=Options(verbosity=:iter))
the result is
julia> ret.sol
3x2 Array{Float64,2}:
-0.573027 -0.531819
0.173453 0.232029
0.399575 0.29979
but again, I am not sure what this corresponds to.
Next I tried the Julia wrapper to Python's SciKitLearn:
using ScikitLearn
#sk_import linear_model: LogisticRegression
model = ScikitLearn.fit!(LogisticRegression(multi_class="multinomial", solver = "lbfgs"), X, y)
model[:coef_]
3x2 Array{Float64,2}:
-0.261902 -0.220771
-0.00453731 0.0540354
0.266439 0.166735
but I have not figured out how to extract the coefficients from this model. Updated with coefficients. These also don't look like the R results.
Any help trying to replicate R's results would be appreciate (using whatever package!)
Note the response variables are just the discretized time-lagged response i.e.
julia> dat[1:3,:]
3x3 DataFrames.DataFrame
| Row | y | X1 | X2 |
|-----|---|----|----|
| 1 | 3 | 1 | 0 |
| 2 | 3 | 0 | 1 |
| 3 | 1 | 0 | 1 |
for row 2 you can see that the response (0, 1) means the previous observation was a 3. Similarly (1,0) means previous observation was a 2 and (0,0) means previous observation was a 1.
Update:
For Regression.jl it seems it does not fit an intercept by default (and they call it "bias" instead of an intercept). By adding this term we get results very similar to python (not sure what the third column is though..)
julia> ret = solve(mlogisticreg(X',y,3, bias=1.0), reg=ZeroReg(), options=Options(verbosity=:iter))
julia> ret.sol
3x3 Array{Float64,2}:
-0.263149 -0.221923 -0.309949
-0.00427033 0.0543008 0.177753
0.267419 0.167622 0.132196
UPDATE:
Since the model coefficients are not identifiable I should not be expecting them to be the same acrossed these different implementations. However, the predicted probabilities should be the same, and in fact they are (using R, Regression.jl, or ScikitLearn).
Related
I need a quick solution to interpolate between the nearest points of a data frame without adding new points to a data frame if there is a lot of data - millions of points (without NANs). The dataframe is sorted using x vlaues.
E.g. I have a dataframe with the next columns:
x | y
-----
0 | 1
1 | 2
2 | 3
...
I need a function that will fire out for a given x_input value calculated as a linear interpolated value between nearest points, something like this:
calc_linear(df, cinput_col = 'x', input_val=1.5, output_col=y) will output 2.5 - as interpolated y value for a given x
Maybe there are some pandas functions for that?
Use numpy.interp:
import numpy as np
def calc_linear(df, input_val, input_col='x', output_col='y'):
return np.interp(input_val, df[input_col], df[output_col])
y = calc_linear(df, 1.5)
print(y)
# Output
2.5
Based on the work of Kuo et al (Kuo, H.-I., Chen, C.-C., Tseng, W.-C., Ju, L.-F., Huang, B.-W. (2007). Assessing impacts of SARS and Avian Flu on international tourism demand to Asia. Tourism Management. Retrieved from: https://www.sciencedirect.com/science/article/abs/pii/S0261517707002191?via%3Dihub), I am measuring the effect of COVID-19 on tourism demand.
My panel data can be found here: https://www.dropbox.com/s/t0pkwrj59zn22gg/tourism_covid_data-total.csv?dl=0
I would like to use a first-difference transformation model(GMMDIFF) and treat the lags of the dependent variable (tourism demand) as instruments for the lagged dependent variable. The dynamic and first difference version of the tourism demand model:
Δyit = η2Δ yit-1 + η3 ΔSit + Δuit
where, y is tourism demand, i refers to COVID-19 infected countries, t is time, S is the number of SARS cases, and u is the fixed effects decomposition of the error term.
Up to now, using python I managed to get some results using the Panel OLS:
import pandas as pd
import numpy as np
from linearmodels import PanelOLS
import statsmodels.api as sm
tourism_covid_data=pd.read_csv('../Data/Data - Dec2021/tourism_covid_data-total.csv, header=0, parse_dates=['month_year']
tourism_covid_data['l.tourism_demand']=tourism_covid_data['tourism_demand'].shift(1)
tourism_covid_data=tourism_covid_data.dropna()
exog = sm.add_constant(tourism_covid_data[['l.tourism_demand','monthly cases']])
mod = PanelOLS(tourism_covid_data['tourism_demand'], exog, entity_effects=True)
fe_res = mod.fit()
fe_res
I am trying to find the solution and use GMM for my data, however, it seems that GMM is not widely used in python and not other similar questions are available on stack. Any ideas on how I can work here?
I just tried your data. I don't think your data fits diff GMM or system GMM because it is a T(=48) >>N(=4) long panel. Anyway, pydynpd still produces results. In both cases, I had to collapse instrument matrix to reduce the issue with too many instruments.
Model 1: diff GMM; treating "monthly cases" as predetermined variable
import pandas as pd
from pydynpd import regression
df = pd.read_csv("tourism_covid_data-total.csv") #, index_col=False)
df['monthly_cases']=df['monthly cases']
command_str='tourism_demand L1.tourism_demand monthly_cases | gmm(tourism_demand, 2 6) gmm(monthly_cases, 1 2)| nolevel collapse '
mydpd = regression.abond(command_str, df, ['Country', 'month_year'])
The output:
Python 3.9.7 (default, Sep 10 2021, 14:59:43)
[GCC 11.2.0] on linux
Warning: system and difference GMMs do not work well on long (T>=N) panel data
Dynamic panel-data estimation, two-step difference GMM
Group variable: Country Number of obs = 184
Time variable: month_year Number of groups = 4
Number of instruments = 7
+-------------------+-----------------+---------------------+------------+-----------+
| tourism_demand | coef. | Corrected Std. Err. | z | P>|z| |
+-------------------+-----------------+---------------------+------------+-----------+
| L1.tourism_demand | 0.7657082 | 0.0266379 | 28.7450196 | 0.0000000 |
| monthly_cases | -182173.5644815 | 171518.4068348 | -1.0621225 | 0.2881801 |
+-------------------+-----------------+---------------------+------------+-----------+
Hansen test of overid. restrictions: chi(5) = 3.940 Prob > Chi2 = 0.558
Arellano-Bond test for AR(1) in first differences: z = -1.04 Pr > z =0.299
Arellano-Bond test for AR(2) in first differences: z = 1.00 Pr > z =0.319
Model 2: diff GMM; treating the lag of "monthly cases" as exogenous variable
command_str='tourism_demand L1.tourism_demand L1.monthly_cases | gmm(tourism_demand, 2 6) iv(L1.monthly_cases)| nolevel collapse '
mydpd = regression.abond(command_str, df, ['Country', 'month_year'])
Output:
Warning: system and difference GMMs do not work well on long (T>=N) panel data
Dynamic panel-data estimation, two-step difference GMM
Group variable: Country Number of obs = 184
Time variable: month_year Number of groups = 4
Number of instruments = 6
+-------------------+-----------------+---------------------+------------+-----------+
| tourism_demand | coef. | Corrected Std. Err. | z | P>|z| |
+-------------------+-----------------+---------------------+------------+-----------+
| L1.tourism_demand | 0.7413765 | 0.0236962 | 31.2866594 | 0.0000000 |
| L1.monthly_cases | -190277.2987977 | 164169.7711072 | -1.1590276 | 0.2464449 |
+-------------------+-----------------+---------------------+------------+-----------+
Hansen test of overid. restrictions: chi(4) = 1.837 Prob > Chi2 = 0.766
Arellano-Bond test for AR(1) in first differences: z = -1.05 Pr > z =0.294
Arellano-Bond test for AR(2) in first differences: z = 1.00 Pr > z =0.318
Model 3: similar to Model 2, but a system GMM.
command_str='tourism_demand L1.tourism_demand L1.monthly_cases | gmm(tourism_demand, 2 6) iv(L1.monthly_cases)| collapse '
mydpd = regression.abond(command_str, df, ['Country', 'month_year'])
Output:
Warning: system and difference GMMs do not work well on long (T>=N) panel data
Dynamic panel-data estimation, two-step system GMM
Group variable: Country Number of obs = 188
Time variable: month_year Number of groups = 4
Number of instruments = 8
+-------------------+-----------------+---------------------+------------+-----------+
| tourism_demand | coef. | Corrected Std. Err. | z | P>|z| |
+-------------------+-----------------+---------------------+------------+-----------+
| L1.tourism_demand | 0.5364657 | 0.0267678 | 20.0414904 | 0.0000000 |
| L1.monthly_cases | -216615.8306112 | 177416.0961037 | -1.2209480 | 0.2221057 |
| _con | -10168.9640333 | 8328.7444649 | -1.2209480 | 0.2221057 |
+-------------------+-----------------+---------------------+------------+-----------+
Hansen test of overid. restrictions: chi(5) = 1.876 Prob > Chi2 = 0.866
Arellano-Bond test for AR(1) in first differences: z = -1.06 Pr > z =0.288
Arellano-Bond test for AR(2) in first differences: z = 0.99 Pr > z =0.322
There is a python package that supports system and difference GMM on dynamic panel models
https://github.com/dazhwu/pydynpd
Features include: (1) difference and system GMM, (2) one-step and two-step estimators, (3) robust standard errors including the one suggested by Windmeijer (2005), (4) Hansen over-identification test, (5) Arellano-Bond test for autocorrelation, (6) time dummies, (7) allows users to collapse instruments to reduce instrument proliferation issue, and (8) a simple grammar for model specification.
I want to ask a quick question related to regression analysis in python pandas.
So, assume that I have the following datasets:
Group Y X
1 10 6
1 5 4
1 3 1
2 4 6
2 2 4
2 3 9
My aim is to run regression; Y is dependent and X is independent variable. The issue is I want to run this regression by Group and print the coefficients in a new data set. So, the results should be like:
Group Coefficient
1 0.25 (lets assume that coefficient is 0.25)
2 0.30
I hope I can explain my question.
Many thanks in advance for your help.
I am not sure about the type of regression you need, but this is how you do an OLS (Ordinary least squares):
import pandas as pd
import statsmodels.api as sm
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
#This is what you need
df.groupby('Group').apply(regress, 'Y', ['X'])
You can define your regression function and pass parameters to it as mentioned.
I'm porting a Stata model to Python, and seeing different results for Python and Stata for linear regression using the same input data (available # https://drive.google.com/file/d/0B8PLy9yAUHvlcTI1SG5sdzdnaWc/view?usp=sharing)
The Stata codes are as below:
reg growth time*
predict ghat
predict resid, residuals
And result is (first 5 rows):
. list growth ghat resid
+----------------------------------+
| growth ghat resid |
|----------------------------------|
1. | 2.3527029 2.252279 .1004239 |
2. | 2.377728 2.214551 .163177 |
3. | 2.3547957 2.177441 .177355 |
4. | 3.0027488 2.140942 .8618064 |
5. | 3.0249328 2.10505 .9198825 |
In Python, the codes are:
import pandas as pd
from sklearn.linear_model import LinearRegression
def linear_regression(df, dep_col, indep_cols):
lf = LinearRegression(normalize=True)
lf.fit(df[indep_cols.split(' ')], df[dep_col])
return lf
df = pd.read_stata('/tmp/python.dta')
lr = linear_regression(df, 'growth', 'time time2 time3 time4 time5')
df['ghat'] = lr.predict(df['time time2 time3 time4 time5'.split(' ')])
df['resid'] = df.growth - df.ghat
df.head(5)['growth ghat resid'.split(' ')]
and the result is:
growth ghat resid
0 2.352703 3.026936 -0.674233
1 2.377728 2.928860 -0.551132
2 2.354796 2.833610 -0.478815
3 3.002749 2.741135 0.261614
4 3.024933 2.651381 0.373551
I also tried in R, and got the same result as in Python. I could not figure out the root cause: is it because the algorithm used in Stata is a little bit different? I can tell from the source code that sklearn uses the ordinary least squares, but have no idea about the one in Stata.
Could anyone advise here?
---------- Edit 1 -----------
I tried to specify the data type in Stata as double, but Stata is still producing the same result as using float. The Stata codes for generating are as below:
gen double growth = .
foreach lag in `lags' {
replace growth = ma_${metric}_per_`group' / l`lag'.ma_${metric}_per_`group' - 1 if nlag == `lag' & in_sample
}
gen double time = day - td(01jan2010) + 1
forvalues i = 2/5 {
gen double time`i' = time^`i'
}
---------- Edit 2 -----------
It's confirmed that Stata does drop the time variable due to collinearity. The message was not seen before as our Stata codes enable the quiet model to suppress undesired messages. This cannot be disabled in Stata per my investigation. So it appears that I need to detect collinearity and remove collinear column(s) in Python as well.
. reg growth time*,
note: time omitted because of collinearity
Source | SS df MS Number of obs = 381
-------------+------------------------------ F( 4, 376) = 126.10
Model | 37.6005042 4 9.40012605 Prob > F = 0.0000
Residual | 28.0291465 376 .074545602 R-squared = 0.5729
-------------+------------------------------ Adj R-squared = 0.5684
Total | 65.6296507 380 .172709607 Root MSE = .27303
------------------------------------------------------------------------------
growth | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
time | 0 (omitted)
time2 | -.0098885 .0009231 -10.71 0.000 -.0117037 -.0080734
time3 | .0000108 1.02e-06 10.59 0.000 8.77e-06 .0000128
time4 | -4.40e-09 4.20e-10 -10.47 0.000 -5.22e-09 -3.57e-09
time5 | 6.37e-13 6.15e-14 10.35 0.000 5.16e-13 7.58e-13
_cons | 3322.727 302.7027 10.98 0.000 2727.525 3917.93
------------------------------------------------------------------------------
The predictors are the 1st ... 5th powers of time which varies between 1627 and 2007 (presumably a calendar year, not that it matters). Even with modern software it would have been prudent to shift the origin of time to reduce the numerical strain, e.g. to work with powers of (time - 1800).
Any way, redoing the regression shows that Stata drops the first predictor as collinear. What happens in Python and R? These are different reactions to a numerically tricky challenge.
(Fitting a quintic polynomial rarely has scientific value, but that may not be of concern here. The fitted curve based on powers 2 to 5 doesn't work very well for these data, which appear economic. It makes more sense that the first 5 residuals are all positive, but that isn't true of them all!)
It is a wild card issue. In your Stata code time* will match time2, time3... but not time. If the Python code is changed to lr = linear_regression(df, 'growth', 'time2 time3 time4 time5') it will crank out the exact same result.
Edit
Appears Stata dropped the 1st independent variable. The fit can be visualized as follows:
lr1 = linear_regression(df, 'growth', 'time time2 time3 time4 time5')
lr2 = linear_regression(df, 'growth', 'time2 time3 time4 time5')
pred_x1 = ((np.linspace(1620, 2000)[..., np.newaxis]**np.array([1,2,3,4,5]))*lr1.coef_).sum(1)+lr1.intercept_
pred_x2 = ((np.linspace(1620, 2000)[..., np.newaxis]**np.array([2,3,4,5]))*lr2.coef_).sum(1)+lr2.intercept_
plt.plot(np.linspace(1620, 2000), pred_x1, label='Python/R fit')
plt.plot(np.linspace(1620, 2000), pred_x2, label='Stata fit')
plt.plot(df.time, df.growth, '+', label='Data')
plt.legend(loc=0)
And the residual sum of squares:
In [149]:
pred1 = (df.time.values[..., np.newaxis]**np.array([1,2,3,4,5])*lr1.coef_).sum(1)+lr1.intercept_
pred2 = (df.time.values[..., np.newaxis]**np.array([2,3,4,5])*lr2.coef_).sum(1)+lr2.intercept_
print 'Python fit RSS',((pred1 - df.growth.values)**2).sum()
print 'Stata fit RSS',((pred2 - df.growth.values)**2).sum()
Python fit RSS 7.2062436549
Stata fit RSS 28.0291464826
i'm using RandomForestRegressor (from the great Scikt-Learn library in python) for my project,
it gives me good results, but i think i can do better.
when i'm giving features to 'fit(..)' function,
is it better to make categorical features as binary features?
example:
instead of:
===========
continent |
===========
1 |
===========
2 |
===========
3 |
===========
2 |
===========
make something like:
===========================
is_europe | is_asia | ...
===========================
1 | 0 |
===========================
0 | 1 |
===========================
because its working as a tree maybe the second option is better,
or does it will work the same for the first option?
thanks alot!
Binarizing categorical variables is highly recommended, and expects to outperform the model without binarizer transform. If scikit-learn considers continent = [1, 2, 3, 2] as numeric values (continuous variable [quantitative] instead of categorical [qualitative]), it imposes an artificial order constraint on that feature. For example, suppose continent=1 means is_europe, continent=2 means is_asia, and continent=3 means is_america, then it implies that is_asia is always in between is_europe and is_america when examing the relation of the continent feature to your response variable y, which is not necessarily true and have a chance to reduce the model effectiveness. In contrast, making it dummy variables has no such problem and scikit-learn will treat each binary feature separately.
To binarize your categorical variables in scikit-learn, you can use LabelBinarizer.
from sklearn.preprocessing import LabelBinarizer
# your data
# ===========================
continent = [1, 2, 3, 2]
continent_dict = {1:'is_europe', 2:'is_asia', 3:'is_america'}
print(continent_dict)
{1: 'is_europe', 2: 'is_asia', 3: 'is_america'}
# processing
# =============================
binarizer = LabelBinarizer()
# fit on the categorical feature
continent_dummy = binarizer.fit_transform(continent)
print(continent_dummy)
[[1 0 0]
[0 1 0]
[0 0 1]
[0 1 0]]
If you process your data in pandas, then its top-level function pandas.get_dummies also helps.