Random Forest discrepancy between R and Matlab & Python

Random Forest discrepancy between R and Matlab & Python - python

I apply the random forest algorithm in three different programming languages to the same pseudo sample dataset (1000 obs, binary 1/0 dependent variable, 10 numeric explanatory variables):
Matlab 2015a (same for 2012a) using the "Treebagger" command (part of the Statistics and Machine Learning Toolbox)
R using the "randomForest" package: https://cran.r-project.org/web/packages/randomForest/index.html
Python using the "RandomForestClassifier" from sklearn.ensemble: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
I also try to to keep all model parameters identical across programming languages (no. of trees, bootstrap sampling of the whole sample, no. of variables randomly sampled as candidates at each split, criterion to measure the quality of a split).
While Matlab and Python produce basically the same results (i.e. probabilties), the R results are very different.
What could be the possible reason for the difference between the results produced by R on the one hand side, and by Matlab & Python on the other?
I guess there's some default model parameter that differs in R which I'm not aware of or which is hard-coded in the underlying randomForest package.
The exact code I ran looks as follows:
Matlab:
b = TreeBagger(1000,X,Y, 'FBoot',1, 'NVarToSample',4, 'MinLeaf',1, 'Method', 'classification','Splitcriterion', 'gdi')
[~,scores,~] = predict(b,X);
Python:
clf = RandomForestClassifier(n_estimators=1000, max_features=4, bootstrap=True)
scores_fit = clf.fit(X, Y)
scores = pd.DataFrame(clf.predict_proba(X))
R:
results.rf <- randomForest(X,Y, ntree=1000, type = "classification", sampsize = length(Y),replace=TRUE,mtry=4)
scores <- predict(results.rf, type="prob",
norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)

When you call predict on a randomForest object in R without providing a dataset, it returns the out-of-bag predictions. In your other methods, you are passing in the training data again. I suspect that if you do this in the R version, your probabilities will be similar:
scores <- predict(results.rf, X, type="prob",
norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
Also note, that if you want unbiased probabilites, the R approach of returning OOB predictions is the best approach when predicting on training data.

Related

Why are probabilities hand-calculated from sklearn.linear_model.LogisticRegression coefficients different from .predict_proba()?

I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().
mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
pd.get_dummies(df[["Education","Gender"]]),
preprocessing.LabelEncoder().fit_transform(df["statement"])
)
I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this:
Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this:
These sum to 1 across the 3 potential categories for each data point.
However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.
First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this:
From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below:
For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.
I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?

I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean:
=
= 0.737007424626824
Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).
Sources that got me here:
How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function

Python PLSRegression : obtaining the latent variables scores using loadings

In sklearn.cross_decomposition.PLSRegression, we can obtain the latent variables scores from the X array using x_scores_.
I would like to extract the loadings to calculate the latent variables scores for a new array W. Intuitively, what I whould do is: scores = W*loadings (matrix multiplication). I tried this using either x_loadings_, x_weights_, and x_rotations_ as loadings as I could not figure out which array was the good one (there is little info on the sklearn website). I also tried to standardize W (subtracting the mean and dividing by the standard deviation of X) before multiplying by the loadings. But none of these works (I tried using the X array and I cannot obtain the same scores as in the x_scores_ array).
Any help with this?

Actually, I just had to better understand the fit() and transform() methods of Sklearn. I need to use transform(W) to obtain the latent variables scores of the W array:
1.Fit(): generates learning model parameters from training data
2.Transform(): uses the parameters generated from fit() method to transform a particular dataset

Logistic regression coefficient meaning

I'm trying to write my own logistic regressor (using batch/mini-batch gradient descent) for practice purposes.
I generated a random dataset (see below) with normally distributed inputs, and the output is binary (0,1). I manually used coefficients for the input and was hoping to be able to reproduce them (see below for the code snippet). However, to my surprise, neither my own code, nor sklearn LogisticRegression were able to reproduce the actual numbers (although the sign and order of magnitude are in line). Moreso, the coefficients my algorithm produced are different than the one produced by sklearn.
Am I misinterpreting what the coefficients for a logistic regression are?
I will appreciate any insight into this discrepancy.
Thank you!
edit: I tried using statsmodels Logit and got yet a third set of slightly different values for the coefficients
Some more info that might be relevant:
I wrote a linear regressor using an almost identical code and it worked perfectly, so I am fairly confident this is not a problem in the code. Also my regressor actually outperformed the sklearn one on the training set, and they have the exact same accuracy on the test set, so I have no reason to believe the regressors are wrong.
Code snippets for the generation of the dataset:
o1 = 2
o2 = -3
x[:,1]=np.random.rand(size)*2
x[:,2]=np.random.rand(size)*3
y = np.vectorize(sigmoid)(x[:,1]*o1+x[:,2]*o2 + np.random.normal(size=size))
so as can be seen, input coefficients are +2 and -3 (intercept 0);
sklearn coefficients were ~2.8 and ~-4.8;
my coefficients were ~1.7 and ~-2.6
and of the regressor (the most relevant parts of it):
for j in range(bin_size):
xs = x[i]
y_real = y[i]
z = np.dot(self.coeff,xs)
h = sigmoid(z)
dc+= (h-y_real)*xs
self.coeff-= dc * (learning_rate/n)

What was the intercept learned? It really should not be a surprise, as your y is polynomial of 3rd degree, while your model has only two coefficients, while 3 + y-intercept would be needed to model the response variable from predictors.
Furthermore, values may be different due to SGD for example.
Not really sure, but the coefficients could be different and return correct y for finite set of points. What are the metrics on each model? Do those differ?

low SVM accuracy on train and test sets in python

I'm porting some matlab/octave scripts for support vector machines (SVMs) to python but I'm getting poor accuracy in one of two scripts with the sklearn method.
ex6_spam.py loads some data trains a spam-detecting model.
In matlab, the SVM code provided, svmTrain.m, (see below for snippets) gives me ~99% accuracy in both the training and the test sets.
In python, sklearn.svm.SVM().fit() is giving me ~56% if I just use their linear kernel, and ~44% if I precompute the Gram matrix for a linear kernel. (The data and code - ex6_spam.py - are here.)
The odd thing, too, is that the exact same piece of code used in ex6.py gives me proper classification of 2D data points. Its behavior there is almost identical to the matlab/octave script.
I'm not doing much in ex6_spam.py - I load a training set:
mat = scipy.io.loadmat('spamTrain.mat')
X = mat["X"]
y = mat["y"]
I feed it to sklearn.svm.SVM().fit():
C = 0.1
model = svmt.svmTrain(X, y, C, "linear")
# this results in
# clf = svm.SVC(C = C, kernel=kernelFunction, tol=tol, max_iter=max_passes, verbose=2)
# return clf.fit(X, y)
and the I make a prediction:
p = model.predict(X)
The matlab/octave equivalent is
load('spamTrain.mat');
C = 0.1;
model = svmTrain(X, y, C, #linearKernel); # see the link to svmTrain.m above
p = svmPredict(model, X);
However, the results are wildly different. Any ideas why? I haven't had the chance to run it in a different computer, but maybe that's a possible reason?

SciKit-learn for data driven regression of oscillating data

Long time lurker first time poster.
I have data that roughly follows a y=sin(time) distribution, but also depends on other variables than time. In terms of correlations, since the target y-variable oscillates there is almost zero statistical correlation with time, but y obviously depends very strongly on time.
The goal is to predict the future values of the target variable. I want to avoid using an explicit assumption of the model, and instead rely on data driven models and machine learning, so I have tried using regression methods from sklearn.
I have tried the following methods (the parameters were blindly copied from examples and other threads):
LogisticRegression()
QDA()
GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=5,
param_grid={"C": [1e0, 1e1, 1e2, 1e3],
"gamma": np.logspace(-2, 2, 5)})
GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1), cv=5,
param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
"gamma": np.logspace(-2, 2, 5)})
GradientBoostingRegressor(loss='quantile', alpha=0.95,
n_estimators=250, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
DecisionTreeRegressor(max_depth=4)
AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)
The results fall into two different categories of failure:
The time field is having no effect, probably due to the absence of correlation from the oscillatory behaviour of the target variable. However, secondary effects from other variables allow a modest predictive capability for future time ranges (these other variables have a simple correlation with the target variable)
The when applying predict() to the training time range the prediction is near perfect with respect to the observations, but when given the future time range (for which data was not used in training) the predicted value stays constant.
Below is how I performed the training and testing:
weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['Days'] = (weather_df.index-datetime.datetime(2005,1,1)).days
ts = pd.DataFrame({'Temperature':weather_df['Mean TemperatureC'].ix[:'2015-1-1'],
'Humidity':weather_df[' Mean Humidity'].ix[:'2015-1-1'],
'Visibility':weather_df[' Mean VisibilityKm'].ix[:'2015-1-1'],
'Wind':weather_df[' Mean Wind SpeedKm/h'].ix[:'2015-1-1'],
'Time':weather_df['Days'].ix[:'2015-1-1']
})
start_test = datetime.datetime(2012,1,1)
ts_train = ts[ts.index < start_test]
ts_test = ts
data_train = np.array(ts_train.Humidity, ts_test.Time)[np.newaxis]
data_target = np.array(ts_train.Temperature)[np.newaxis].ravel()
model.fit(data_train.T, data_target.T)
data_test = np.array(ts_test.Humidity, ts_test.Time)[np.newaxis]
pred = model.predict(data_test.T)
ts_test['Pred'] = pred
Is there a regression model I could/should use for this problem, and if so what would be appropriate options and parameters?
(Also, my treatment of the time objects in sklearn is far from elegant, so I am gladly taking advice there.)

Here is my guess about what is happening in your two types of results:
.days does not convert your index into a form that repeats itself between your train and test samples. So it becomes a unique value for every date in your dataset.
As a consequence your models either ignore days (1st result), or your model overfits on the days feature (2nd result) causing the model to perform badly on your test data.
Suggestion:
If your dataset is large enough (it looks like it goes from 2005), try using dayofyear or weekofyear instead, so that your model will have something generalizable from the date information.

Agree with #zemekeneng that time should be module by the corresponding periods like 24hours, 12 months etc.
Beyond that, I'd like to remind using prior knowledge when selecting features or models. Since you already knew that your data is highly likely to follow sin(x), it should be used even in data driven approach.
We know that sin(x) can be approximated by x - x^3/3! + x^5/5! - x^7/7! then these should be used as features. None of the models that you used may have included these features. One way to do it would be to create these high order features by yourself and concatenate to your other features. Then a linear model with regulation may give you reasonable results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Random Forest discrepancy between R and Matlab & Python - python

Related

Why are probabilities hand-calculated from sklearn.linear_model.LogisticRegression coefficients different from .predict_proba()?

Python PLSRegression : obtaining the latent variables scores using loadings

Logistic regression coefficient meaning

low SVM accuracy on train and test sets in python

SciKit-learn for data driven regression of oscillating data

Categories

Resources