I've trained a Random Forest (regressor in this case) model using scikit learn (python), and I'would like to plot the error rate on a validation set based on the numeber of estimators used. In other words, there's a way to predict using only a portion of the estimators in your RandomForestRegressor?
Using predict(X) will give you the predictions based on the mean of every single tree results. There is a way to limit the usage of the trees? Or eventually, get each single output for each single tree in the forest?
Thanks to cohoz I've figured out how to do it.
I've written a couple of def, which turned out to be handy while plotting the learning curve of the random forest regressor on the test set.
## Error metric
import numpy as np
def rmse(train,test):
return np.sqrt(np.mean(pow(test - train+,2)))
## Print test set error
## Input the RandomForestRegressor, test set feature and test set known values
def rfErrCurve(rf_model,test_X,test_y):
p = []
for i,tree in enumerate(rf_model.estimators_):
p.insert(i,tree.predict(test_X))
print rmse(np.mean(p,axis=0),test_y)
Once trained, you can access these via the "estimators_" attribute of the random forest object.
Related
I am trying to run a Random Forest Classifier on an imbalanced dataset (~1:4).
I am using the method from imblearn as follows:
from imblearn.ensemble import BalancedRandomForestClassifier
rf=BalancedRandomForestClassifier(n_estimators=1000,random_state=42,class_weight='balanced',sampling_strategy='not minority')
rf.fit(train_features,train_labels)
predictions=rf.predict(test_features)
The split in training and test set is performed within a cross-validation approach using RepeatedStratifiedKFold from scikit learn.
However, I wonder if the test set needs to be balanced as well in order to obtain sensible accuracy scores (sensitivity, specificity etc.). I hope you can help me with this.
Many thanks!
From the imblearn docs:
A balanced random forest randomly under-samples each bootstrap sample
to balance it.
If you are okay with random undersampling as your balancing method, then the classifier is doing that for you "under the hood". In fact, that's the point of using imblearn in the first place, to handle class imbalance. If you were using a straight random forest, like the out-of-the-box version from sklearn, then I would be more concerned about dealing with class imbalance on the front end.
Im trying to figure out how to configure a neural network using Neupy. The problem is that I cant seem to find much options for a GRNN, only the sigma value as described here:
There is a parameter, y_i, that I want to be able to adjust, but there doesn't seem to be a way to do it on the package. I'm parsing through the code but i'm not a developer so i've trouble following all the steps, maybe a more experienced set of eyes can find a way to tweak that parameter.
Thanks
From the link that you've provided it looks like y_i is the target variable. In your case it's your target training variable. In the neupy code it's used during the prediction. https://github.com/itdxer/neupy/blob/master/neupy/algorithms/rbfn/grnn.py#L140
GRNN uses lazy learning, which means that it doesn't train, it just re-uses all your training data per each prediction. The self.target_train variable is just a copy that you use during the training phase. You can update this value before making prediction
from neupy import algorithms
grnn = algorithms.GRNN(std=0.1)
grnn.train(x_train, y_train)
grnn.train_target = modify_grnn_algorithm(grnn.train_target)
predicted = grnn.predict(x_test)
Or you can use GRNN code for prediction instead of default predict function
import numpy as np
from neupy import algorithms
from neupy.algorithms.rbfn.utils import pdf_between_data
grnn = algorithms.GRNN(std=0.1)
grnn.train(x_train, y_train)
# In this part of the code you can do any moifications you want
ratios = pdf_between_data(grnn.input_train, x_test, grnn.std)
predicted = (np.dot(grnn.target_train.T, ratios) / ratios.sum(axis=0)).T
I am right now trying to make a simple program on random forest. Taking two sequences to train and predict and plot the final random forest curve.
But I am unable to do it as I cant understand which kind of sequence I should take and how to plot the random forest result on graph as we used to do in R language.
I have tried this as far as now -
import numpy as np
from pylab import *
test=np.random.rand(1000,10)
print (test)
train=np.random.rand(1000,5)
print (train)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100,n_jobs=10)
rfc.fit(test, train)
Kindly see the code and it would be a great help if you could correct the code and also show me how to plot for random forest result.
I am expecting your kind reply as soon as possible.
In R language, I did this -
simulate the data
train=rnorm(1,1000,.2)
predict=rnorm(1100,1200,.5)
df=data.frame(train, predict)
run the randomForest implementation
library(randomForest)
rf1 <- randomForest(predict~., data=df, mtry=2, ntree=500, importance=TRUE)
importance(rf1,type=1)
run the party implementation
library(party)
cf1 <- cforest(predict~.,data=df,control=cforest_unbiased(mtry=2,ntree=50))
varimp(cf1)
varimp(cf1,conditional=TRUE)
plots
plot (rf1, log = "y")
What is the expected meaning of the train and test variable?
THe documentation of RandomForestClassifier.fit tells that for a classifier you need to pass class labels for the second argument (named y in the documentation). This can be either integer values (on integer per possible class) or a list of string labels.
Also fit is expected to be called with training data only (training set input features and training set labels) so passing a variable named test is really confusion.
Please start by following one of the tutorials of scikit-learn to understand how to train a classifier with that library:
http://scikit-learn.org/stable/documentation.html
then read the documentation of the random forests in particular:
http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees
If you want to compute the variable importances, read this section in particular:
http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation
I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.
If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.
I am using python to do a bit of machine learning.
I have a python nd array with 2000 entries. Each entry has information about some subjects and at the end has a boolean to tell me if they are a vampire or not.
Each entry in the array looks like this:
[height(cm), weight(kg), stake aversion, garlic aversion, reflectance, shiny, IS_VAMPIRE?]
My goal is to be able to give a probability that a new subject is a vampire given the data shown above for the subject.
I have used sklearn to do some machine learning for me:
clf = tree.DecisionTreeRegressor()
clf=clf.fit(X,Y)
print clf.predict(W)
Where W is an array of data for the new subject. The script I have written returns booleans, but I would like it to return probabilities. How can I modify it?
If you are using DecisionTreeRegressor() then you may use the score function to determine the coefficient of determination R^2 of the prediction.
Please find the below link to the documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor
Also you can list out the cross validation score (for 10 samples) as below
from sklearn.model_selection import cross_val_score
clf = tree.DecisionTreeRegressor()
clf=clf.fit(X,Y)
cross_val_score(clf, X, Y, cv=10)
print clf.predict(W)
Which gives an output something similar to this,
array([ 0.61..., 0.57..., -0.34..., 0.41..., 0.75...,
0.07..., 0.29..., 0.33..., -1.42..., -1.77...])
Use a DecisionTreeClassifier instead of a regressor, and use the predict_proba method. Alternatively, you could use a logistic regression (also available in scikit learn.)
The basic idea is this:
clf = tree.DecisionTreeClassifier()
clf=clf.fit(X,Y)
print clf.predict_proba(W)
You want to use a classifier that gives you a probability. Also, you will want to make sure in your testing array W, the data points are not replicates of any of your training data. If it matches exactly with any of your training data, it thinks it's definitely vampire or definitely not vampire, so will give you 0 or 1.
You're using a regressor but you probably want to use a classifier.
You'll also want to use a classifier that can give you posterior probabilities like a decision tree or logistic regression. Other classifiers may give you a score (some kind of confidence measure) which may also work for your needs.