I have a data set comprising a vector of features, and a target - either 1.0 or 0.0 (representing two classes). If I fit a RandomForestRegressor and call its predict function, is it equivalent to using RandomForestClassifier.predict_proba()?
In other words if the target is 1.0 or 0.0 does RandomForestRegressor output probabilities?
I think so, and the results I a m getting suggest so, but I would like to get a second opinion...
Thanks
Weasel
There is a major conceptual diffrence between those, based on different tasks being addressed:
Regression: continuous (real-valued) target variable.
Classification: discrete target variable (classes).
For a general classification method, term probability of observation being class X may be not defined, as some classification methods, knn for example, do not deal with probabilities.
However for Random Forest (and some other classification methods), classification is reduced to regression of classes probabilities destibution. Predicted class is taked then as argmax of computed "probabilities". In your case, you feed the same input, you get the same result. And yes, it is ok to treat values returned by RandomForestRegressor as probabilities.
Related
I am trying to use roc_auc_score from sklearn to evaluate two models. The first one is Random Forest, and the second one is kNN classifier. It's a binary classification problem. But I have a problem since the output types of these two models are different.
For the Random Forest, the output is the probability, some thinking like below.
[0.1, 0.4, 0.6, ..., 0.9]
For the kNN classifier, the outcome is binary like
[0,0,1,...,1]
The ground truth is binary values.
I am wondering if I directly use
from sklearn.metrics import roc_auc_score
roc_auc_score(labels, predictions)
Are these roc_auc values from these two models comparable? Since the first one is continuous output the other is discrete? Or should I convert the first one into binary as well? Like element below 0.5 is 0 and above 0.5 is 1?
KNN Classifier has a method predict_proba which should return probabilities - see the documentation.
Unfortunately you cannot directly use roc_auc_score as it would complain from continuous values. This is meaningful because roc-auc scoring is useful for classification problems. If you would like to handle with such continuous values you must update your predictions in a such way that it should only contain integer values representing labels. One such useful strategy for the binary classification problem is that you can set all predictions (assuming they are between 0 and 1) above 0.5 as 1 and the rest is zero. Because in the end, having a score larger than 0.5 means that your method favors label one to another, vice versa.
I'm trying to distill the predictions of another classifier model, "C" using xgboost. Thus, instead of labels, I have the probabilities predicted by C for the samples being positive.
I've tried doing the most obvious thing, using the probabilities output by C as if they were labels
distill_model = XGBClassifier(learning_rate=0.1, max_depth=10, n_estimators=100)
distill_model.fit(X, probabilities)
but it seems that in that case XGBoost just translates each distinct probability value to its own class. So if C output 72 distinct values, XGBoost considers that as 72 to different classes. I've tried changing the objective function to multi:softmax/multi:softprob but that didn't help.
Any suggestions?
There is probably an xgboost specific method with custom loss. But a generic solution is to split each training row into two rows one with each label, and assign each row the original probability for that label as its weight.
I have been learning about classification techniques and studied about random forest, gradient boosting etc.Based on some help from codes available online,i tried to write code in python3 for random forest and GBM. My objective is to get the probability values from the model and not just look at accuracy as i intend to use the probability values to create KS later on.
I used the readily available titanic data set to start practicing.
Following are some of the steps i did :
/**load train data**/
train_df=pd.read_csv('***/classification/titanic/train.csv')
/**load test data**/
test_df =pd.read_csv('***/Desktop/classification/titanic/test.csv')
/**drop some variables in train data**/
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
/**drop some variables in test data**/
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
/** i calculated the title variable (again based on multiple threads in kaggle**/
train_df=pd.get_dummies(train_df,columns=['Pclass','Sex','Title'],drop_first=True)
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Title'],drop_first=True)
/**i checked for missing and IV values next (not including that code here***/
predictors=[x for x in train.columns if x not in ['Survived','PassengerID']]
predictors
# create classifier object (GBM)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
# create classifier object (RF)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
Now when i check the probability values from the two different models, i noticed that for the Random forest output, a significant chunk had a 0 probability score whereas that was not the case for the GBM model.
I understand that the techniques are different, but how can the results be so far off ? Am i missing out on something ?
With a large chunk of the population getting tagged with '0' as probability score, my KS table goes for a toss.
Welcome to SO! Since you don't seem to be having an issue with code execution in specific, or totally incorrect outputs, this looks like it is more appropriate for CrossValidated, where you can find answers to questions of statistical concerns.
In fact, I'd suggest that answers to this question might give you some good insights into why you are seeing very different values from the predict_proba method. In short: while both GradientBoostingClassifier and RandomForestClassifier both use tree methods, what they do is very different, so direct comparison of the model parameters is not necessarily appropriate.
I was wondering if it is possible to use dynamic weights in sklearn's VotingClassifier. Overall i have 3 labels 0 = Other, 1 = Spam, 2 = Emotion. By dynamic weights I mean the following:
I have 2 classifiers. First one is a Random Forest which performs best on Spam detection. Other one is a CNN which is superior for topic detection (good distinction between Other and Emotion). What I would like is a VotingClassifier that gives a higher weight to RF when it assigns the label "Spam/1".
Is VotingClassifier the right way to go?
Best regards,
Stefan
I thing Voting Classifier only accepts different static weights for each estimator. However you may solve the problem by assigning class weights with the class_weight parameter of the random forest estimator by calculating the class weights on your train set.
I am using sklearn.lda for a classification purpose and was a little puzzled about the score function that prints the mean classification error.
Is it determined by leave one out - jackknife?
How do I interpret the result? It's only a float value without much documentation.
Thanks in advance,
EL
The score method takes samples X and their true labels y and compares its own predictions with y. It returns the mean accuracy, which is always a single figure. For example,
lda = LDA().fit(X, y)
print(lda.score(X, y))
will print the accuracy of the classifier on its own training set.
Every classifier has a score method, which usually (though not necessarily) returns mean accuracy. The method is used by the GridSearchCV model selection algorithm to determine the quality of the classifier if you don't explicitly give it a scoring argument.