Ensembling with dynamic weights - python

I was wondering if it is possible to use dynamic weights in sklearn's VotingClassifier. Overall i have 3 labels 0 = Other, 1 = Spam, 2 = Emotion. By dynamic weights I mean the following:
I have 2 classifiers. First one is a Random Forest which performs best on Spam detection. Other one is a CNN which is superior for topic detection (good distinction between Other and Emotion). What I would like is a VotingClassifier that gives a higher weight to RF when it assigns the label "Spam/1".
Is VotingClassifier the right way to go?
Best regards,
Stefan

I thing Voting Classifier only accepts different static weights for each estimator. However you may solve the problem by assigning class weights with the class_weight parameter of the random forest estimator by calculating the class weights on your train set.

Related

Is Random Forest regression is good for this kind of regression problem?

I am working with vehicle occupancy prediction and I am very much new to this, I have used random forest regression to predict the occupancy values.
Jupyter notebook_Random forest
I have around 48 M rows and I have used all the data to predict the occupancy, As the population and occupancy were normalized due to the higher numbers and I have predicted. I am sure the model is not good, how can I interpret the results from the RMSE and MAE. Also, the plot shows that it is not predicted well, Am I doing it in a correct way to predict the occupancy of the vehicles.
Kindly help me with the following,
Is Random forest regression is a good method to approach this problem?
How can I improve the model results?
How to interpret the results from the outcome
Is Random forest regression is a good method to approach this problem?
-> The model is just a tool and can of course be used. However, no one can answer whether it is suitable or not, because we have not studied the distribution of data. It is suggested that you can try logistic regression, support vector machine regression, etc.
How can I improve the model results?
-> I have several suggestions on how to improve: 1.Do not standardize without confirming whether the y value column has extreme values. 2.When calculating RMSE and Mae, use the original y value. 3.Deeply understand business logic and add new features. 4.Learn about data processing and Feature Engineering on the blog.
How to interpret the results from the outcome
-> Bad results do not necessarily mean no value. You need to compare whether the model is better than the existing methods and whether it has produced more economic value. For example, error is loss, and accuracy is gain.
Hope these can help you.
You were recommended XGBoost based regressor so you could try as well LightGBM based one: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html
You are getting RMSE of 0.002175863553610834 which is really close to zero. So, we can say that you have a good model. I don't think the model needs further improvement. If you still want to improve it, I think you should change the algorithm to XGBoost and use regularization and early stopping to avoid overfitting.
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators = 3000, learning_rate = 0.01, reg_alpha = 2, reg_lambda = 1, n_jobs = -1, random_state = 34, verbosity = 0)
evalset = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric = 'rmse', eval_set = evalset, early_stopping_rounds = 5)

My classifier gives 1.0 accuracy on ALL test data set (except wrong photos)

Have:
Dataset: 115 color images with 256x256 size, all photos belong to ONE class (cartoon person).
Classifiers: KNN and Random Forest Classifier.
Comment: I wanted to make a classifier to predict ONE cartoon person on some photo, so I've collected a dataset, digitized it and put it in the fit method of classifiers. So at first, I chose SGDClassifier, but it works only with 2 and more classes in the dataset. So then chose KNN and Random Forest Classifier.
Problem: when I try to test my ready classifiers, I got 1.0 score on EVERY photo (i tested that 1 object, 1 another object (another cartoon person) and a photo of the black screen) and they all had 1.0 score anyway.
Can somebody help me please? : ( I am stuck on this 2 days already and don't see ways to solve it by myself, I watched many solutions, but none of them worked in my case.
Dataset:
The shape of my dataset numpy array is (115, 196608) and (for example) one image in my dataset numpy array looks this:
Dataset is a 2D array, because classifiers take only 1D or 2D arrays.
Code: it's not full, just for an example
train_data_values = numpy.array([*115 photos*])
train_data_labels = numpy.array([*115 labels*])
# For fact, all my labels equal "1", there is no other value.
# Trying KNN
from sklearn.neighbors import KNeighborsClassifier
KNN_clf = KNeighborsClassifier(**{'n_neighbors': 16, 'weights': 'distance'})
KNN_clf.fit(train_data_values, train_data_labels)
test_im = cv2.imread(DATASET_IMAGES_DIRECTORY + "\\test\\" + "test2.png")
KNN_clf.predict_proba(test_im.reshape(1, 3*256*256)) # Returns array([[1.]])
# Trying Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
RF_clf = RandomForestClassifier()
RF_clf.fit(train_data_values, train_data_labels)
test_im = cv2.imread(DATASET_IMAGES_DIRECTORY + "\\test\\" + "test.png")
RF_clf.predict_proba(test_im.reshape(1, 3*256*256)) # Returns array([[1.]])
Comment: I looked at images in my numpy dataset because I thought they can be bad digitized, but NO, they can be built easily from array to image.
P.S. Parameters for the KNN classifier are random, because I've been trying grid search for the best parameters, but there were again 1.0 scores everywhere.
All classifiers learn their scores from their training data. And scores of most classifiers (including random forest and KNN) have probabilistic meaning: they are tuned to reflect the probabilistic distribution of the training data as well as possible.
So if your training data consists of 100% of a single class, then the classifier will learn that with 100% probability any sample belongs to this class, and will predict this class with absolute confidence.
The lesson: to use any classifier, you need at least two classes, otherwise, the prediction will be more or less meaningless. My recommendation is to add negative samples, that is, samples without your target person, including:
images with other persons from your and other cartoons
images with background only and without persons
images with some non-animated objects
There are a few exceptions, such as OneClassSVM, that are (presumable) capable of producing meaningful scores being trained on a single class. But whether they work adequately on your data, that you will never know, until you test them with data from several different classes.

Getting probability values for random forest and Gradient Boosting in python

I have been learning about classification techniques and studied about random forest, gradient boosting etc.Based on some help from codes available online,i tried to write code in python3 for random forest and GBM. My objective is to get the probability values from the model and not just look at accuracy as i intend to use the probability values to create KS later on.
I used the readily available titanic data set to start practicing.
Following are some of the steps i did :
/**load train data**/
train_df=pd.read_csv('***/classification/titanic/train.csv')
/**load test data**/
test_df =pd.read_csv('***/Desktop/classification/titanic/test.csv')
/**drop some variables in train data**/
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
/**drop some variables in test data**/
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
/** i calculated the title variable (again based on multiple threads in kaggle**/
train_df=pd.get_dummies(train_df,columns=['Pclass','Sex','Title'],drop_first=True)
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Title'],drop_first=True)
/**i checked for missing and IV values next (not including that code here***/
predictors=[x for x in train.columns if x not in ['Survived','PassengerID']]
predictors
# create classifier object (GBM)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
# create classifier object (RF)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
Now when i check the probability values from the two different models, i noticed that for the Random forest output, a significant chunk had a 0 probability score whereas that was not the case for the GBM model.
I understand that the techniques are different, but how can the results be so far off ? Am i missing out on something ?
With a large chunk of the population getting tagged with '0' as probability score, my KS table goes for a toss.
Welcome to SO! Since you don't seem to be having an issue with code execution in specific, or totally incorrect outputs, this looks like it is more appropriate for CrossValidated, where you can find answers to questions of statistical concerns.
In fact, I'd suggest that answers to this question might give you some good insights into why you are seeing very different values from the predict_proba method. In short: while both GradientBoostingClassifier and RandomForestClassifier both use tree methods, what they do is very different, so direct comparison of the model parameters is not necessarily appropriate.

Random Forest and Imbalance

I'm working on a dataset of around 20000 rows.
The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.
I chose to train a Random Forest Classifier to work on this problem.
I splitted the dataset 70%-30% randomly into a training set and a test set.
After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.
I tried several things:
I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.
Using scikit-learn GridSearchCV i tweaked the three models to get the best parameters:
param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
'max_depth':[10,15,20],
'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced'],
'criterion':['entropy','gini']}
sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc')
grid.fit(X_train,y_train)
The best score was obtained with
rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3,
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})
trained on the whole X_train and giving classification report on the test set
precision recall f1-score support
0 0.9397 0.9759 0.9575 5189
1 0.7329 0.5135 0.6039 668
micro avg 0.9232 0.9232 0.9232 5857
macro avg 0.8363 0.7447 0.7807 5857
weighted avg 0.9161 0.9232 0.9171 5857
With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.
For undersampling:
precision recall f1-score support
0 0.9532 0.9310 0.9420 5189
1 0.5463 0.6452 0.5916 668
For SMOTE:
precision recall f1-score support
0 0.9351 0.9794 0.9567 5189
1 0.7464 0.4716 0.5780 668
I played with the parameter class_weights to give more weight to the 1s and also with sample_weight in the fitting process.
I tried to figure out which score to take into account other than accuracy. Running the GridSearchCV to tweak the forests, I used different scores, focusing especially on f1 and roc_auc hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.
Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
I don't know exactly what to do with the oob_score other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highest oob_score = 0.9535 but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.
Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?

using RandomForestClassifier.predict_proba vs RandomForestRegressor.predict

I have a data set comprising a vector of features, and a target - either 1.0 or 0.0 (representing two classes). If I fit a RandomForestRegressor and call its predict function, is it equivalent to using RandomForestClassifier.predict_proba()?
In other words if the target is 1.0 or 0.0 does RandomForestRegressor output probabilities?
I think so, and the results I a m getting suggest so, but I would like to get a second opinion...
Thanks
Weasel
There is a major conceptual diffrence between those, based on different tasks being addressed:
Regression: continuous (real-valued) target variable.
Classification: discrete target variable (classes).
For a general classification method, term probability of observation being class X may be not defined, as some classification methods, knn for example, do not deal with probabilities.
However for Random Forest (and some other classification methods), classification is reduced to regression of classes probabilities destibution. Predicted class is taked then as argmax of computed "probabilities". In your case, you feed the same input, you get the same result. And yes, it is ok to treat values returned by RandomForestRegressor as probabilities.

Categories

Resources