I am using the scikit-learn library for python for a classification problem. I used RandomForestClassifier and a SVM (SVC class). However while the rf achieves about 66% precision and 68% recall the SVM only gets up to 45% each.
I did a GridSearch for the parameters C and gamma for the rbf-SVM and also considered scaling and normalization in advance. However I think the gap between rf and SVM is still too large.
What else should I consider to get an adequate SVM performance?
I thought it should be possible to get at least up to equal results.
(All the scores are obtained by cross-validation on the very same test and training sets.)
As EdChum said in the comments there is no rule or guarantee that any model always perform best.
The SVM with RBF kernel model makes the assumption that the optimal decision boundary is smooth and rotation invariant (once you fix a specific feature scaling that is not rotation invariant).
The Random Forest does not make the smoothness assumption (it's a piece wise constant prediction function) and favors axis aligned decision boundaries.
The assumptions made by the RF model might just better fit the task.
BTW, thanks for having grid searched C and gamma and checked the impact of feature normalization before asking on stackoverflow :)
Edit to get some more insight, it might be interesting to plot the learning curves for the 2 models. It might be the case that the SVM model regularization and kernel bandwidth cannot deal with overfitting good enough while the ensemble nature of RF works best for this dataset size. The gap might get closer if you had more data. The learning curves plot is a good way to check how your model would benefit from more samples.
Related
I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?
I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.
I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?
Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.
I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.
I am making an application for multilabel text classification .
I've tried different machine learning algorithm.
No doubt the SVM with linear kernel gets the best results.
I have also tried to sort through the algorithm Radom Forest and the results I have obtained have been very bad, both the recall and precision are very low.
The fact that the linear kernel to respond better result gives me an idea of the different categories are linearly separable.
Is there any reason the Random Forest results are so low?
The ensemble of the random forest performs well across many domains and types of data. They are excellent at reducing error from variance and don't over fit if trees are kept simple enough.
I would expect a forest to perform comparably to a SVM with a linear kernel.
The SVM will tend to overfit more because it does not benefit from being an ensemble.
If you are not using cross validation of some kind. At minimum measuring performance on unseen data using a test/training regimen than i could see you obtaining this type of result.
Go back and make sure performance is measured on unseen data and likelier you'll see the RF performing more comparably.
Good luck.
It is very hard to answer this question without looking at the data in question.
SVM does have a history of working better with text classification - but machine learning by definition is context dependent.
Consider the parameters by which you are running the random forest algorithm. What are your number and depth of trees, are you pruning branches? Are you searching a larger parameter space for SVMs therefore are more likely to find a better optimum.
I am using sklearn.svr with the RBF kernel on an 80k-size dataset with 20+ variables. I was wondering how to choose the termination parameter tol. I ask because the regression does not seem to converge for certain combinations of C and gamma (2+ days before I give up). Interestingly, it converges after less than 10 minutes for certain combinations with an average run-time of approximately an hour.
Is there some sort of rule of thumb for setting this parameter? Perhaps a relationship to the standard deviation or expected value of the forecast?
Mike's answer is correct: subsampling for grid searching parameter is probably the best strategy to train SVR on medium-ish dataset sizes. SVR is not scalable so don't waste your time doing a grid search on the full dataset. Try on 1000 random sub samples, then 2000 and then 4000. Each time find the optimal values for C and gamma and try to guess how they evolve whenever you double the size of the dataset.
Also you can approximate the true SVR solution with the Nystroem kernel approximation and a linear regressor model such as SGDRegressor, LinearRegression, LassoCV or ElasticNetCV. RidgeCV is likely not to improve upon LinearRegression in the n_samples >> n_features regime.
Finally, do not forget to scale your input data by putting a MinMaxScaler or a StandardScaler before the SVR model in a Pipeline.
I would also try GradientBoostingRegressor models (although completely unrelated to SVR).
You really shouldn't use SVR on large data sets: its training algorithm takes between quadratic and cubic time. sklearn.linear_model.SGDRegressor can fit a linear regression on such datasets without trouble, so try that instead. If linear regression won't hack it, transform your data with a kernel approximation before feeding it to SGDRegressor to get a linear-time approximation of an RBF-SVM.
You may have seen the scikit learn documentation for the RBF function. Considering what C and gamma actually do and the fact that the SVR training time is at worst quadratic in the number of samples, I would try training first on a small subset of the data. By first getting a result for all parameter settings and then scaling up the amount of training data used, you might find you actually only need a small sample of the data to get results very close to the full set.
This is the advice I was given by my MSc project supervisor recently, as I had the exact same problem. I found that out of a set of 120k examples with 250 features I only needed around 3000 samples to get within 2% of the error of the full set models.
Sorry this isn't answering your question directly, but I thought it might help.