I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?
Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.
Related
I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?
I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.
Often-times stakeholders don't want a black-box model that's good at predicting; they want insights about features to have a better understanding about their business, and so they can explain it to others.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Is there a way to explain not only what features are important but also WHY they're important?
I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot).
In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed).
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
plot_importance(xgb)
plt.show()
Update:
What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. Currently, I maintain that it's not possible so somebody please prove me wrong. I'd love to be wrong!
I also understand that decision trees are non-parametric and have no coefficients. Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y).
Update2:
Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Partial dependence plots might be the answer.
Partial Dependence Plots (PDP) were introduced by Friedman (2001) with
purpose of interpreting complex Machine Learning algorithms.
Interpreting a linear regression model is not as complicated as
interpreting Support Vector Machine, Random Forest or Gradient
Boosting Machine models, this is were Partial Dependence Plot can come
into use. For some statistical explaination you can refer hereand More
Advance. Some of the algorithms have methods for finding variable
importance but they do not express whether a varaible is positively or
negatively affecting the model .
tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
I'd like to clear up some of the wording to make sure we're on the same page.
Predictive power: what features significantly contribute to the prediction
Feature dependence: are the features positively or negatively
correlated, i.e., does a change in the feature X cause the prediction y to increase/decrease
1. Predictive power
Your feature importance shows you what retains the most information, and are the most significant features. Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients.
2. Correlation/Dependence
As pointed out by #Tiago1984, it depends heavily on the underlying algorithm. XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split).
In a regression problem, the trees are typically using a criterion related to the MSE. I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3.
You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model).
But, to cut to the chase; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value.
Take a look at partial dependency plots: http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
There's also a chapter on it in Elements of Statistical Learning, Chapter 10.13.2.
The "importance" of a feature depends on the algorithm you are using to build the trees. In C4.5 trees, for example, a maximum-entropy criterion is often used. This means that the feature set is the one that allows classification with the fewer decision steps.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Yes we do. Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. So it gives "why" in a proper, mathematical sense.
I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.
I am making an application for multilabel text classification .
I've tried different machine learning algorithm.
No doubt the SVM with linear kernel gets the best results.
I have also tried to sort through the algorithm Radom Forest and the results I have obtained have been very bad, both the recall and precision are very low.
The fact that the linear kernel to respond better result gives me an idea of the different categories are linearly separable.
Is there any reason the Random Forest results are so low?
The ensemble of the random forest performs well across many domains and types of data. They are excellent at reducing error from variance and don't over fit if trees are kept simple enough.
I would expect a forest to perform comparably to a SVM with a linear kernel.
The SVM will tend to overfit more because it does not benefit from being an ensemble.
If you are not using cross validation of some kind. At minimum measuring performance on unseen data using a test/training regimen than i could see you obtaining this type of result.
Go back and make sure performance is measured on unseen data and likelier you'll see the RF performing more comparably.
Good luck.
It is very hard to answer this question without looking at the data in question.
SVM does have a history of working better with text classification - but machine learning by definition is context dependent.
Consider the parameters by which you are running the random forest algorithm. What are your number and depth of trees, are you pruning branches? Are you searching a larger parameter space for SVMs therefore are more likely to find a better optimum.
I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen
The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset
There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.