My target is continuous value like house price. I am training a regression tree on it. I use Gradientboostregressor in scklearn, Python.
My target value(house price) has a L shape distribution: house price on high end is like 10 times higher than price on low end. My regression tree model under-predict high values, and over-predict low values.
Anything I can do to improve model prediction? I tried to model log(price), then exp(prediction),but not work well.
Thank you very much.
Couple of things you can try:
1) Are there features that capture high price? Things like lat/long, square footage etc.
2) How large is your test set? Is it representative of the validation set?
Also, there is a number of posts analyzing this exact problem on US data. For example, this post from kaggle for some useful features that can work:
https://www.kaggle.com/erick5/predicting-house-prices-with-machine-learning
A single decision tree often does not work very well. Of course you could try to optimize the tree. But I think it is better to change to random forest or gradient boosting if you want to work with trees.
Related
Im working on a Kaggle machine learning proyect (https://www.kaggle.com/c/house-prices-advanced-regression-techniques) and my target variable is "SalePrice" of a particular house.
After plotting the data I can see that my target variable doesnt show a normal distribution and have positive skewness
So I (kind of) normalize it by taking the log of it
When I run my predictions using my regressors later, am I going to predict the log of the sale price? In this case, what should I do?
Is it okay just to do the inverse transformation or is it mathematical or statisticaly wrong?
Short answer: yes your model will predict the log of the house price, and there's nothing wrong with taking the exponential of that (mathematically speaking) to get back to the actual house price.
Forgetting statistics for a moment, if you have taken the log of your house price for your training data, your algorithm doesn't "know" that, it's just a different set of numbers, and your regressor is going to just fit to a different curve.
One practical problem is, when you take the exponential of your output, you better be quite confident in the precision of the predicted number. A small difference in your log-predicted-house prices, after being exponentiated, could make a large difference in your output. Is your model going to be able to achieve that level of precision?
I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?
Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.
I am new to Machine Learning
I have a dataset which has highly unbalanced classes(dominated by negative class) and contains more than 2K numeric features and the target is [0,1]. I have trained a logistics regression though I am getting an accuracy of 89% but from confusion matrix, it was found the model True positive is very low. Below are the scores of my model
Accuracy Score : 0.8965989500114129
Precision Score : 0.3333333333333333
Recall Score : 0.029545454545454545
F1 Score : 0.05427974947807933
How I can increase my True Positives? Should I be using a different classification model?
I have tried the PCA and represented my data in 2 components, it increased the model accuracy up to 90%(approx) however True Positives was decreased again
There are several ways to do this :
You can change your model and test whether it performs better or not
You can Fix a different prediction threshold : here I guess you predict 0 if the output of your regression is <0.5, you could change the 0.5 into 0.25 for example. It would increase your True Positive rate, but of course, at the price of some more False Positives.
You can duplicate every positive example in your training set so that your classifier has the feeling that classes are actually balanced.
You could change the loss of the classifier in order to penalize more False Negatives (this is actually pretty close to duplicating your positive examples in the dataset)
I'm sure many other tricks could apply, here is just my favorite short-list.
I'm assuming that your purpose is to obtain a model with good classification accuracy on some test set, regardless of the form of that model.
In that case, if you have access to the computational resources, try Gradient-Boosted Trees. That's a ensemble classifier using multiple decision trees on subsets of your data, then a voting ensemble to make predictions. As far as I know, it can give good results with unbalanced class counts.
SciKitLearn has the function sklearn.ensemble.GradientBoostingClassifier for this. I have not used that particular one, but I use the regression version often and it seems good. I'm pretty sure MATLAB has this as a package too, if you have access.
2k features might be difficult for the SKL algorithm - I don't know I've never tried.
What is the size of your dataset?How many rows are we talking here?
Your dataset is not balanced and so its kind of normal for a simple classification algorithm to predict the 'majority-class' most of the times and give you an accuracy of 90%. Can you collect more data that will have more positive examples in it.
Or, just try oversampling/ under-sampling. see if that helps.
You can also use penalized version of the algorithm to impose penalty, whenever a wrong class is predicted. That may help.
You can try many different solutions.
If you have quite a lot data points. For instance you have 2k 1s and 20k 0s. You can try just dump those extra 0s only keep 2k 0s. Then train it. And also you can try to use different set of 2k 0s and same set of 2k 1s. To train multiple models. And make decision based on multiple models.
You also can try adding weights at the output layer. For instance, you have 10 times 0s than 1s. Try to multiply 10 at the 1s prediction value.
Probably you also can try to increase dropout?
And so on.
the goal:
Hey guys, I'm trying to create a classification model in Python to predict when a bike-share station will have too much relative inflow or outflow per hour.
what we're workin with:
The first 5 rows of my dataframe (over 200,000 rows in all) look like this, and I've assigned values 0, 1, 2 in the 'flux' column - 0 if no significant action, 1 if too much inflow, 2 if too much outflow.
And I'm thinking of using the station_name (over 300 stations), hour of day, and day of week as the predictor variables to classify 'flux'.
the model choice:
What should I go with? Naive Bayes? KNN? Random Forest? anything else that would be a good fit? GDMs? SVMs?
fyi: the baseline prediction of always 0 is pretty high at 92.8%. unfortunately the accuracy of logistic regression and decision tree is right on par w that and doesn't improve it much. and KNN just takes forever....
Recommendations from those more experienced with machine learning in dealing with a classification question like this?
The Azure machine learning team has an article on how to choose algorithms which could help even if you aren't using AzureML. From that article:
How large is your training data? If your training set is small, and
you're going to train a supervised classifier, then machine learning
theory says you should stick to a classifier with high bias/low
variance, such as Naive Bayes. These have an advantage over low
bias/high variance classifiers such as kNN since the latter tends to
overfit. But low bias/high variance classifiers are more appropriate
if you have a larger training set because they have a smaller
asymptotic error - in these cases a high bias classifier isn't
powerful enough to provide an accurate model. There are theoretical
and empirical results that indicate that Naive Bayes does well in such
circumstances. But note that having better data and good features
usually can give you a greater advantage than having a better
algorithm. Also, if you have a very large dataset classification
performance may not be affected as much by the algorithm you use, so
in that case it's better to choose your algorithm based on such things
as its scalability, speed, or ease of use.
Do you need to train incrementally or in a batched mode? If you have a
lot of data, or your data is updated frequently, you probably want to
use Bayesian algorithms that update well. Both neural nets and SVMs
need to work on the training data in batch mode.
Is your data exclusively categorical or exclusively numeric or a
mixture of both kinds? Bayesian works best with categorical/binomial
data. Decision trees can't predict numerical values.
Do you or your audience need to understand how the classifier works?
Bayesian or decision trees are more easily explained. It's much harder
to see or explain how neural networks and SVMs classify data.
How fast does your classification need to be generated? Decision trees
can be slow when the tree is complex. SVMs, on the other hand,
classify more quickly since they only need to determine which side of
the "line" your data is on.
How much complexity does the problem present or require? Neural nets
and SVMs can handle complex non-linear classification.
Now, regarding your comment about "fyi: the baseline prediction of always 0 is pretty high at 92.8%": there are anomaly detection algorithms - meaning that the classification is highly unbalanced, with one classification being an "anomaly" that occurs very rarely, like credit card fraud detection (true fraud is hopefully a very small percentage of your total dataset). In Azure Machine Learning, we use one-class support vector machine (SVM) and PCA-based anomaly detection algorithms. Hope that helps!
Just use anything different from average accuracy for model evaluation in case of such unbalanced data: precision/recall/f1/confusion matrix:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Try different models and choose best according to chosen metrics on test set.
I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen
The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset
There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.