Easy statistics question with an (not so) easy answer

Easy statistics question with an (not so) easy answer - python

Im working on a Kaggle machine learning proyect (https://www.kaggle.com/c/house-prices-advanced-regression-techniques) and my target variable is "SalePrice" of a particular house.
After plotting the data I can see that my target variable doesnt show a normal distribution and have positive skewness
So I (kind of) normalize it by taking the log of it
When I run my predictions using my regressors later, am I going to predict the log of the sale price? In this case, what should I do?
Is it okay just to do the inverse transformation or is it mathematical or statisticaly wrong?

Short answer: yes your model will predict the log of the house price, and there's nothing wrong with taking the exponential of that (mathematically speaking) to get back to the actual house price.
Forgetting statistics for a moment, if you have taken the log of your house price for your training data, your algorithm doesn't "know" that, it's just a different set of numbers, and your regressor is going to just fit to a different curve.
One practical problem is, when you take the exponential of your output, you better be quite confident in the precision of the predicted number. A small difference in your log-predicted-house prices, after being exponentiated, could make a large difference in your output. Is your model going to be able to achieve that level of precision?

Related

Regression tree does not perform well

My target is continuous value like house price. I am training a regression tree on it. I use Gradientboostregressor in scklearn, Python.
My target value(house price) has a L shape distribution: house price on high end is like 10 times higher than price on low end. My regression tree model under-predict high values, and over-predict low values.
Anything I can do to improve model prediction? I tried to model log(price), then exp(prediction),but not work well.
Thank you very much.

Couple of things you can try:
1) Are there features that capture high price? Things like lat/long, square footage etc.
2) How large is your test set? Is it representative of the validation set?
Also, there is a number of posts analyzing this exact problem on US data. For example, this post from kaggle for some useful features that can work:
https://www.kaggle.com/erick5/predicting-house-prices-with-machine-learning

A single decision tree often does not work very well. Of course you could try to optimize the tree. But I think it is better to change to random forest or gradient boosting if you want to work with trees.

Input data necessary for forecasting/ estimating trends for a given variable

This could be more of a theoretical question than a code-related one. In my current job I find myself estimating/ predicting (this last is more opportunistic) the water level for a given river in Africa.
The point is that I am developing a simplistic multiple regression model that takes more than 15 years of historical water levels and precipitation (from different locations) to generate water level estimates.
I am not that used to work with Machine Learning or whatever the correct name is. I am more used to model data and generate fittings (the current data can be perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials.
So the point is; once I have a multiple regression model, my colleagues advised me not to use fitted data for the estimation but all the raw data instead. Since they couldn't explain to me the reason of that, I attempted to use the fitted data as raw inputs (in my defense, a median of all the fitting models has a very low deviation error == nice fittings). But what I don't understand is why should I use just the raw data, which cold be noisy, innacurate, taking into account factors that are not directly related (biasing the regression?). What is the advantage of that?
My lack of theoretical knowledge in the field is what makes me wonder about that. Should I always use all the raw data to determine the variables of my multiple regression or can I use the fitted values (i.e. get a median of the different fitting models of each historical year)?
Thanks a lot!

here is my 2 cents
I think your colleagues are saying that because it would be better for the model to learn the correlations between the raw data and the actual rain fall.
In the field you will start with the raw data so being able to predict directly from it is very useful. The more work you do after the raw data is work you will have to do every time you want to make a prediction.
However, if a simpler model work perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials then I would recommend doing that. As long as your (y_pred - t_true) ** 2 is very small

Regression vs Classification for a problem that could be solved by both

I have a problem that I have been treating as a classification problem. I am trying to predict whether a machine will pass or fail a particular test based on a number of input features.
What I am really interested in is actually whether a new machine is predicted to pass or fail the test. It can pass or fail the test by having certain signatures (such as speed, vibration etc) go out of range.
Therefore, I could either:
1) Treat it as a pure regression problem; try to predict the actual values of speed, vibration etc
2) Treat it as a pure classification problem; for each observation, feed in whether it passed or failed on the labels, and try to predict this in the tool I am making
3) Treat it as a pseudo problem; where I predict the actual value, and come up with some measure of how confident I am that it is a pass or fail based on distance from the threshold of pass/fail
To be clear; I am working on a real problem. I am not interested in getting a super precise prediction of a certain value, just whether a machine is predicted to pass or fail (and bonus extension; how likely that it is to be true).
I have been working with classification model as I only have a couple hundred observations and some previous research showed that this might be the best way to treat the problem. However I am wondering now whether this is the right thing to do.
What would you do!?
Many thanks.

Without having the data and running classification or regression, a comparison would be hard because of the metric you use for each family is different.
For example, comparing RMSE of a regression with F1 score (or accuracy) of a classification problem would be apple to orange comparison.
It would be ideal if you can train a good regression model (low RMSE) because that would give you information more than the original pass/fail question. From my past experiences with industrial customers,
First, train all 3 models you have mentioned and then present the outcome to your customer and let them give you more direction on which models/outputs are more meaningful for them.

Predicting Energy Consumption of different buildings

I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)

The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.

Accuracy difference on normalization in KNN

I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.

To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!

That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.

If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.