How to avoid float values in regression models

How to avoid float values in regression models - python

I am trying to predict wine quality (ranges from 1 to 10) using regression models such as linear,SGDRegressor, ridge,lasso.
dataset:http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Independent values:volatile acidity,residual sugar,free sulfur dioxide,total sulfur dioxide,alchohol
Dependent:Quality
Linear model
regr = linear_model.LinearRegression(n_jobs=3)
regr.fit(x_train, y_train)
predicted = regr.predict(x_test)
predicted values for LinearRegression
array([ 5.33560542, 5.47347404, 6.09337194, ..., 5.67566813,
5.43609198, 6.08189 ])
predicted values are in float instead of (1,2,3...10)
I tried to round predicted values using numpy
predicted = np.round(regr.predict(x_test))` but my accuracy gone down with this attempt.
SGDRegressor model.
from sklearn import linear_model
np.random.seed(0)
clf = linear_model.SGDRegressor()
clf.fit(x_train, y_train)
redicted = np.floor(clf.predict(x_test))
predicted output values for SGDRegressor:
array([ -2.77685458e+12, 3.26826414e+12, 4.18655713e+11, ...,
4.72375220e+12, -7.08866307e+11, 3.95571514e+12])
Here I am unable to convert the output values into integers.
Could someone please let me know the best way to predict the wine quality using these regression models.

You are doing a regression and therefore the output is continuous in nature.
The thing you should note is that your mini-project on predicting wine quality is not a classification problem. The response variable y, the wine quality, has intrinsic order which means a score of 6 is strictly better than a score of 5. It is NOT categorical variable where different numbers just represent different groups where groups are non-comparable.

Related

Limitations of Regression in Machine Learning?

I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!

This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.

I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

How can we calculate accuracy for the Random forest classifier if we are using 4 label classification?

I am trying to predict the quality attributes of the product which have been sold over last decade.
Based on the likes/dislikes i have kept the 4 label for the product
Labels are: bad , good, very good,very bad
I have downloaded the last decade data and categorized the samples in these 4 labels. When i put the input in random forest classifier , it is giving the valid result and giving the feature importance:
Here is the code for same:
classifier = RandomForestClassifier(
n_estimators=100, n_jobs=6, oob_score=True, random_state=50,
max_features="auto", min_samples_leaf=50
)
'''
classifier = RandomForestClassifier(
n_estimators=100, n_jobs=6, oob_score=True, random_state=50#, max_depth=3
)
I just want to understand , how we can calculate the accuracy of the model as it has 4 labels.

There are a few accuracies you can check to assess model quality; the first is the overall model accuracy (how many did it get right). For this you can simply use the sklearn accuracy score
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
Of course this is not giving you enough information about which class is being miss-classified and to what (eg. it might be more acceptable to categorize very good as good rather than bad). For this you need a confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_pred)
You probably also want to look into recall and precision as they will help understand the matrix and quantify it.
What you can also do since your labels are ranked, is to convert them to int values and adress the problem with a regression instead of a classification (then convert the outputs back to ints). This way the model will get an understanding of order, therefore you get an ordinal classification.
EDIT:
Just in case the answer is not clear, you get y_pred in the following way:
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_val)

Calculate evaluation metrics using cross_val_predict sklearn

In the sklearn.model_selection.cross_val_predict page it is stated:
Generate cross-validated estimates for each input data point. It is
not appropriate to pass these predictions into an evaluation metric.
Can someone explain what does it mean? If this gives estimate of Y (y prediction) for every Y (true Y), why can't I calculate metrics such as RMSE or coefficient of determination using these results?

It seems to be based on how samples are grouped and predicted. From the user guide linked in the cross_val_predict docs:
Warning Note on inappropriate usage of cross_val_predict
The result of
cross_val_predict may be different from those obtained using
cross_val_score as the elements are grouped in different ways. The
function cross_val_score takes an average over cross-validation folds,
whereas cross_val_predict simply returns the labels (or probabilities)
from several distinct models undistinguished. Thus, cross_val_predict
is not an appropriate measure of generalisation error.
The cross_val_score seems to say that it averages across all of the folds, while the cross_val_predict groups individual folds and distinct models but not all and therefore it won't necessarily generalize as well. For example, using the sample code from the sklearn page:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer
diabetes = datasets.load_diabetes()
X = diabetes.data[:200]
y = diabetes.target[:200]
lasso = linear_model.Lasso()
y_pred = cross_val_predict(lasso, X, y, cv=3)
print("Cross Val Prediction score:{}".format(mean_squared_error(y,y_pred)))
print("Cross Val Score:{}".format(np.mean(cross_val_score(lasso, X, y, cv=3, scoring = make_scorer(mean_squared_error)))))
Cross Val Prediction score:3993.771257795029
Cross Val Score:3997.1789145156217

Just to add a little more clarity, it is easier to understand the difference if you consider a non-linear scoring function such as Maximum-Absolute-Error instead of something like a mean-absolute error.
cross_val_score() would compute the maximum-absolute-error on each off the 3-folds (assuming 3 fold cross-validator) and report the aggregate (say mean?) over 3 such scores. That is, something like mean of (a, b, c) where a , b, c are the max-abs-errors for the 3 folds respectively. I guess it is safe to conclude the returned value as the max-absolute-error of your estimator in the average or general case.
with cross_val_predict() you would get 3-sets of predictions corresponding to 3-folds and taking the maximum-absolute-error over the aggregate (concatenation) of these 3-sets of predictions is certainly not the same as above. Even if the predicted values are identical in both the scenarios, what you end up with here is max of (a, b,c ). Also, max(a,b,c) would be an unreasonable and overly pessimistic characterization of the max-absolute-error score of your model.

Seaborn Regplot and Scikit-Learn Logistic Models Calculated Differently?

I'm using both the Scikit-Learn and Seaborn logistic regression functions -- the former for extracting model info (i.e. log-odds, parameters, etc.) and the later for plotting the resulting sigmoidal curve fit to the probability estimations.
Maybe my intuition is incorrect for how to interpret this plot, but I don't seem to be getting results as I'd expect:
#Build and visualize a simple logistic regression
ap_X = ap[['TOEFL Score']].values
ap_y = ap['Chance of Admit'].values
ap_lr = LogisticRegression()
ap_lr.fit(ap_X, ap_y)
def ap_log_regplot(ap_X, ap_y):
plt.figure(figsize=(15,10))
sns.regplot(ap_X, ap_y, logistic=True, color='green')
return None
ap_log_regplot(ap_X, ap_y)
plt.xlabel('TOEFL Score')
plt.ylabel('Probability')
plt.title('Logistic Regression: Probability of High Chance by TOEFL Score')
plt.show
Seems alright, but then I attempt to use the predict_proba function in Scikit-Learn to find the probabilities of Chance to Admit given some arbitrary value for TOEFL Score (in this case 108, 104, and 112):
eight = ap_lr.predict_proba(108)[:, 1]
four = ap_lr.predict_proba(104)[:, 1]
twelve = ap_lr.predict_proba(112)[:, 1]
print(eight, four, twelve)
Where I get:
[0.49939019] [0.44665597] [0.55213799]
To me, this seems to indicate that a TOEFL Score of 112 gives an individual a 55% chance of being admitted based on this data set. If I were to extend a vertical line from 112 on the x-axis to the sigmoid curve, I'd expect the intersection at around .90.
Am I interpreting/modeling this correctly? I realize that I'm using two different packages to calculate the model coefficients but with another model using a different data set, I seem to get correct predictions that fit the logistic curve.
Any ideas or am I completely modeling/interpreting this inaccurately?

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=4)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('log: ', metrics.accuracy_score(y_test, y_pred))
you can easily find model accuracy like this and decide which model you can use for your application data.

After some searching, Cross-Validated provided the correct answer to my question. Although it already exists on Cross-Validated, I wanted to provide this answer on Stack Overflow as well.
Simply put, Scikit-Learn automatically adds a regularization penalty to the logistic model that shrinks the coefficients. Statsmodels does not add this penalty. There is apparently no way to turn this off so one has to set the C= parameter within the LogisticRegression instantiation to some arbitrarily high value like C=1e9.
After trying this and comparing the Scikit-Learn predict_proba() to the sigmoidal graph produced by regplot (which uses statsmodels for its calculation), the probability estimates align.
Link to full post: https://stats.stackexchange.com/questions/203740/logistic-regression-scikit-learn-vs-statsmodels

Unable to obtain accuracy score for my linear

I am working on my regression model based on the IMDB data, to predict IMDB value. On my linear-regression, i was unable to obtain the accuracy score.
my line of code:
metrics.accuracy_score(test_y, linear_predicted_rating)
Error :
ValueError: continuous is not supported
if i were to change that line to obtain the r2 score,
metrics.r2_score(test_y,linear_predicted_rating)
i was able to obtain r2 without any error.
Any clue why i am seeing this?
Thanks.
Edit:
One thing i found out is test_y is panda data frame whereas the linear_predicted_rating is in numpy array format.

metrics.accuracy_score is used to measure classification accuracy, it can't be used to measure accuracy of regression model because it doesn't make sense to see accuracy for regression - predictions rarely can equal the expected values. And if predictions differ from expected values by 1%, the accuracy will be zero, though these predictions are great
Here are some metrics for regression: http://scikit-learn.org/stable/modules/classes.html#regression-metrics

NOTE: Accuracy (e.g. classification accuracy) is a measure for classification, not regression so we can't calculate accuracy for a regression model. For regression, one of the matrices we've to get the score (ambiguously termed as accuracy) is R-squared (R2).
You can get the R2 score (i.e accuracy) of your prediction using the score(X, y, sample_weight=None) function from LinearRegression as follows by changing the logic accordingly.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
r2_score = regressor.score(x_test,y_test)
print(r2_score*100,'%')
output (a/c to my model)
86.23%

The above is R squared value and not the accuracy :
# R squared value
metrics.explained_variance_score(y_test, predictions)

What does your variables look like. Code below works well.
from sklearn import metrics
test_y, linear_predicted_rating = [1,2,3,4], [1,2,3,5]
metrics.accuracy_score(test_y, linear_predicted_rating)

You can not predict the accuracy of regression model,however you can analyze your model using Mean absolute error ,Mean squared error ,Root mean squared error,Max error,median error R-square etc.
for reference
you can go this to gain more knowledge

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.