SVM regression ruined by adding polynomial features - python

I'm trying to get the feel for SVM regression with a toy example. I generated random numbers between 1 and 100 as the predictors, then took their log and added gaussian noise to create the target variables. Popping this data into sklearn's SVR module produces a reasonable looking model:
However, when I augment the training data by throwing in the squares of the original predictor variables, everything goes haywire:
I understand that the RBF kernel does something analogous to taking powers of the original features, so throwing in the second feature is mostly redundant. However, is it really the case the SVMs are this bad at handling feature redundancy? Or am I doing something wrong?
Here is the code I used to generate these graphs:
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt
# change to highest_power=2 to get the bad model
def create_design_matrix(x_array, highest_power=1):
return np.array([[x**k for k in range(1, highest_power + 1)] for x in x_array])
N = 1000
x_array = np.random.uniform(1, 100, N)
y_array = np.log(x_array) + np.random.normal(0,0.2,N)
model = SVR(C=1.0, epsilon=0.1)
print model
X = create_design_matrix(x_array)
#print X
#print y_array
model = model.fit(X, y_array)
test_x = np.linspace(1.0, 100.0, num=10000)
test_y = model.predict(create_design_matrix(test_x))
plt.plot(x_array, y_array, 'ro')
plt.plot(test_x, test_y)
plt.show()
I'd appreciate any help with this mystery!

It looks like your model's picking up on outliers too heavily, which is a symptom of error from variance. This makes sense, because adding polynomial features increases the variance of a model. You should try tweaking the bias-variance tradeoff via cross validation by tweaking parameters. The parameters to modify would be C, epsilon, and gamma. The gamma parameter's incredibly important when using an RBF kernel, so I'd start there.
Manually fiddling with these parameters (which is not recommended - see below) gave me the following model:
The parameters used here were C=5, epsilon=0.1, gamma=2**-15.
Choosing these parameters is really a task for a proper model selection framework. I prefer simulated annealing + cross validation. The best scikit-learn currently has is random grid search + crossval. Shameless plug for a simulated annealing module I helped with: https://github.com/skylergrammer/SimulatedAnnealing
Note: Polynomial features are actually products of all combinations of size d (with replacement), not just the squares of features. In the second degree case, since you only have a single feature, these are equivalent. Scikit-learn has a class that'll calculate these though: sklearn.preprocessing.PolynomialFeatures

Related

Getting feature importances out of an Adaboosted linear regression

I have the following code:
modelClf = AdaBoostRegressor(base_estimator=LinearRegression(), learning_rate=2, n_estimators=427, random_state=42)
modelClf.fit(X_train, y_train)
While trying to interpret and improve the results, I wanted to see the feature importances, however I get an error saying that linear regressions don't really do that kind of thing.
Alright, sounds reasonable, so I tried using .coef_ since it should work for linear regressions, but it, in place, turned out incompatible with the adaboost regressor.
Is there any way to find the feature importances or is it impossible when adaboost it used on a linear regression?
Issue12137 suggests to add support for this using the coefs_, although a choice needs to be made how to normalize negative coefficients. There's also the question of when coefficients are really good representatives of importance (you should at least scale your data first). And then there's the question of when adaptive boosting helps a linear model in the first place.
One way to do this quickly is to modify the LinearRegression class:
class MyLinReg(LinearRegression):
#property
def feature_importances_(self):
return self.coef_ # assuming one output
modelClf = AdaBoostRegressor(base_estimator=MyLinReg(), ...)
Checked with below code, there is an attribute for feature importance:
import pandas as pd
import random
from sklearn.ensemble import AdaBoostRegressor
df = pd.DataFrame({'x1':random.choices(range(0, 100), k=10), 'x2':random.choices(range(0, 100), k=10)})
df['y'] = df['x2'] * .5
X = df[['x1','x2']].values
y = df['y'].values
regr = AdaBoostRegressor(random_state=0, n_estimators=100)
regr.fit(X, y)
regr.feature_importances_
Output: You can see feature 2 is more important as Y is nothing but half of it (as the data is created in such way).

Limitations of Regression in Machine Learning?

I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!
This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.
I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

model selection for GaussianMixture by using GridSearch

I'd like to use the function GaussianMixture by scikit-learn, and I have to perform model selection.
I want to do it by using GridSearchCV, and I would like to use for the selection the BIC and the AIC.
Both these values are implemented into GaussianMixture(), but I don't know how to insert them into the definition of my custom scorer, since the function
make_scorer(score_func, greater_is_better=True, needs_proba=False, needs_threshold=False, **kwargs)
that I am using to create my custom scorer takes as input a function score_funct, that has to be defined as
score_func(y, y_pred, **kwargs)
Can someone help me?
Using the BIC/AIC is an alternative to using cross validation. GridSearchCV selects models using cross validation. To perform model selection using the BIC/AIC we have to do something a little different. Let's take an example where we generate samples from two Gaussians, and then try to fit them using scikit-learn.
import numpy as np
X1 = np.random.multivariate_normal([0.,0.],[[1.,0.],[0.,1.]],10000)
X2 = np.random.multivariate_normal([10.,10.],[[1.,0.],[0.,1.]],10000)
X = np.vstack((X1,X2))
np.random.shuffle(X)
Method 1: Cross-validation
Cross validation involves splitting the data into pieces. One then fits the model on some of the pieces ('training') and tests how well it performs on the remaining pieces ('validating'). This guards against over-fitting. Here we will use two-fold cross validation, where we split the data in half.
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
#check 1->4 components
tuned_parameters = {'n_components': np.array([1,2,3,4])}
#construct grid search object that uses 2 fold cross validation
clf = GridSearchCV(GaussianMixture(),tuned_parameters,cv=2)
#fit the data
clf.fit(X)
#plot the number of Gaussians against their rank
plt.scatter(clf.cv_results_['param_n_components'],\
clf.cv_results_['rank_test_score'])
We can see that 2-fold cross validation favours two Gaussian components, as we expect.
Method 2: BIC/AIC
Instead of using cross-validation, we can evaluate the BIC using the best-fit model given each number of Gaussians. We then choose the model that has the lowest BIC. The procedure would be identical if one used the AIC (although it is a different statistic, and can provide different answers: but your code structure would be identical to below).
bic = np.zeros(4)
n = np.arange(1,5)
models = []
#loop through each number of Gaussians and compute the BIC, and save the model
for i,j in enumerate(n):
#create mixture model with j components
gmm = GaussianMixture(n_components=j)
#fit it to the data
gmm.fit(X)
#compute the BIC for this model
bic[i] = gmm.bic(X)
#add the best-fit model with j components to the list of models
models.append(gmm)
After carrying out this procedure, we can plot the number of Gaussians against the BIC.
plt.plot(n,bic)
So we can see that the BIC is minimised for two Gaussians, so the best model
according to this method also has two components.
Because I took 10000 samples from two very well-separated Gaussians (i.e. the distance between their centres is much larger than either of their dispersions), the answer was very clear-cut. This is not always the case, and often neither of these methods will confidently tell you which number of Gaussians to use, but rather some sensible range.

Linear Regression with quadratic terms

I've been looking into machine learning recently and now making my first steps with scikit and linear regression.
Here is my first sample
from sklearn import linear_model
import numpy as np
X = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]
y = [2,4,6,8,10,12,14,16,18,20]
clf = linear_model.LinearRegression()
clf.fit (X, y)
print(clf.predict([11]))
==> 22
The output is as expected 22 (apparently scikit comes up with 2x as the hypothesis function). But when I create a slightly more complicated example with y = [1,4,9,16,25,36,49,64,81,100] my code just creates crazy output. I assumed linear regression would come up with a quadratic function (x^2) but instead I don't know what is going on. The output for 11 is now: 99. So I guess my code tries to find some kind of linear function to map all the examples.
In the tutorial on linear regression that I did there were examples of polynomial terms, so I assumed scikits implementation would come up with a correct solution. Am I wrong? If so, how do I teach scikit to consider quadratic, cubic, etc... functions?
LinearRegression fits a linear model to data. In the case of one-dimensional X values like you have above, the results is a straight line (i.e. y = a + b*x). In the case of two-dimensional values, the result is a plane (i.e. z = a + b*x + c*y). So you can't expect a linear regression model to perfectly fit a quadratic curve: it simply doesn't have enough model complexity to do that.
That said, you can cleverly transform your input data in order to fit a quadratic curve with a linear regression model. Consider the 2D case above:
z = a + b*x + c*y
Now let's make the substitution y = x^2. That is, we add a second dimension to our data which contains the quadratic term. Now we have another linear model:
z = a + b*x + c*x^2
The result is a model that is quadratic in x, but still linear in the coefficients! That is, we can solve it easily via a linear regression: this is an example of a basis function expansion of the input data. Here it is in code:
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.arange(10)[:, None]
y = np.ravel(x) ** 2
p = np.array([1, 2])
model = LinearRegression().fit(x ** p, y)
model.predict(11 ** p)
# [121]
This is a bit awkward, though, because the model requires 2D input to predict(), so you have to transform the input manually. If you want this transformation to happen automatically, you can use e.g.PolynomialFeatures in a pipeline:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(x, y).predict(11)
# [121]
This is one of the beautiful things about linear models: using basis function expansion like this, they can be very flexible, while remaining very fast! You could think about adding columns with cubic, quartic, or other terms, and it's still a linear regression. Or for periodic models, you might think about adding columns of sines, cosines, etc. In the extreme limit of this, the so-called "kernel trick" allows you to effectively add an infinite number of new columns to your data, and end up with a model that is very powerful – but still linear and thus still relatively fast! For an example of this type of estimator, take a look at scikit-learn's KernelRidge.

Impossible to use sum in a dataframe while similar code works

I am taking dataquest.io and I observed something strange (but could not get any answer back there). I am wondering why I can't use a code snippet that worked before in a situation that use the same kind/type of data, and should not cause an exception.
The lesson first teach to fit a regressor on a same training set and to predict on the same values, the calculating MSE.
Then it shows that it would overfit and propose a randomization process to avoid that. Problem being, apart from the random splitting, the dataframes generated are very similar, but if I try to calculate my MSE on the final results, it fails poorly, and I have to change the code for an alternative.
Here are both codes:
First code
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Initialize the linear regression class.
regressor = LinearRegression()
# We're using 'value' as a predictor, and making predictions for 'next_day'.
# The predictors need to be in a dataframe.
# We pass in a list when we select predictor columns from "sp500" to
# force pandas not to generate a series.
# (?) I could not figure out why it is not necessary for "to_predict"
predictors = sp500[["value"]]
to_predict = sp500["next_day"]
# Train the linear regression model on our dataset.
regressor.fit(predictors, to_predict)
# Generate a list of predictions with our trained linear regression model
next_day_predictions = regressor.predict(predictors)
print(next_day_predictions)
MSE_frame=(next_day_predictions-to_predict)**2
#(?) can math.pow(frame_difference, 2) be used on a dataframe?
mse=MSE_frame.sum()/len(MSE_frame.index)
______________________________________________________________________________
Second code
import numpy as np
import random
# Set a random seed to make the shuffle deterministic.
np.random.seed(1)
random.seed(1)
#(?) are there any difference between both of these statements? Are they
# both necessary or just one out of two?
# Randomly shuffle the rows in our dataframe
sp500 = sp500.loc[np.random.permutation(sp500.index)]
# Select 70% of the dataset to be training data
highest_train_row = int(sp500.shape[0] * .7)
train = sp500.loc[:highest_train_row,:]
# Select 30% of the dataset to be test data.
test = sp500.loc[highest_train_row:,:]
regressor = LinearRegression()
regressor.fit(train[["value"]], train["next_day"])
predictions = regressor.predict(test[["value"]])
mse = sum((predictions - test["next_day"]) ** 2) / len(predictions)
regressor = LinearRegression()
predictors = train[["value"]]
to_predict = train["next_day"]
# Train the linear regression model on our dataset.
regressor.fit(predictors, to_predict)
# Generate a list of predictions with our trained linear regression model
next_day_predictions = regressor.predict(test[["value"]])
print(next_day_predictions)
sqr=(next_day_predictions-test["next_day"])**2
Mistake was here, I was passing a with test[["next_day"]] while it was not done in the first code. Stupid me
mse=sum(sqr)/len(sqr.index)
#or
mse=sqr.sum()/len(sqr.index)
# This is the line which failed while it was identical to what was
#done before.
** it is worth noting both mse expressions don't yield the same results, They are identical for first ten decimals, but comparison with == doesn't give True.
So, the problem was there:
sqr=(next_day_predictions-test["next_day"])**2
I originally wrote
sqr=(next_day_predictions-test[["next_day"]])**2
thus passing a list into calculation, which was not done in the first code.

Categories

Resources