I've been looking into machine learning recently and now making my first steps with scikit and linear regression.
Here is my first sample
from sklearn import linear_model
import numpy as np
X = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]
y = [2,4,6,8,10,12,14,16,18,20]
clf = linear_model.LinearRegression()
clf.fit (X, y)
print(clf.predict([11]))
==> 22
The output is as expected 22 (apparently scikit comes up with 2x as the hypothesis function). But when I create a slightly more complicated example with y = [1,4,9,16,25,36,49,64,81,100] my code just creates crazy output. I assumed linear regression would come up with a quadratic function (x^2) but instead I don't know what is going on. The output for 11 is now: 99. So I guess my code tries to find some kind of linear function to map all the examples.
In the tutorial on linear regression that I did there were examples of polynomial terms, so I assumed scikits implementation would come up with a correct solution. Am I wrong? If so, how do I teach scikit to consider quadratic, cubic, etc... functions?
LinearRegression fits a linear model to data. In the case of one-dimensional X values like you have above, the results is a straight line (i.e. y = a + b*x). In the case of two-dimensional values, the result is a plane (i.e. z = a + b*x + c*y). So you can't expect a linear regression model to perfectly fit a quadratic curve: it simply doesn't have enough model complexity to do that.
That said, you can cleverly transform your input data in order to fit a quadratic curve with a linear regression model. Consider the 2D case above:
z = a + b*x + c*y
Now let's make the substitution y = x^2. That is, we add a second dimension to our data which contains the quadratic term. Now we have another linear model:
z = a + b*x + c*x^2
The result is a model that is quadratic in x, but still linear in the coefficients! That is, we can solve it easily via a linear regression: this is an example of a basis function expansion of the input data. Here it is in code:
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.arange(10)[:, None]
y = np.ravel(x) ** 2
p = np.array([1, 2])
model = LinearRegression().fit(x ** p, y)
model.predict(11 ** p)
# [121]
This is a bit awkward, though, because the model requires 2D input to predict(), so you have to transform the input manually. If you want this transformation to happen automatically, you can use e.g.PolynomialFeatures in a pipeline:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(x, y).predict(11)
# [121]
This is one of the beautiful things about linear models: using basis function expansion like this, they can be very flexible, while remaining very fast! You could think about adding columns with cubic, quartic, or other terms, and it's still a linear regression. Or for periodic models, you might think about adding columns of sines, cosines, etc. In the extreme limit of this, the so-called "kernel trick" allows you to effectively add an infinite number of new columns to your data, and end up with a model that is very powerful – but still linear and thus still relatively fast! For an example of this type of estimator, take a look at scikit-learn's KernelRidge.
Related
I have the following code:
modelClf = AdaBoostRegressor(base_estimator=LinearRegression(), learning_rate=2, n_estimators=427, random_state=42)
modelClf.fit(X_train, y_train)
While trying to interpret and improve the results, I wanted to see the feature importances, however I get an error saying that linear regressions don't really do that kind of thing.
Alright, sounds reasonable, so I tried using .coef_ since it should work for linear regressions, but it, in place, turned out incompatible with the adaboost regressor.
Is there any way to find the feature importances or is it impossible when adaboost it used on a linear regression?
Issue12137 suggests to add support for this using the coefs_, although a choice needs to be made how to normalize negative coefficients. There's also the question of when coefficients are really good representatives of importance (you should at least scale your data first). And then there's the question of when adaptive boosting helps a linear model in the first place.
One way to do this quickly is to modify the LinearRegression class:
class MyLinReg(LinearRegression):
#property
def feature_importances_(self):
return self.coef_ # assuming one output
modelClf = AdaBoostRegressor(base_estimator=MyLinReg(), ...)
Checked with below code, there is an attribute for feature importance:
import pandas as pd
import random
from sklearn.ensemble import AdaBoostRegressor
df = pd.DataFrame({'x1':random.choices(range(0, 100), k=10), 'x2':random.choices(range(0, 100), k=10)})
df['y'] = df['x2'] * .5
X = df[['x1','x2']].values
y = df['y'].values
regr = AdaBoostRegressor(random_state=0, n_estimators=100)
regr.fit(X, y)
regr.feature_importances_
Output: You can see feature 2 is more important as Y is nothing but half of it (as the data is created in such way).
There seems to be two methods for OLS fits in python. The Sklearn one and the Statsmodel one. I have a preference for the statsmodel one because it gives the error on the coefficients via the summary() function. However, I would like to use the TransformedTargetRegressor from sklearn to log my target. It would seem that I need to choose between getting the error on my fit coefficients in statsmodel and being able to transform my target in statsmodel. Is there a good way to do both of these at the same time in either system?
In stats model it would be done like this
import statsmodels.api as sm
X = sm.add_constant(X)
ols = sm.OLS(y, X)
ols_result = ols.fit()
print(ols_result.summary())
To return the fit with the coefficients and the error on them
For Sklearn you can use the TransformedTargetRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print('Coefficients: \n', regr.coef_)
But there is no way to get the error on the coefficients without calculating them yourself. Is there a good way to get the best of both worlds?
EDIT
I found a good example for the special case I care about here
https://web.archive.org/web/20160322085813/http://www.ats.ucla.edu/stat/mult_pkg/faq/general/log_transformed_regression.htm
Just to add a lengthy comment here, I believe that TransformedTargetRegressor does not do what you think it does. As far as I can tell, the inverse transformation function is only applied when the predict method is called. It does not express the coefficients in units of the untransformed outcome.
Example:
import pandas as pd
import statsmodels.api as sm
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn import datasets
create some sample data:
df = pd.DataFrame(datasets.load_iris().data)
df.columns = datasets.load_iris().feature_names
X = df.loc[:,['sepal length (cm)', 'sepal width (cm)']]
y = df.loc[:, 'petal width (cm)']
Sklearn first:
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print(regr.regressor_.intercept_)
for coef in regr.regressor_.coef_:
print(coef)
#-0.45867804195769357
# 0.3567583897503805
# -0.2962942997303887
Statsmodels on transformed outcome:
X = sm.add_constant(X)
ols_trans = sm.OLS(np.log1p(y), X).fit()
print(ols_trans.params)
#const -0.458678
#sepal length (cm) 0.356758
#sepal width (cm) -0.296294
#dtype: float64
You see that in both cases, the coefficients are identical.That is, using the regression with TransformedTargetRegressor yields the same coefficients as statsmodels.OLS with the transformed outcome. TransformedTargetRegressor does not backtranslate the coefficients into the original untransformed space. Note that the coefficients would be non-linear in the original space unless the transformation itself is linear, in which case this is trivial (adding and multiplying with constants). This discussion here points into a similar direction - backtransforming betas is infeasible in most/many cases.
What to do instead?
If interpretation is your goal, I believe the closest you can get to what you wish to achieve is to use predicted values where you vary the regressors or the coefficients. So, let me give you an example: if your goal is to say what's the effect of one standard error higher for sepal length on the untransformed outcome, you can create the predicted values as fitted as well as the predicted values for the 1-sigma scenario (either by tempering with the coefficient, or by tempering with the corresponding column in X).
Example:
# Toy example to add one sigma to sepal length coefficient
coeffs = ols_trans.params.copy()
coeffs['sepal length (cm)'] += 0.018 # this is one sigma
# function to predict and translate predictions back:
def get_predicted_backtransformed(coeffs, data, inv_func):
return inv_func(data.dot(coeffs))
# get standard predicted values, backtransformed:
original = get_predicted_backtransformed(ols_trans.params, X, np.expm1)
# get counterfactual predicted values, backtransformed:
variant1 = get_predicted_backtransformed(coeffs, X, np.expm1)
Then you can say e.g. about the mean shift in the untransformed outcome:
variant1.mean()-original.mean()
#0.2523083548367202
In short, Scikit learn cannot help you in calculating coefficient standard errors. However, if you opt to use it, you can just calculate the errors by yourself. In the question Python scikit learn Linear Model Parameter Standard Error #grisaitis provided a great answer explaining the main concepts behind it.
If you only want to use a plug-and-play function that will work with sciait-learn you can use this:
def get_coef_std_errors(reg: 'sklearn.linear_model.LinearRegression',
y_true: 'np.ndarray', X: 'np.ndarray'):
"""Function that calculates the standard deviation of the coefficients of
a linear regression.
Parameters
----------
reg : sklearn.linear_model.LinearRegression
LinearRegression object which has been fitted
y_true : np.ndarray
array containing the target variable
X : np.ndarray
array containing the features used in the regression
Returns
-------
beta_std
Standard deviation of the regression coefficients
"""
y_pred = reg.predict(X) # get predictions
errors = y_true - y_pred # calculate residuals
sigma_sq_hat = np.var(errors) # calculate residuals std error
sigma_beta_hat = sigma_sq_hat * np.linalg.inv(X.T # X)
return np.sqrt(np.diagonal(sigma_beta_hat)) # diagonal to recover variances
I have developed the code below for starting a project for svm method:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error
housing = load_boston()
df = pd.DataFrame(np.c_[housing['data'], housing['target']],
columns= np.append(housing['feature_names'], ['target']))
features = df.columns.tolist()
label = features[-1]
features = features[:-1]
x_train = df[features].iloc[:400]
y_train = df[label].iloc[:400]
x_test = df[features].iloc[400:]
y_test = df[label].iloc[400:]
svr = svm.SVR(kernel='rbf')
svr.fit(x_train, y_train)
y_pred = svr.predict(x_test)
print(mean_absolute_error(y_pred, y_test))
Now I want to use my customized rbf kernel which is:
def my_rbf(feat, lbl):
#feat = feat.values
#lbl = lbl.values
ans = np.array([])
gamma = 0.000005
for i in range(len(feat)):
ans = np.append(ans, np.exp(-gamma * np.dot(feat[i]-lbl[i], feat[i]-lbl[i])))
return ans
Then I changed svm.SVR(kernel=my_rbf) But I get plenty of errors while modifying it in any way. I also tried to use a simple function like np.dot(feat-lbl,feat-lbl) which worked fine in SVR.fit method but in svr.predict some error occurred which said that shape of input matrix has to be like [n_samples_test, n_samples_train].
I'm stymied to deal with the errors. Can anyone help me make this code work?
The custom kernel method my_rbf you coded uses both X (features) and y (labels). You cannot evaluate this function during predictions as you have no access to labels. The custom kernel if flawed.
Backgroud
The RBF function is defined as below (from wiki)
where x and x' are two feature (X) vectors.
Let H(X) is a function with transforms a vector X to other dimension (normally to very very high dimension). SVM needs to calculate the dot product between all combinations of the feature vectors (ie all H(X)'s). So if H(X1) . H(X2) = K(X1, X2) then K is called the kernel function or kernalization of H. So instead of transforming the points X1 and X2 to very high dimensions and calculating the dot product there, K calculates it directly from X1 and X2.
Conclusion
The my_rbf is not a valid kernel function simply because it uses labels (Ys). It should be only on the feature vectors.
According to this source, RBF function which I was looking for (takes training featues as X and testing features as X' as inputs) and outputs [n_training_samples, n_testing_samples] as explained more thoroughly in docs, is something like this:
def my_kernel(X,Y):
K = np.zeros((X.shape[0],Y.shape[0]))
for i,x in enumerate(X):
for j,y in enumerate(Y):
K[i,j] = np.exp(-1*np.linalg.norm(x-y)**2)
return K
clf=SVR(kernel=my_kernel)
which results exactly equal to:
clf=SVR(kernel="rbf",gamma=1)
In terms of speed it lacks performance as efecient as the default svm library rbf. It could be useful to use static typing of cython library for indexes and also using memory-views for numpy arrays to speed it up a little bit.
I'm trying to get the feel for SVM regression with a toy example. I generated random numbers between 1 and 100 as the predictors, then took their log and added gaussian noise to create the target variables. Popping this data into sklearn's SVR module produces a reasonable looking model:
However, when I augment the training data by throwing in the squares of the original predictor variables, everything goes haywire:
I understand that the RBF kernel does something analogous to taking powers of the original features, so throwing in the second feature is mostly redundant. However, is it really the case the SVMs are this bad at handling feature redundancy? Or am I doing something wrong?
Here is the code I used to generate these graphs:
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt
# change to highest_power=2 to get the bad model
def create_design_matrix(x_array, highest_power=1):
return np.array([[x**k for k in range(1, highest_power + 1)] for x in x_array])
N = 1000
x_array = np.random.uniform(1, 100, N)
y_array = np.log(x_array) + np.random.normal(0,0.2,N)
model = SVR(C=1.0, epsilon=0.1)
print model
X = create_design_matrix(x_array)
#print X
#print y_array
model = model.fit(X, y_array)
test_x = np.linspace(1.0, 100.0, num=10000)
test_y = model.predict(create_design_matrix(test_x))
plt.plot(x_array, y_array, 'ro')
plt.plot(test_x, test_y)
plt.show()
I'd appreciate any help with this mystery!
It looks like your model's picking up on outliers too heavily, which is a symptom of error from variance. This makes sense, because adding polynomial features increases the variance of a model. You should try tweaking the bias-variance tradeoff via cross validation by tweaking parameters. The parameters to modify would be C, epsilon, and gamma. The gamma parameter's incredibly important when using an RBF kernel, so I'd start there.
Manually fiddling with these parameters (which is not recommended - see below) gave me the following model:
The parameters used here were C=5, epsilon=0.1, gamma=2**-15.
Choosing these parameters is really a task for a proper model selection framework. I prefer simulated annealing + cross validation. The best scikit-learn currently has is random grid search + crossval. Shameless plug for a simulated annealing module I helped with: https://github.com/skylergrammer/SimulatedAnnealing
Note: Polynomial features are actually products of all combinations of size d (with replacement), not just the squares of features. In the second degree case, since you only have a single feature, these are equivalent. Scikit-learn has a class that'll calculate these though: sklearn.preprocessing.PolynomialFeatures
Using Scikit learn, the basic idea (with regression, for example) is to predict some "y" given a data vector "x" after having fit a model. Typical code would look like this (adapted from from here):
from sklearn.svm import SVR
import numpy as np
n_samples, n_features = 10, 5
np.random.seed(0)
y = np.random.randn(n_samples)
X = np.random.randn(n_samples, n_features)
clf = SVR(C=1.0, epsilon=0.2)
clf.fit(X[:-1], y[:-1])
prediction = clf.predict(X[-1])
print 'prediction:', prediction[0]
print 'actual:', y[-1]
My question is: Is it possible to fit some model (perhaps not SVR) given "x" and "y", and then predict "x" given "y". In other words, something like this:
clf = someCLF()
clf.fit(x[:-1], y[:-1])
prediction = clf.predict(y[-1])
#where predict would return the data vector that could produce y[-1]
No. There are many vectors (X) that may lead to the same result (Y), not vice versa.
Probably you may think about changing your X and Y if you need to predict the data you used as X in the beginning.
Not possible in scikit, no.
You're asking about a generative or joint model of x and y. If you fit such a model you can do inference about the distribution p(x, y), or either of the conditional distributions p(x | y) or p(y | x). Naive Bayes is the most popular generative model, but you won't be able to do the kind of inferences above with scikit's version. It will also produce bad estimates for anything other than trivial problems. Fitting good join models is much harder than conditional models of one variable given the rest.