Use scikit-learn to predict data vector "x" given "y"?

Use scikit-learn to predict data vector "x" given "y"? - python

Using Scikit learn, the basic idea (with regression, for example) is to predict some "y" given a data vector "x" after having fit a model. Typical code would look like this (adapted from from here):
from sklearn.svm import SVR
import numpy as np
n_samples, n_features = 10, 5
np.random.seed(0)
y = np.random.randn(n_samples)
X = np.random.randn(n_samples, n_features)
clf = SVR(C=1.0, epsilon=0.2)
clf.fit(X[:-1], y[:-1])
prediction = clf.predict(X[-1])
print 'prediction:', prediction[0]
print 'actual:', y[-1]
My question is: Is it possible to fit some model (perhaps not SVR) given "x" and "y", and then predict "x" given "y". In other words, something like this:
clf = someCLF()
clf.fit(x[:-1], y[:-1])
prediction = clf.predict(y[-1])
#where predict would return the data vector that could produce y[-1]

No. There are many vectors (X) that may lead to the same result (Y), not vice versa.
Probably you may think about changing your X and Y if you need to predict the data you used as X in the beginning.

Not possible in scikit, no.
You're asking about a generative or joint model of x and y. If you fit such a model you can do inference about the distribution p(x, y), or either of the conditional distributions p(x | y) or p(y | x). Naive Bayes is the most popular generative model, but you won't be able to do the kind of inferences above with scikit's version. It will also produce bad estimates for anything other than trivial problems. Fitting good join models is much harder than conditional models of one variable given the rest.

Related

Python Logistic regression with 2 features data X and label Y - Training accuracy

# import sklearn and necessary libraries
from sklearn.linear_model import LogisticRegression
# Apply sklearn logistic regression on the given data X and labels Y
X_skl = np.vstack((df1,df2)) # 10000 x 2 array
Y_skl = Y # 10000 x 1 array
LogR = LogisticRegression()
LogR.fit(X_skl,Y_skl)
Y_skl_hat = LogR.predict(X_skl)
# Calculate the accuracy
# Check the number of points where Y_skl is not equal to Y_skl_hat
error_count_skl = 0 # Count the number of error points
for i in range(N):
if Y_skl[i] == Y_skl_hat[i]:
error_count_skl = error_count_skl
else:
error_count_skl = error_count_skl + 1
# Calculate the accuracy
Accuracy = 100*(N - error_count_skl)/N
print("Accuracy(%):")
print(Accuracy)
Output:
Accuracy(%):
99.48
Hello,
I'm trying to apply logistic regression model on array X (with size of 10000 x 2) and label Y (10000 x 1)
using sklearn library in Python. I'm completely lost cause I've never used this library before. Can anyone help me with the coding?
Edited:
Sorry for the vague question, the goal is to find the training accuracy using the entire dataset of X. Above is what I came up with, can anyone take a look and see if it makes sense?

To calculate accuracy you can simply use this sklearn method.
sklearn.metrics.accuracy_score(y_true, y_pred)
In your case
sklearn.metrics.accuracy_score(Y_skl, Y_skl_hat)
If you want to take a look at
sklearn documentation for accuracy_score
And also you should train your model on some data and test it on others to check if the model can be generalized and to avoid overfitting.
To split your data in train and test datasets you could use:
sklearn.model_selection.train_test_split
If you want to take a look at
sklearn documentation for train_test_split

Getting feature importances out of an Adaboosted linear regression

I have the following code:
modelClf = AdaBoostRegressor(base_estimator=LinearRegression(), learning_rate=2, n_estimators=427, random_state=42)
modelClf.fit(X_train, y_train)
While trying to interpret and improve the results, I wanted to see the feature importances, however I get an error saying that linear regressions don't really do that kind of thing.
Alright, sounds reasonable, so I tried using .coef_ since it should work for linear regressions, but it, in place, turned out incompatible with the adaboost regressor.
Is there any way to find the feature importances or is it impossible when adaboost it used on a linear regression?

Issue12137 suggests to add support for this using the coefs_, although a choice needs to be made how to normalize negative coefficients. There's also the question of when coefficients are really good representatives of importance (you should at least scale your data first). And then there's the question of when adaptive boosting helps a linear model in the first place.
One way to do this quickly is to modify the LinearRegression class:
class MyLinReg(LinearRegression):
#property
def feature_importances_(self):
return self.coef_ # assuming one output
modelClf = AdaBoostRegressor(base_estimator=MyLinReg(), ...)

Checked with below code, there is an attribute for feature importance:
import pandas as pd
import random
from sklearn.ensemble import AdaBoostRegressor
df = pd.DataFrame({'x1':random.choices(range(0, 100), k=10), 'x2':random.choices(range(0, 100), k=10)})
df['y'] = df['x2'] * .5
X = df[['x1','x2']].values
y = df['y'].values
regr = AdaBoostRegressor(random_state=0, n_estimators=100)
regr.fit(X, y)
regr.feature_importances_
Output: You can see feature 2 is more important as Y is nothing but half of it (as the data is created in such way).

Predict radio signal strength (RSS) using Gaussian Process Regression (GPR)

I want to use GPR to predict RSS from a deployed access point (AP). Since GPR gives mean RSS and its variance too, GPR could be very useful in positioning and navigation system. I read the GPR related published journals and got the theoretical insight of it. Now, I want to implement it with real data (RSS). In my system, the input and corresponding outputs (observations) are:
X: 2D cartesian coordinates points
y: an array of RSS (-dBm) at the corresponding coordinates
After searching online, I found that I can use sklearn software (using python). I installed sklearn and successfully tested the sample codes. The sample python scripts are for 1D GPR. Since my input sets are 2D coordinates, I wanted to modify the sample code. I found that other people have also tried to do the same, for example : How to correctly use scikit-learn's Gaussian Process for a 2D-inputs, 1D-output regression?, How to make a 2D Gaussian Process Using GPML (Matlab) for regression?, and Is kringing suitable for high dimensional regression problems?.
The expected (predicted) values should be similar to y. The value I got is very different. The size of the testbed where I want to predict the RSS is 16*16 sq.meters. I want to predict RSS at every meter apart. I assume that the Gaussian Process predictor is trained with the Gaussian Decent algorithm in the sample code. I want to optimize the hyperparameter (theta: trained by using y and X) with Firefly algorithm.
In order to use my own data (2D input), in which line of code am I suppose to edit? Similarly, how can I implement Firefly algorithm (I've installed firefly algorithm using pip)?
Please help me with your kind suggestions and comments.
Thank you very much.

I have simplified the code a bit to illustrate potential issues:
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
x_train = np.array([[0,0],[2,0],[4,0],[6,0],[8,0],[10,0],[12,0],[14,0],[16,0],[0,2],
[2,2],[4,2],[6,2],[8,2],[10,2],[12,2],[14,2],[16,2]])
y_train = np.array([-54,-60,-62,-64,-66,-68,-70,-72,-74,-60,-62,-64,-66,
-68,-70,-72,-74,-76])
# This is a test set?
x1min = 0
x1max = 16
x2min = 0
x2max = 16
x1 = np.linspace(x1min, x1max)
x2 = np.linspace(x2min, x2max)
x_test =(np.array([x1, x2])).T
gp = GaussianProcessRegressor()
gp.fit(x_train, y_train)
# predict on training data
y_pred_train = gp.predict(x_train)
print('Avg MSE: ', ((y_train - y_pred_train)**2).mean()) # MSE is 0
# predict on test (?) data
y_pred_test = gp.predict(x_test)
# it is unclear how good this result without y_test (e.g., held out labeled test samples)
The expected (predicted) values should be similar to y.
Here, I have renamed y to y_train for clarity. After fitting the GP and predicting on x_train, we see that the model perfectly predicts the training samples, which is possibly what you meant. I am not sure if you mistakenly wrote lowercase x which I call x_test (instead of uppercase X which I call x_train) in the question. If we predict on x_test, we cannot really know how good the prediction is without the corresponding y_test values. So, this basic example is working as I would expect.
It also appears you are trying to create a grid for x_test, however the current code does not do that. Here, x1 and x2 are always the same for each position. If you want a grid, take a look at np.meshgrid.

Linear Regression with quadratic terms

I've been looking into machine learning recently and now making my first steps with scikit and linear regression.
Here is my first sample
from sklearn import linear_model
import numpy as np
X = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]
y = [2,4,6,8,10,12,14,16,18,20]
clf = linear_model.LinearRegression()
clf.fit (X, y)
print(clf.predict([11]))
==> 22
The output is as expected 22 (apparently scikit comes up with 2x as the hypothesis function). But when I create a slightly more complicated example with y = [1,4,9,16,25,36,49,64,81,100] my code just creates crazy output. I assumed linear regression would come up with a quadratic function (x^2) but instead I don't know what is going on. The output for 11 is now: 99. So I guess my code tries to find some kind of linear function to map all the examples.
In the tutorial on linear regression that I did there were examples of polynomial terms, so I assumed scikits implementation would come up with a correct solution. Am I wrong? If so, how do I teach scikit to consider quadratic, cubic, etc... functions?

LinearRegression fits a linear model to data. In the case of one-dimensional X values like you have above, the results is a straight line (i.e. y = a + b*x). In the case of two-dimensional values, the result is a plane (i.e. z = a + b*x + c*y). So you can't expect a linear regression model to perfectly fit a quadratic curve: it simply doesn't have enough model complexity to do that.
That said, you can cleverly transform your input data in order to fit a quadratic curve with a linear regression model. Consider the 2D case above:
z = a + b*x + c*y
Now let's make the substitution y = x^2. That is, we add a second dimension to our data which contains the quadratic term. Now we have another linear model:
z = a + b*x + c*x^2
The result is a model that is quadratic in x, but still linear in the coefficients! That is, we can solve it easily via a linear regression: this is an example of a basis function expansion of the input data. Here it is in code:
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.arange(10)[:, None]
y = np.ravel(x) ** 2
p = np.array([1, 2])
model = LinearRegression().fit(x ** p, y)
model.predict(11 ** p)
# [121]
This is a bit awkward, though, because the model requires 2D input to predict(), so you have to transform the input manually. If you want this transformation to happen automatically, you can use e.g.PolynomialFeatures in a pipeline:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(x, y).predict(11)
# [121]
This is one of the beautiful things about linear models: using basis function expansion like this, they can be very flexible, while remaining very fast! You could think about adding columns with cubic, quartic, or other terms, and it's still a linear regression. Or for periodic models, you might think about adding columns of sines, cosines, etc. In the extreme limit of this, the so-called "kernel trick" allows you to effectively add an infinite number of new columns to your data, and end up with a model that is very powerful – but still linear and thus still relatively fast! For an example of this type of estimator, take a look at scikit-learn's KernelRidge.

SVM regression ruined by adding polynomial features

I'm trying to get the feel for SVM regression with a toy example. I generated random numbers between 1 and 100 as the predictors, then took their log and added gaussian noise to create the target variables. Popping this data into sklearn's SVR module produces a reasonable looking model:
However, when I augment the training data by throwing in the squares of the original predictor variables, everything goes haywire:
I understand that the RBF kernel does something analogous to taking powers of the original features, so throwing in the second feature is mostly redundant. However, is it really the case the SVMs are this bad at handling feature redundancy? Or am I doing something wrong?
Here is the code I used to generate these graphs:
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt
# change to highest_power=2 to get the bad model
def create_design_matrix(x_array, highest_power=1):
return np.array([[x**k for k in range(1, highest_power + 1)] for x in x_array])
N = 1000
x_array = np.random.uniform(1, 100, N)
y_array = np.log(x_array) + np.random.normal(0,0.2,N)
model = SVR(C=1.0, epsilon=0.1)
print model
X = create_design_matrix(x_array)
#print X
#print y_array
model = model.fit(X, y_array)
test_x = np.linspace(1.0, 100.0, num=10000)
test_y = model.predict(create_design_matrix(test_x))
plt.plot(x_array, y_array, 'ro')
plt.plot(test_x, test_y)
plt.show()
I'd appreciate any help with this mystery!

It looks like your model's picking up on outliers too heavily, which is a symptom of error from variance. This makes sense, because adding polynomial features increases the variance of a model. You should try tweaking the bias-variance tradeoff via cross validation by tweaking parameters. The parameters to modify would be C, epsilon, and gamma. The gamma parameter's incredibly important when using an RBF kernel, so I'd start there.
Manually fiddling with these parameters (which is not recommended - see below) gave me the following model:
The parameters used here were C=5, epsilon=0.1, gamma=2**-15.
Choosing these parameters is really a task for a proper model selection framework. I prefer simulated annealing + cross validation. The best scikit-learn currently has is random grid search + crossval. Shameless plug for a simulated annealing module I helped with: https://github.com/skylergrammer/SimulatedAnnealing
Note: Polynomial features are actually products of all combinations of size d (with replacement), not just the squares of features. In the second degree case, since you only have a single feature, these are equivalent. Scikit-learn has a class that'll calculate these though: sklearn.preprocessing.PolynomialFeatures

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.