In my problem there are four features(X); a,b,c,d and two dependents(Y); e,f. I have with me a data set containing a set of values for all these variables. How can I predict the values through Support Vector Regression using scikit learn in python, for the e,f variables when new a,b,c,d values are given?
I'm very new to ML and I would really appreciate some guidance since I found it very difficult to follow the scikit learn documentation on SVR.
This is what I have done so far with the help of an example in the sklearn documentation.
train = pd.read_csv('/Desktop/test.csv')
X = train.iloc[:, 4]
y = train.iloc[:, 4:5]
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
y_rbf = svr_rbf.fit(X, y).predict(X)
lw = 2
plt.scatter(X, y, color='darkorange', label='data')
plt.plot(X, y_rbf, color='navy', lw=lw, label='RBF model')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()
This gives the error,
ValueError: Expected 2D array, got 1D array instead:
:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I'm assuming that your target variables need to be independently predicted here, so correct me if I'm wrong. I've slightly modified the sklearn doc example to illustrate what you need to do. Please do consider scaling your data before performing the regression.
import numpy as np
from sklearn import svm
import matplotlib.pyplot as plt
n_samples, n_features = 10, 4 # your four features a,b,c,d are the n_features
np.random.seed(0)
y_e = np.random.randn(n_samples)
y_f = np.random.randn(n_samples)
# your input array should be formatted like this.
X = np.random.randn(n_samples, n_features)
#dummy parameters - use grid search etc to find best params
svr_rbf = svm.SVR(kernel='rbf', C=1e3, gamma=0.1)
# Fit and predict for one target, do the same for the other
y_pred_e = svr_rbf.fit(X, y_e).predict(X)
Assuming your data file has 6 columns, and the feature values are in the the first 4 columns, and the targets (which you call 'dependents') are in the final 2 columns, then I think you need to do this instead:
train = pd.read_csv('/Desktop/test.csv')
X = train.iloc[:, 0:3]
y = train.iloc[:, 4:5]
Related
I've trained a logistic regression model like this:
reg = LogisticRegression(random_state = 40)
cvreg = GridSearchCV(reg, param_grid={'C':[0.05,0.1,0.5],
'penalty':['none','l1','l2'],
'solver':['saga']},
cv = 5)
cvreg.fit(X_train, y_train)
Now to show the feature's importance I've tried this code, but I don't get the names of the coefficients in the plot:
from matplotlib import pyplot
importance = cvreg.best_estimator_.coef_[0]
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Obviously, the plot isn't very informative. How do I add the names of the coefficients to the x-axis?
The importance of the coeff is:
cvreg.best_estimator_.coef_
array([[1.10303023e+00, 7.48816905e-01, 4.27705027e-04, 6.01404570e-01]])
The coefficients correspond to the columns of X_train, so pass in the X_train names instead of range(len(importance)).
Assuming X_train is a pandas dataframe:
import matplotlib.pyplot as plt
features = X_train.columns
importance = cvreg.best_estimator_.coef_[0]
plt.bar(features, importance)
plt.show()
Note that if X_train is just a numpy array without column names, you will have to define the features list based on your own data dictionary.
I have run a KNN model. Now i want to plot the residual vs predicted value plot. Every example from different websites shows that i have to first run a linear regression model. But i couldn't understand how to do this. Can anyone help? Thanks in advance.
Here is my model-
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
x_train = train.iloc[:,[2,5]].values
y_train = train.iloc[:,4].values
x_validate = validate.iloc[:,[2,5]].values
y_validate = validate.iloc[:,4].values
x_test = test.iloc[:,[2,5]].values
y_test = test.iloc[:,4].values
clf=neighbors.KNeighborsRegressor(n_neighbors = 6)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_validate)
Residuals are nothing but how much your predicted values differ from actual values. So, it's calculated as actual values-predicted values. In your case, it's residuals = y_test-y_pred. Now for the plot, just use this;
import matplotlib.pyplot as plt
plt.scatter(residuals,y_pred)
plt.show()
What is the question? The residuals are simply y_test-y_pred. Now use seaborn's regplot.
I am getting this error
ValueError: Expected 2D array, got scalar array instead: array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
while executing this code
# SVR
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
# Load dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting the SVR to the data set
regressor = SVR(kernel = 'rbf', gamma = 'auto')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)
You need to understand how SVM works. Your trainig data is a matrix of shape (n_samples, n_features). That means, your SVM operates in feature space of n_features dimensions. Hence, it cannot predict a value for a scalar input, unless n_features is 1. You can only predict values for vectors of dimension n_features. So, if your data set has 5 columns, you can predict values for an arbitrary row-vector of 5 columns. See the below example.
import numpy as np
from sklearn.svm import SVR
# Data: 200 instances of 5 features each
X = randint(1, 100, size=(200, 5))
y = randint(0, 2, size=200)
reg = SVR()
reg.fit(X, y)
y_test = np.array([[0, 1, 2, 3, 4]]) # Input to .predict must be 2-dimensional
reg.predict(y_test)
# Predicting a new result with Linear Regression
X_test = np.array([[6.5]])
print(lin_reg.predict(X_test))
# Predicting a new result with Polynomial Regression
print(lin_reg_2.predict(poly_reg.fit_transform(X_test)))
It's an old problem about prediction using regression exploring Gapminder data. They used "prediction space" to compute prediction.
Q1. Why should I be creating "prediction space"? What is the use of it?
Q2. The relation of computing predictions over the "prediction space"?
import numpy as np
import pandas as pd
# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')
The data seems like this;
Country,Year,life,population,income,region
Afghanistan,1800,28.211,3280000,603.0,South Asia
Slovak Republic,1960,70.47800000000001,4137224,8693.0,Europe & Central Asia
# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values
# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)
# Create the regressor: reg
reg = LinearRegression()
# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
# Fit the model to the data
reg.fit(X_fertility, y)
# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)
I believe that you are taking a course from DataCamp
I stumbled upon this too, and the answer is prediction_space and y_pred are used to construct the straight line in the graph
NOTE: for those who are reading this and don't understand what I'm talking about, the code snippet is actually missing the graph plotting code
# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()
It comes with the y_pred to make a baseline for you to calculate the residuals and further get the R^2 value.
As we know, in logistic regression algorithm we predict one when theta times X is bigger than 0.5. I wanna raise the precision value. so i wanna change the predict function to predict 1 when theta times X is bigger than 0.7 or other values bigger than 0.5.
If i write the algorithm i could easily do it. But with sklearn package, i have no idea what to do.
Anyone can give me a hand?
To explain the question clearly enough, here is the predict function wroten in octave:
p = sigmoid(X*theta);
for i=1:size(p)(1)
if p(i) >= 0.6
p(i) = 1;
else
p(i) = 0;
endif;
endfor
The LogisticRegression predictor object from sklearn has a predict_proba method which outputs the probabilities that an input example belongs to a certain class. You can use this function along with your own defined theta times X to get the functionality you desire.
An example:
from sklearn import linear_model
import numpy as np
np.random.seed(1337) # Seed random for reproducibility
X = np.random.random((10, 5)) # Create sample data
Y = np.random.randint(2, size=10)
lr = linear_model.LogisticRegression().fit(X, Y)
prob_example_is_one = lr.predict_proba(X)[:, 1]
my_theta_times_X = 0.7 # Our custom threshold
predict_greater_than_theta = prob_example_is_one > my_theta_times_X
Here's the docstring for predict_proba:
Probability estimates.
The returned estimates for all classes are ordered by the
label of classes.
For a multi_class problem, if multi_class is set to be "multinomial"
the softmax function is used to find the predicted probability of
each class.
Else use a one-vs-rest approach, i.e calculate the probability
of each class assuming it to be positive using the logistic function.
and normalize these values across all the classes.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
T : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model,
where classes are ordered as they are in ``self.classes_``.
this works for both binary and multi-class classification:
from sklearn.linear_model import LogisticRegression
import numpy as np
#X = some training data
#y = labels for training data
#X_test = some test data
clf = LogisticRegression()
clf.fit(X, y)
predictions = clf.predict_proba(X_test)
predictions = clf.classes_[np.argmax(predictions > threshold, axis=1)]