Error when calculating predicted values of polynomial regression python - python

I am trying to calculate predicted values after running a polynomial regression in Python using the following code:
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x) + x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
X = X_train.reshape(-1, 1)
X_predict = np.linspace(0, 10, 100)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X)
model = LinearRegression()
reg_poly = model.fit(X_train_poly, y_train)
y_predict = model.predict(X_predict)
After running it I get the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[ 0. 0.1010101 0.2020202 0.3030303 0.4040404 0.50505051 ......
Reshape your data either using array.reshape(-1, 1)
if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried reshaping the array as was said in the error message, so the last line of code would be:
y_predict = model.predict(X_predict.reshape(-1,1))
But as a result I got this error:
ValueError: shapes (100,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
Can someone please explain what I am doing wrong?

You forgot to prepare data for your prediction in the same way you prepared training data for the model. In particular, you forgot to fit_transform your X_predict with PolynomialFeatures.
Since the shape of data you used to predict have to exactly match the shape used for training, you need to recreate all you did for X_train_poly (you used that for training) for X_predict. Therefore your line should look like:
y_predict = model.predict(poly.fit_transform(X_predict.reshape(-1, 1)))

Related

Linear Regression fitting

I have first done a train/test split then fitted that data to a LinearRegression model shown below
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.4, random_state = 101)
Log_m = LinearRegression()
Log_m.fit(X_train,y_train)
predictions = Log_m.predict(X_test)
I have been given another test data frame and wanted to fit that to the Log_m model which has been created. So I did
predictions_t = Log_m.predict(fin_df1_t)
But I get the error message :
ValueError: shapes (1450,262) and (282,) not aligned: 262 (dim 1) != 282 (dim 0)
These are the shapes of dataframes
fin_df1_t (1450,262)
X_test (556,282)
X_train (834,282)
y_test (556,)
y_train (834,)
The feature columns of new test data (262) are not equal to feature columns of Xtrain and Xtest (282), so it will always give an error. Both should have the same feature columns.
For example, Xtrain and Xtest have the same columns (282), so there is no error at that step.

Predicted array has a strange shape

The dataset contains 3 columns: comment, parent_comment and label (0 or 1). I try to predict label for y_test, but have an error
Found input variables with inconsistent numbers of samples: [2, 758079]
Somehow the shape of y_predict is (2,0).
Why is that and how to fix it?
When I do
y_test = y_test_["comment"]
everything is fine.
For some reason y_predict has a shape (2,) while y_truehas a normal shape (252694,)
x_train, y_test_ = train_test_split(df1, random_state=17)
y_test = y_test_[["comment", "parent_comment"]]
y_true = y_test_["label"]
tf_idf = TfidfVectorizer(stop_words = 'english', ngram_range=(1, 2), max_features=700000, min_df=0.01)
# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1, n_jobs=4, solver='lbfgs',
random_state=17, verbose=1)
# sklearn's pipeline
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf),
('logit', logit)])
tfidf_logit_pipeline.fit(x_train[['comment',"parent_comment"]], x_train["label"])
y_predict = tfidf_logit_pipeline.predict(y_test)
accuracy_score(y_predict, y_true)
I think the problem is your train/test-split which is not posted here. If you check this: https://datascience.stackexchange.com/questions/20199/train-test-split-error-found-input-variables-with-inconsistent-numbers-of-sam you need to have the same length of X and y for splitting.
A check-up for this is:
X.shape
Y.shape
Followed the post to remove the error:
o fix this error:
Remove the extra list from inside of np.array() when defining X or remove the extra dimension afterwards with the following command: X = X.reshape(X.shape[1:]) .
Transpose X by running X = X.transpose() to get equal number of samples in X and Y.
Update:
With your last comment:
Check shape of y_predict, y_true this is probably not matching

ValueError: Expected 2D array, got 1D array instead insists after converting 1D array to 2D

I looked into other answers but still cannot understand why the problem insists.
Classic machine learning practice with Iris dataset.
Code:
dataset=load_iris()
X = np.array(dataset.data)
y = np.array(dataset.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
prediction=model.predict(X_test)
The shapes of all arrays:
X shape: (150, 4)
y shape: (150,)
X_train: (105, 4)
X_test: (45, 4)
y_train: (105,)
y_test (45,)
prediction: (45,)
Trying to print this model.score(y_test, prediction) and I get the error.
I tried to convert y_test and prediction into 2D arrays by using .reshape(-1,1) and I get another error: query data dimension must match training data dimension.
It's not only about the solution, it's about understanding what's wrong.
Its often useful to look at the signature and docstrings for functions you use. model.score for example has
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
in the docstrings to show you the exact type of input you should give it.
Here you would do model.score(X_test, y_test)
model.score will do both the prediction from X_test and the comparison with y_test

How to convert pandas.core.series.Series type to 2D array?

I am trying to train a model using KNNClassifier. I split the data as follows:
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.30, random_state=42)
y_train= y_train.astype('int')
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
When I try to test it, I get a value error.
pre = neigh.predict(y_test)
Expected 2D array, got 1D array instead:
array=[0. 1. 1. ... 0. 0. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
My y_test is of type pandas.core.series.Series
So how do I convert pandas.core.series.Series to array of 2D to make this testing work?
I have tried to convert y_test to dataframe and then to array, but I get another value error and I am stuck.
y_test = pd.DataFrame(y_test)
y_test = y_test.as_matrix().reshape(-1,1)
pre = neigh.predict(y_test)
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 1 while Y.shape[1] == 6038
I guess you need to use your X_test variable / array, not y_test.
X_test are the independent variables / features used to test the accuracy of our model, and y_test are the actual target values which will be compared with the predicted values.
Example:
pre = neigh.predict(X_test)
To measure accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pre)

Python sklearn polynomial preprocessing and dimensional problems

I am experimenting the fit of 1-3 degree polynomial transformation to the original data using 100 predicted values each. I first 1) reshaped the original data, 2) applied fit_transform on the test set and prediction space (of data features), 3) obtained linear prediction on the prediction space, and 4) exported them into an array, using the following code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
y = y.reshape(-1, 1)
pred_data = np.linspace(0,10,100).reshape(-1,1)
results = []
for i in [1, 2, 3] :
poly = PolynomialFeatures(degree = i)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_poly1 = poly.fit_transform(x_train)
pred_data = poly.fit_transform(pred_data)
linreg1 = LinearRegression().fit(x_poly1, y_train)
pred = linreg1.predict(pred_data)
results.append(pred)
results
However, I did not get what I wanted, Python did not return an array of (3, 100) shape as I was expecting and, in fact, I received an error message
ValueError: shapes (100,10) and (4,1) not aligned: 10 (dim 1) != 4 (dim 0)
Seems to be a dimensional problem resulting either from "reshape" or from the "fit_transform" step. I got confused as this was supposed to be straightforward test. Would anyone enlighten me on this? It will be much appreciated.
Thank you.
Sincerely,
First, as I suggested in comment, you should always call just transform() on test data (pred_data in your case).
But even if you do that, a different error occurs. The error is due to this line:
pred_data = poly.fit_transform(pred_data)
Here you are replacing the original pred_data with the transformed version. So for first iteration of loop, it works, but for second and third iteration it becomes invalid, because it requires the original pred_data of shape (100,1) defined in this line above the for loop:
pred_data = np.linspace(0,10,100).reshape(-1,1)
Change the name of variable inside the loop to something else and all works well.
for i in [1, 2, 3] :
poly = PolynomialFeatures(degree = i)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_poly1 = poly.fit_transform(x_train)
# Changed here
pred_data_poly1 = poly.transform(pred_data)
linreg1 = LinearRegression().fit(x_poly1, y_train)
pred = linreg1.predict(pred_data_poly1)
results.append(pred)
results

Categories

Resources