Predicted array has a strange shape - python

The dataset contains 3 columns: comment, parent_comment and label (0 or 1). I try to predict label for y_test, but have an error
Found input variables with inconsistent numbers of samples: [2, 758079]
Somehow the shape of y_predict is (2,0).
Why is that and how to fix it?
When I do
y_test = y_test_["comment"]
everything is fine.
For some reason y_predict has a shape (2,) while y_truehas a normal shape (252694,)
x_train, y_test_ = train_test_split(df1, random_state=17)
y_test = y_test_[["comment", "parent_comment"]]
y_true = y_test_["label"]
tf_idf = TfidfVectorizer(stop_words = 'english', ngram_range=(1, 2), max_features=700000, min_df=0.01)
# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1, n_jobs=4, solver='lbfgs',
random_state=17, verbose=1)
# sklearn's pipeline
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf),
('logit', logit)])
tfidf_logit_pipeline.fit(x_train[['comment',"parent_comment"]], x_train["label"])
y_predict = tfidf_logit_pipeline.predict(y_test)
accuracy_score(y_predict, y_true)

I think the problem is your train/test-split which is not posted here. If you check this: https://datascience.stackexchange.com/questions/20199/train-test-split-error-found-input-variables-with-inconsistent-numbers-of-sam you need to have the same length of X and y for splitting.
A check-up for this is:
X.shape
Y.shape
Followed the post to remove the error:
o fix this error:
Remove the extra list from inside of np.array() when defining X or remove the extra dimension afterwards with the following command: X = X.reshape(X.shape[1:]) .
Transpose X by running X = X.transpose() to get equal number of samples in X and Y.
Update:
With your last comment:
Check shape of y_predict, y_true this is probably not matching

Related

ValueError: `axis` must be fewer than the number of dimensions (1)

i would like to show my trained model in a confusion matrix but if I run my method im getting a error on this code ytrue = np.argmax(y_test, axis=1).tolist():
raise ValueError(f"axis must be fewer than the number of dimensions ({ndim})")
ValueError: axis must be fewer than the number of dimensions (1)
What is wron in my code?
df = pd.read_csv("data.csv")
X = df.drop(['Rin'], axis=1)
y = df['Rin']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
model = Sequential()
model.add
...
model.save('tfmodel.h5')
model.load_weights('tfmodel.h5')
yhat = model.predict(X_test)
ytrue = np.argmax(y_test, axis=1).tolist()
yhat = np.argmax(yhat, axis=1).tolist()
print(confusion_matrix(ytrue, yhat))
From comments
Targets are 1d, no need to call argmax on them. they're
not a probability matrix.
After changing the targets to integers ytrue = y_test.astype(int).tolist() issue was resolved
(Paraphrased from Nicolas Gervais)

Error when calculating predicted values of polynomial regression python

I am trying to calculate predicted values after running a polynomial regression in Python using the following code:
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x) + x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
X = X_train.reshape(-1, 1)
X_predict = np.linspace(0, 10, 100)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X)
model = LinearRegression()
reg_poly = model.fit(X_train_poly, y_train)
y_predict = model.predict(X_predict)
After running it I get the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[ 0. 0.1010101 0.2020202 0.3030303 0.4040404 0.50505051 ......
Reshape your data either using array.reshape(-1, 1)
if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried reshaping the array as was said in the error message, so the last line of code would be:
y_predict = model.predict(X_predict.reshape(-1,1))
But as a result I got this error:
ValueError: shapes (100,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
Can someone please explain what I am doing wrong?
You forgot to prepare data for your prediction in the same way you prepared training data for the model. In particular, you forgot to fit_transform your X_predict with PolynomialFeatures.
Since the shape of data you used to predict have to exactly match the shape used for training, you need to recreate all you did for X_train_poly (you used that for training) for X_predict. Therefore your line should look like:
y_predict = model.predict(poly.fit_transform(X_predict.reshape(-1, 1)))

what is the problem with my code linreg.predict() not giving out right answer?

SO The question given to me was
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
So tried the problem myself and failed and saw some other persons GitHub code and was very similar to me but it worked.
So what is the difference between my code and the other person code?
Here is some basic code prior to my question
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
Here is my approach
pred=np.linspace(0,10,100).reshape(100,1)
k=np.zeros((4,100))
for count,i in enumerate([1,3,6,9]):
poly = PolynomialFeatures(degree=i)
X_poly = poly.fit_transform(X_train.reshape(-1,1))
linreg = LinearRegression()
linreg.fit(X_poly,y_train.reshape(-1,1))
pred = poly.fit_transform(pred.reshape(-1,1))
t=linreg.predict(pred)
#print(t) #used for debugging
print("### **** ####") #used for debugging
k[count,:]=t.reshape(1,-1)
print(k)
Here is the code that works
result = np.zeros((4, 100))
for i, degree in enumerate([1, 3, 6, 9]):
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X_train.reshape(11,1))
linreg = LinearRegression().fit(X_poly, y_train)
y=linreg.predict(poly.fit_transform(np.linspace(0,10,100).reshape(100,1)))
result[i, :] = y
print(result)
My approach got an error
13 print("### **** ####")
---> 14 k[count,:]=t.reshape(1,-1)
15
16
ValueError: could not broadcast input array from shape (200) into shape (100)
While other code worked fine
The difference lies in the argument for linreg.predict. You are overwriting your pred variable with the result of poly.fit_transform, which changes it's shape from (100,1) to (200,2) in the first iteration of the loop. In the second iteration, t does not fit into k anymore, resulting in the error you are facing.

ValueError: Expected 2D array, got 1D array instead insists after converting 1D array to 2D

I looked into other answers but still cannot understand why the problem insists.
Classic machine learning practice with Iris dataset.
Code:
dataset=load_iris()
X = np.array(dataset.data)
y = np.array(dataset.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
prediction=model.predict(X_test)
The shapes of all arrays:
X shape: (150, 4)
y shape: (150,)
X_train: (105, 4)
X_test: (45, 4)
y_train: (105,)
y_test (45,)
prediction: (45,)
Trying to print this model.score(y_test, prediction) and I get the error.
I tried to convert y_test and prediction into 2D arrays by using .reshape(-1,1) and I get another error: query data dimension must match training data dimension.
It's not only about the solution, it's about understanding what's wrong.
Its often useful to look at the signature and docstrings for functions you use. model.score for example has
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
in the docstrings to show you the exact type of input you should give it.
Here you would do model.score(X_test, y_test)
model.score will do both the prediction from X_test and the comparison with y_test

Python sklearn polynomial preprocessing and dimensional problems

I am experimenting the fit of 1-3 degree polynomial transformation to the original data using 100 predicted values each. I first 1) reshaped the original data, 2) applied fit_transform on the test set and prediction space (of data features), 3) obtained linear prediction on the prediction space, and 4) exported them into an array, using the following code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
y = y.reshape(-1, 1)
pred_data = np.linspace(0,10,100).reshape(-1,1)
results = []
for i in [1, 2, 3] :
poly = PolynomialFeatures(degree = i)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_poly1 = poly.fit_transform(x_train)
pred_data = poly.fit_transform(pred_data)
linreg1 = LinearRegression().fit(x_poly1, y_train)
pred = linreg1.predict(pred_data)
results.append(pred)
results
However, I did not get what I wanted, Python did not return an array of (3, 100) shape as I was expecting and, in fact, I received an error message
ValueError: shapes (100,10) and (4,1) not aligned: 10 (dim 1) != 4 (dim 0)
Seems to be a dimensional problem resulting either from "reshape" or from the "fit_transform" step. I got confused as this was supposed to be straightforward test. Would anyone enlighten me on this? It will be much appreciated.
Thank you.
Sincerely,
First, as I suggested in comment, you should always call just transform() on test data (pred_data in your case).
But even if you do that, a different error occurs. The error is due to this line:
pred_data = poly.fit_transform(pred_data)
Here you are replacing the original pred_data with the transformed version. So for first iteration of loop, it works, but for second and third iteration it becomes invalid, because it requires the original pred_data of shape (100,1) defined in this line above the for loop:
pred_data = np.linspace(0,10,100).reshape(-1,1)
Change the name of variable inside the loop to something else and all works well.
for i in [1, 2, 3] :
poly = PolynomialFeatures(degree = i)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_poly1 = poly.fit_transform(x_train)
# Changed here
pred_data_poly1 = poly.transform(pred_data)
linreg1 = LinearRegression().fit(x_poly1, y_train)
pred = linreg1.predict(pred_data_poly1)
results.append(pred)
results

Categories

Resources