i would like to show my trained model in a confusion matrix but if I run my method im getting a error on this code ytrue = np.argmax(y_test, axis=1).tolist():
raise ValueError(f"axis must be fewer than the number of dimensions ({ndim})")
ValueError: axis must be fewer than the number of dimensions (1)
What is wron in my code?
df = pd.read_csv("data.csv")
X = df.drop(['Rin'], axis=1)
y = df['Rin']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
model = Sequential()
model.add
...
model.save('tfmodel.h5')
model.load_weights('tfmodel.h5')
yhat = model.predict(X_test)
ytrue = np.argmax(y_test, axis=1).tolist()
yhat = np.argmax(yhat, axis=1).tolist()
print(confusion_matrix(ytrue, yhat))
From comments
Targets are 1d, no need to call argmax on them. they're
not a probability matrix.
After changing the targets to integers ytrue = y_test.astype(int).tolist() issue was resolved
(Paraphrased from Nicolas Gervais)
Related
I have seen that many others on stackoverflow have posted about this same problem, but I haven't been able to figure out how to apply those solutions to my example.
I have been working on creating a model to predict an outcome of either 0 or 1 based on a dataset which contains 16 features - Everything has seemed to work fine (accuracy evaluation, epoch completion, etc.).
As mentioned, my training features include 16 different variables, but when I pass in a list that contains 16 unique values separate from the training dataset in order to try and make an individual prediction (of either 0 or 1), I get this error:
ValueError: Layer sequential_11 expects 1 input(s), but it received 16 input tensors.
Here is my code -
y = datas.Result
X = datas.drop(columns = ['Date', 'home_team', 'away_team', 'home_pitcher', 'away_pitcher', 'Result'])
X = X.values.astype('float32')
y = y.values.astype('float32')
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = 0.2)
model=keras.Sequential([
keras.layers.Dense(32, input_shape = (16,)),
keras.layers.Dense(20,activation=tf.nn.relu),
keras.layers.Dense(2,activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['acc'])
history = model.fit(X_train,y_train,epochs=20, validation_data=(X_validation, y_validation))
#all variables within features list are single values, ex: .351, 11, .991, etc.
features = [t1_pqm,t2_pqm,t1_elo,t2_elo,t1_era,t2_era,t1_bb9,t2_bb9,t1_fip,t2_fip,t1_ba,t2_ba,t1_ops,t2_ops,t1_so,t2_so]
prediction = model.predict(features)
The model expects an input of shape (None,16) but features has the shape (16,) (1D list). The easiest solution is to make it an numpy array with the right shape (1, 16):
features = np.array([[t1_pqm,t2_pqm,t1_elo,t2_elo,t1_era,t2_era,t1_bb9,t2_bb9,t1_fip,t2_fip,t1_ba,t2_ba,t1_ops,t2_ops,t1_so,t2_so]])
The dataset contains 3 columns: comment, parent_comment and label (0 or 1). I try to predict label for y_test, but have an error
Found input variables with inconsistent numbers of samples: [2, 758079]
Somehow the shape of y_predict is (2,0).
Why is that and how to fix it?
When I do
y_test = y_test_["comment"]
everything is fine.
For some reason y_predict has a shape (2,) while y_truehas a normal shape (252694,)
x_train, y_test_ = train_test_split(df1, random_state=17)
y_test = y_test_[["comment", "parent_comment"]]
y_true = y_test_["label"]
tf_idf = TfidfVectorizer(stop_words = 'english', ngram_range=(1, 2), max_features=700000, min_df=0.01)
# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1, n_jobs=4, solver='lbfgs',
random_state=17, verbose=1)
# sklearn's pipeline
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf),
('logit', logit)])
tfidf_logit_pipeline.fit(x_train[['comment',"parent_comment"]], x_train["label"])
y_predict = tfidf_logit_pipeline.predict(y_test)
accuracy_score(y_predict, y_true)
I think the problem is your train/test-split which is not posted here. If you check this: https://datascience.stackexchange.com/questions/20199/train-test-split-error-found-input-variables-with-inconsistent-numbers-of-sam you need to have the same length of X and y for splitting.
A check-up for this is:
X.shape
Y.shape
Followed the post to remove the error:
o fix this error:
Remove the extra list from inside of np.array() when defining X or remove the extra dimension afterwards with the following command: X = X.reshape(X.shape[1:]) .
Transpose X by running X = X.transpose() to get equal number of samples in X and Y.
Update:
With your last comment:
Check shape of y_predict, y_true this is probably not matching
I am trying to calculate predicted values after running a polynomial regression in Python using the following code:
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x) + x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
X = X_train.reshape(-1, 1)
X_predict = np.linspace(0, 10, 100)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X)
model = LinearRegression()
reg_poly = model.fit(X_train_poly, y_train)
y_predict = model.predict(X_predict)
After running it I get the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[ 0. 0.1010101 0.2020202 0.3030303 0.4040404 0.50505051 ......
Reshape your data either using array.reshape(-1, 1)
if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried reshaping the array as was said in the error message, so the last line of code would be:
y_predict = model.predict(X_predict.reshape(-1,1))
But as a result I got this error:
ValueError: shapes (100,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
Can someone please explain what I am doing wrong?
You forgot to prepare data for your prediction in the same way you prepared training data for the model. In particular, you forgot to fit_transform your X_predict with PolynomialFeatures.
Since the shape of data you used to predict have to exactly match the shape used for training, you need to recreate all you did for X_train_poly (you used that for training) for X_predict. Therefore your line should look like:
y_predict = model.predict(poly.fit_transform(X_predict.reshape(-1, 1)))
I looked into other answers but still cannot understand why the problem insists.
Classic machine learning practice with Iris dataset.
Code:
dataset=load_iris()
X = np.array(dataset.data)
y = np.array(dataset.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
prediction=model.predict(X_test)
The shapes of all arrays:
X shape: (150, 4)
y shape: (150,)
X_train: (105, 4)
X_test: (45, 4)
y_train: (105,)
y_test (45,)
prediction: (45,)
Trying to print this model.score(y_test, prediction) and I get the error.
I tried to convert y_test and prediction into 2D arrays by using .reshape(-1,1) and I get another error: query data dimension must match training data dimension.
It's not only about the solution, it's about understanding what's wrong.
Its often useful to look at the signature and docstrings for functions you use. model.score for example has
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
in the docstrings to show you the exact type of input you should give it.
Here you would do model.score(X_test, y_test)
model.score will do both the prediction from X_test and the comparison with y_test
I want to split my dataset into training and test datasets based on years. The idea is to put the rows with years ranging form 2009-2017 in train dataset and the 2018 data in test dataset. Splitting the datasets was easy for the most part but my models are throwing a lot of indexing issues.
X = ((df[df['Year'] < 2018]))
X_train = np.array(X.drop(['Usage'], 1))
X_test = np.array(X['Usage'])
y =((df[df['Year'] > 2017]))
y_train = np.array(y.drop(['Usage'], 1))
y_test = np.array(y['Usage'])
This is how I plan on splitting the data. The usage column is my forecast column and contains continuous values. Applying a simple RandomForestRegressor() gave me this error in return
ValueError: Number of labels=14495 does not match number of samples=382772
aditya my regressor model was pretty basic but i'm attaching the code any way. the columns being passed in X are as follows: X= [Cust_Id', 'Usage', 'Plan_Group', 'Contract_Type', 'Cust_Status','Premise_Zip', 'Year', 'Month']
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# evaluate predictions
print(model.score(X_test, y_test))
# accuracy = accuracy_score(y_test, (y_pred < 0.5).astype(int))
For most of the algorithms in sklearn stack, you have a standard notation:
X, capital letter, is usually an array (even if there is one feature) and represents each data point in vector form.
y, small letter, is usually a vector that denotes labels, e.g. class label, or value of a regression element.
You created X and y both as a dataframe generated by the Year attribute. Instead you have to split into X_train and X_test.
X = df.drop(['Usage'],1)
X_train = df[df['Year'] < 2018]
X_test = df[df['Year'] > 2017]
y_train = df[df['Year'] < 2018]
y_test = df[df['Year'] > 2017]
y_train = y_train['Usage']
y_test = y_test['Usage']
And then you train on the basis of X_train and y_train
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
This is not the best way though. Will come back to edit the answer but this should be enough to get you going for now.