I'm learning some basics in machine learning in Python (scikit - learn) and when I tried to implement the K-nearest neighbors algorithm an error occurs: ValueError: Found input variables with inconsistent numbers of samples: [426, 143]. I have no idea how to deal with it.
This is my code:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
cancer = load_breast_cancer()
X_train, y_train, X_test, y_test = train_test_split(cancer.data,cancer.target,
stratify =
cancer.target,
random_state = 0)
clf = KNeighborsClassifier(n_neighbors = 6)
clf.fit(X_train, y_train)`
train_test_split returns a tuple in the order X_train, X_test, y_train, y_test
You've assigned the return values to the wrong variables so you are fitting with the training data and the test data instead of the training data and the training labels.
It should be
X_train, X_test, y_train, y_test = train_test_split()
Related
Load popular digits dataset from sklearn.datasets module and assign it to variable digits.
Split digits.data into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split() method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf.
Evaluate the model accuracy on the testing data set and print its score.
I used the following code:
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
I got the below output.
(1347,64)
(450,64)
0.4088888888888889
But I am not able to pass the test. Can someone help with what is wrong?
You are missing the stratified sampling requirement; modify your split to include it:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
Check the documentation.
I am trying to create a machine learning program in sci-kit learn. I am using a CSV file to store data, and have decided to use Pandas data frame to import and format this data. I cannot figure out how to fit this data frame with the model.
My CSV file has one feature, age, and one target, weight. I am using a linear regression algorithm to predict the weight using the age. I do realize this isn't the best algorithm to use with this data.
When I run this code I get the error "ValueError: Found input variables with inconsistent numbers of samples: [10, 40]"
Here is my code:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load And Split Data
data = pd.read_csv("awd.csv")
feature_cols = ['Ages']
X = data.loc[:, feature_cols]
y = data.loc[:, "Weights"]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Scores
print(f"Test set score: {round(lr.score(X_test, y_test), 3)}")
print(f"Training set score: {round(lr.score(X_train, y_train), 3)}")
The first 5 lines of my CSV file:
Ages,Weights
1,19
1,21
2,26
2,32
You're assigning the return values incorrectly. See below:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
You should correct the order of X_train, X_test, y_train and y_test like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
See the relevant documentation for details.
When I apply StandardScaler to my train data after traintestsplit, the score for the train data is ok but the score for my test data makes no sense.
I tried LinearRegression(normalize=True) and it also made the score of my test data go crazy.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
print(lr.score(X_train_sc, y_train))
print(lr.score(X_test_sc, y_test))
Results are:
0.961269156232134
-1.5466488732709964e+19
Why??? Please, help!
Note: If I do not run the StandardScaler, then both my scores make perfect sense.
This is my code, I have those models 1 to 4 run in above cells without any errors.
I will also show my test train split.
Train test split & smote
Error image
from sklearn.ensemble import VotingClassifier
estimators=[('lr', model1,('RF1', model2), ('nn', model3),('RF2',model4))]
model5 = VotingClassifier(estimators, voting='hard')
model5.fit(X_train_nns,y_train_nns)
ypred_en=model5.predict(X_test)
model5.score(X_test, y_test)
This is how i split & over sampled it
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10,
random_state=42)
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_train_nns, y_train_nns = sm.fit_sample(X_train, y_train.values.ravel())
Can anyone tell me what i am doing wrong here.
I am applying KNN model on breast cancer wisconsin data but everytime I run the code I get this error:
ValueError: Found input variables with inconsistent numbers of samples: [559, 140]
import numpy as np
import pandas as pd
from sklearn import preprocessing,cross_validation,neighbors
df=pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)
X=np.array(df.drop(['class'],1))
y=np.array(df['class'])
X_train, y_train, X_test, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy=clf.score(X_test, y_test)
print(accuracy)
example=np.array([4,2,1,1,1,2,3,2,1])
example=example.reshape(-1,1)
prediction=clf.predict(example)
print(prediction)
The output of cross_validation.train_test_split, as per the documentation, should be X_train, X_test, y_train, y_test. Change that line in your code to:
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)