Why I am getting this value error in KNN model? - python

I am applying KNN model on breast cancer wisconsin data but everytime I run the code I get this error:
ValueError: Found input variables with inconsistent numbers of samples: [559, 140]
import numpy as np
import pandas as pd
from sklearn import preprocessing,cross_validation,neighbors
df=pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)
X=np.array(df.drop(['class'],1))
y=np.array(df['class'])
X_train, y_train, X_test, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy=clf.score(X_test, y_test)
print(accuracy)
example=np.array([4,2,1,1,1,2,3,2,1])
example=example.reshape(-1,1)
prediction=clf.predict(example)
print(prediction)

The output of cross_validation.train_test_split, as per the documentation, should be X_train, X_test, y_train, y_test. Change that line in your code to:
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)

Related

How do I return the result of each cross validation prediction

I have a task that requires me to analyse a model but I need the output predictions for each cross validation step- and the data that the cross validation used in that step.
Here is my code:
results= cross_validate(MLPClassifier, X_train, y_train, cv=5,return_estimator = True)
Which did not work. Also,
results= cross_val_predict(MLPClassifier, X_train, y_train, cv=5)
Neither worked, however the second method gave me the a set of predictions that are the shape of y_train (labels). However I expected a smaller value to be returned say 10% the size of y_train.
Also I'm unsure how to obtain the data used for each cross validation step.
How about using one of the Cross Validation iterators?
from sklearn.datasets import make_classification
from sklearn.model_selection import ShuffleSplit
from sklearn.neural_network import MLPClassifier
X, y = make_classification(n_samples=1000, random_state=0)
datasets = {} # [(X_train, y_train), (X_test, y_test)]
results = {}
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
for idx, (train_index, test_index) in enumerate(ss.split(X)):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
datasets[f"train_{idx}"] = X_train, y_train
datasets[f"test_{idx}"] = X_test, y_test
model = MLPClassifier(random_state=0).fit(X_train, y_train)
results[f"accuracy_{idx}"] = model.score(X_test, y_test)
results
Output:
{'accuracy_0': 0.968,
'accuracy_1': 0.924,
'accuracy_2': 0.94,
'accuracy_3': 0.944,
'accuracy_4': 0.964}

How do I properly fit a sci-kit learn model using a pandas dataframe?

I am trying to create a machine learning program in sci-kit learn. I am using a CSV file to store data, and have decided to use Pandas data frame to import and format this data. I cannot figure out how to fit this data frame with the model.
My CSV file has one feature, age, and one target, weight. I am using a linear regression algorithm to predict the weight using the age. I do realize this isn't the best algorithm to use with this data.
When I run this code I get the error "ValueError: Found input variables with inconsistent numbers of samples: [10, 40]"
Here is my code:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load And Split Data
data = pd.read_csv("awd.csv")
feature_cols = ['Ages']
X = data.loc[:, feature_cols]
y = data.loc[:, "Weights"]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Scores
print(f"Test set score: {round(lr.score(X_test, y_test), 3)}")
print(f"Training set score: {round(lr.score(X_train, y_train), 3)}")
The first 5 lines of my CSV file:
Ages,Weights
1,19
1,21
2,26
2,32
You're assigning the return values incorrectly. See below:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
You should correct the order of X_train, X_test, y_train and y_test like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
See the relevant documentation for details.

StandardScaling ruins the test data score for my linear regression

When I apply StandardScaler to my train data after traintestsplit, the score for the train data is ok but the score for my test data makes no sense.
I tried LinearRegression(normalize=True) and it also made the score of my test data go crazy.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
print(lr.score(X_train_sc, y_train))
print(lr.score(X_test_sc, y_test))
Results are:
0.961269156232134
-1.5466488732709964e+19
Why??? Please, help!
Note: If I do not run the StandardScaler, then both my scores make perfect sense.

I am trying to use ensemble learner method sklearn but having model fit issue, getting an value error: too many values to unpack (expected 2)

This is my code, I have those models 1 to 4 run in above cells without any errors.
I will also show my test train split.
Train test split & smote
Error image
from sklearn.ensemble import VotingClassifier
estimators=[('lr', model1,('RF1', model2), ('nn', model3),('RF2',model4))]
model5 = VotingClassifier(estimators, voting='hard')
model5.fit(X_train_nns,y_train_nns)
ypred_en=model5.predict(X_test)
model5.score(X_test, y_test)
This is how i split & over sampled it
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10,
random_state=42)
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_train_nns, y_train_nns = sm.fit_sample(X_train, y_train.values.ravel())
Can anyone tell me what i am doing wrong here.

"Inconsistent numbers of samples" - scikit - learn

I'm learning some basics in machine learning in Python (scikit - learn) and when I tried to implement the K-nearest neighbors algorithm an error occurs: ValueError: Found input variables with inconsistent numbers of samples: [426, 143]. I have no idea how to deal with it.
This is my code:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
cancer = load_breast_cancer()
X_train, y_train, X_test, y_test = train_test_split(cancer.data,cancer.target,
stratify =
cancer.target,
random_state = 0)
clf = KNeighborsClassifier(n_neighbors = 6)
clf.fit(X_train, y_train)`
train_test_split returns a tuple in the order X_train, X_test, y_train, y_test
You've assigned the return values to the wrong variables so you are fitting with the training data and the test data instead of the training data and the training labels.
It should be
X_train, X_test, y_train, y_test = train_test_split()

Categories

Resources