KNN : Found input variables with inconsistent numbers of samples: [20, 499] - python
Full replit here: https://repl.it/#JacksonEnnis/KNNPercentage
I am trying to use the KNN tool from sci-kit learn to make some predictions.
I have two functions, recurse() and predict(). recurse() is intended to iterate through every single possible combo of features, while predict is supposed to do the actual
def predict(self, data, answers):
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split as tts
import numpy as np
if len(data) > 1:
print("length before transposition {}".format(len(data)))
#n_data = np.transpose(data)
#print("length after transposition {}".format(len(n_data)))
knn = KNeighborsClassifier(n_neighbors=1)
xTrain, xTest, yTrain, yTest = tts(data, answers)
print("xTrain data: {}".format(len(xTrain)))
knn.fit(xTrain, yTrain)
print(knn.score(xTest, yTest))
def recurse(self, data):
self.predict(data, self.y)
if len(data) > 0:
self.recurse(self.rLeft(data))
if len(data) > 1:
self.recurse(self.rMid(data))
if len(data) > 2:
self.recurse(self.rRight(data))
However, when I run the program, it states that is has a problem with the train/test line. I have checked the samples in each feature, as well as the answers, and found that they are all the same length, so why this is happening I am unsure.
Traceback (most recent call last):
File "main.py", line 12, in <module>
best = Config(apple)
File "/home/runner/Config.py", line 13, in __init__
self.predict(self.features, self.y)
File "/home/runner/Config.py", line 45, in predict
xTrain, xTest, yTrain, yTest = tts(data, answers)
File "/home/runner/.local/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 2096, in train_test_split
arrays = indexable(*arrays)
File "/home/runner/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 230, in indexable
check_consistent_length(*result)
File "/home/runner/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 205, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [20, 499]
You have your axes reversed. The format is that for each of your arrays, array.shape[0] must be the same size. I recommend you check out the scikit docs for more examples.
tts(np.array(data).T, answers)
Related
Sklearn can't convert string to float
I'm using Sklearn as a machine learning tool, but every time I run my code, it gives this error: Traceback (most recent call last): File "C:\Users\FakeUserMadeUp\Desktop\Python\Machine Learning\MachineLearning.py", line 12, in <module> model.fit(X_train, Y_train) File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\tree\_classes.py", line 942, in fit X_idx_sorted=X_idx_sorted, File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\tree\_classes.py", line 166, in fit X, y, validate_separately=(check_X_params, check_y_params) File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\base.py", line 578, in _validate_data X = check_array(X, **check_X_params) File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 746, in check_array array = np.asarray(array, order=order, dtype=dtype) File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 1993, in __ array __ return np.asarray(self._values, dtype=dtype) ValueError: could not convert string to float: 'Paris' Here is the code, and down below there's my dataset: (I've tried multiple different datasets, also, this dataset is a txt because I made it myself and am to dumb to convert it to csv.) import pandas as pd from sklearn.tree import DecisionTreeClassifier as dtc from sklearn.model_selection import train_test_split as tts city_data = pd.read_csv('TimeZoneTable.txt') X = city_data.drop(columns=['Country']) Y = city_data['Country'] X_train, X_test, Y_train, Y_test = tts(X, Y, test_size = 0.2) model = dtc() model.fit(X_train, Y_train) predictions = model.predict(X_test) print(Y_test) print(predictions) Dataset: CityName,Country,Latitude,Longitude,TimeZone Moscow,Russia,55.45'N,37.37'E,3 Vienna,Austria,48.13'N,16.22'E,2 Barcelona,Spain,41.23'N,2.11'E,2 Madrid,Spain,40.25'N,3.42'W,2 Lisbon,Portugal,38.44'N,9.09'W,1 London,UK,51.30'N,0.08'W,1 Cardiff,UK,51.29'N,3.11'W,1 Edinburgh,UK,55.57'N,3.11'W,1 Dublin,Ireland,53.21'N,6.16'W,1 Paris,France,48.51'N,2.21'E,2
Machine learning algorithms and in particular the random forest work exclusively with input numbers. If you want to improve your model it is even recommended to normalize your model between -1;1 in general and therefore to use decimal numbers, hence the expectation of a float. In your case, your dataframe seems to contain exclusively string entries. As Dilara Gokay said, you first need to transform your strings into floats and to do so, use what is called an onehotencoder. I let you follow this tutorial if you don't know how to do it.
Error `` `MultiLabelBinarizer``` when importing strings from a csv to a fit () function to train a model with scikit-learn
import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv('coords.csv',sep=';') #Cargo el archivo csv x = df.iloc[1:,1:] #features values y = df.iloc[1:,0] #target value y = y.apply(lambda y: y.encode()) print(x) print(y) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1234) print(x_train) print(y_train) from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression, RidgeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier pipelines = { 'lr':make_pipeline(StandardScaler(), LogisticRegression()), 'rc':make_pipeline(StandardScaler(), RidgeClassifier()), 'rf':make_pipeline(StandardScaler(), RandomForestClassifier()), 'gb':make_pipeline(StandardScaler(), GradientBoostingClassifier()), } fit_models = {} for algo, pipeline in pipelines.items(): model = pipeline.fit(x_train, y_train) fit_models[algo] = model print(fit_models) print(fit_models['lr'].predict(x_test)) print(fit_models['rc'].predict(x_test)) print(fit_models['rf'].predict(x_test)) print(fit_models['gb'].predict(x_test)) I was having a problem when trying to load strings from a csv file, because it tells me: Traceback (most recent call last): File "3_Train_Custom_Model_Using_Scikit_Learn.py", line 99, in <module> model = pipeline.fit(x_train, y_train) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\optimize.py", line 243, in _check_optimize_result ).format(solver, result.status, result.message.decode("latin1")) AttributeError: 'str' object has no attribute 'decode' And when I add y = y.apply (lambda y: y.encode ()) because I thought I needed to transform strings to bytes, I get this: Traceback (most recent call last): File "3_Train_Custom_Model_Using_Scikit_Learn.py", line 99, in <module> model = pipeline.fit(x_train, y_train) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1345, in fit check_classification_targets(y) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 169, in check_classification_targets y_type = type_of_target(y) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 263, in type_of_target raise ValueError('You appear to be using a legacy multi-label data' ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format. How do I so that the data framed in red from the csv that you see in the following Excel screenshot, which would be the targets, are saved in the variable y, and those that are framed in blue that It would be the features (x1, y1, z1, v1, x2, y2, z2, v2, ..., x501, y501, z501, v501) that must be saved in the variable x.
Try this: df = pd.read_csv('testing.csv',sep=';',header=1) x = df.iloc[:,1:] #features values y = df.iloc[:,0] #target value #y = y.apply(lambda y: y.encode()) print(x) print(y) ...
Cross validation inconsistent numbers of samples error (Python)
I am trying to make a classification using cross validation method and SVM classifier. In my data file, the last column contains my classes (which are 0, 1, 2, 3, 4, 5) and the rest (except first column) is the numeric data that I want to use to predict these classes. from sklearn import svm from sklearn import metrics import numpy as np from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import cross_val_score filename = "Features.csv" dataset = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=range(1, 39)) x = dataset[:, 0:36] y = dataset[:, 36] print("len(x): " + str(len(x))) print("len(y): " + str(len(x))) skf = StratifiedKFold(n_splits=10, shuffle=False, random_state=42) modelsvm = svm.SVC() expected = y print("len(expected): " + str(len(expected))) predictedsvm = cross_val_score(modelsvm, x, y, cv=skf) print("len(predictedsvm): " + str(len(predictedsvm))) svm_results = metrics.classification_report(expected, predictedsvm) print(svm_results) And I am getting such an error: len(x): 2069 len(y): 2069 len(expected): 2069 C:\Python\Python37\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True. FutureWarning len(predictedsvm): 10 Traceback (most recent call last): File "C:/Users/MyComp/PycharmProjects/GG/AR.py", line 54, in <module> svm_results = metrics.classification_report(expected, predictedsvm) File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f return f(**kwargs) File "C:\Python\Python37\lib\site-packages\sklearn\metrics\_classification.py", line 1929, in classification_report y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "C:\Python\Python37\lib\site-packages\sklearn\metrics\_classification.py", line 81, in _check_targets check_consistent_length(y_true, y_pred) File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", line 257, in check_consistent_length " samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [2069, 10] Process finished with exit code 1 I don't understand how my data count in y goes down to 10 when I am trying to predict it using CV. Can anyone help me on this please?
You are misunderstanding the output from cross_val_score. As per the documentation it returns "array of scores of the estimator for each run of the cross validation," not actual predictions. Because you have 10 folds, you get 10 values. classification_report expects the true values and the predicted values. To use this, you'll want to predict with a model. To do this, you'll need to fit the model on the data. If you're happy with the results from cross_val_score you can train that model on the data. Or, you can use GridSearchCV to do this all in one sweep.
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
I am trying to predict using SVM but I receive the error AttributeError: 'numpy.ndarray' object has no attribute 'lower' when executing line text_clf.fit(X_train,y_train) of my code. How to fix this and get the probability that my prediction is correct using SVM? I am predicting the first column (gold) of my input file based on the values of the remaining columns. My input file dataExtended.txtis under the form: gold,T-x-T,T-x-N,T-x-U,T-x-NT,T-x-UT,T-x-UN,T-x-UNT,N-x-T,N-x-N,N-x-U,N-x-NT,N-x-UT,N-x-UN,N-x-UNT,U-x-T,U-x-N,U-x-U,U-x-NT,U-x-UT,U-x-UN,U-x-UNT,NT-x-T,NT-x-N,NT-x-U,NT-x-NT,NT-x-UT,NT-x-UN,NT-x-UNT,UT-x-T,UT-x-N,UT-x-U,UT-x-NT,UT-x-UT,UT-x-UN,UT-x-UNT,UN-x-T,UN-x-N,UN-x-U,UN-x-NT,UN-x-UT,UN-x-UN,UN-x-UNT,UNT-x-T,UNT-x-N,UNT-x-U,UNT-x-NT,UNT-x-UT,UNT-x-UN,UNT-x-UNT,T-T-x,T-N-x,T-U-x,T-NT-x,T-UT-x,T-UN-x,T-UNT-x,N-T-x,N-N-x,N-U-x,N-NT-x,N-UT-x,N-UN-x,N-UNT-x,U-T-x,U-N-x,U-U-x,U-NT-x,U-UT-x,U-UN-x,U-UNT-x,NT-T-x,NT-N-x,NT-U-x,NT-NT-x,NT-UT-x,NT-UN-x,NT-UNT-x,UT-T-x,UT-N-x,UT-U-x,UT-NT-x,UT-UT-x,UT-UN-x,UT-UNT-x,UN-T-x,UN-N-x,UN-U-x,UN-NT-x,UN-UT-x,UN-UN-x,UN-UNT-x,UNT-T-x,UNT-N-x,UNT-U-x,UNT-NT-x,UNT-UT-x,UNT-UN-x,UNT-UNT-x,x-T-T,x-T-N,x-T-U,x-T-NT,x-T-UT,x-T-UN,x-T-UNT,x-N-T,x-N-N,x-N-U,x-N-NT,x-N-UT,x-N-UN,x-N-UNT,x-U-T,x-U-N,x-U-U,x-U-NT,x-U-UT,x-U-UN,x-U-UNT,x-NT-T,x-NT-N,x-NT-U,x-NT-NT,x-NT-UT,x-NT-UN,x-NT-UNT,x-UT-T,x-UT-N,x-UT-U,x-UT-NT,x-UT-UT,x-UT-UN,x-UT-UNT,x-UN-T,x-UN-N,x-UN-U,x-UN-NT,x-UN-UT,x-UN-UN,x-UN-UNT,x-UNT-T,x-UNT-N,x-UNT-U,x-UNT-NT,x-UNT-UT,x-UNT-UN,x-UNT-UNT,callersAtLeast1T,CalleesAtLeast1T,callersAllT,calleesAllT,CallersAtLeast1N,CalleesAtLeast1N,CallersAllN,CalleesAllN,childrenAtLeast1T,parentsAtLeast1T,childrenAtLeast1N,parentsAtLeast1N,childrenAllT,parentsAllT,childrenAllN,ParentsAllN,ParametersatLeast1T,FieldMethodsAtLeast1T,ReturnTypeAtLeast1T,ParametersAtLeast1N,FieldMethodsAtLeast1N,ReturnTypeN,ParametersAllT,FieldMethodsAllT,ParametersAllN,FieldMethodsAllN,ClassGoldN,ClassGoldT,Inner,Leaf,Root,Isolated,EmptyCallers,EmptyCallees,EmptyCallersCallers,EmptyCalleesCallees,Program,Requirement,MethodID T,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,1,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,2,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,3,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,4,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,5,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,6,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,7,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,8,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,1,3 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,2,3 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,3,3 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,4,3 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,5,3 Here is my full reproducible code: # Make Predictions with Naive Bayes On The Iris Dataset from sklearn.cross_validation import train_test_split from sklearn import metrics import pandas as pd import numpy as np import seaborn as sns; sns.set() from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report import seaborn as sns from sklearn import svm from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline data = pd.read_csv( 'dataExtended.txt', sep= ',') row_count, column_count = data.shape # Printing the dataswet shape print ("Dataset Length: ", len(data)) print ("Dataset Shape: ", data.shape) print("Number of columns ", column_count) # Printing the dataset obseravtions print ("Dataset: ",data.head()) data['gold'] = data['gold'].astype('category').cat.codes data['Program'] = data['Program'].astype('category').cat.codes # Building Phase Separating the target variable X = data.values[:, 1:column_count] Y = data.values[:, 0] # Splitting the dataset into train and test X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100) #Create a svm Classifier svclassifier = svm.LinearSVC() print('Before fitting') svclassifier.fit(X_train, y_train) predicted = svclassifier.predict(X_test) text_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())]) text_clf.fit(X_train,y_train) Traceback leading to error: Traceback (most recent call last): File "<ipython-input-9-8e85a0a9f81c>", line 1, in <module> runfile('C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python/Paper4SVM.py', wdir='C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python') File "C:\Users\mouna\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile execfile(filename, namespace) File "C:\Users\mouna\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile exec(compile(f.read(), filename, 'exec'), namespace) File "C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python/Paper4SVM.py", line 53, in <module> text_clf.fit(X_train,y_train) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 248, in fit Xt, fit_params = self._fit(X, y, **fit_params) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 213, in _fit **fit_params_steps[name]) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__ return self.func(*args, **kwargs) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1381, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform self.fixed_vocabulary_) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab for feature in analyze(doc): File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 266, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 232, in <lambda> return lambda x: strip_accents(x.lower())
You cannot use TF-IDF-related methods for numeric data; the method is exclusively for use with text data, hence it uses methods such as .tolower(), which are by default applicable to strings, hence the error. This is already apparent from the documentation: fit(self, raw_documents, y=None) Learn vocabulary and idf from training set. Parameters raw_documents: iterable An iterable which yields either str, unicode or file objects. I am afraid that your rationale, as explained in the comments: I'm just trying to get the probability that each prediction is correct and TF-IDF seems to be the only way to do so when using SVM is extremely weak. For starters, there is no such thing as "the probability that each prediction is correct" - I take it that you mean probabilistic predictions, in contrast to hard class predictions (see Predict classes or class probabilities?) To get to the point of your actual requirement: in contrast to LinearSVC, which you are using here, SVC does indeed provide a predict_proba method, which should do the job (see the docs and the instructions therein). Notice that LinearSVC is not actually an SVM - see answer in Under what parameters are SVC and LinearSVC in scikit-learn equivalent? for details. In short, forget about TF-IDF and switch to SVC instead of LinearSVC.
What does ValueError: Found input variables with inconsistent numbers of samples: [75, 1] signify in Python?
I was going through the "Writing Our First Classifier - Machine Learning Recipes #5" Machine Learning Video on Youtube. I followed along the example, but am not sure why I am not able to get the code running. Note: This isn't the final code for the KNN Classifier. It is the initial testing phase. #implementing KNN Classifier without using import statement import random class ScrappyKNN(): def fit(self, X_train, Y_train): self.X_train=X_train self.Y_train=Y_train def predict(self, X_test): predictions=[] for row in X_test: label = random.choice(self.Y_train) predictions.append(label) return predictions from sklearn.datasets import load_iris iris=load_iris() X=iris.data Y=iris.target from sklearn.cross_validation import train_test_split X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.5) clf=ScrappyKNN() clf.fit (X_train,Y_train) predictions_result=clf.predict(X_test) from sklearn.metrics import accuracy_score print(accuracy_score(Y_test,predictions_result)) I am getting the error "ValueError: Found input variables with inconsistent numbers of samples: [75, 1]". I believe there is some size inconsistency in the list as training and testing data sets are split out of the 150 into 75 samples each (I have used test_size=0.5). I am really stuck in this one. Could you Kindly tell what this error means. I searched through similar answers on stack overflow but unfortunately, can't make out what causes this error. I'm new to using Python for Machine Learning.Can someone Kindly help me out? Here is the full Stacktrace /Users/joyjitchatterjee/anaconda3/envs/machinelearning/bin/python /Users/joyjitchatterjee/PycharmProjects/untitled1/ml_5.py Traceback (most recent call last): File "/Users/joyjitchatterjee/PycharmProjects/untitled1/ml_5.py", line 36, in <module> print(accuracy_score(Y_test,predictions_result)) File "/Users/joyjitchatterjee/anaconda3/envs/machinelearning/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 176, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/Users/joyjitchatterjee/anaconda3/envs/machinelearning/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 71, in _check_targets check_consistent_length(y_true, y_pred) File "/Users/joyjitchatterjee/anaconda3/envs/machinelearning/lib/python3.6/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length " samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [75, 4] Process finished with exit code 1 Screenshot of code
The indentation of the final return statement is wrong in your code. It should be def predict(self, X_test): predictions=[] for row in X_test: label = random.choice(self.Y_train) predictions.append(label) return predictions