Sklearn: Cross validation for grouped data - python
I am trying to implement a cross validation scheme on grouped data. I was hoping to use the GroupKFold method, but I keep getting an error. what am I doing wrong?
The code (slightly different from the one I used--I had different data so I had a larger n_splits, but everythign else is the same)
from sklearn import metrics
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import GroupKFold
from sklearn.grid_search import GridSearchCV
from xgboost import XGBRegressor
#generate data
x=np.array([0,1,2,3,4,5,6,7,8,9,10,11,12,13])
y= np.array([1,2,3,4,5,6,7,1,2,3,4,5,6,7])
group=np.array([1,0,1,1,2,2,2,1,1,1,2,0,0,2)]
#grid search
gkf = GroupKFold( n_splits=3).split(x,y,group)
subsample = np.arange(0.3,0.5,0.1)
param_grid = dict( subsample=subsample)
rgr_xgb = XGBRegressor(n_estimators=50)
grid_search = GridSearchCV(rgr_xgb, param_grid, cv=gkf, n_jobs=-1)
result = grid_search.fit(x, y)
the error:
Traceback (most recent call last):
File "<ipython-input-143-11d785056a08>", line 8, in <module>
result = grid_search.fit(x, y)
File "/home/student/anaconda/lib/python3.5/site-packages/sklearn/grid_search.py", line 813, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "/home/student/anaconda/lib/python3.5/site-packages/sklearn/grid_search.py", line 566, in _fit
n_folds = len(cv)
TypeError: object of type 'generator' has no len()
changing the line
gkf = GroupKFold( n_splits=3).split(x,y,group)
to
gkf = GroupKFold( n_splits=3)
does not work either. The error message is then:
'GroupKFold' object is not iterable
The split function of GroupKFold yields the training and test indices pair one at a time. You should call list on the split value to get them all in a list so the length can be computed:
gkf = list(GroupKFold( n_splits=3).split(x,y,group))
Related
Sklearn can't convert string to float
I'm using Sklearn as a machine learning tool, but every time I run my code, it gives this error: Traceback (most recent call last): File "C:\Users\FakeUserMadeUp\Desktop\Python\Machine Learning\MachineLearning.py", line 12, in <module> model.fit(X_train, Y_train) File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\tree\_classes.py", line 942, in fit X_idx_sorted=X_idx_sorted, File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\tree\_classes.py", line 166, in fit X, y, validate_separately=(check_X_params, check_y_params) File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\base.py", line 578, in _validate_data X = check_array(X, **check_X_params) File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 746, in check_array array = np.asarray(array, order=order, dtype=dtype) File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 1993, in __ array __ return np.asarray(self._values, dtype=dtype) ValueError: could not convert string to float: 'Paris' Here is the code, and down below there's my dataset: (I've tried multiple different datasets, also, this dataset is a txt because I made it myself and am to dumb to convert it to csv.) import pandas as pd from sklearn.tree import DecisionTreeClassifier as dtc from sklearn.model_selection import train_test_split as tts city_data = pd.read_csv('TimeZoneTable.txt') X = city_data.drop(columns=['Country']) Y = city_data['Country'] X_train, X_test, Y_train, Y_test = tts(X, Y, test_size = 0.2) model = dtc() model.fit(X_train, Y_train) predictions = model.predict(X_test) print(Y_test) print(predictions) Dataset: CityName,Country,Latitude,Longitude,TimeZone Moscow,Russia,55.45'N,37.37'E,3 Vienna,Austria,48.13'N,16.22'E,2 Barcelona,Spain,41.23'N,2.11'E,2 Madrid,Spain,40.25'N,3.42'W,2 Lisbon,Portugal,38.44'N,9.09'W,1 London,UK,51.30'N,0.08'W,1 Cardiff,UK,51.29'N,3.11'W,1 Edinburgh,UK,55.57'N,3.11'W,1 Dublin,Ireland,53.21'N,6.16'W,1 Paris,France,48.51'N,2.21'E,2
Machine learning algorithms and in particular the random forest work exclusively with input numbers. If you want to improve your model it is even recommended to normalize your model between -1;1 in general and therefore to use decimal numbers, hence the expectation of a float. In your case, your dataframe seems to contain exclusively string entries. As Dilara Gokay said, you first need to transform your strings into floats and to do so, use what is called an onehotencoder. I let you follow this tutorial if you don't know how to do it.
Error `` `MultiLabelBinarizer``` when importing strings from a csv to a fit () function to train a model with scikit-learn
import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv('coords.csv',sep=';') #Cargo el archivo csv x = df.iloc[1:,1:] #features values y = df.iloc[1:,0] #target value y = y.apply(lambda y: y.encode()) print(x) print(y) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1234) print(x_train) print(y_train) from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression, RidgeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier pipelines = { 'lr':make_pipeline(StandardScaler(), LogisticRegression()), 'rc':make_pipeline(StandardScaler(), RidgeClassifier()), 'rf':make_pipeline(StandardScaler(), RandomForestClassifier()), 'gb':make_pipeline(StandardScaler(), GradientBoostingClassifier()), } fit_models = {} for algo, pipeline in pipelines.items(): model = pipeline.fit(x_train, y_train) fit_models[algo] = model print(fit_models) print(fit_models['lr'].predict(x_test)) print(fit_models['rc'].predict(x_test)) print(fit_models['rf'].predict(x_test)) print(fit_models['gb'].predict(x_test)) I was having a problem when trying to load strings from a csv file, because it tells me: Traceback (most recent call last): File "3_Train_Custom_Model_Using_Scikit_Learn.py", line 99, in <module> model = pipeline.fit(x_train, y_train) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\optimize.py", line 243, in _check_optimize_result ).format(solver, result.status, result.message.decode("latin1")) AttributeError: 'str' object has no attribute 'decode' And when I add y = y.apply (lambda y: y.encode ()) because I thought I needed to transform strings to bytes, I get this: Traceback (most recent call last): File "3_Train_Custom_Model_Using_Scikit_Learn.py", line 99, in <module> model = pipeline.fit(x_train, y_train) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 335, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1345, in fit check_classification_targets(y) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 169, in check_classification_targets y_type = type_of_target(y) File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 263, in type_of_target raise ValueError('You appear to be using a legacy multi-label data' ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format. How do I so that the data framed in red from the csv that you see in the following Excel screenshot, which would be the targets, are saved in the variable y, and those that are framed in blue that It would be the features (x1, y1, z1, v1, x2, y2, z2, v2, ..., x501, y501, z501, v501) that must be saved in the variable x.
Try this: df = pd.read_csv('testing.csv',sep=';',header=1) x = df.iloc[:,1:] #features values y = df.iloc[:,0] #target value #y = y.apply(lambda y: y.encode()) print(x) print(y) ...
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
I am trying to predict using SVM but I receive the error AttributeError: 'numpy.ndarray' object has no attribute 'lower' when executing line text_clf.fit(X_train,y_train) of my code. How to fix this and get the probability that my prediction is correct using SVM? I am predicting the first column (gold) of my input file based on the values of the remaining columns. My input file dataExtended.txtis under the form: gold,T-x-T,T-x-N,T-x-U,T-x-NT,T-x-UT,T-x-UN,T-x-UNT,N-x-T,N-x-N,N-x-U,N-x-NT,N-x-UT,N-x-UN,N-x-UNT,U-x-T,U-x-N,U-x-U,U-x-NT,U-x-UT,U-x-UN,U-x-UNT,NT-x-T,NT-x-N,NT-x-U,NT-x-NT,NT-x-UT,NT-x-UN,NT-x-UNT,UT-x-T,UT-x-N,UT-x-U,UT-x-NT,UT-x-UT,UT-x-UN,UT-x-UNT,UN-x-T,UN-x-N,UN-x-U,UN-x-NT,UN-x-UT,UN-x-UN,UN-x-UNT,UNT-x-T,UNT-x-N,UNT-x-U,UNT-x-NT,UNT-x-UT,UNT-x-UN,UNT-x-UNT,T-T-x,T-N-x,T-U-x,T-NT-x,T-UT-x,T-UN-x,T-UNT-x,N-T-x,N-N-x,N-U-x,N-NT-x,N-UT-x,N-UN-x,N-UNT-x,U-T-x,U-N-x,U-U-x,U-NT-x,U-UT-x,U-UN-x,U-UNT-x,NT-T-x,NT-N-x,NT-U-x,NT-NT-x,NT-UT-x,NT-UN-x,NT-UNT-x,UT-T-x,UT-N-x,UT-U-x,UT-NT-x,UT-UT-x,UT-UN-x,UT-UNT-x,UN-T-x,UN-N-x,UN-U-x,UN-NT-x,UN-UT-x,UN-UN-x,UN-UNT-x,UNT-T-x,UNT-N-x,UNT-U-x,UNT-NT-x,UNT-UT-x,UNT-UN-x,UNT-UNT-x,x-T-T,x-T-N,x-T-U,x-T-NT,x-T-UT,x-T-UN,x-T-UNT,x-N-T,x-N-N,x-N-U,x-N-NT,x-N-UT,x-N-UN,x-N-UNT,x-U-T,x-U-N,x-U-U,x-U-NT,x-U-UT,x-U-UN,x-U-UNT,x-NT-T,x-NT-N,x-NT-U,x-NT-NT,x-NT-UT,x-NT-UN,x-NT-UNT,x-UT-T,x-UT-N,x-UT-U,x-UT-NT,x-UT-UT,x-UT-UN,x-UT-UNT,x-UN-T,x-UN-N,x-UN-U,x-UN-NT,x-UN-UT,x-UN-UN,x-UN-UNT,x-UNT-T,x-UNT-N,x-UNT-U,x-UNT-NT,x-UNT-UT,x-UNT-UN,x-UNT-UNT,callersAtLeast1T,CalleesAtLeast1T,callersAllT,calleesAllT,CallersAtLeast1N,CalleesAtLeast1N,CallersAllN,CalleesAllN,childrenAtLeast1T,parentsAtLeast1T,childrenAtLeast1N,parentsAtLeast1N,childrenAllT,parentsAllT,childrenAllN,ParentsAllN,ParametersatLeast1T,FieldMethodsAtLeast1T,ReturnTypeAtLeast1T,ParametersAtLeast1N,FieldMethodsAtLeast1N,ReturnTypeN,ParametersAllT,FieldMethodsAllT,ParametersAllN,FieldMethodsAllN,ClassGoldN,ClassGoldT,Inner,Leaf,Root,Isolated,EmptyCallers,EmptyCallees,EmptyCallersCallers,EmptyCalleesCallees,Program,Requirement,MethodID T,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,1,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,2,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,3,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,4,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,5,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,6,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,7,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,8,1 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,1,3 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,2,3 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,3,3 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,chess,4,3 N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,chess,5,3 Here is my full reproducible code: # Make Predictions with Naive Bayes On The Iris Dataset from sklearn.cross_validation import train_test_split from sklearn import metrics import pandas as pd import numpy as np import seaborn as sns; sns.set() from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report import seaborn as sns from sklearn import svm from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline data = pd.read_csv( 'dataExtended.txt', sep= ',') row_count, column_count = data.shape # Printing the dataswet shape print ("Dataset Length: ", len(data)) print ("Dataset Shape: ", data.shape) print("Number of columns ", column_count) # Printing the dataset obseravtions print ("Dataset: ",data.head()) data['gold'] = data['gold'].astype('category').cat.codes data['Program'] = data['Program'].astype('category').cat.codes # Building Phase Separating the target variable X = data.values[:, 1:column_count] Y = data.values[:, 0] # Splitting the dataset into train and test X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100) #Create a svm Classifier svclassifier = svm.LinearSVC() print('Before fitting') svclassifier.fit(X_train, y_train) predicted = svclassifier.predict(X_test) text_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())]) text_clf.fit(X_train,y_train) Traceback leading to error: Traceback (most recent call last): File "<ipython-input-9-8e85a0a9f81c>", line 1, in <module> runfile('C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python/Paper4SVM.py', wdir='C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python') File "C:\Users\mouna\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile execfile(filename, namespace) File "C:\Users\mouna\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile exec(compile(f.read(), filename, 'exec'), namespace) File "C:/Users/mouna/ownCloud/Mouna Hammoudi/dumps/Python/Paper4SVM.py", line 53, in <module> text_clf.fit(X_train,y_train) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 248, in fit Xt, fit_params = self._fit(X, y, **fit_params) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 213, in _fit **fit_params_steps[name]) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__ return self.func(*args, **kwargs) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1381, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform self.fixed_vocabulary_) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab for feature in analyze(doc): File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 266, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\Users\mouna\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 232, in <lambda> return lambda x: strip_accents(x.lower())
You cannot use TF-IDF-related methods for numeric data; the method is exclusively for use with text data, hence it uses methods such as .tolower(), which are by default applicable to strings, hence the error. This is already apparent from the documentation: fit(self, raw_documents, y=None) Learn vocabulary and idf from training set. Parameters raw_documents: iterable An iterable which yields either str, unicode or file objects. I am afraid that your rationale, as explained in the comments: I'm just trying to get the probability that each prediction is correct and TF-IDF seems to be the only way to do so when using SVM is extremely weak. For starters, there is no such thing as "the probability that each prediction is correct" - I take it that you mean probabilistic predictions, in contrast to hard class predictions (see Predict classes or class probabilities?) To get to the point of your actual requirement: in contrast to LinearSVC, which you are using here, SVC does indeed provide a predict_proba method, which should do the job (see the docs and the instructions therein). Notice that LinearSVC is not actually an SVM - see answer in Under what parameters are SVC and LinearSVC in scikit-learn equivalent? for details. In short, forget about TF-IDF and switch to SVC instead of LinearSVC.
matplotlib error: x and y must be the same size
How can I fix ‘ValueError: x and y must be the same size` error? The idea of the code is that from different sensors of temperature and NO data applied the model of Multivariate Linear Regression. To train the model and see the results correlated among them, as well as the prediction as a whole. from sklearn import linear_model from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split import pandas as pd import matplotlib.pyplot as plt # Name of de file filename = 'NORM_AC_HAE.csv' file = 'NORM_NABEL_HAE_lev1.csv' # Read the data data=pd.read_csv(filename) data_other=pd.read_csv(file) col = ['Aircube.009.0.no.we.aux.ch6', 'Aircube.009.0.sht.temperature.ch1'] X = data.loc[:, col] Y = data_other.loc[:,'NO.ppb'] # Fitting the Liner Regression to training set X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, train_size = 0.6, random_state = np.random.seed(0)) mlr = LinearRegression() mlr.fit(X_train, y_train) # Visualization of the test set results plt.figure(2) plt.scatter(y_test, X_test) #The VALUE ERROR appears here The Error Code is: Traceback (most recent call last): File "C:\Users\andre\Desktop\UV\4o\TFG\EMPA\dataset_Mila\MLR_no_temp_hae_no.py", line 65, in <module> plt.scatter(y_test, X_test) File "C:\Users\andre\AppData\Local\Programs\Python\Python37-32\lib\site-packages\matplotlib\pyplot.py", line 2864, in scatter is not None else {}), **kwargs) File "C:\Users\andre\AppData\Local\Programs\Python\Python37-32\lib\site-packages\matplotlib\__init__.py", line 1810, in inner return func(ax, *args, **kwargs) File "C:\Users\andre\AppData\Local\Programs\Python\Python37-32\lib\site-packages\matplotlib\axes\_axes.py", line 4182, in scatter raise ValueError("x and y must be the same size") ValueError: x and y must be the same size [Finished in 6.9s]
X_test.shape = [36648 rows x 2 columns] Both data arguments in plt.scatter (here y_test and X_test) must be 1-dimensional arrays; from the docs: x, y : array_like, shape (n, ) while here you attempt to pass a 2-dimensional matrix for X_test, hence the error of different size. You cannot get a scatter plot of a matrix with an array/vector; what you could do is produce two separate scatter plots, one for each column in your X_test: plt.figure(2) plt.scatter(y_test, X_test.iloc[:,0].values) plt.figure(3) plt.scatter(y_test, X_test.iloc[:,1].values)
python, sklearn: 'dict' object is not callable using GridSearchCV and SVC
I'm trying to use GridSearchCV to optimize the parameters for the classifier svm.SVC (both from sklearn). from sklearn.grid_search import GridSearchCV from sklearn.svm import SVC from sklearn.metrics import confusion_matrix import numpy as np X_train = np.array([[1,2],[3,4],[5,6],[2,3],[9,4],[4,5],[2,7],[1,0],[4,7],[2,9]) Y_train = np.array([0,1,0,1,0,0,1,1,0,1]) X_test = np.array([[2,4],[5,3],[7,1],[2,4],[6,4],[2,7],[9,2],[7,5],[1,6],[0,3]]) Y_test = np.array([1,0,0,0,1,0,1,1,0,0]) parameters = {'kernel':['rbf'],'C':np.linspace(10,100,10)} clf1 = GridSearchCV(SVC(), parameters, verbose = 10) clf1.fit(X_train, Y_train) cm = confusion_matrix(Y_test, clf1.predict(X_test)) bp = clf1.best_params_ The output shows it completing GridSearchCV, but then it throws the error: Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 479, in runfile execfile(filename, namespace) File "I:\setup\Desktop\Stats\FinalProject.py", line 112, in <module> clf1 = GridSearchCV(SVC(), parameters, verbose = 10) TypeError: 'dict' object is not callable
When I am running the code you posted: from sklearn.grid_search import GridSearchCV from sklearn.svm import SVC from sklearn.metrics import confusion_matrix import numpy as np X_train = np.array([[1,2],[3,4],[5,6]]) Y_train = np.array([0,1,0]) X_test = np.array([[2,4],[5,3],[7,1]]) Y_test = np.array([1,0,0]) parameters = {'kernel':['rbf'],'C':np.linspace(10,100,10)} clf1 = GridSearchCV(SVC(), parameters, verbose = 10) clf1.fit(X_train, Y_train) cm = confusion_matrix(Y_test, clf1.predict(X_test)) bp = clf1.best_params_ I'm getting this error: File "C:\Anaconda\lib\site-packages\sklearn\svm\base.py", line 447, in _validate_targets % len(cls)) ValueError: The number of classes has to be greater than one; got 1 Since the train data consist of 3 samples, when the GridSearchCV break the data into 3 folds (BTW you can control this parameter, it is called cv). e.g. - fold1 = [1,2] , label1 = 0 fold2 = [3,4] , label2 = 1 fold3 = [5,6] , label3 = 0 Now, in some iteration, it takes the first and the third folds to train on, and the second fold is used for validation. Please note that these training folds contains only 1 type of label! (the label 0) hence the error it prints. If I create the data in this manner: X, Y = datasets.make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=2, n_classes=2) X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X,Y, test_size =0.2) It runs just fine. I guess you have some other problem, but regarding the code you entered - this is the error it has.