I'm quite new to the ML python environment, I need to plot the precision/recall graph, as stated in this post: [https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html][1] you need to compute the y_score :
# Create a simple classifier
classifier = svm.LinearSVC(random_state=random_state)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)
So the question is: how can I compute the score using Multinomial NaiveBayes or LearningTree? In my code I have:
print("MultinomialNB - countVectorizer")
xTrain, xTest, yTrain, yTest=countVectorizer(db)
classifier = MultinomialNB()
model = classifier.fit(xTrain, yTrain)
yPred = model.predict(xTest)
print("confusion Matrix of MNB/ cVectorizer:\n")
print(confusion_matrix(yTest, yPred))
print("\n")
print("classificationReport Matrix of MNB/ cVectorizer:\n")
print(classification_report(yTest, yPred))
elapsed_time = time.time() - start_time
print("elapsed Time: %.3fs" %elapsed_time)
Plot function:
def plotLearningAlgorithm(yTest,yScore,algName):
precision, recall, _ = precision_recall_curve(yTest, yScore)
plt.step(recall, precision, color='b', alpha=0.2,
where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall'+ algName +'curve: AP={0:0.2f}'.format(average_precision))
Error with plot:
<ipython-input-43-d07c3365bfc2> in MultinomialNaiveBayesOPT()
11 yPred = model.predict(xTest)
12
---> 13 plotLearningAlgorithm(yTest,model.predict_proba(xTest),"MultinomialNB - countVectorizer")
14
15 print("confusion Matrix of MNB/ cVectorizer:\n")
<ipython-input-42-260aac9918f2> in plotLearningAlgorithm(yTest, yScore, algName)
1 def plotLearningAlgorithm(yTest,yScore,algName):
2
----> 3 precision, recall, _ = precision_recall_curve(yTest, yScore)
4
5 step_kwargs = ({'step': 'post'}
/opt/anaconda3/lib/python3.7/site-packages/sklearn/metrics/ranking.py in precision_recall_curve(y_true, probas_pred, pos_label, sample_weight)
522 fps, tps, thresholds = _binary_clf_curve(y_true, probas_pred,
523 pos_label=pos_label,
--> 524 sample_weight=sample_weight)
525
526 precision = tps / (tps + fps)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/metrics/ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
398 check_consistent_length(y_true, y_score, sample_weight)
399 y_true = column_or_1d(y_true)
--> 400 y_score = column_or_1d(y_score)
401 assert_all_finite(y_true)
402 assert_all_finite(y_score)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
758 return np.ravel(y)
759
--> 760 raise ValueError("bad input shape {0}".format(shape))
761
762
ValueError: bad input shape (9000, 2)
Where db contains my dataset already divided between train set and test set.
Any suggestions?
Solution:
def plot_pr(y_pred,y_true,l):
precision, recall, thresholds = precision_recall_curve(y_true, y_pred,pos_label=l)
return precision,recall
def plotPrecisionRecall(xTest,yTest,yPred,learningName,model):
yPred_probability = model.predict_proba(xTest)
yPred_probability = yPred_probability[:,1];
no_skill_probs = [0 for _ in range(len(yTest))]
ns_precision,ns_recall,_=precision_recall_curve(yTest,no_skill_probs,pos_label="L")
precision, rec= plot_pr(yPred_probability,yTest,"L");
plt.title(learningName)
plt.plot(ns_recall,ns_precision,linestyle='--',label='No Skill')
plt.plot(rec,precision,Label='Skill')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend()
plt.show()
So as It turns out the y_Pred needed to be transformed with:
yPred_probability = yPred_probability[:,1];
So big thank you to #ignoring_gravity to providing me to the right solution, I've also printed the no-skill line to extra readability to the graph.
What they call y_score is just the predicted probabilities outputted by your ML algorithm.
In multinomial nb and in a decision tree (I suppose that's what you mean by LearningTree?), you can do this with the method .predict_proba:
classifier = MultinomialNB()
model = classifier.fit(xTrain, yTrain)
yPred = model.predict_proba(xTest)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# countvectorizer is not used for train and test split, instead use train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.33, random_state=42) # here x is going to be your textual data, whereas y will be your target
countMatrix_train = vect.fit_transform(train_x) # you have to fit with your train data
countMatrix_test = vect.transform(test_x) # now have to transform( and not fit_transform) according to your train data
classifier = MultinomialNB()
classifier.fit(countMatrix_train, train_y)
ypred = classifier.predict(countMatrix_test) # this will give you class for your test data, now use this for making classification report
Related
I wrote a function to find the best combination of given dataframe features, f1 score, and auc score using LogisticRegression. The problem is that when I try to pass a list of dataframes combinations, using itertools combinations, LogisticRegression doesn't recognize each combination as its own X variable/ dataframe.
I'm starting with a dataframe of 10 feature columns and 10k rows. When I run the below code I get a "ValueError: X has 10 features, but LogisticRegression is expecting 1 features as input".
def find_best_combination(X, y):
#initialize variables
best_f1 = 0
best_auc = 0
best_variables = []
# get all possible combinations of variables
for i in range(1, X.shape[1]):
for combination in combinations(X.columns, i):
X_subset = X[list(combination)]
logreg = LogisticRegression()
logreg.fit(X_subset, y)
y_pred = logreg.predict(X_subset)
f1 = f1_score(y, y_pred)
auc = roc_auc_score(y, logreg.predict_proba(X)[:,1])
# evaluate performance on current combination of variables
if f1> best_f1 and auc > best_auc:
best_f1 = f1
best_auc = auc
best_variables = combination
return best_variables, best_f1, best_auc
and the error
C:\Users\katurner\Anaconda3\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- IBE1273_01_11.0
- IBE1273_01_6.0
- IBE7808
- IBE8439_2.0
- IBE8557_7.0
- ...
warnings.warn(message, FutureWarning)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\2\ipykernel_15932\895415673.py in <module>
----> 1 best_combo = ml.find_best_combination(X,lg_y)
2 best_combo
~\Documents\Arcadia\modeling_library.py in find_best_combination(X, y)
176 # print(y_test)
177 f1 = f1_score(y, y_pred)
--> 178 auc = roc_auc_score(y, logreg.predict_proba(X)[:,1])
179 # evaluate performance on current combination of variables
180 if f1> best_f1 and auc > best_auc:
~\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py in predict_proba(self, X)
1309 )
1310 if ovr:
-> 1311 return super()._predict_proba_lr(X)
1312 else:
1313 decision = self.decision_function(X)
~\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in _predict_proba_lr(self, X)
459 multiclass is handled by normalizing that over all classes.
460 """
--> 461 prob = self.decision_function(X)
462 expit(prob, out=prob)
463 if prob.ndim == 1:
~\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in decision_function(self, X)
427 check_is_fitted(self)
428
--> 429 X = self._validate_data(X, accept_sparse="csr", reset=False)
430 scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
431 return scores.ravel() if scores.shape[1] == 1 else scores
~\Anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
598
599 if not no_val_X and check_params.get("ensure_2d", True):
--> 600 self._check_n_features(X, reset=reset)
601
602 return out
~\Anaconda3\lib\site-packages\sklearn\base.py in _check_n_features(self, X, reset)
398
399 if n_features != self.n_features_in_:
--> 400 raise ValueError(
401 f"X has {n_features} features, but {self.__class__.__name__} "
402 f"is expecting {self.n_features_in_} features as input."
ValueError: X has 10 features, but LogisticRegression is expecting 1 features as input.
I'm xpecting the function to return a combination of best_variables, and accociated best_f1, best_auc.
I've also tried running the function using train, test, split. When I add train, test, split into the below code the function does run but returns "[], 0, 0" for best_variables, best_f1, best_auc.
def find_best_combination(X, y):
#initialize variables
best_f1 = 0
best_auc = 0
best_variables = []
# get all possible combinations of variables
for i in range(1, X.shape[1]):
for combination in combinations(X.columns, i):
X_subset = X[list(combination)]
X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, stratify=y, random_state=73)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])
# evaluate performance on current combination of variables
if f1> best_f1 and auc > best_auc:
best_f1 = f1
best_auc = auc
best_variables = combination
return best_variables, best_f1, best_auc
I'm not sure what's going on under the hood of train, test, split that enables the function to iterate through and not error like before.
I hope this explains it enough. Thanks in advance for any help.
I am trying to plot a decision boundary plot using some variables. I know that I need to split the data into test and training data and then plot the decision boundary, that can be done using the x_train, x_test line, however I seem to encounter a problem. Below I will show my errors as well.
def LogReg2D0732():
GDP = np.array(GDPArr) # Temp
Support = np.array(SupportArr) #Income
Score = np.array(list(map(ScoreConvert, ScoreArr))) #Sales
X = GDP
X = np.vstack((X,Support)) # var x samples
y = Score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# use sci-kit learn
clf = LogisticRegression(fit_intercept=True, C = 1e15) # instantiate
clf.fit(X.T, y) # do the fit
print("Coef ", clf.intercept_, clf.coef_) # print some attributes
xx, yy = np.mgrid[np.min(GDP):np.max(GDP):0.01, np.min(Support):np.max(Support):0.01]
gridxy = np.c_[xx.ravel(), yy.ravel()]
probs = clf.predict_proba(gridxy)[:,1].reshape(xx.shape)
f, ax = plt.subplots(figsize=(12, 8))
contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu", vmin=0, vmax=1)
ax_c = f.colorbar(contour) # for the probability range
ax_c.set_ticks([0, 1/4, 1/2, 3/4, 1]) # and values
idx = np.where(y==1); idx = np.reshape(idx,np.shape(idx)[1]) # tuple>list
y1 = X[:,idx]
idx = np.where(y==0); idx = np.reshape(idx,np.shape(idx)[1]) # tuple>list
y0 = X[:,idx]
ax.scatter(y1[0,:], y1[1,:], c='cyan') # light blue
ax.scatter(y0[0,:], y0[1,:], c='pink')
plt.title('GDP against Social support')
plt.xlabel('GDP')
plt.ylabel('Social Support')
plt.savefig('LR2D.svg')
plt.show()
# compute AUC
probsa = clf.predict_proba(X.T)
probsa = probsa[:, 1] # only the positive samples
auc = roc_auc_score(y, probsa)
print("Area under curve:", auc)
#fpr, tpr, thresholds = roc_curve(y, probsa)
fpr, tpr, thresholds = roc_curve(y, probsa, drop_intermediate=False)
plt.plot(fpr, tpr, color='sienna', label='ROC')
plt.plot([0, 1], [0, 1], color='seagreen', linestyle='dotted') # no skill
plt.xlabel('False Positive Rate, Specificity')
plt.ylabel('True Positive Rate, Sensitivity')
plt.title('Receiver Operating Characteristic (ROC) Plot')
plt.legend()
plt.savefig('LRauc.svg')
plt.show()
LogReg2D0732()#Logistic Regression
ValueError Traceback (most recent call last)
<ipython-input-21-31361159fc28> in <module>
48 plt.savefig('LRauc.svg')
49 plt.show()
---> 50 LogReg2D0732()#Logistic Regression
<ipython-input-21-31361159fc28> in LogReg2D0732()
8 y = Score
9
---> 10 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
11
12 # use sci-kit learn
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
2116 raise TypeError("Invalid parameters passed: %s" % str(options))
2117
-> 2118 arrays = indexable(*arrays)
2119
2120 n_samples = _num_samples(arrays[0])
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
246 """
247 result = [_make_indexable(X) for X in iterables]
--> 248 check_consistent_length(*result)
249 return result
250
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
210 if len(uniques) > 1:
211 raise ValueError("Found input variables with inconsistent numbers of"
--> 212 " samples: %r" % [int(l) for l in lengths])
213
214
ValueError: Found input variables with inconsistent numbers of samples: [2, 156]
I also do not quite understand the error which says inconsistent number of samples[2,156]. Do let me know how to solve this issue.
The error is happening during the creation of the split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
The error means the shape of the arrays is not the same. Check this link for similar issues.
I have a dataset with X.shape (104481, 34) and y.shape (104481,), and I want to train an SVM model on it.
The steps I do are (1) Split data, (2) Scale data, and (3) Train SVM:
(1) Split data:
Function:
from sklearn.model_selection import train_test_split
def split_data(X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12, stratify=y)
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = split_data_set.split_data(X,y)
The 4 classes are the following. The data set is quite imbalanced, but that is an issue for later.
y_train.value_counts()
out:
Status_9_Substatus_8 33500
Other 33500
Status_62_Substatus_7 2746
Status_62_Substatus_30 256
Name: Status, dtype: int64
y_test.value_counts()
out:
Status_9_Substatus_8 16500
Other 16500
Status_62_Substatus_7 1352
Status_62_Substatus_30 127
Name: Status, dtype: int64
(2) Scale data:
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_train_scaled.shape)
print(y_train.shape)
(3) Train and predict with SVM:
svm_method.get_svm_model(X_train_scaled, X_test_scaled, y_train, y_test)
Calling this method:
def get_svm_model(X_train, X_test, y_train, y_test):
print('Loading...')
print('Training...')
svm, y_train_pred, y_test_pred = train_svm_model(X_train,y_train, X_test)
print('Training Complete')
print('Plotting Confusion Matrix...')
performance_measure.plot_confusion_matrix(y_test,y_test_pred, normalize=True)
print('Plotting Performance Measure...')
performance_measure.get_performance_measures(y_test, y_test_pred)
return svm
Which calls this method:
def train_svm_model(X_train,y_train, X_test):
#
svm = SVC(kernel='poly', gamma='auto', random_state=12)
# Fitting the model
svm.fit(X_train, y_train)
# Predicting values
y_train_pred = svm.predict(X_train)
y_test_pred = svm.predict(X_test)
return svm, y_train_pred, y_test_pred
The resulting '''Output''' is this screenshot.
What is strange is that there are samples from all four classes present (since I used the stratify parameter when calling train_test_split), however, it looks like some of the classes disappear?
The SVM and confusion matrix functions worked well with a toy data set:
from sklearn.datasets import load_wine
data = load_wine()
X = pd.DataFrame(data.data, columns = data.feature_names)
y = pd.DataFrame(data.target)
y = np.array(y)
y = np.ravel(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
svm, y_train_pred, y_test_pred = train_svm_model(X_train, y_train, X_test)
get_svm_model(X_train, X_test, y_train, y_test)
Any idea what is going on here?
Thanks in advance.
The CM code:
def plot_confusion_matrix(y_true, y_pred,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
#classes = classes[unique_labels(y_true, y_pred)]
classes = unique_labels(y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
plt.show()
return ax
Your confusion matrix is not zero:
If we look at this on the x-axis you have the predicted label and on the y-axis the true labels, lets have a look at the third row from the top:
0.94: 0.94 of the true label: Status_62_Substatus_7 are predicted as class other, which is wrong
0.00 of the same true label are predicted also wrong
0.00 of the same true label are predicted wrong (this should be the correct predict value (higher is better)
0.06 are predicted again wrong
Is your problem is so imbalanced you just have 0 predictions for two labels.
I'm using anaconda-navigator -> python3.6
when I run the following code I get this error:
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from sklearn.grid_search import GridSearchCV
def fit_model(X, y):
""" Performs grid search over the 'max_depth' parameter for a
decision tree regressor trained on the input data [X, y]. """
# Create cross-validation sets from the training data
# sklearn version 0.18: ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None)
# sklearn versiin 0.17: ShuffleSplit(n, n_iter=10, test_size=0.1, train_size=None, random_state=None)
cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)
# TODO: Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# TODO: Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth':range(1,10)}
# TODO: Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
# TODO: Create the grid search cv object --> GridSearchCV()
# Make sure to include the right parameters in the object:
# (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.
grid = GridSearchCV(regressor, params, scoring_fnc, cv=cv_sets)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_`
# Fit the training data to the model using grid search
reg = fit_model(X_train, y_train)
# Produce the value for 'max_depth'
print ("Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth']))
Here are the error messages:
ValueError Traceback (most recent call last)
<ipython-input-12-05857a84a7c5> in <module>()
1 # Fit the training data to the model using grid search
----> 2 reg = fit_model(X_train, y_train)
3
4 # Produce the value for 'max_depth'
5 print ("Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth']))
<ipython-input-11-2c0c19498236> in fit_model(X, y)
26 # (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.
27
---> 28 grid = GridSearchCV(regressor, params, scoring_fnc, cv=cv_sets)
29
30 # Fit the grid search object to the data to compute the optimal model
~/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py in __init__(self, estimator, param_grid, scoring, fit_params, n_jobs, iid, refit, cv, verbose, pre_dispatch, error_score)
819 refit, cv, verbose, pre_dispatch, error_score)
820 self.param_grid = param_grid
--> 821 _check_param_grid(param_grid)
822
823 def fit(self, X, y=None):
~/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py in _check_param_grid(param_grid)
349 if True not in check:
350 raise ValueError("Parameter values for parameter ({0}) need "
--> 351 "to be a sequence.".format(name))
352
353 if len(v) == 0:
ValueError: Parameter values for parameter (max_depth) need to be a sequence.
grid_search.py checks this:
check = [isinstance(v, k) for k in (list, tuple, np.ndarray)]
It seems like you can't use a range. I would try this:
params = {'max_depth': np.arange(1,10)}
or without numpy:
params = {'max_depth': [x for x in range(1,10)]}
I want to run several regression types (Lasso, Ridge, ElasticNet and SVR) on a dataset with around 5,000 rows and 6 features. Linear regression. Use GridSearchCV for cross validation. The code is extensive but here are some critical parts:
def splitTrainTestAdv(df):
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
#Scaling and Sampling
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
return X_train, X_test, y_train, y_test
def performSVR(x_train, y_train, X_test, parameter):
C = parameter[0]
epsilon = parameter[1]
kernel = parameter[2]
model = svm.SVR(C = C, epsilon = epsilon, kernel = kernel)
model.fit(x_train, y_train)
return model.predict(X_test) #prediction for the test
def performRidge(X_train, y_train, X_test, parameter):
alpha = parameter[0]
model = linear_model.Ridge(alpha=alpha, normalize=True)
model.fit(X_train, y_train)
return model.predict(X_test) #prediction for the test
MODELS = {
'lasso': (
linear_model.Lasso(),
{'alpha': [0.95]}
),
'ridge': (
linear_model.Ridge(),
{'alpha': [0.01]}
),
)
}
def performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train):
print("# Tuning hyper-parameters for %s" % feature)
print()
model, param_grid = MODELS[model_name]
gs = GridSearchCV(model, param_grid, n_jobs= 1, cv=5, verbose=1, scoring='%s_weighted' % feature)
gs.fit(X_train, y_train)
print("Best parameters set found on development set:")
print(gs.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in gs.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
y_true, y_pred = y_test, gs.predict(X_test)
print(classification_report(y_true, y_pred))
soil = pd.read_csv('C:/training.csv', index_col=0)
soil = getDummiedSoilDepth(soil)
np.random.seed(2015)
soil = shuffleData(soil)
soil = soil.drop('Depth', 1)
X_train, X_test, y_train, y_test = splitTrainTestAdv(soil)
scores = ['precision', 'recall']
for score in scores:
for model in MODELS.keys():
print '####################'
print model, score
print '####################'
performParameterSelection(model, score, X_test, y_test, X_train, y_train)
You can assume that all required imports are done
I am getting this error and do not know why:
ValueError Traceback (most recent call last)
in ()
18 print model, score
19 print '####################'
---> 20 performParameterSelection(model, score, X_test, y_test, X_train, y_train)
21
<ipython-input-27-304555776e21> in performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train)
12 # cv=5 - constant; verbose - keep writing
13
---> 14 gs.fit(X_train, y_train) # Will get grid scores with outputs from ALL models described above
15
16 #pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\metrics\classification.pyc in _check_targets(y_true, y_pred)
90 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
91 "multilabel-sequences"]):
---> 92 raise ValueError("{0} is not supported".format(y_type))
93
94 if y_type in ["binary", "multiclass"]:
ValueError: continuous-multioutput is not supported
I am still very new to Python and this error puzzles me. This should not because I have 6 features, of course. I tried to follow standard buil-in functions.
Please, help
First let's replicate the problem.
First import the libraries needed:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.grid_search import GridSearchCV
Then create some data:
df = pd.DataFrame(np.random.rand(5000,11))
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
Now we can replicate the error and also see options which do not replicate the error:
This runs OK
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1)
print gs.fit(X_train, y_train)
This does not
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1, scoring='recall')
gs.fit(X_train, y_train)
and indeed the error is exactly as you have above; 'continuous multi-output is not supported'.
If you think about the recall measure, it is to do with binary or categorical data - about which we can define things like false positives and so on. At least in my replication of your data, I used continuous data and recall simply is not defined. If you use the default score it works, as you can see above.
So you probably need to look at your predictions and understand why they are continuous (i.e. use a classifier instead of regression). Or use a different score.
As an aside, if you run the regression with only one set of (column of) y values, you still get an error. This time it says more simply 'continuous output is not supported', i.e. the issue is using recall (or precision) on continuous data (whether or not it is multi-output).
The end goal is to evaluate the performance of the model, you can use the model.evaluate method:
_,accuracy = model.evaluate(our_data_feat, new_label2,verbose=0.0)
print('Accuracy:%.2f'%(accuracy*100))
This will give you the accuracy value.
Make sure you have single series for the dependent variable. Properly split your data in train_test_split.