I'm using catboost for my model and I'm getting the error as below to the following code:
from catboost import Pool, CatBoostClassifier, cv
#Split data
train = data[:split]
test = data[split:]
# Get variables for a model
x = train.drop(["Survived"], axis=1)
y = train["Survived"]
#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
cat_features = np.where(x.dtypes != float)[0]
cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy')
cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)
pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))
...which raises:
---> 23 cv_scores = cv(pool, cat.get_params(), fold_count=10)
CatBoostError: catboost/libs/metrics/metric.cpp:5069: Cannot calc
metric which requires logits for absolute values.
Help would be appreciated. Thanks.
You need to add the parameter loss_function='Logloss' inside the CatBoostClassifier(). This issue is described here and was supposedly fixed but now it appeared again. I will reopen this issue because this is clearly a bug.
Related
Im trying to do some machine learning. The error appears on line 14. Any help is hugely appreciated.
# Data to be predicted
X_to_be_predicted = my_df[my_df.Survived.isnull()]
X_to_be_predicted = X_to_be_predicted.drop(['Survived'], axis = 1)
# X_to_be_predicted[X_to_be_predicted.Age.isnull()]
# X_to_be_predicted.dropna(inplace = True) # 417 x 27
#Training data
train_data = my_df
train_data = train_data.dropna()
feature_train = train_data['Survived']
label_train = train_data.drop(['Survived'], axis = 1)
##Gaussian
clf = GaussianNB()
x_train, x_test, y_train, y_test = train_test_split(label_train, feature_train, test_size=0.2)
clf.fit(x_train, np.ravel(y_train))
print("NB Accuracy: " + repr(round(clf.score(x_test, y_test) * 100, 2)) + "%")
result_rf = cross_val_score(clf, x_train, y_train, cv=10, scoring = 'accuracy')
print('The cross validated score for Random forest is:', round(result_rf.mean()*100,2))
y_pred = cross_val_predict(clf,x_train, y_train, cv=10)
sns.heatmap(confusion_matrix(y_train, y_pred), annot=True, fmt = '3.0f', cmap = "summer")
plt.title('Confusion_matrix for NB', y=1.05, size=15)
Please what does this mean? I have been trying to deploy my model but this keeps coming up when I run it
Please specify the error you are getting. I think you are getting the error in this line
clf.fit(x_train, np.ravel(y_train))
Try this : clf.fit(x_train, y_train)
Because you are already gave the inputs to x_train and y_train. Now your task is to fit the train data.
I hope it helps.
I'm trying to learn the basics of XGBoost and devises a script that splits some data I found on Kaggle about Corona virus outbreaks in China. The code and model work, but some some reason when I use the model to make a new prediction I get a "ValueError: feature_names mismatch." The new test data has a 2-d array with 2 values, just like the test data, but I still get a value error.
train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['infected'].astype(int)
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
param = {
'max_depth':4,
'eta':0.3,
'num_class': 2}
epochs = 10
model = xgb.train(param, train, epochs)
All the code above works, but the terst below gives me the error:
testArray=np.array([[13, 67]])
test_individual = xgb.DMatrix(testArray)
print(model.predict(test_individual))
Any idea what I'm doing wrong?
Seems like you are missing out on the basics of using the train_test_split function of sklearn.
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
The line above expects the train to have all the features to be used for training, while the test expects the target feature.
Try fixing that first.
I am running a RandomTreeForest model with scikit learn library and when assessing its accuracy using mean absolute I get a ValueError:
"Found input variables with inconsistent numbers of samples".
The pandas dataframe I used comes from a real estate csv file that gets modified after I've added some missing values. Please see the code below.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
X_train_plus = X_train.copy()
X_valid_plus = X_train.copy()
for col in cols_with_missing:
X_train_plus[col + "_was_missing"] = X_train_plus[col].isnull()
X_valid_plus[col + "_was_missing"] = X_valid_plus[col].isnull()
my_imputer = SimpleImputer()
imp_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imp_X_valid_plus = pd.DataFrame(my_imputer.fit_transform(X_valid_plus))
imp_X_train_plus.columns = X_train_plus.columns
imp_X_valid_plus.columns = X_valid_plus.columns
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion="mae", random_state=0)
model_4 = RandomForestRegressor(n_estimators=100, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)
models = [model_1, model_2, model_3, model_4, model_5]
def score_model (model, imp_X_train_plus, y_train, imp_X_valid_plus, y_valid):
model.fit(imp_X_train_plus, y_train)
pred = model.predict(imp_X_valid_plus)
return mean_absolute_error(y_valid, pred)
for i in range(0, len(models)):
mae = score_model (models[i], imp_X_train_plus, y_train, imp_X_valid_plus, y_valid)
print("Model %d with extended imputed has a MAE: %d" %(i+1, mae))
I expect the output to be something similar to this:
"Model 1 with extended imputed has a MAE: 345237"
But instead what I get is the following Value Error when return is calling mean_absolute_error within score_model function:
"ValueError: Found input variables with inconsistent numbers of samples: [2716, 10864]"
I think the mistake might be in the imp_X_train_plus and imp_X_valid_plus variables, however I have ran a very similar model that implements the dataframes and works fine.
It might be because of your 3d line: X_valid_plus = X_train.copy().
Did you mean to do: X_valid_plus = X_valid.copy()?
I want to run several regression types (Lasso, Ridge, ElasticNet and SVR) on a dataset with around 5,000 rows and 6 features. Linear regression. Use GridSearchCV for cross validation. The code is extensive but here are some critical parts:
def splitTrainTestAdv(df):
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
#Scaling and Sampling
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
return X_train, X_test, y_train, y_test
def performSVR(x_train, y_train, X_test, parameter):
C = parameter[0]
epsilon = parameter[1]
kernel = parameter[2]
model = svm.SVR(C = C, epsilon = epsilon, kernel = kernel)
model.fit(x_train, y_train)
return model.predict(X_test) #prediction for the test
def performRidge(X_train, y_train, X_test, parameter):
alpha = parameter[0]
model = linear_model.Ridge(alpha=alpha, normalize=True)
model.fit(X_train, y_train)
return model.predict(X_test) #prediction for the test
MODELS = {
'lasso': (
linear_model.Lasso(),
{'alpha': [0.95]}
),
'ridge': (
linear_model.Ridge(),
{'alpha': [0.01]}
),
)
}
def performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train):
print("# Tuning hyper-parameters for %s" % feature)
print()
model, param_grid = MODELS[model_name]
gs = GridSearchCV(model, param_grid, n_jobs= 1, cv=5, verbose=1, scoring='%s_weighted' % feature)
gs.fit(X_train, y_train)
print("Best parameters set found on development set:")
print(gs.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in gs.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
y_true, y_pred = y_test, gs.predict(X_test)
print(classification_report(y_true, y_pred))
soil = pd.read_csv('C:/training.csv', index_col=0)
soil = getDummiedSoilDepth(soil)
np.random.seed(2015)
soil = shuffleData(soil)
soil = soil.drop('Depth', 1)
X_train, X_test, y_train, y_test = splitTrainTestAdv(soil)
scores = ['precision', 'recall']
for score in scores:
for model in MODELS.keys():
print '####################'
print model, score
print '####################'
performParameterSelection(model, score, X_test, y_test, X_train, y_train)
You can assume that all required imports are done
I am getting this error and do not know why:
ValueError Traceback (most recent call last)
in ()
18 print model, score
19 print '####################'
---> 20 performParameterSelection(model, score, X_test, y_test, X_train, y_train)
21
<ipython-input-27-304555776e21> in performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train)
12 # cv=5 - constant; verbose - keep writing
13
---> 14 gs.fit(X_train, y_train) # Will get grid scores with outputs from ALL models described above
15
16 #pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\metrics\classification.pyc in _check_targets(y_true, y_pred)
90 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
91 "multilabel-sequences"]):
---> 92 raise ValueError("{0} is not supported".format(y_type))
93
94 if y_type in ["binary", "multiclass"]:
ValueError: continuous-multioutput is not supported
I am still very new to Python and this error puzzles me. This should not because I have 6 features, of course. I tried to follow standard buil-in functions.
Please, help
First let's replicate the problem.
First import the libraries needed:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.grid_search import GridSearchCV
Then create some data:
df = pd.DataFrame(np.random.rand(5000,11))
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
Now we can replicate the error and also see options which do not replicate the error:
This runs OK
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1)
print gs.fit(X_train, y_train)
This does not
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1, scoring='recall')
gs.fit(X_train, y_train)
and indeed the error is exactly as you have above; 'continuous multi-output is not supported'.
If you think about the recall measure, it is to do with binary or categorical data - about which we can define things like false positives and so on. At least in my replication of your data, I used continuous data and recall simply is not defined. If you use the default score it works, as you can see above.
So you probably need to look at your predictions and understand why they are continuous (i.e. use a classifier instead of regression). Or use a different score.
As an aside, if you run the regression with only one set of (column of) y values, you still get an error. This time it says more simply 'continuous output is not supported', i.e. the issue is using recall (or precision) on continuous data (whether or not it is multi-output).
The end goal is to evaluate the performance of the model, you can use the model.evaluate method:
_,accuracy = model.evaluate(our_data_feat, new_label2,verbose=0.0)
print('Accuracy:%.2f'%(accuracy*100))
This will give you the accuracy value.
Make sure you have single series for the dependent variable. Properly split your data in train_test_split.
I want to run several regression types (Lasso, Ridge, ElasticNet and SVR) on a dataset with around 5,000 rows and 6 features. Linear regression. Use GridSearchCV for cross validation. The code is extensive but here are some critical parts:
def splitTrainTestAdv(df):
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
#Scaling and Sampling
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
return X_train, X_test, y_train, y_test
def performSVR(x_train, y_train, X_test, parameter):
C = parameter[0]
epsilon = parameter[1]
kernel = parameter[2]
model = svm.SVR(C = C, epsilon = epsilon, kernel = kernel)
model.fit(x_train, y_train)
return model.predict(X_test) #prediction for the test
def performRidge(X_train, y_train, X_test, parameter):
alpha = parameter[0]
model = linear_model.Ridge(alpha=alpha, normalize=True)
model.fit(X_train, y_train)
return model.predict(X_test) #prediction for the test
MODELS = {
'lasso': (
linear_model.Lasso(),
{'alpha': [0.95]}
),
'ridge': (
linear_model.Ridge(),
{'alpha': [0.01]}
),
)
}
def performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train):
print("# Tuning hyper-parameters for %s" % feature)
print()
model, param_grid = MODELS[model_name]
gs = GridSearchCV(model, param_grid, n_jobs= 1, cv=5, verbose=1, scoring='%s_weighted' % feature)
gs.fit(X_train, y_train)
print("Best parameters set found on development set:")
print(gs.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in gs.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
y_true, y_pred = y_test, gs.predict(X_test)
print(classification_report(y_true, y_pred))
soil = pd.read_csv('C:/training.csv', index_col=0)
soil = getDummiedSoilDepth(soil)
np.random.seed(2015)
soil = shuffleData(soil)
soil = soil.drop('Depth', 1)
X_train, X_test, y_train, y_test = splitTrainTestAdv(soil)
scores = ['precision', 'recall']
for score in scores:
for model in MODELS.keys():
print '####################'
print model, score
print '####################'
performParameterSelection(model, score, X_test, y_test, X_train, y_train)
You can assume that all required imports are done
I am getting this error and do not know why:
ValueError Traceback (most recent call last)
in ()
18 print model, score
19 print '####################'
---> 20 performParameterSelection(model, score, X_test, y_test, X_train, y_train)
21
<ipython-input-27-304555776e21> in performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train)
12 # cv=5 - constant; verbose - keep writing
13
---> 14 gs.fit(X_train, y_train) # Will get grid scores with outputs from ALL models described above
15
16 #pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
C:\Users\Tony\Anaconda\lib\site-packages\sklearn\metrics\classification.pyc in _check_targets(y_true, y_pred)
90 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
91 "multilabel-sequences"]):
---> 92 raise ValueError("{0} is not supported".format(y_type))
93
94 if y_type in ["binary", "multiclass"]:
ValueError: continuous-multioutput is not supported
I am still very new to Python and this error puzzles me. This should not because I have 6 features, of course. I tried to follow standard buil-in functions.
Please, help
First let's replicate the problem.
First import the libraries needed:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.grid_search import GridSearchCV
Then create some data:
df = pd.DataFrame(np.random.rand(5000,11))
y = df.iloc[:,-5:] # last 5 columns
X = df.iloc[:,:-5] # Except for last 5 columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
Now we can replicate the error and also see options which do not replicate the error:
This runs OK
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1)
print gs.fit(X_train, y_train)
This does not
gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1, scoring='recall')
gs.fit(X_train, y_train)
and indeed the error is exactly as you have above; 'continuous multi-output is not supported'.
If you think about the recall measure, it is to do with binary or categorical data - about which we can define things like false positives and so on. At least in my replication of your data, I used continuous data and recall simply is not defined. If you use the default score it works, as you can see above.
So you probably need to look at your predictions and understand why they are continuous (i.e. use a classifier instead of regression). Or use a different score.
As an aside, if you run the regression with only one set of (column of) y values, you still get an error. This time it says more simply 'continuous output is not supported', i.e. the issue is using recall (or precision) on continuous data (whether or not it is multi-output).
The end goal is to evaluate the performance of the model, you can use the model.evaluate method:
_,accuracy = model.evaluate(our_data_feat, new_label2,verbose=0.0)
print('Accuracy:%.2f'%(accuracy*100))
This will give you the accuracy value.
Make sure you have single series for the dependent variable. Properly split your data in train_test_split.