I'm trying to learn the basics of XGBoost and devises a script that splits some data I found on Kaggle about Corona virus outbreaks in China. The code and model work, but some some reason when I use the model to make a new prediction I get a "ValueError: feature_names mismatch." The new test data has a 2-d array with 2 values, just like the test data, but I still get a value error.
train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['infected'].astype(int)
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
param = {
'max_depth':4,
'eta':0.3,
'num_class': 2}
epochs = 10
model = xgb.train(param, train, epochs)
All the code above works, but the terst below gives me the error:
testArray=np.array([[13, 67]])
test_individual = xgb.DMatrix(testArray)
print(model.predict(test_individual))
Any idea what I'm doing wrong?
Seems like you are missing out on the basics of using the train_test_split function of sklearn.
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
The line above expects the train to have all the features to be used for training, while the test expects the target feature.
Try fixing that first.
Related
I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
I've built a machine learning model from 2 data frames called df_test and df_train for naive Bayes, I run it with this code in my pycharm, but when I run it with this model, it returns:
ValueError: Found input variables with inconsistent numbers of samples: [164309, 109541].
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(columns = ['Interest_Rate']), df_test, test_size=1.0,random_state=109) # 70% training and 30% test
from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
gnb = GaussianNB()
#Train the model using the training sets
gnb.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = gnb.predict(X_test)
Where have I gone wrong?
You want 70-30 split of your data but here you created 100% test data. Change test_size to 0.3(30%) instead of 1.0(100%).
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(columns = ['Interest_Rate']), df_test, test_size=0.3,random_state=109) # 70% training and 30% test
I am trying to do the machine learning practice problem of Heart disease , dataset from kaggle.
Then i tried to split data into train set and test set and after that combing models into single function and predicting,this error shows up in jupyter notebook .
Here's my code:
# Split data into X and y
X = df.drop("target", axis=1)
y = df["target"]
Spliting
# Split data into train and test sets
np.random.seed(42)
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
Prediction function
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
"KNN": KNeighborsClassifier(),
"Random Forest": RandomForestClassifier()}
# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models : a dict of differetn Scikit-Learn machine learning models
X_train : training data (no labels)
X_test : testing data (no labels)
y_train : training labels
y_test : test labels
"""
# Set random seed
np.random.seed(42)
# Make a dictionary to keep model scores
model_scores = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_scores[name] = model.score(X_test, y_test)
return model_scores
And when i run this code , that error shows up
model_scores = fit_and_score(models=models,
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test)
model_scores
This is error
Your X_train, y_train, or both, seem to have entries that are not float numbers.
At some point in the code, try using
X_train = X_train.astype(float)
y_train = y_train.astype(float)
X_test = X_test.astype(float)
y_test = y_test.astype(float)
Either this will work and the error will go away, or one of the conversions will fail, at which point you will need to decide how (or if) you want to encode the data as a float.
I'm working on building a model using XGBoost to predict corona virus infections based on province and region codes. dataset: https://www.kaggle.com/sudalairajkumar/covid19-in-italy.
I have split the data, but when I try to set up the model, I get the following error:
XGBoostError: [16:16:15] C:/Users/Administrator/workspace/xgboost-
win64_release_1.0.0/src/objective/multiclass_obj.cu:115:
SoftmaxMultiClassObj: label must be in [0, num_class).
Code is as follows:
train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['TotalPositiveCases'].astype(int)
X_test, X_train, y_test, y_train = train_test_split(train, test,
test_size=0.30, random_state=42)
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
param = {
'max_depth':4,
'eta':0.3,
'objective': 'multi:softmax',
'num_class': 3}
epochs = 10
model = xgb.train(param, train, epochs)
the model attribute is where I get the error
This error comes when there are more labels in your target features than mentioned in the num_class parameter.
You should check if your target has more features than your num_class parameters or what you can do is to print the target.unique() as there might be some nulls or NAN in the data.
I'm using catboost for my model and I'm getting the error as below to the following code:
from catboost import Pool, CatBoostClassifier, cv
#Split data
train = data[:split]
test = data[split:]
# Get variables for a model
x = train.drop(["Survived"], axis=1)
y = train["Survived"]
#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
cat_features = np.where(x.dtypes != float)[0]
cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy')
cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)
pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))
...which raises:
---> 23 cv_scores = cv(pool, cat.get_params(), fold_count=10)
CatBoostError: catboost/libs/metrics/metric.cpp:5069: Cannot calc
metric which requires logits for absolute values.
Help would be appreciated. Thanks.
You need to add the parameter loss_function='Logloss' inside the CatBoostClassifier(). This issue is described here and was supposedly fixed but now it appeared again. I will reopen this issue because this is clearly a bug.