XGBOOST feature name error - Python - python

Probably this question has been asked many times in different forms. However, my problem is when I use XGBClassifier() with a production like data, I get a feature name mismatch error. I am hoping someone could please tell me what I am doing wrong. Here is my code. BTW, the data is completely made up:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score
import xgboost as xgb
data = {"Age":[44,27,30,38,40,35,70,48,50,37],
"BMI":["25-29","35-39","30-35","40-45","45-49","20-25","<19",">70","50-55","55-59"],
"BP":["<140/90",">140/90",">140/90",">140/90","<140/90","<140/90","<140/90",">140/90",">140/90","<140/90"],
"Risk":["No","Yes","Yes","Yes","No","No","No","Yes","Yes","No"]}
df = pd.DataFrame(data)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
labelencoder = LabelEncoder()
def encoder_X(columns):
for i in columns:
X.iloc[:, i] = labelencoder.fit_transform(X.iloc[:, i])
encoder_X([1,2])
y = labelencoder.fit_transform(y)
onehotencdoer = OneHotEncoder(categorical_features = [[1,2]])
X = onehotencdoer.fit_transform(X).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)
model = xgb.XGBClassifier()
model.fit(X_train, y_train, verbose = True)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: {0}%".format(accuracy*100))
So far so good, no error. The accuracy score is 100%, but that's because it is a made up data set so I am not worried about that.
When I try to classify a new dataset based on the model, I get "feature name mismatch error":
proddata = {"Age":[65,50,37],
"BMI":["25-29","35-39","30-35"],
"BP":["<140/90",">140/90",">140/90"]}
prod_df = pd.DataFrame(proddata)
def encoder_prod(columns):
for i in columns:
prod_df.iloc[:, i] = labelencoder.fit_transform(prod_df.iloc[:, i])
encoder_prod([1,2])
onehotencdoer = OneHotEncoder(categorical_features = [[1,2]])
prod_df = onehotencdoer.fit_transform(prod_df).toarray()
predictions = model.predict(prod_df)
After this I get the below error
predictions = model.predict(prod_df)
Traceback (most recent call last):
File "<ipython-input-24-456b5626e711>", line 1, in <module>
predictions = model.predict(prod_df)
File "c:\users\sozdemir\appdata\local\programs\python\python35\lib\site-packages\xgboost\sklearn.py", line 526, in predict
ntree_limit=ntree_limit)
File "c:\users\sozdemir\appdata\local\programs\python\python35\lib\site-packages\xgboost\core.py", line 1044, in predict
self._validate_features(data)
File "c:\users\sozdemir\appdata\local\programs\python\python35\lib\site-packages\xgboost\core.py", line 1288, in _validate_features
data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5']
expected f6, f11, f12, f9, f7, f8, f10 in input data
I know this is happening as a result of OneHotEncoding when fit and transform to an array. I might be wrong though.
If this is as a result of OneHotEncoding, can I just not use OneHotEncoding since LabelEncoder() already codes the categorical values?
Thank you so much for any help and feedback.
PS: The version of XGBOOST is 0.7
xgboost.__version__
Out[37]: '0.7'

It seems like the encoder needs to be saved after it is being fitted. I used joblib from sklearn. Jason from https://machinelearningmastery.com/ gave me the idea of saving the encoder. The below is an edited version:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import xgboost as xgb
data = {"Age":[44,27,30,38,40,35,70,48,50,37],
"BMI":["25-29","35-39","30-35","40-45","45-49","20-25","<19",">70","50-55","55-59"],
"BP":["<140/90",">140/90",">140/90",">140/90","<140/90","<140/90","<140/90",">140/90",">140/90","<140/90"],
"Risk":["No","Yes","Yes","Yes","No","No","No","Yes","Yes","No"]}
df = pd.DataFrame(data)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
labelencoder = LabelEncoder()
def encoder_X(columns):
for i in columns:
X.iloc[:, i] = labelencoder.fit_transform(X.iloc[:, i])
encoder_X([1,2])
y = labelencoder.fit_transform(y)
onehotencdoer = OneHotEncoder(categorical_features = [[1,2]])
onehotencdoer.fit(X)
enc = joblib.dump(onehotencdoer, "encoder.pkl") # save the fitted encoder
X = onehotencdoer.transform(X).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)
model = xgb.XGBClassifier()
model.fit(X_train, y_train, verbose = True)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: {0}%".format(accuracy*100))
And now, we can use the fitted encoder to transform the prod data:
proddata = {"Age":[65,50,37],
"BMI":["25-29","35-39","30-35"],
"BP":["<140/90",">140/90",">140/90"]}
prod_df = pd.DataFrame(proddata)
def encoder_prod(columns):
for i in columns:
prod_df.iloc[:, i] = labelencoder.fit_transform(prod_df.iloc[:, i])
encoder_prod([1,2])
enc = joblib.load("encoder.pkl")
prod_df = enc.transform(prod_df).toarray()
predictions = model.predict(prod_df)
results = [round(val) for val in predictions]
It seems to be working for this example and I'll try this method at work for a larger data-set.
Please, let me know what you think.
Thanks

Related

I'm not sure what needs to be reshaped in my data

I'm trying to use a LinearRegression() algorithm to predict the price of a house.
Here's my code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data.csv')
df = df.drop(columns=['date', 'street', 'city', 'statezip', 'country'])
X = df.drop(columns=['price'])
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
pred.reshape((-1, 1))
acc = lr.score(pred, y_test)
However, I keep on getting this error:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I've tried to reshape all the attributes in my data, but the only thing that I'm able to reshape is pred, and I still get the same error after doing that?
How should I fix this error?
Thanks in advance.
Base on Documentation of sklearn.linear_model.LinearRegression.score:
score(X, y, sample_weight=None)
return R^2 score of self.predict(X) wrt. y.
You need to pass X as the first argument like below:
lr.fit(X_train, y_train)
acc = lr.score(X_test, y_test)
print(acc)
Or You can use sklearn.metrics.r2_score:
from sklearn.metrics import r2_score
acc = r2_score(y_test, pred)
print(acc)
Example:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
acc = lr.score(X_test, y_test)
print(acc)
# Or
from sklearn.metrics import r2_score
acc = r2_score(y_test, pred)
print(acc)
Output:
0.8888888888888888
0.8888888888888888

How to evaluate the effect of different methods of handling missing values?

I am a total beginner and I am trying to compare different methods of handling missing data. In order to evaluate the effect of each method (drop raws with missing values, drop columns with missigness over 40%, impute with the mean, impute with the KNN), I compare the results of the LDA accuracy and LogReg accuracy on the training set between a dataset with 10% missing values, 20% missing values against the results of the original complete dataset. Unfortunately, I get pretty much the same results even between the complete dataset and the dataset with 20% missing-ness. I don't know what I am doing wrong.
from numpy import nan
from numpy import isnan
from pandas import read_csv
from sklearn.impute import SimpleImputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
#dataset = read_csv('telecom_churn_rev10.csv')
dataset = read_csv('telecom_churn_rev20.csv')
dataset = dataset.replace(nan, 0)
values = dataset.values
X = values[:,1:11]
y = values[:,0]
dataset.fillna(dataset.mean(), inplace=True)
#dataset.fillna(dataset.mode(), inplace=True)
print(dataset.isnull().sum())
imputer = SimpleImputer(missing_values = nan, strategy = 'mean')
transformed_values = imputer.fit_transform(X)
print('Missing: %d' % isnan(transformed_values).sum())
model = LinearDiscriminantAnalysis()
cv = KFold(n_splits = 3, shuffle = True, random_state = 1)
result = cross_val_score(model, X, y, cv = cv, scoring = 'accuracy')
print('Accuracy: %.3f' % result.mean())
#print('Accuracy: %.3f' % result.mode())
print(dataset.describe())
print(dataset.head(20))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test,y_pred)
from sklearn import metrics
# make predictions on X
expected = y
predicted = classifier.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
# make predictions on X test
expected = y_test
predicted = classifier.predict(X_test)
# summarize the fit of the model
print(metrics.confusion_matrix(expected, predicted))
print(metrics.classification_report(expected, predicted))
You replace all your missing values with 0 at that line : dataset = dataset.replace(nan, 0). After this line, you have a full dataset without missing values. So, the .fillna() and the SimpleImputer() are useless after that line.

how to predict the output value using logistic regression?

I simulated successfully my classification function to predict the single value of output binary by ANN utilizing pandas and sklearn libraries. Now I want to simulate my model to predict another feature which is not binary, as the input columns are (0,1,4,6,7,8,11,12,13,14) and the output column is (15) of my data set. A typical example of the input data is [4096,0.07324,1.7,20,5.2,64,0.142,0.5,35,30,584.232] as some values are float. How can I predict 584.232 by the first ten numbers utilizing logistic regression?
thank you all.
dataset = pd.read_csv("DataSet.csv")
X = dataset.iloc[:, [0,1,4,6,7,8,11,12,13,14]].values
y = dataset.iloc[:, 15].values
for avoiding type error, I converted the input values into float using the following way:
dataset['ColumnsName'] = dataset['ColumnsName'].astype(float)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelEncoder_X_delay_1 = LabelEncoder()
X[:, 1] = labelEncoder_X_1.fit_transform(X[:, 1])
labelEncoder_X_delay_2 = LabelEncoder()
X[:, 2] = labelEncoder_X_2.fit_transform(X[:, 2])
# normalizing the input
X = X.T
X = X / np.amax(X, axis=1)
X = X.T
# splitting the dataset into the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.2, random_state = 0)
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# fitting logestic regression to the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
but after compiling the code up to now, it gives the error:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
Traceback (most recent call last):
File "<ipython-input-5-f18c8875152f>", line 3, in <module>
classifier.fit(X_train, y_train)
File "C:\Users\ali\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1528, in fit
check_classification_targets(y)
File "C:\Users\ali\anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 169, in check_classification_targets
raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'
I have already converted the predefined columns from string to float!
dataset = pd.read_csv("DataSet.csv")
X = dataset.iloc[:, [0,1,4,6,7,8,11,12,13,14]].values
y = dataset.iloc[:, 15].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelEncoder_X_delay_1 = LabelEncoder()
X[:, 1] = labelEncoder_X_1.fit_transform(X[:, 1])
labelEncoder_X_delay_2 = LabelEncoder()
X[:, 2] = labelEncoder_X_2.fit_transform(X[:, 2])
# normalizing the input
X = X.T
X = X / np.amax(X, axis=1)
X = X.T
# splitting the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Activation Function
model = Sequential()
model.add(Dense(6, input_dim=9, activation= "relu"))
model.add(Dense(6, activation= "relu"))
model.add(Dense(6, activation= "relu"))
model.add(Dense(1))
# splitting the dataset into the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.2, random_state = 0)
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# fitting logestic regression to the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

Logistic regression gives me precision of 0.55. Whats wrong with my code?

Copied columns from data frame Z, made dummies, trying to predict Click, a 0/1 variable. Balanced size of train and test. Where did I go wrong?
df = z[['user_state', 'device_maker', 'day_of_week', 'device_area_zscore', 'Age_zscore', 'consumption_zscore', 'click']].copy()
day_dummy = pd.get_dummies(df["day_of_week"])
state_dummy = pd.get_dummies(df["user_state"])
maker_dummy = pd.get_dummies(df["device_maker"])
combined_df = pd.concat([df, day_dummy, state_dummy, maker_dummy], axis=1)
click_rows = combined_df[combined_df.click == 1]
no_click_rows = combined_df[combined_df.click == 0]
no_click_rows = no_click_rows.sample(frac=1, replace=False, random_state=1)
final_df = pd.concat([click_rows, no_click_rows], axis = 0)
final_df = final_df.reset_index(drop=True)
from sklearn.model_selection import train_test_split
final_df = final_df.drop(['user_state', 'device_maker', 'day_of_week'], axis = 1)
x_train, x_test, y_train, y_test = train_test_split(final_df.drop(['click'], axis = 1), final_df['click'], test_size=0.2, random_state=2)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(x_train,y_train)
predictions = logmodel.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
Following are my suggestions:
You are missing stratify argument in train_test_split function. It ensures the target distribution is similar in train/val/test data.
Logistic Regression doesn't well in detecting non-linear patterns in data. Try a tree based model like RandomForestClassifier.

Classification with one file with entirely the training and another file with entirely test

I am trying to make a classification in which one file is entirely the training and another file is entirely the test. It's possible? I tried:
import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
#csv file from train
df = pd.read_csv('data_train.csv', sep = ',')
#csv file from test
df_test = pd.read_csv('data_test.csv', sep = ',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
df_test = df_test.reindex(np.random.permutation(df_test.index))
vect = CountVectorizer()
X = vect.fit_transform(df['data_train'])
y = df['label']
X_T = vect.fit_transform(df_test['data_test'])
y_t = df_test['label']
X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
tf_transformer = TfidfTransformer(use_idf=False).fit(X)
X_train_tf = tf_transformer.transform(X)
X_train_tf.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X)
X_train_tfidf.shape
tf_transformer = TfidfTransformer(use_idf=False).fit(X_T)
X_train_tf_teste = tf_transformer.transform(X_T)
X_train_tf_teste.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf_teste = tfidf_transformer.fit_transform(X_T)
X_train_tfidf_teste.shape
#RegLog
clf = LogisticRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("confusion matrix")
print(confusion_matrix(y_test, y_pred, labels = y))
print("F-score")
print(f1_score(y_test, y_pred, average=None))
print(precision_score(y_test, y_pred, average=None))
print(recall_score(y_test, y_pred, average=None))
print("cross validation")
scores = cross_validation.cross_val_score(clf, X, y, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
I have set test_size to zero because I do not want to have a partition in those files. And I also applied Count and TFIDF in the training and test file.
My output error:
Traceback (most recent call last):
File "classif.py", line 34, in
X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
ValueError: too many values to unpack (expected 2)
The error you are getting in train_test_split is clearly indicated and solved by #Alexis. And once again I also suggest to not use train_test_split as it will not do anything except shuffling, which you have already done.
But I want to highlight another important point, i.e., If you are keeping your train and test files separately, then just don't fit vectorizers separately. It will create different columns for train and test files. Example:
cv = CountVectorizer()
train=['Hi this is stack overflow']
cv.fit(train)
cv.get_feature_names()
Output:
['hi', 'is', 'overflow', 'stack', 'this']
test=['Hi that is not stack overflow']
cv.fit(test)
cv.get_feature_names()
Output:
['hi', 'is', 'not', 'overflow', 'stack', 'that']
Hence, fitting them separately will result in columns mismatch. So, you should merge train and test files firstly and then fit_transform vectorizer collectively, or if you don't have test data beforehand you could only transform the test data using vectorizer fitted on train data, which will ignore the words not present in train data.
So first, for the error you get , just write the code as follow, it should work.
X_train, y_train,_,_ = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test,_,_ = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
the code is made to return 4 sets and it is expected that you have 4 variables to receive them. Putting _ is just to let everyone know that you don't care about those outputs.
Second, i don't really know why you are doing this manipulation. If you want to shuffle the data, it's not the best way to do it. And you have already done it before.

Categories

Resources