XGBoost difference in train and test features after converting to DMatrix - python

Just wondering how is possible next case:
def fit(self, train, target):
xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
self.model = xgb.train(self.params, xgtrain, self.num_rounds)
I passed the train dataset as csr_matrix with 5233 columns, and after converting to DMatrix I got 5322 features.
Later on predict step, I got an error as cause of above bug :(
def predict(self, test):
if not self.model:
return -1
xgtest = xgb.DMatrix(test)
return self.model.predict(xgtest)
Error: ... training data did not have the following fields: f5232
How can I guarantee correct converting my train/test datasets to DMatrix?
Are there any chance to use in Python something similar to R?
# get same columns for test/train sparse matrixes
col_order <- intersect(colnames(X_train_sparse), colnames(X_test_sparse))
X_train_sparse <- X_train_sparse[,col_order]
X_test_sparse <- X_test_sparse[,col_order]
My approach doesn't work, unfortunately:
def _normalize_columns(self):
columns = (set(self.xgtest.feature_names) - set(self.xgtrain.feature_names)) | \
(set(self.xgtrain.feature_names) - set(self.xgtest.feature_names))
for item in columns:
if item in self.xgtest.feature_names:
self.xgtest.feature_names.remove(item)
else:
# seems, it's immutable structure and can not add any new item!!!
self.xgtest.feature_names.append(item)

One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:
If exclusive to training: Model object will have reference of this feature in model equation. While scoring it will throw error saying I am not able to find this column.
If exclusive to test (lesser likely as test data is usually smaller than training data): Model object will NOT have reference of this feature in model equation. While scoring it will throw error saying I got this column but model equation don't have this column. This is also lesser likely because most implementations are cognizant of this case.
Solutions:
The best "automated" solution is to keep only those columns, which are common to both training and test post one hot encoding.
For adhoc analysis if you can not afford to drop the level of feature because of its importance then do stratified sampling to ensure that all level of feature gets distributed to training and test data.

This situation can happen after one-hot encoding. For example,
ar = np.array([
[1, 2],
[1, 0]
])
enc = OneHotEncoder().fit(ar)
ar2 = enc.transform(ar)
b = np.array([[1, 0]])
b2 = enc.transform(b)
xgb_ar = xgb.DMatrix(ar2)
xgb_b = xgb.DMatrix(b2)
print(b2.shape) # (1, 3)
print(xgb_b.num_col()) # 2
So, when you have all zero column in sparse matrix, DMatrix drop this column (I think, because this column is useless for XGBoost)
Usually, I add a fake row to matrix which contents 1 in all columns.

Such an issue occurred for me when RandomUnderSampler (RUS) method returned a np.array rather than a Pandas DataFrame with column names.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X_train, y_train)
I resolved the issue with this:
X_rus = pd.DataFrame(X_rus, columns = X_train.columns)
Basically taking the output of RUS method and creating a Pandas DataFrame out of it with column names from the original X_train data which was the input of RUS method.
This can be generalized to any similar problem where XGBoost expected to read column names but could not. Just create a Pandas DataFrame and assign the column names accordingly.

Related

CatBoost Post-Training Feature Information

I would like to understand how I can access information about numerical and categorical features after training a CatBoost model. For the sake of example, here's some toy code:
import pandas as pd
from catboost import CatBoostClassifier, Pool
train_pool = Pool(pd.DataFrame({'size': [1,1,2,1],
'shape': ['square','square','square', 'circle']}),
[1,1,0,1],
feature_names = ['size','shape'],
cat_features= ['shape'])
model = CatBoostClassifier(iterations=2,
cat_features = ['shape'],
ctr_leaf_count_limit=1)
model.fit(train_pool, plot=False)
I would now like to run a function on the model object to obtain the following:
Numerical Feature size has minimum value 0, and max value 1 (this should be part of CatBoosts split logic for numerical features)
Categorical Feature shape has the following training values:
values=['square', None].
Notice that circle is not in values because the car_leaf_count_limit=1 would have selected the most occurring value, which in this case is 'square'. I've put None here because I'm pretty sure cat boost will assign None to any unseen classes.
Next, I've chosen the above data example to make sure that CatBoost decides to split on shape=='square'. Ideally I'd like to see an array used_values=['square'] which emphasizes that there was at least one split on this square value.
It's important to emphasize here that I want to operate on the model object only. Obviously, one can get some of these details by running functions onto of the training data. My motivation is to double-make-sure that I completely understand the training-range of inputs into the model, and what it may do to them in preprocessing.

Why is sklearn.metrics support value changing every time?

I'm working on training a supervised learning keras model to categorize data into one of 3 categories. After training, I run this:
dataset = pandas.read_csv(filename, header=[0], encoding='utf-8-sig', sep=',')
# split X and Y (last column)
array = dataset.values
columns = array.shape[1] - 1
np.random.shuffle(array)
x_orig = array[:, 1:columns]
testy = array[:, columns]
columns -= 1
# normalize data
scaler = StandardScaler()
testx= scaler.fit_transform(x_orig)
#onehot
testy = to_categorical(testy)
# load weights
save_path = "[filepath]"
model = tf.keras.models.load_model(save_path)
# gets class breakdown
y_pred = model.predict(testx, verbose=1)
y_pred_bool = np.argmax(y_pred, axis=1)
y_true = np.argmax(testy, axis=1)
print(sklearn.metrics.precision_recall_fscore_support(y_true, y_pred))
sklearn.metrics.precision_recall_fscore_support prints, among other metrics, the support for each class. Per this link, support is the number of occurrences of each class in y_true, which is the true labels.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
My problem: each run, support is different. I'm using the same data, and support for each class always adds up the same (but different than the total in the file – which I also don’t understand), but the number per class differs.
As an example, one run might say [16870, 16299, 7807] and the next might say [17169, 15923, 7884]. They add up the same, but each class differs.
Since my data isn't changing between runs, I'd expect support to be identical every time. Am I wrong? If not, what's going on? I've tried googling, but didn't get any useful results.
Potentially useful information: when I run sklearn.metrics.classification_report, I have the same issue, and the numbers from that match the numbers from precision_recall_fscore_support.
Sidenote: unrelated to above question, but I couldn't google-fu an answer to this one either, I hope that's ok to include here. When I run model.evaluate, part of the printout is e.g. 74us/sample. What does us/sample mean?
Add:
np.random.seed(42)
before you shuffle the array at
np.random.shuffle(array)
The reason for this is without seeding np.shuffle will create a different result each time. Thus when you feed the array into the model it will return a different result. Seeding allows you to shuffle it the same each time, thus creating reproducible results.
Or you can not shuffle and get the same array each time to feed into the model. Either or both methods will ensure reproducibility within the model.

Using Naive Bayes for spam detection

I have two files for e-mails some are spam and some are ham, I'm trying to train a classifier using Naive Bayes and then test it on a test set, I'm still trying to figure out how to do that
df = DataFrame()
train=data.sample(frac=0.8,random_state=20)
test=data.drop(train.index)
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train['message'].values)
classifier = MultinomialNB()
targets = train['class'].values
classifier.fit(counts, targets)
testing_set = vectorizer.fit_transform(test['message'].values)
predictions = classifier.predict(testing_set)
I don't think it's the right way to do that and in addition to that, the last line is giving me an error.
ValueError: dimension mismatch
The idea behind CountVectorizer is that it creates a function that maps word counts to identical places in an array. For example this: a b a c might become [2, 1, 1]. When you call fit_transform it creates that index mapping A -> 0, B-> 1, C -> 2 and then applies that to create the vector of counts. Here you call fit_transform to create a count vectorizer for your training and then again for your testing set. Some words may be in your testing data and not your training data and these get added. To expand on the earlier example example, your test set might be d a b which would create a vector with dimension 4 to account for d. This is likely why the dimensions don't match.
To fix this don't use fit transform the second time so replace:
vectorizer.fit_transform(test['message'].values)
with:
vectorizer.transform(test['message'].values)
It is important to make your vectorizier from your training data not all of your data, which is tempting to avoid missing features. This makes your tests more accurate since when really using the model it will encounter unknown words.
This is no guarantee your approach will work but this is likely the source of the dimensionality issue.

Predicting a single value in classification network

I am currently trying to get into machine learning and neural networks, but my lack of programming skills is kind of hindering me at the moment. I am following an online tutorial in which these lines of code were made to evaluate the created model:
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)
predictions = list(model.predict(input_fn=pred_fn))
predictions[0]
final_preds = []
for pred in predictions:
final_preds.append(pred['class_ids'][0])
final_preds[:10]
from sklearn.metrics import classification_report
print(classification_report(y_test,final_preds))
This works very well for me an tells me the precision I achieved on these 10 inputs I chose from X_test. Unfortunately, I can't really figure out how to be able to predict a particular, single value from X_test or maybe even a manually input value that has the same dimensions as an element of X_test.
X_test is a pandas.core.frame.DataFrame and includes 15 columns and thousands of rows. Therefore, I would find it helpful to maybe predict or evaluate a certain value.
If I missed any essential information, that I should have included, let me know. Thanks in advance!
Why don't you just take sections of the X_test dataframe, or pass in single values as a dataframe with a single row.
Sectioning a dataframe:
temp = X_test[i:i+1]
to test with the ith row, use temp now instead of X_test.
Or create a new dataframe with required data:
temp = pandas.DataFrame(data, columns = X_test.columns)
where data is input as a list (iterable) [[a1,a2,a3...a15]].
again use temp instead of X_test in your code.

Training a sklearn LogisticRegression classifier without all possible labels

I am trying to use scikit-learn 0.12.1 to:
train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob) pairs.
If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
Building on larsman's excellent answer, I ended up with this:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.

Categories

Resources