SelectKBest based on (estimated) amount of features - python

I'm trying to implement a hierarchical text classifier with scikit-learn, with one "root" classifier that arranges all input strings in one (or more) of ~50 categories. For each of these categories, I'm gonna train a new classifier, which solves the actual task.
The reason for this two-layer approach is training performance and memory issues (a classifier which is supposed to separate >1k classes does not perform very well...).
This is what my pipeline looks like for each of these "subclassifiers"
pipeline = Pipeline([
('vect', CountVectorizer(strip_accents=None, lowercase=True, analyzer='char_wb', ngram_range=(3,8), max_df=0.1)),
('tfidf', TfidfTransformer(norm='l2')),
('feat', SelectKBest(chi2, k=10000)),
('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, n_iter=10))),
Now to my problem: I'm using SelectKBest to limit the model size to a reasonable amount, but for the subclassifiers, there is sometimes not enough input data available so I don't even get to the 10k feature limit, which causes
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/", line 300, in fit
self._check_params(X, y)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/", line 405, in _check_params
% self.k)
ValueError: k should be >=0, <= n_features; got 10000.Use k='all' to return all features.
I don't know how many features I will have without applying the CountVectorizer, but I have to define the pipeline in advance. My preferred solution would be to skip the SelectKBest step, if there are less than k features anyway, but I don't know how to implement this behaviour without calling CountVectorizer twice (once in advance, once as part of the pipeline).
Any thoughts on this?

I followed the advice of Martin Krämer and created a subclass of SelectKBest which implements the desired functionality:
class SelectAtMostKBest(SelectKBest):
def _check_params(self, X, y):
if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
# set k to "all" (skip feature selection), if less than k features are available
self.k = "all"
I tried to add this snipped to his answer but the request was rejected so there you are...

I think the cleanest option would be to subclass SelectKBest and fallback to identity transformation in your implementation, if k exceeds the number of input features, otherwise just call the super implementation.

You could use SelectPercentile, which is more meaningful if you don't have a fixed number of features.


How to get k-fold cross validation final model with sklearn

Once I iterated on each training combination, given the k-fold split, I can estimate mean and standard deviation of models performance but I actually get k different models (with their own fitted parameters). How do I get the final, whole model? Is a matter of averaging the parameters?
Not showing code because is a general question so I'll write down the logic only:
splitting dataset according to the k-fold theory (let's say k = 5)
iterations: training from the first to the fifth model
getting 5 different models with, let's say, the following parameters:
model_1 = [p10, p11, p12] \
model_2 = [p20, p21, p22] |
model_3 = [p30, p31, p32] > param_matrix
model_4 = [p40, p41, p42] |
model_5 = [p50, p51, p52] /
What about model_final: [pf0, pf1, pf2]?
Too trivial solution 1:
model_final = mean(param_matrix, axis=0)
Too trivial solution 2:
model_final = the one of the fives that reach the highest performance
(could be a overfit rather than the optimal one)
First of all, the purpose of cross-validation (K-fold) is model checking, not model building.
In your question, you said that every fold of your program has different parameters, maybe this is not the best way to work.
One possibility to proceed is evaluate every model (each one with different parameters) using K-fold inside (using GridSearchCV); if you see that you obtain similar values of accuracy or other metrics in each split, then you are not overfitting.
Make this methodology for every model you have, and chose the one you obtain better results. Of course, always there is possibility to overfit, but with K-fold, you reduce it.
Finally, once you have checked with cross-validation that you obtain similar metrics for every split and you have chosed the model parameters, you have to train your model with all your training data; and you will finally obtain one unique model.

Using Naive Bayes for spam detection

I have two files for e-mails some are spam and some are ham, I'm trying to train a classifier using Naive Bayes and then test it on a test set, I'm still trying to figure out how to do that
df = DataFrame()
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train['message'].values)
classifier = MultinomialNB()
targets = train['class'].values, targets)
testing_set = vectorizer.fit_transform(test['message'].values)
predictions = classifier.predict(testing_set)
I don't think it's the right way to do that and in addition to that, the last line is giving me an error.
ValueError: dimension mismatch
The idea behind CountVectorizer is that it creates a function that maps word counts to identical places in an array. For example this: a b a c might become [2, 1, 1]. When you call fit_transform it creates that index mapping A -> 0, B-> 1, C -> 2 and then applies that to create the vector of counts. Here you call fit_transform to create a count vectorizer for your training and then again for your testing set. Some words may be in your testing data and not your training data and these get added. To expand on the earlier example example, your test set might be d a b which would create a vector with dimension 4 to account for d. This is likely why the dimensions don't match.
To fix this don't use fit transform the second time so replace:
It is important to make your vectorizier from your training data not all of your data, which is tempting to avoid missing features. This makes your tests more accurate since when really using the model it will encounter unknown words.
This is no guarantee your approach will work but this is likely the source of the dimensionality issue.

Passing the argument from a previous step in sklearn pipelines

I have a dataset with numerical and categorical features on which I am trying to fit a classifier. My idea was to preprocess the categorical data first using Pandas such that my dataset can be written as (to borrow MATLAB's concatenation notation)
X_train = [ X_train_num, X_train_cat ]
X_test = [ X_test_num, X_test_cat ].
To deal with numerical data, I did the following:
# define concatenation of arrays so we can assemble the various parts
# that are preprocessed differently in the pipelines
def concat(a1, a2):
return np.concatenate((a1, a2), axis=1)
# pipeline to preprocess, reassemble, and fit our models
trainPipeline = Pipeline([
('preprocessing', numPipeline), # scale numerical data
('assembling', FunctionTransformer(concat, kw_args={'a2' : X_train[nominalFeatures]})), # wrong, but how?
('classifying', LogisticRegression())
The issue here is that when I pass X_train to the pipeline, it only extracts X_train_num to scale it in the first step, which is why I need to reassemble X_train_num_scaled with X_train_cat = X_train[nominalFeatures] together in the second step. The code above will obviously not work when I use X_test as an input for prediction unless I find a way to access the initial input from the first step and use that in the concatenation step.
I have tried to look at trainPipeline.steps[0] and down the list for the initial variable name but found nothing that could help me. What am I missing?
As #Vivek Kumar states, you should implement FeatureUnion() method in order to construct that pipe. It is usually used to concatenate inputs to let the model train on the extended data. So, in your case the pipe should look as the following:
def concat(a1, a2):
return np.concatenate((a1, a2), axis=1)
subpipe = Pipeline(
[('concat', FunctionTransformer(concat, kw_args={'a2': X_train[nominalFeatures]})),
('preproc', numPipeline())])
union = FeatureUnion(
[('prep_data', subpipe),
('raw_data', FunctionTransformer(concat, kw_args={'a1': X_train_num}))])
pipe = Pipeline(
[('union', union),
('logreg', LogisticRegression())])
Then, you should be able to perform pipe.predict(X_test, y) provided X_test is already preprocessed.
Quickcheck: I applied numPipeline() function to X_train[nominalFeatures] and let X_train_num be as it is. I hope that is what you desire.

Make RandomForestClassifier pick a variable for sure during training

This is a bit of a newbie question.
I'd like to train a Random Forest using a RandomForestClassifier from sklearn. I have a few variables, but out of these variables, I'd like the algorithm to pick a variable (let's call it SourceID) for sure in every single tree that it trains.
How do I do that? I don't see any paramters in the classifier that would help in this case.
Any help would be appreciated!
So here's the scenario I have..
If a teacher assigns an assignment on Concept A, I have to predict the next possible assignment concept. The next assigned concept would be heavily dependant on Concept A which has already been assigned. For example - after assigning "Newton's first law of motion", there's a great possibility that "Newton's second law of motion" may be assigned. Quite often, the choice of concepts to be assigned after, say, Concept A, are limited. I'd like to predict the best possible option after Concept A has been assigned, given past data.
If I let the random forest do its job of picking variables at random, then there will be a few trees which will not have the variable for Concept A, in which case, the prediction may not make much sense, which is why I'd like to force this variable into selection. Better yet, it'd be great if this variable is chosen as the first variable in each tree to split on.
Does this make things clear? Is random forest not a candidate at all for this job?
There is no option for this in the RandomForestClassifier, but the random forest algorithm is just an ensemble of decision trees where each tree only considers a subset of all possible features and is trained on a bootstrap subsample of the training data.
So, it isn't too difficult to create this ourselves manually for trees that are forced to use a specific set of features. I've written a class to do this below. This does not perform robust input validation or anything like that, but you can consult the source of sklearn's random forest fit function for that. This is meant to give you a flavor of how to build it yourself:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
class FixedFeatureRFC:
def __init__(self, n_estimators=10, random_state=None):
self.n_estimators = n_estimators
if random_state is None:
self.random_state = np.random.RandomState()
def fit(self, X, y, feats_fixed=None, max_features=None, bootstrap_frac=0.8):
feats_fixed: indices of features (columns of X) to be
always used to train each estimator
max_features: number of features that each estimator will use,
including the fixed features.
bootstrap_frac: size of bootstrap sample that each estimator will use.
self.estimators = []
self.feats_used = []
self.n_classes = np.unique(y).shape[0]
if feats_fixed is None:
feats_fixed = []
if max_features is None:
max_features = X.shape[1]
n_samples = X.shape[0]
n_bs = int(bootstrap_frac*n_samples)
feats_fixed = list(feats_fixed)
feats_all = range(X.shape[1])
random_choice_size = max_features - len(feats_fixed)
feats_choosable = set(feats_all).difference(set(feats_fixed))
feats_choosable = np.array(list(feats_choosable))
for i in range(self.n_estimators):
chosen = self.random_state.choice(feats_choosable,
feats = feats_fixed + list(chosen)
bs_sample = self.random_state.choice(n_samples,
dtc = DecisionTreeClassifier(random_state=self.random_state)[bs_sample][:,feats], y[bs_sample])
def predict_proba(self, X):
out = np.zeros((X.shape[0], self.n_classes))
for i in range(self.n_estimators):
out += self.estimators[i].predict_proba(X[:,self.feats_used[i]])
return out / self.n_estimators
def predict(self, X):
return self.predict_proba(X).argmax(axis=1)
def score(self, X, y):
return (self.predict(X) == y).mean()
Here is a test script to see if the class above works as intended:
import numpy as np
from sklearn.datasets import load_breast_cancer
from FixedFeatureRFC import FixedFeatureRFC
rs = np.random.RandomState(1234)
BC = load_breast_cancer()
X,y =,
train = rs.rand(X.shape[0]) < 0.8
print "n_features =", X.shape[1]
fixed = [0,4,21]
maxf = 10
ffrfc = FixedFeatureRFC(n_estimators=1000)[train], y[train], feats_fixed=fixed, max_features=maxf)
for feats in ffrfc.feats_used:
assert len(feats) == maxf
for f in fixed:
assert f in feats
print ffrfc.score(X[~train], y[~train])
The output is:
n_features = 30
None of the assertions failed, indicating that the features we have chosen to be fixed were used in each random feature subsample and that the size of each feature subsample was the required max_features size. The high accuracy on the held-out data indicates that the classifier is working properly.
I do not believe there is a way in scikit now. You could use max_features=None which removes all randomness of feature selections.
If you can switch packages, R's Ranger ( has options and always.split.variables which may be what you are looking for. Define the probability for the random choices or always include these features in addition to the random choices.
This works against the overall design of random forest, reducing the randomness which may in turn weaken the variance reduction of the algorithm. You should know a lot about your data and the problem to choose this option. As #Michal alluded to, proceed carefully here.

sklearn GaussianNB - bad results, [nan] probabilities

I'm doing some work on gender classification for a class. I've been using SVMLight with decent results, but I wanted to try some bayesian methods on my data as well. My dataset consists of text data, and I've done feature reduction to pare down the feature space to a more reasonable size for some of the bayesian methods. All of the instances are run through tf-idf and then normalized (through my own code).
I grabbed the sklearn toolkit because it was easy to integrate with my current codebase, but the results I'm getting from the GaussianNB are all of one class (-1 in this case), and the predicted probabilities are all [nan].
I've pasted some relevant code; I don't know if this is enough to go on, but I'm hoping that I'm just overlooking something obvious in using the sklearn api. I have a couple different feature sets that I've tried pushing through it, also with the same results. Same thing too using the training set and with cross-validation. Any thoughts? Could it be that my feature space simply too sparse for this to work? I have 300-odd instances, most of which have several hundred non-zero features.
class GNBLearner(BaseLearner):
def __init__(self, featureCount):
self.gnb = GaussianNB()
self.featureCount = featureCount
def train(self, instances, params):
X = np.zeros( (len(instances), self.featureCount) )
Y = [0]*len(instances)
for i, inst in enumerate(instances):
for idx,val in
X[i,idx-1] = val
Y[i] = inst.c, Y)
def test(self, instances, params):
X = np.zeros( (len(instances), self.featureCount) )
for i, inst in enumerate(instances):
for idx,val in
X[i,idx-1] = val
return self.gnb.predict(X)
def conf_mtx(self, res, test_set):
conf = [[0,0],[0,0]]
for r, x in xzip(res, test_set):
print "pred: %d, act: %d" % (r, x.c)
conf[(x.c+1)/2][(r+1)/2] += 1
return conf
GaussianNB is not a good fit for document classification at all, since tf-idf values are non-negative frequencies; use MultinomialNB instead, and maybe try BernoulliNB. scikit-learn comes with a document classification example that, incidentally, uses tf-idf weighting using the built-in TfidfTransformer.
Don't expect miracles, though, as 300 samples is quite small for a training set (although for binary classification, it might just be enough to beat a "most frequent" baseline). YMMV.
Full disclosure: I'm one of the scikit-learn core devs and the main author of the current MultinomialNB and BernoulliNB code.

