sklearn GaussianNB - bad results, [nan] probabilities - python

I'm doing some work on gender classification for a class. I've been using SVMLight with decent results, but I wanted to try some bayesian methods on my data as well. My dataset consists of text data, and I've done feature reduction to pare down the feature space to a more reasonable size for some of the bayesian methods. All of the instances are run through tf-idf and then normalized (through my own code).
I grabbed the sklearn toolkit because it was easy to integrate with my current codebase, but the results I'm getting from the GaussianNB are all of one class (-1 in this case), and the predicted probabilities are all [nan].
I've pasted some relevant code; I don't know if this is enough to go on, but I'm hoping that I'm just overlooking something obvious in using the sklearn api. I have a couple different feature sets that I've tried pushing through it, also with the same results. Same thing too using the training set and with cross-validation. Any thoughts? Could it be that my feature space simply too sparse for this to work? I have 300-odd instances, most of which have several hundred non-zero features.
class GNBLearner(BaseLearner):
def __init__(self, featureCount):
self.gnb = GaussianNB()
self.featureCount = featureCount
def train(self, instances, params):
X = np.zeros( (len(instances), self.featureCount) )
Y = [0]*len(instances)
for i, inst in enumerate(instances):
for idx,val in inst.data:
X[i,idx-1] = val
Y[i] = inst.c
self.gnb.fit(X, Y)
def test(self, instances, params):
X = np.zeros( (len(instances), self.featureCount) )
for i, inst in enumerate(instances):
for idx,val in inst.data:
X[i,idx-1] = val
return self.gnb.predict(X)
def conf_mtx(self, res, test_set):
conf = [[0,0],[0,0]]
for r, x in xzip(res, test_set):
print "pred: %d, act: %d" % (r, x.c)
conf[(x.c+1)/2][(r+1)/2] += 1
return conf

GaussianNB is not a good fit for document classification at all, since tf-idf values are non-negative frequencies; use MultinomialNB instead, and maybe try BernoulliNB. scikit-learn comes with a document classification example that, incidentally, uses tf-idf weighting using the built-in TfidfTransformer.
Don't expect miracles, though, as 300 samples is quite small for a training set (although for binary classification, it might just be enough to beat a "most frequent" baseline). YMMV.
Full disclosure: I'm one of the scikit-learn core devs and the main author of the current MultinomialNB and BernoulliNB code.

Related

What is the state of the art way of doing regression with probability in pytorch

All regression examples I find are examples where you predict a real number and unlike with classification you dont the the confidence the model had when predicting that number. I have done in reinforcement learning another way the output is instead the mean and std and then you sample from that distribution. Then you know how confident the model is at predicting every value. Now I cant find how to do this using supervised learning in pytorch. The problem is that I dont understand how to perform sample from the distribution the get the actual value while training or what sort of loss function I should use, not sure how for example MSE or L1Smooth would work.
Is there any example ot there where this is done in pytorch in a robust and state of the art way?
The key point is that you do not need to sample from the NN-produced distribution. All you need is to optimize the likelihood of the target value under the NN distribution.
There is an example in the official PyTorch example on VAE (https://github.com/pytorch/examples/tree/master/vae), though for multidimensional Bernoulli distribution.
Since PyTorch 0.4, you can use torch.distributions: instantiate distribution distro with outputs of your NN and then optimize -distro.log_prob(target).
EDIT: As requested in a comment, a complete example of using the torch.distributions module.
First, we create a heteroscedastic dataset:
import numpy as np
import torch
X = np.random.uniform(size=300)
Y = X + 0.25*X*np.random.normal(size=X.shape[0])
We build a trivial model, which is perfectly able to match the generative process of our data:
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.mean_coeff = torch.nn.Parameter(torch.Tensor([0]))
self.var_coeff = torch.nn.Parameter(torch.Tensor([1]))
def forward(self, x):
return torch.distributions.Normal(self.mean_coeff * x, self.var_coeff * x)
mdl = Model()
optim = torch.optim.SGD(mdl.parameters(), lr=1e-3)
Initialization of the model makes it always produce a standard normal, which is a poor fit for our data, so we train (note it is a very stupid batch training, but demonstrates that you can output a set of distributions for your batch at once):
for _ in range(2000): # epochs
dist = mdl(torch.from_numpy(X).float())
obj = -dist.log_prob(torch.from_numpy(Y).float()).mean()
optim.zero_grad()
obj.backward()
optim.step()
Eventually, the learned parameters should match the values we used to construct the Y.
print(mdl.mean_coeff, mdl.var_coeff)
# tensor(1.0150) tensor(0.2597)

Make RandomForestClassifier pick a variable for sure during training

This is a bit of a newbie question.
I'd like to train a Random Forest using a RandomForestClassifier from sklearn. I have a few variables, but out of these variables, I'd like the algorithm to pick a variable (let's call it SourceID) for sure in every single tree that it trains.
How do I do that? I don't see any paramters in the classifier that would help in this case.
Any help would be appreciated!
TIA.
EDIT
So here's the scenario I have..
If a teacher assigns an assignment on Concept A, I have to predict the next possible assignment concept. The next assigned concept would be heavily dependant on Concept A which has already been assigned. For example - after assigning "Newton's first law of motion", there's a great possibility that "Newton's second law of motion" may be assigned. Quite often, the choice of concepts to be assigned after, say, Concept A, are limited. I'd like to predict the best possible option after Concept A has been assigned, given past data.
If I let the random forest do its job of picking variables at random, then there will be a few trees which will not have the variable for Concept A, in which case, the prediction may not make much sense, which is why I'd like to force this variable into selection. Better yet, it'd be great if this variable is chosen as the first variable in each tree to split on.
Does this make things clear? Is random forest not a candidate at all for this job?
There is no option for this in the RandomForestClassifier, but the random forest algorithm is just an ensemble of decision trees where each tree only considers a subset of all possible features and is trained on a bootstrap subsample of the training data.
So, it isn't too difficult to create this ourselves manually for trees that are forced to use a specific set of features. I've written a class to do this below. This does not perform robust input validation or anything like that, but you can consult the source of sklearn's random forest fit function for that. This is meant to give you a flavor of how to build it yourself:
FixedFeatureRFC.py
import numpy as np
from sklearn.tree import DecisionTreeClassifier
class FixedFeatureRFC:
def __init__(self, n_estimators=10, random_state=None):
self.n_estimators = n_estimators
if random_state is None:
self.random_state = np.random.RandomState()
def fit(self, X, y, feats_fixed=None, max_features=None, bootstrap_frac=0.8):
"""
feats_fixed: indices of features (columns of X) to be
always used to train each estimator
max_features: number of features that each estimator will use,
including the fixed features.
bootstrap_frac: size of bootstrap sample that each estimator will use.
"""
self.estimators = []
self.feats_used = []
self.n_classes = np.unique(y).shape[0]
if feats_fixed is None:
feats_fixed = []
if max_features is None:
max_features = X.shape[1]
n_samples = X.shape[0]
n_bs = int(bootstrap_frac*n_samples)
feats_fixed = list(feats_fixed)
feats_all = range(X.shape[1])
random_choice_size = max_features - len(feats_fixed)
feats_choosable = set(feats_all).difference(set(feats_fixed))
feats_choosable = np.array(list(feats_choosable))
for i in range(self.n_estimators):
chosen = self.random_state.choice(feats_choosable,
size=random_choice_size,
replace=False)
feats = feats_fixed + list(chosen)
self.feats_used.append(feats)
bs_sample = self.random_state.choice(n_samples,
size=n_bs,
replace=True)
dtc = DecisionTreeClassifier(random_state=self.random_state)
dtc.fit(X[bs_sample][:,feats], y[bs_sample])
self.estimators.append(dtc)
def predict_proba(self, X):
out = np.zeros((X.shape[0], self.n_classes))
for i in range(self.n_estimators):
out += self.estimators[i].predict_proba(X[:,self.feats_used[i]])
return out / self.n_estimators
def predict(self, X):
return self.predict_proba(X).argmax(axis=1)
def score(self, X, y):
return (self.predict(X) == y).mean()
Here is a test script to see if the class above works as intended:
test.py
import numpy as np
from sklearn.datasets import load_breast_cancer
from FixedFeatureRFC import FixedFeatureRFC
rs = np.random.RandomState(1234)
BC = load_breast_cancer()
X,y = BC.data, BC.target
train = rs.rand(X.shape[0]) < 0.8
print "n_features =", X.shape[1]
fixed = [0,4,21]
maxf = 10
ffrfc = FixedFeatureRFC(n_estimators=1000)
ffrfc.fit(X[train], y[train], feats_fixed=fixed, max_features=maxf)
for feats in ffrfc.feats_used:
assert len(feats) == maxf
for f in fixed:
assert f in feats
print ffrfc.score(X[~train], y[~train])
The output is:
n_features = 30
0.983739837398
None of the assertions failed, indicating that the features we have chosen to be fixed were used in each random feature subsample and that the size of each feature subsample was the required max_features size. The high accuracy on the held-out data indicates that the classifier is working properly.
I do not believe there is a way in scikit now. You could use max_features=None which removes all randomness of feature selections.
If you can switch packages, R's Ranger (https://cran.r-project.org/web/packages/ranger/ranger.pdf) has options split.select.weights and always.split.variables which may be what you are looking for. Define the probability for the random choices or always include these features in addition to the random choices.
This works against the overall design of random forest, reducing the randomness which may in turn weaken the variance reduction of the algorithm. You should know a lot about your data and the problem to choose this option. As #Michal alluded to, proceed carefully here.

SelectKBest based on (estimated) amount of features

I'm trying to implement a hierarchical text classifier with scikit-learn, with one "root" classifier that arranges all input strings in one (or more) of ~50 categories. For each of these categories, I'm gonna train a new classifier, which solves the actual task.
The reason for this two-layer approach is training performance and memory issues (a classifier which is supposed to separate >1k classes does not perform very well...).
This is what my pipeline looks like for each of these "subclassifiers"
pipeline = Pipeline([
('vect', CountVectorizer(strip_accents=None, lowercase=True, analyzer='char_wb', ngram_range=(3,8), max_df=0.1)),
('tfidf', TfidfTransformer(norm='l2')),
('feat', SelectKBest(chi2, k=10000)),
('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, n_iter=10))),
])
Now to my problem: I'm using SelectKBest to limit the model size to a reasonable amount, but for the subclassifiers, there is sometimes not enough input data available so I don't even get to the 10k feature limit, which causes
(...)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 300, in fit
self._check_params(X, y)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 405, in _check_params
% self.k)
ValueError: k should be >=0, <= n_features; got 10000.Use k='all' to return all features.
I don't know how many features I will have without applying the CountVectorizer, but I have to define the pipeline in advance. My preferred solution would be to skip the SelectKBest step, if there are less than k features anyway, but I don't know how to implement this behaviour without calling CountVectorizer twice (once in advance, once as part of the pipeline).
Any thoughts on this?
I followed the advice of Martin Krämer and created a subclass of SelectKBest which implements the desired functionality:
class SelectAtMostKBest(SelectKBest):
def _check_params(self, X, y):
if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
# set k to "all" (skip feature selection), if less than k features are available
self.k = "all"
I tried to add this snipped to his answer but the request was rejected so there you are...
I think the cleanest option would be to subclass SelectKBest and fallback to identity transformation in your implementation, if k exceeds the number of input features, otherwise just call the super implementation.
You could use SelectPercentile, which is more meaningful if you don't have a fixed number of features.

Features in sklearn logistic regression

I have some problem with adding own features to sklearn.linear_model.LogisticRegression. But anyway lets see some example code:
from sklearn.linear_model import LogisticRegression, LinearRegression
import numpy as np
#Numbers are class of tag
resultsNER = np.array([1,2,3,4,5])
#Acording to resultNER every row is another class so is another features
#but in this way every row have the same features
xNER = np.array([[1.,0.,0.,0.,-1.,1.],
[1.,0.,1.,0.,0.,1.],
[1.,1.,1.,1.,1.,1.],
[0.,0.,0.,0.,0.,0.],
[1.,1.,1.,0.,0.,0.]])
#Assing resultsNER to y
y = resultsNER
#Create LogReg
logit = LogisticRegression(C=1.0)
#Learn LogReg
logit.fit(xNER,y)
#Some test vector to check wich class will be predict
xPP = np.array([1.,1.,1.,0.,0.,1.])
#linear = LinearRegression()
#linear.fit(x, y)
print "expected: ", y
print "predicted:", logit.predict(xPP)
print "decision: ",logit.decision_function(xNER)
print logit.coef_
#print linear.predict(x)
print "params: ",logit.get_params(deep=True)
Code above is clear and easy. So I have some classes which I called 1,2,3,4,5(resultsNER) they are related to some classes like "data", "person", "organization" etc. So for each class I make custom features which return true or false, in this case one and zero numbers. Example: if token equals "(S|s)unday", it is data class. Mathematically it is clear. I have token for each class features I test it. Then I look which class have the max value of sum of features (that’s why return number not boolean) and pick it up. In other words I use argmax function. Of course in summarization each feature have alpha coefficients. In this case it is multiclass classification, so I need to know how to add multiclass features to sklearn.LogisticRegression.
I need two things, alphas coefficients and add my own features to Logistic Regression. The most important for me is how to add to sklearn.LogisticRegression my own features functions for each class.
I know I can compute coefficients by gradient descent. But I think when I use fit(x,y) the LogisticRegression use some algorithm to compute coefficients witch I can get by attribute
.coef_ .
So in the end my main question is how to add custom features for different classes in my example classes 1,2,3,4,5 (resultNER).
Not quite sure about your question, but few thing that might help you:
You can use predict_proba function to estimate probabilities for each class:
>>> logit.predict_proba(xPP)
array([[ 0.1756304 , 0.22633999, 0.25149571, 0.10134168, 0.24519222]])
If you want features to have some weights (is this the thing you're calling alpha?), you do it not in learning algorithm but on preprocessing phase. I your case you can use an array of coefficients:
>>> logit = LogisticRegression(C=1.0).fit(xNER,y)
>>> logit.predict(xPP)
array([3])
>>> alpha = np.array([[0.2, 0.2, 1, 1, 0.3, 1]])
>>> logit = LogisticRegression(C=1.0).fit(alpha*xNER,y)
>>> logit.predict(alpha*xPP)
array([2])

Training a sklearn LogisticRegression classifier without all possible labels

I am trying to use scikit-learn 0.12.1 to:
train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob) pairs.
If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
Building on larsman's excellent answer, I ended up with this:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.

Categories

Resources