I have some problem with adding own features to sklearn.linear_model.LogisticRegression. But anyway lets see some example code:
from sklearn.linear_model import LogisticRegression, LinearRegression
import numpy as np
#Numbers are class of tag
resultsNER = np.array([1,2,3,4,5])
#Acording to resultNER every row is another class so is another features
#but in this way every row have the same features
xNER = np.array([[1.,0.,0.,0.,-1.,1.],
[1.,0.,1.,0.,0.,1.],
[1.,1.,1.,1.,1.,1.],
[0.,0.,0.,0.,0.,0.],
[1.,1.,1.,0.,0.,0.]])
#Assing resultsNER to y
y = resultsNER
#Create LogReg
logit = LogisticRegression(C=1.0)
#Learn LogReg
logit.fit(xNER,y)
#Some test vector to check wich class will be predict
xPP = np.array([1.,1.,1.,0.,0.,1.])
#linear = LinearRegression()
#linear.fit(x, y)
print "expected: ", y
print "predicted:", logit.predict(xPP)
print "decision: ",logit.decision_function(xNER)
print logit.coef_
#print linear.predict(x)
print "params: ",logit.get_params(deep=True)
Code above is clear and easy. So I have some classes which I called 1,2,3,4,5(resultsNER) they are related to some classes like "data", "person", "organization" etc. So for each class I make custom features which return true or false, in this case one and zero numbers. Example: if token equals "(S|s)unday", it is data class. Mathematically it is clear. I have token for each class features I test it. Then I look which class have the max value of sum of features (that’s why return number not boolean) and pick it up. In other words I use argmax function. Of course in summarization each feature have alpha coefficients. In this case it is multiclass classification, so I need to know how to add multiclass features to sklearn.LogisticRegression.
I need two things, alphas coefficients and add my own features to Logistic Regression. The most important for me is how to add to sklearn.LogisticRegression my own features functions for each class.
I know I can compute coefficients by gradient descent. But I think when I use fit(x,y) the LogisticRegression use some algorithm to compute coefficients witch I can get by attribute
.coef_ .
So in the end my main question is how to add custom features for different classes in my example classes 1,2,3,4,5 (resultNER).
Not quite sure about your question, but few thing that might help you:
You can use predict_proba function to estimate probabilities for each class:
>>> logit.predict_proba(xPP)
array([[ 0.1756304 , 0.22633999, 0.25149571, 0.10134168, 0.24519222]])
If you want features to have some weights (is this the thing you're calling alpha?), you do it not in learning algorithm but on preprocessing phase. I your case you can use an array of coefficients:
>>> logit = LogisticRegression(C=1.0).fit(xNER,y)
>>> logit.predict(xPP)
array([3])
>>> alpha = np.array([[0.2, 0.2, 1, 1, 0.3, 1]])
>>> logit = LogisticRegression(C=1.0).fit(alpha*xNER,y)
>>> logit.predict(alpha*xPP)
array([2])
Related
There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression
I cannot tell if there exists a work-around within the sklearn framework.
There exists a workaround, but it is not intrinsically within the sklearn framework.
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.
This is a bit of a newbie question.
I'd like to train a Random Forest using a RandomForestClassifier from sklearn. I have a few variables, but out of these variables, I'd like the algorithm to pick a variable (let's call it SourceID) for sure in every single tree that it trains.
How do I do that? I don't see any paramters in the classifier that would help in this case.
Any help would be appreciated!
TIA.
EDIT
So here's the scenario I have..
If a teacher assigns an assignment on Concept A, I have to predict the next possible assignment concept. The next assigned concept would be heavily dependant on Concept A which has already been assigned. For example - after assigning "Newton's first law of motion", there's a great possibility that "Newton's second law of motion" may be assigned. Quite often, the choice of concepts to be assigned after, say, Concept A, are limited. I'd like to predict the best possible option after Concept A has been assigned, given past data.
If I let the random forest do its job of picking variables at random, then there will be a few trees which will not have the variable for Concept A, in which case, the prediction may not make much sense, which is why I'd like to force this variable into selection. Better yet, it'd be great if this variable is chosen as the first variable in each tree to split on.
Does this make things clear? Is random forest not a candidate at all for this job?
There is no option for this in the RandomForestClassifier, but the random forest algorithm is just an ensemble of decision trees where each tree only considers a subset of all possible features and is trained on a bootstrap subsample of the training data.
So, it isn't too difficult to create this ourselves manually for trees that are forced to use a specific set of features. I've written a class to do this below. This does not perform robust input validation or anything like that, but you can consult the source of sklearn's random forest fit function for that. This is meant to give you a flavor of how to build it yourself:
FixedFeatureRFC.py
import numpy as np
from sklearn.tree import DecisionTreeClassifier
class FixedFeatureRFC:
def __init__(self, n_estimators=10, random_state=None):
self.n_estimators = n_estimators
if random_state is None:
self.random_state = np.random.RandomState()
def fit(self, X, y, feats_fixed=None, max_features=None, bootstrap_frac=0.8):
"""
feats_fixed: indices of features (columns of X) to be
always used to train each estimator
max_features: number of features that each estimator will use,
including the fixed features.
bootstrap_frac: size of bootstrap sample that each estimator will use.
"""
self.estimators = []
self.feats_used = []
self.n_classes = np.unique(y).shape[0]
if feats_fixed is None:
feats_fixed = []
if max_features is None:
max_features = X.shape[1]
n_samples = X.shape[0]
n_bs = int(bootstrap_frac*n_samples)
feats_fixed = list(feats_fixed)
feats_all = range(X.shape[1])
random_choice_size = max_features - len(feats_fixed)
feats_choosable = set(feats_all).difference(set(feats_fixed))
feats_choosable = np.array(list(feats_choosable))
for i in range(self.n_estimators):
chosen = self.random_state.choice(feats_choosable,
size=random_choice_size,
replace=False)
feats = feats_fixed + list(chosen)
self.feats_used.append(feats)
bs_sample = self.random_state.choice(n_samples,
size=n_bs,
replace=True)
dtc = DecisionTreeClassifier(random_state=self.random_state)
dtc.fit(X[bs_sample][:,feats], y[bs_sample])
self.estimators.append(dtc)
def predict_proba(self, X):
out = np.zeros((X.shape[0], self.n_classes))
for i in range(self.n_estimators):
out += self.estimators[i].predict_proba(X[:,self.feats_used[i]])
return out / self.n_estimators
def predict(self, X):
return self.predict_proba(X).argmax(axis=1)
def score(self, X, y):
return (self.predict(X) == y).mean()
Here is a test script to see if the class above works as intended:
test.py
import numpy as np
from sklearn.datasets import load_breast_cancer
from FixedFeatureRFC import FixedFeatureRFC
rs = np.random.RandomState(1234)
BC = load_breast_cancer()
X,y = BC.data, BC.target
train = rs.rand(X.shape[0]) < 0.8
print "n_features =", X.shape[1]
fixed = [0,4,21]
maxf = 10
ffrfc = FixedFeatureRFC(n_estimators=1000)
ffrfc.fit(X[train], y[train], feats_fixed=fixed, max_features=maxf)
for feats in ffrfc.feats_used:
assert len(feats) == maxf
for f in fixed:
assert f in feats
print ffrfc.score(X[~train], y[~train])
The output is:
n_features = 30
0.983739837398
None of the assertions failed, indicating that the features we have chosen to be fixed were used in each random feature subsample and that the size of each feature subsample was the required max_features size. The high accuracy on the held-out data indicates that the classifier is working properly.
I do not believe there is a way in scikit now. You could use max_features=None which removes all randomness of feature selections.
If you can switch packages, R's Ranger (https://cran.r-project.org/web/packages/ranger/ranger.pdf) has options split.select.weights and always.split.variables which may be what you are looking for. Define the probability for the random choices or always include these features in addition to the random choices.
This works against the overall design of random forest, reducing the randomness which may in turn weaken the variance reduction of the algorithm. You should know a lot about your data and the problem to choose this option. As #Michal alluded to, proceed carefully here.
I've been learning how to use machine-learning classifiers lately and got started thinking if there was anything in sklearn that could take in either an array or a distribution for each i,j cell as training data? Does such a classification algorithm exist in scikit-learn? If so, how is it used? If not, can someone provide some insight into any algorithms that are known to handle this type of data?
Somebody asked kind of a similar question: https://stats.stackexchange.com/questions/178109/linear-regression-problem-with-multi-dimensional-vectors-instead-of-scalar-value#comment443880_178109 but it was for Regression and it was also never answered.
I tried using just a RandomForestClassifier but it didn't like the array instead of the scalar. If it's more a Bayesian problem, I would be keen on using PyMC3 but I don't even know what algorithms to look at to even start the process.
from sklearn.ensemble import RandomForestClassifier
# Create 2 Distinguishable Classes, 20 of each
# Where each column has a fixed size for the array (e.g. `attr_0`=3, `attr_1`=5, `attr_2`=2)
class_A = np.vstack([[np.random.normal(loc=0, scale=1,size=3),
np.random.normal(loc=5, scale=1,size=5),
np.random.normal(loc=10, scale=1,size=2)] for k in range(20)])
class_B = np.vstack([[np.random.normal(loc=15, scale=1,size=3),
np.random.normal(loc=20, scale=1,size=5),
np.random.normal(loc=30, scale=1,size=2)] for k in range(20)])
# Merge them
Ar_data = np.concatenate([class_A,class_B], axis=0)
X = pd.DataFrame(Ar_data, columns=["attr_0","attr_1","attr_2"])
# Create target vector
y = np.array(20*[0] + 20*[1])
# Test data
X_test = [np.random.normal(loc=0, scale=1,size=3),
np.random.normal(loc=5, scale=1,size=5),
np.random.normal(loc=10, scale=1,size=2)]
X_test
# [array([-0.15510844, 0.04567395, -0.66192602]),
# array([ 4.5412568 , 4.32526163, 4.56558114, 5.48178697, 5.2559264 ]),
# array([ 9.17293292, 10.19746434])]
I tried fitting a RandomForestClassifier but it didn't work :(
Mod_rf = RandomForestClassifier()
Mod_rf.fit(X,y)
# ValueError: setting an array element with a sequence.
I'm building a basic speaker recognizer with the GMM toolkit from sklearn. I have 3 classes, for each class I have a classifier. In the testing stage, the GMM for the speaker with the highest probability should be selected and the program should return the predicted class for each test sample. I want to vary the number of mixture components and set n_components=4 in this example code.
If I use 4 mixture components the output of my classifier will either be 0, 1, 2 or 3. If I use 3 mixture components, it will be 0, 1 or 2. I have the feeling that the classifier returns the predicted mixture component instead of the whole GMM. But I want it to predict the class: 1, 2 or 3.
Here is my code:
import numpy as np
from sklearn.mixture import GMM
#set path
path="path"
class_names = [1,2,3]
covs = ['spherical', 'diag', 'tied', 'full']
training_data = {1: np.loadtxt(path+"/01_train_debug.data"), 2: np.loadtxt(path+"/02_train_debug.data"), 3: np.loadtxt(path+"/03_train_debug.data")}
print "Training models"
models = {}
for c in class_names:
# make a GMM for each of the classes in class_names
models[c] = dict((covar_type,GMM(n_components=4,
covariance_type=covar_type, init_params='wmc',n_init=1, n_iter=20))
for covar_type in covs)
for cov in covs:
for c in class_names:
models[c][cov].fit(training_data[c])
#define test set
test01 = np.loadtxt(path+"/01_test_debug.data")
test02 = np.loadtxt(path+"/02_test_debug.data")
test03 = np.loadtxt(path+"/03_test_debug.data")
testing_data = {1: test01, 2: test02, 3: test03}
probs = {}
print "Calculating Probabilities"
for c in class_names:
probs[c] = {}
for cov in covs:
probs[c][cov] = {}
for p in class_names:
probs[c][cov] = models[p][cov].predict(testing_data[c])
for c in class_names:
print c
for cov in covs:
print " ",cov,
for p in class_names:
print p, probs,
print
Is my assumption from above correct or do I have a logical error in my code?
Is there a way to solve this in sklearn?
Thanks in advance for your help!
In your code, the first time you the keys of the models dict are covariance types and the second time the keys are class names. I misread your code, sorry.
Edit: if you want the per-sample likelihood of the data under a fitted GMM models you should use the score_samples method. The predict method does not return probabilities but component assignments instead.
Also GMM by default is non supervised model. If you want to build a supervised model out of a bunch GMM models, you should probably wrap it as an estimator class that wraps them and implement the fit / predict API to be able to estimate its accuracy via cross validation and adjust the hyper parameter values by grid search. Pull request #2468 is implementing something like this. It it's merged in time it might get included in the next scikit-learn release (0.15 that should come out early 2014).
I am trying to use scikit-learn 0.12.1 to:
train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob) pairs.
If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
Building on larsman's excellent answer, I ended up with this:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.