This is a bit of a newbie question.
I'd like to train a Random Forest using a RandomForestClassifier from sklearn. I have a few variables, but out of these variables, I'd like the algorithm to pick a variable (let's call it SourceID) for sure in every single tree that it trains.
How do I do that? I don't see any paramters in the classifier that would help in this case.
Any help would be appreciated!
TIA.
EDIT
So here's the scenario I have..
If a teacher assigns an assignment on Concept A, I have to predict the next possible assignment concept. The next assigned concept would be heavily dependant on Concept A which has already been assigned. For example - after assigning "Newton's first law of motion", there's a great possibility that "Newton's second law of motion" may be assigned. Quite often, the choice of concepts to be assigned after, say, Concept A, are limited. I'd like to predict the best possible option after Concept A has been assigned, given past data.
If I let the random forest do its job of picking variables at random, then there will be a few trees which will not have the variable for Concept A, in which case, the prediction may not make much sense, which is why I'd like to force this variable into selection. Better yet, it'd be great if this variable is chosen as the first variable in each tree to split on.
Does this make things clear? Is random forest not a candidate at all for this job?
There is no option for this in the RandomForestClassifier, but the random forest algorithm is just an ensemble of decision trees where each tree only considers a subset of all possible features and is trained on a bootstrap subsample of the training data.
So, it isn't too difficult to create this ourselves manually for trees that are forced to use a specific set of features. I've written a class to do this below. This does not perform robust input validation or anything like that, but you can consult the source of sklearn's random forest fit function for that. This is meant to give you a flavor of how to build it yourself:
FixedFeatureRFC.py
import numpy as np
from sklearn.tree import DecisionTreeClassifier
class FixedFeatureRFC:
def __init__(self, n_estimators=10, random_state=None):
self.n_estimators = n_estimators
if random_state is None:
self.random_state = np.random.RandomState()
def fit(self, X, y, feats_fixed=None, max_features=None, bootstrap_frac=0.8):
"""
feats_fixed: indices of features (columns of X) to be
always used to train each estimator
max_features: number of features that each estimator will use,
including the fixed features.
bootstrap_frac: size of bootstrap sample that each estimator will use.
"""
self.estimators = []
self.feats_used = []
self.n_classes = np.unique(y).shape[0]
if feats_fixed is None:
feats_fixed = []
if max_features is None:
max_features = X.shape[1]
n_samples = X.shape[0]
n_bs = int(bootstrap_frac*n_samples)
feats_fixed = list(feats_fixed)
feats_all = range(X.shape[1])
random_choice_size = max_features - len(feats_fixed)
feats_choosable = set(feats_all).difference(set(feats_fixed))
feats_choosable = np.array(list(feats_choosable))
for i in range(self.n_estimators):
chosen = self.random_state.choice(feats_choosable,
size=random_choice_size,
replace=False)
feats = feats_fixed + list(chosen)
self.feats_used.append(feats)
bs_sample = self.random_state.choice(n_samples,
size=n_bs,
replace=True)
dtc = DecisionTreeClassifier(random_state=self.random_state)
dtc.fit(X[bs_sample][:,feats], y[bs_sample])
self.estimators.append(dtc)
def predict_proba(self, X):
out = np.zeros((X.shape[0], self.n_classes))
for i in range(self.n_estimators):
out += self.estimators[i].predict_proba(X[:,self.feats_used[i]])
return out / self.n_estimators
def predict(self, X):
return self.predict_proba(X).argmax(axis=1)
def score(self, X, y):
return (self.predict(X) == y).mean()
Here is a test script to see if the class above works as intended:
test.py
import numpy as np
from sklearn.datasets import load_breast_cancer
from FixedFeatureRFC import FixedFeatureRFC
rs = np.random.RandomState(1234)
BC = load_breast_cancer()
X,y = BC.data, BC.target
train = rs.rand(X.shape[0]) < 0.8
print "n_features =", X.shape[1]
fixed = [0,4,21]
maxf = 10
ffrfc = FixedFeatureRFC(n_estimators=1000)
ffrfc.fit(X[train], y[train], feats_fixed=fixed, max_features=maxf)
for feats in ffrfc.feats_used:
assert len(feats) == maxf
for f in fixed:
assert f in feats
print ffrfc.score(X[~train], y[~train])
The output is:
n_features = 30
0.983739837398
None of the assertions failed, indicating that the features we have chosen to be fixed were used in each random feature subsample and that the size of each feature subsample was the required max_features size. The high accuracy on the held-out data indicates that the classifier is working properly.
I do not believe there is a way in scikit now. You could use max_features=None which removes all randomness of feature selections.
If you can switch packages, R's Ranger (https://cran.r-project.org/web/packages/ranger/ranger.pdf) has options split.select.weights and always.split.variables which may be what you are looking for. Define the probability for the random choices or always include these features in addition to the random choices.
This works against the overall design of random forest, reducing the randomness which may in turn weaken the variance reduction of the algorithm. You should know a lot about your data and the problem to choose this option. As #Michal alluded to, proceed carefully here.
Related
I'm writting a class on Python, where I'm trying to automatically pick up a value of num_features_to_select in CatBoostClassifier().select_features(). Right now, function uses enumeration of num_features_to_select values.
Code:
def CatBoost(X_var=df.drop(columns=['status']), y_var=df[['creation_date','status']]):
from catboost import CatBoostClassifier, Pool, EShapCalcType, EFeaturesSelectionAlgorithm
from sklearn.model_selection import train_test_split
from datetime import datetime, timedelta # подключаем библиотеку datetime для работы с датами
import os
os.environ['OPENBLAS_NUM_THREADS'] = '10'
valid_time_border = X_var['creation_date'].max()-timedelta(days=7)
X_train, X_test, y_train, y_test = train_test_split(X_var[X_var['creation_date']<=valid_time_border]\
.drop(columns=['creation_date']),\
y_var[y_var['creation_date']<=valid_time_border]['status'],\
test_size=0.3)
X_valid = X_var[X_var['creation_date']>valid_time_border].drop(columns=['creation_date'])
y_valid = y_var[y_var['creation_date']>valid_time_border]['status']
best_accurancy = 0
mas_num_features_to_select = [10,20,30,40,50,60]
for i in mas_num_features_to_select:
# Определяем все переменные
predict_columns = X_train.columns.to_list()
# определяем категориальные переменные
cat_features_num = np.where(np.isin(X_train[X_train.columns].dtypes, ['bool', 'object']))[0]
train_pool = Pool(X_train, y_train, cat_features=cat_features_num, feature_names=list(predict_columns))
test_pool = Pool(X_test, y_test, cat_features=cat_features_num, feature_names=list(predict_columns))
model = CatBoostClassifier(iterations=round(200), eval_metric='AUC', thread_count = 10)
summary = model.select_features(
train_pool,
eval_set=test_pool,
features_for_select=predict_columns,
num_features_to_select=i,
steps=15,
algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
shap_calc_type=EShapCalcType.Regular,
train_final_model=False,
logging_level='Silent',
plot=False
)
predict_columns = summary['selected_features_names']
model.fit(X_train, y_train)
y_pred = model.predict(X_valid) # предсказываем новые данные
mislabel = np.sum((y_valid!=y_pred)) # считаем неправильно посчитанные значения
accurancy = 1 - mislabel/len(y_pred)
print(accurancy)
if accurancy > best_accurancy:
best_accurancy = accurancy
best_predict_columns = predict_columns
print('Лучшая точность предсказания: '+str(best_accurancy))
print('Лучшие фичи:')
print(best_predict_columns)
return(best_predict_columns)
I can't find any information about methods which afford to use built in function of automatic feature selection. Is it even possible using CatBoost?
Use the summary dictionary output to find your best point. If you want an interactive plot to define it, you can use the following:
import matplotlib.pyplot as plt
line = plt.plot(summary["loss_graph"]["removed_features_count"], summary["loss_graph"]["loss_values"], picker=True)
x = plt.ginput(n=1, timeout=30, show_clicks=True)
print(x)
If I understand your question correctly, you're looking for a way of using select_features to determine how many and which features to include in the model such that performance is maintained/improved while eliminating the maximum number of features. Sadly, your approach seems to be the best for an automated function. CatBoost does not return the features from the iteration with the best performance, only the features remaining after pruning down to the number of features specified in num_features_to_select by iterating steps number of times.
If you can compromise and add a manual step, you can set plot=True and see at which number of features the loss value is minimized, such as in CatBoost's documentation here:
If you set steps to the number of features, features will be removed one by one, and you can see the loss for the removal of each feature. You could then manually select the number of features to match that iteration. It would be nice if CatBoost had a "train_best_model" parameter instead of just a "train_final_model" parameter! I don't know if theres a way to capture what this function logs to stdout or outputs in the plot, but that contains the loss value, and would allow you set the value.
Edit: I thought of one more approach that is still a form of iterating over num_features_to_select parameter, but may be interesting.
Set train_final_model=True, steps=1, and num_features_to_select to the width of your dataset
Iteratively subtract 1 from num_features_to_select
At the end of each loop, test the performance of the model
Stop if negative performance change exceeds a threshold (e.g., -5% or -2%)
This may take a while, depending on how long the training takes, but would automatically pick the num_features_to_select as you desire.
If we have a hierarchical model with data from different sites as different groups in the model, how do we predict on new groups (new sites that we haven't seen before)?
e.g. using the following logistic regression model:
from pymc3 import Model, sample, Normal, HalfCauchy,Bernoulli
import theano.tensor as tt
with Model() as varying_slope:
mu_beta = Normal('mu_beta', mu=0., sd=1e5)
sigma_beta = HalfCauchy('sigma_beta', 5)
a = Normal('a', mu=0., sd=1e5)
betas = Normal('b',mu=mu_beta,sd=sigma_beta,shape=(n_features,n_site))
y_hat = a + tt.dot(X_shared,betas[:,site_shared])
y_like = Bernoulli('y_like', logit_p=y_hat, observed=train_y)
After we fit this model, we can predict on new data (i.e. sample from the posterior predictive) from a specific site using:
site_to_predict = 1
samples = 500
x = tt.matrix('X',dtype='float64')
new_site = tt.vector('new_site',dtype='int32')
n_samples = tt.iscalar('n_samples')
x.tag.test_value = np.empty(shape=(1,X.shape[1]))
new_site.tag.test_value = np.empty(shape=(1,1))
_sample_proba = approx.sample_node(varying_slope.y_like.distribution.p,
size=n_samples,
more_replacements={X_shared: x,site_shared:new_site})
sample_proba = theano.function([x,new_site,n_samples], _sample_proba)
pred_test = sample_proba(test_X.reshape(1,-1),np.array(site_to_predict).reshape(-1),samples)
but what is the correct way to sample from the posterior predictive distribution if we have a new unseen site ?
I'm just copying my answer from the pymc discourse thread if someone by chance runs into this question or another one like it here.
First of all, beware of the centered hierarchical parametrization 1 you are using, it may lead to divergences and difficulties while fitting.
That being said, your model looks more or less like a GLM with shared prior random variates mu_beta and sigma_beta across features and sites. Once you get a posterior distribution over those two, your predictions should look something like
y_hat = a + dot(X_shared, Normal(mu=mu_beta, sigma=sigma_beta))
y_like = Bernoulli('y_like', logit_p=y_hat)
So, we will aim to get that.
The way in which we always recommend out of sample posterior predictive checks is to use theano.shared's. I’ll use a different approach, inspired in the functional API that is being the core design idea for pymc4. The are many differences I wont go into between pymc3 and the skeleton of pymc4, but one thing that I started to use more were factory functions to get the Model instances. Instead of trying to define things inside the model with theano.shared's, I just create a new model with the new data and draw posterior predictive samples from it. I just recently posted about this here.
The idea is to create the model with the training data and sample from it to get a trace. Then you use have to extract from the trace the hierarchical part which is shared with the unseen site: mu_beta, sigma_beta and a. Finally, you create a new model using the new data of the test site, and sample from the posterior predictive using a list of dictionaries that hold the mu_beta, sigma_beta and a part of the training trace. Here’s a self-contained example
import numpy as np
import pymc3 as pm
from theano import tensor as tt
from matplotlib import pyplot as plt
def model_factory(X, y, site_shared, n_site, n_features=None):
if n_features is None:
n_features = X.shape[-1]
with pm.Model() as model:
mu_beta = pm.Normal('mu_beta', mu=0., sd=1)
sigma_beta = pm.HalfCauchy('sigma_beta', 5)
a = pm.Normal('a', mu=0., sd=1)
b = pm.Normal('b', mu=0, sd=1, shape=(n_features, n_site))
betas = mu_beta + sigma_beta * b
y_hat = a + tt.dot(X, betas[:, site_shared])
pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
return model
# First I generate some training X data
n_features = 10
ntrain_site = 5
ntrain_obs = 100
ntest_site = 1
ntest_obs = 1
train_X = np.random.randn(ntrain_obs, n_features)
train_site_shared = np.random.randint(ntrain_site, size=ntrain_obs)
new_site_X = np.random.randn(ntest_obs, n_features)
test_site_shared = np.zeros(ntest_obs, dtype=np.int32)
# Now I generate the training and test y data with a sample from the prior
with model_factory(X=train_X,
y=np.empty(ntrain_obs, dtype=np.int32),
site_shared=train_site_shared,
n_site=ntrain_site) as train_y_generator:
train_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
with model_factory(X=new_site_X,
y=np.empty(ntest_obs, dtype=np.int32),
site_shared=test_site_shared,
n_site=ntest_site) as test_y_generator:
new_site_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
# The previous part is just to get some toy data to fit
# Now comes the important parts. First training
with model_factory(X=train_X,
y=train_Y,
site_shared=train_site_shared,
n_site=ntrain_site) as train_model:
train_trace = pm.sample()
# Second comes the hold out data posterior predictive
with model_factory(X=new_site_X,
y=new_site_Y,
site_shared=test_site_shared,
n_site=ntrain_site) as test_model:
# We first have to extract the learnt global effect from the train_trace
df = pm.trace_to_dataframe(train_trace,
varnames=['mu_beta', 'sigma_beta', 'a'],
include_transformed=True)
# We have to supply the samples kwarg because it cannot be inferred if the
# input trace is not a MultiTrace instance
ppc = pm.sample_posterior_predictive(trace=df.to_dict('records'),
samples=len(df))
plt.figure()
plt.hist(ppc['y_like'], 30)
plt.axvline(new_site_Y, linestyle='--', color='r')
The posterior predictive I get looks like this:
Of course, I don’t know what kind of data to concretely put as your X_shared, site_shared or train_y, so I just made up some nonsense toy data at the beginning of the code, you should replace that with your actual data.
All regression examples I find are examples where you predict a real number and unlike with classification you dont the the confidence the model had when predicting that number. I have done in reinforcement learning another way the output is instead the mean and std and then you sample from that distribution. Then you know how confident the model is at predicting every value. Now I cant find how to do this using supervised learning in pytorch. The problem is that I dont understand how to perform sample from the distribution the get the actual value while training or what sort of loss function I should use, not sure how for example MSE or L1Smooth would work.
Is there any example ot there where this is done in pytorch in a robust and state of the art way?
The key point is that you do not need to sample from the NN-produced distribution. All you need is to optimize the likelihood of the target value under the NN distribution.
There is an example in the official PyTorch example on VAE (https://github.com/pytorch/examples/tree/master/vae), though for multidimensional Bernoulli distribution.
Since PyTorch 0.4, you can use torch.distributions: instantiate distribution distro with outputs of your NN and then optimize -distro.log_prob(target).
EDIT: As requested in a comment, a complete example of using the torch.distributions module.
First, we create a heteroscedastic dataset:
import numpy as np
import torch
X = np.random.uniform(size=300)
Y = X + 0.25*X*np.random.normal(size=X.shape[0])
We build a trivial model, which is perfectly able to match the generative process of our data:
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.mean_coeff = torch.nn.Parameter(torch.Tensor([0]))
self.var_coeff = torch.nn.Parameter(torch.Tensor([1]))
def forward(self, x):
return torch.distributions.Normal(self.mean_coeff * x, self.var_coeff * x)
mdl = Model()
optim = torch.optim.SGD(mdl.parameters(), lr=1e-3)
Initialization of the model makes it always produce a standard normal, which is a poor fit for our data, so we train (note it is a very stupid batch training, but demonstrates that you can output a set of distributions for your batch at once):
for _ in range(2000): # epochs
dist = mdl(torch.from_numpy(X).float())
obj = -dist.log_prob(torch.from_numpy(Y).float()).mean()
optim.zero_grad()
obj.backward()
optim.step()
Eventually, the learned parameters should match the values we used to construct the Y.
print(mdl.mean_coeff, mdl.var_coeff)
# tensor(1.0150) tensor(0.2597)
I have some problem with adding own features to sklearn.linear_model.LogisticRegression. But anyway lets see some example code:
from sklearn.linear_model import LogisticRegression, LinearRegression
import numpy as np
#Numbers are class of tag
resultsNER = np.array([1,2,3,4,5])
#Acording to resultNER every row is another class so is another features
#but in this way every row have the same features
xNER = np.array([[1.,0.,0.,0.,-1.,1.],
[1.,0.,1.,0.,0.,1.],
[1.,1.,1.,1.,1.,1.],
[0.,0.,0.,0.,0.,0.],
[1.,1.,1.,0.,0.,0.]])
#Assing resultsNER to y
y = resultsNER
#Create LogReg
logit = LogisticRegression(C=1.0)
#Learn LogReg
logit.fit(xNER,y)
#Some test vector to check wich class will be predict
xPP = np.array([1.,1.,1.,0.,0.,1.])
#linear = LinearRegression()
#linear.fit(x, y)
print "expected: ", y
print "predicted:", logit.predict(xPP)
print "decision: ",logit.decision_function(xNER)
print logit.coef_
#print linear.predict(x)
print "params: ",logit.get_params(deep=True)
Code above is clear and easy. So I have some classes which I called 1,2,3,4,5(resultsNER) they are related to some classes like "data", "person", "organization" etc. So for each class I make custom features which return true or false, in this case one and zero numbers. Example: if token equals "(S|s)unday", it is data class. Mathematically it is clear. I have token for each class features I test it. Then I look which class have the max value of sum of features (that’s why return number not boolean) and pick it up. In other words I use argmax function. Of course in summarization each feature have alpha coefficients. In this case it is multiclass classification, so I need to know how to add multiclass features to sklearn.LogisticRegression.
I need two things, alphas coefficients and add my own features to Logistic Regression. The most important for me is how to add to sklearn.LogisticRegression my own features functions for each class.
I know I can compute coefficients by gradient descent. But I think when I use fit(x,y) the LogisticRegression use some algorithm to compute coefficients witch I can get by attribute
.coef_ .
So in the end my main question is how to add custom features for different classes in my example classes 1,2,3,4,5 (resultNER).
Not quite sure about your question, but few thing that might help you:
You can use predict_proba function to estimate probabilities for each class:
>>> logit.predict_proba(xPP)
array([[ 0.1756304 , 0.22633999, 0.25149571, 0.10134168, 0.24519222]])
If you want features to have some weights (is this the thing you're calling alpha?), you do it not in learning algorithm but on preprocessing phase. I your case you can use an array of coefficients:
>>> logit = LogisticRegression(C=1.0).fit(xNER,y)
>>> logit.predict(xPP)
array([3])
>>> alpha = np.array([[0.2, 0.2, 1, 1, 0.3, 1]])
>>> logit = LogisticRegression(C=1.0).fit(alpha*xNER,y)
>>> logit.predict(alpha*xPP)
array([2])
I'm doing some work on gender classification for a class. I've been using SVMLight with decent results, but I wanted to try some bayesian methods on my data as well. My dataset consists of text data, and I've done feature reduction to pare down the feature space to a more reasonable size for some of the bayesian methods. All of the instances are run through tf-idf and then normalized (through my own code).
I grabbed the sklearn toolkit because it was easy to integrate with my current codebase, but the results I'm getting from the GaussianNB are all of one class (-1 in this case), and the predicted probabilities are all [nan].
I've pasted some relevant code; I don't know if this is enough to go on, but I'm hoping that I'm just overlooking something obvious in using the sklearn api. I have a couple different feature sets that I've tried pushing through it, also with the same results. Same thing too using the training set and with cross-validation. Any thoughts? Could it be that my feature space simply too sparse for this to work? I have 300-odd instances, most of which have several hundred non-zero features.
class GNBLearner(BaseLearner):
def __init__(self, featureCount):
self.gnb = GaussianNB()
self.featureCount = featureCount
def train(self, instances, params):
X = np.zeros( (len(instances), self.featureCount) )
Y = [0]*len(instances)
for i, inst in enumerate(instances):
for idx,val in inst.data:
X[i,idx-1] = val
Y[i] = inst.c
self.gnb.fit(X, Y)
def test(self, instances, params):
X = np.zeros( (len(instances), self.featureCount) )
for i, inst in enumerate(instances):
for idx,val in inst.data:
X[i,idx-1] = val
return self.gnb.predict(X)
def conf_mtx(self, res, test_set):
conf = [[0,0],[0,0]]
for r, x in xzip(res, test_set):
print "pred: %d, act: %d" % (r, x.c)
conf[(x.c+1)/2][(r+1)/2] += 1
return conf
GaussianNB is not a good fit for document classification at all, since tf-idf values are non-negative frequencies; use MultinomialNB instead, and maybe try BernoulliNB. scikit-learn comes with a document classification example that, incidentally, uses tf-idf weighting using the built-in TfidfTransformer.
Don't expect miracles, though, as 300 samples is quite small for a training set (although for binary classification, it might just be enough to beat a "most frequent" baseline). YMMV.
Full disclosure: I'm one of the scikit-learn core devs and the main author of the current MultinomialNB and BernoulliNB code.