I build up a XGBoost model using scikit-learn and I am pretty happy with it. As fine tuning to avoid overfitting, I'd like to ensure monotonicity of some features but there I start facing some difficulties...
As far as I understood, there is no documentation in scikit-learn about xgboost (which I confess I am really surprised about - knowing that this situation is lasting for several months). The only documentation I found is directly on http://xgboost.readthedocs.io
On this website, I found out that monotonicity can be enforced using "monotone_constraints" option.
I tried to use it in Scikit-Learn but I got an error message "TypeError: init() got an unexpected keyword argument 'monotone_constraints'"
Do you know a way to do it ?
Here is the code I wrote in python (using spyder):
grid = {'learning_rate' : 0.01, 'subsample' : 0.5, 'colsample_bytree' : 0.5,
'max_depth' : 6, 'min_child_weight' : 10, 'gamma' : 1,
'monotone_constraints' : monotonic_indexes}
#'monotone_constraints' ~ = "(1,-1)"
m07_xgm06 = xgb.XGBClassifier(n_estimators=2000, **grid)
m07_xgm06.fit(X_train_v01_oe, Label_train, early_stopping_rounds=10, eval_metric="logloss",
eval_set=[(X_test1_v01_oe, Label_test1)])
In order to do this using the xgboost sklearn API, you need to upgrade to xgboost 0.81. They fixed the ability to set parameters controlled via kwargs as part of this PR:
https://github.com/dmlc/xgboost/pull/3791
XGBoost Scikit-Learn API currently (0.6a2) doesn't support monotone_constraints. You can use Python API instead. Take a look into example.
This code in the example can be removed:
params_constr['updater'] = "grow_monotone_colmaker,prune"
How would you expect monotone constraints to work for a general classification problem where the response might have more than 2 levels? All the examples I've seen relating to this functionality are for regression problems. If your classification response only has 2 levels, try switching to regression on an indicator variable and then choose an appropriate score threshold for classification.
This feature appears to work as of the latest xgboost / scikit-learn, provided that you use an XGBregressor rather than an XGBclassifier and set monotone_constraints via kwargs.
The syntax is like this:
params = {
'monotone_constraints':'(-1,0,1)'
}
normalised_weighted_poisson_model = XGBRegressor(**params)
In this example, there is a negative constraint on column 1 in the training data, no constraint on column 2, and a positive constraint on column 3. It is up to you to keep track of which is which - you cannot refer to columns by name, only by position, and you must specify an entry in the constraint tuple for every column in your training data.
Related
I'm dealing with this warning:
[20:16:09] WARNING: ../src/learner.cc:541:
Parameters: { scale_pos_weight } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
while training an XGBoost in Python.
I've been researching about it, it's due to the classification type (binary or multiclass). The thing is that I'm doing binary classification over unbalanced data (6483252 negative / 70659 positive), so I need to set that parameter in order to consider this unbalance during training, but I don't understand why I'm getting that warning :(
This is how I'm initializing and training the XGBoost:
param = {'n_jobs':-1, 'random_state':5, 'booster':'gbtree', 'seed':5, 'objective': 'binary:hinge', 'scale_pos_weight':ratio}
param['eval_metric'] = ['auc', 'aucpr', 'rmse', 'error']
xgb_clf =xgb.XGBClassifier(**param)
xgb_clf.fit(dtrain,y_train)
dtrain is a pandas dataframe and y_train is a pandas series with the labels (0,1).
Thanks!
possible fix are 2:
Your training set is multiclass and then the parameter is not valid.
n_jobs problem in the implementation (set n_jobs to 0)
this are the most common problems
I have a trained ShareBoost model. How do I obtain the model's weight parameters/vectors?
I tried to get the individual linear machines and extract the individual weight vectors but unlike the linear svm it does not seem to have a get_w() method.
Also, even though the C++ ShareBoost class is a subclass of CMachine, the Parameters object obtained from m_parameters (see docs) does not appear to have the parameters available.
The following code is what I have tried.
num_machines = shareboost.get_num_machines()
# num_machines = 2
lm0 = shareboost.get_machine(0)
p0 = lm0.m_parameters
# The following method does not exist
p0.get_parameter(0)
in case you are using the C++ API you could get the weight vector the following way:
auto lm = (CLinearMachine*)shareboost->get_machine(0);
lm->get_w();
since you are using the python API currently this only possible if you are using the new API of shogun (that is only available in develop branch atm):
lm0 = shareboost.get_machine(0)
weights = lm0.get_real_vector("w")
see some more examples of how to use the new API:
http://shogun.ml/examples/nightly/examples/binary/linear_support_vector_machine.html
In R, mclust has an argument 'modelNames' where you can define which model to implement. I wish to do a univariate modeling which is also modelNames <- 'V' in mclust under mixture.GMM in python. However, the only thing I find that I can tweak with is the covariance_type. Nonetheless, when I run the same data using R and mixture.GMM under sklearn, I get different fitting despite the same number of fitted components. What could I change in mixture.GMM to indicate I am using a univariate variable variance?
mclust code:
function(x){Mclust(ma78[x,],G=2,modelNames="V",verbose=FALSE)}
GMM code:
gmm = GMM(n_components = 2).fit(data)
With univariate data, the covariance can either be equal or unique (variable). With Mclust these options are modelNames = "E" or "V", respectively.
With sklearn, they appear to be covariance_type = "tied" or "full". Possibly, something like this for variable Gaussian mixture model
gmm = mixture.GaussianMixture(n_components = 2, covariance_type='full').fit(data)
Even using Mclust or sklearn alone there can be instanced that you may not get same parameter values for different runs - this is because the estimates can depend on the initial values. One way to avoid this is using a larger number of starts if such option is available.
found the answer on stats.stackexchange. The only thing you have to do is to reshape your data data.reshape(-1, 1) before you pass it into sklearn.mixture.GaussianMixture
Andreas
I have created a sequential model in CNTK and pass this model into a loss function like the following:
ce = cross_entropy_with_softmax(model, labels)
As mentioned here and as I have multilabel classifier, I want to use a proper loss function. The problem is I can not find any proper document to find these loss functions in Python. Is there any suggestion or sample code for this requirement.
I should notice that I found these alternatives (logistic and weighted logistic) in BrainScript language, but not in Python.
"my data has more than one label (three label) and each label has more than two values (30 different values)"
Do I understand right, you have 3 network outputs and associated labels, and each one is a 1-in-30 classifier? Then it seems you can just add three cross_entropy_with_softmax() values. Is that what you want?
E.g. if the model function returns a triple (ending in something like return combine([z1, z2, z3])), then your criterion function that you pass to Trainer could look like this (if you don't use Python 3, the syntax is a little different):
from cntk.layers.typing import Tensor, SparseTensor
#Function
def my_criterion(input : Tensor[input_dim], labels1 : SparseTensor[30],
labels2 : SparseTensor[30], labels3 : SparseTensor[30]):
z1, z2, z3 = my_model(input).outputs
loss = cross_entropy_with_softmax(z1, labels1) + \
cross_entropy_with_softmax(z2, labels2) + \
cross_entropy_with_softmax(z3, labels3)
return loss
learner = ...
trainer = Trainer(None, my_criterion, learner)
# in MB loop:
input_mb, L1_mb, L2_mb, L3_mb = my_next_minibatch()
trainer.train_minibatch(my_criterion.argument_map(input_mb, L1_mb, L2_mb, L3_mb))
Update (based on comments below): If you are using a sequential model then you are probably interested in taking a sum over all positions in the sequence of the loss at each position. cross_entropy_with_softmax is appropriate for the per-position loss and CNTK will automatically compute the sum of the loss values over all positions in the sequence.
Note that the terminology multilabel is non-standard here as it is typically referring to problems with multiple binary labels. The wiki page you link to refers to that case which is different from what you are doing.
Original answer (valid for the actual multilabel case): You will want to use binary_cross_entropy or weighted_binary_cross_entropy. (We decided to rename Logistic when porting this to Python). At the time of this writing these operations only support {0,1} labels. If your labels are in (0,1) then you will need to define your loss like this
import cntk as C
my_bce = label*C.log(model)+(1-label)*C.log(1-model)
Currently, most operators are in the cntk.ops package and documented here. The only exception being the sequence related operators, which reside in cntk.ops.sequence.
We have plans to restructure the operator space (without breaking backwards compatibility) to increase discoverability.
For your particular case, cross_entropy_with_softmax seems to be a reasonable choice, and you can find its documentation with examples here. Please also check out this Jupyter Notebook for a complete example.
Can Label Propagation be used for semi-supervised regression tasks in scikit-learn?
According to its API, the answer is YES.
http://scikit-learn.org/stable/modules/label_propagation.html
However, I got the error message when I tried to run the following code.
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
import numpy as np
rng=np.random.RandomState(0)
boston = datasets.load_boston()
X=boston.data
y=boston.target
y_30=np.copy(y)
y_30[rng.rand(len(y))<0.3]=-999
label_propagation.LabelSpreading().fit(X,y_30)
It shows that "ValueError: Unknown label type: 'continuous'" in the label_propagation.LabelSpreading().fit(X,y_30) line.
How should I solve the problem? Thanks a lot.
It looks like the error in the documentation, code itself clearly is classification only (beggining of the .fit call of the BasePropagation class):
check_classification_targets(y)
# actual graph construction (implementations should override this)
graph_matrix = self._build_graph()
# label construction
# construct a categorical distribution for classification only
classes = np.unique(y)
classes = (classes[classes != -1])
In theory you could remove the "check_classification_targets" call and use "regression like method", but it will not be the true regression since you will never "propagate" any value which is not encountered in the training set, you will simply treat the regression value as the class identifier. And you will be unable to use value "-1" since it is a codename for "unlabeled"...