get probability from xgb.train() - python

I am new to Python and Machine learning. I have searched internet regarding my question and tried the solution people have suggested, but still not get it. Would really appreciate it if anyone can help me out.
I am working on my first XGboost model. I have tuned the parameters by using xgb.XGBClassifier, and then would like to enforce monotonicity on model variables. Seemingly I have to use xgb.train() to enforce monotonicity as shown in my code below.
xgb.train() can do predict(), but NOT predict_proba() function. So how can I get probability from xgb.train() ?
I have tried to use 'objective':'multi:softprob' instead of 'objective':'binary:logistic'. then score = bst_constr.predict(dtrain). But the score does not seem right to me.
Thank you so much.
params_constr={
'base_score':0.5,
'learning_rate':0.1,
'max_depth':5,
'min_child_weight':100,
'n_estimators':200,
'nthread':-1,
'objective':'binary:logistic',
'seed':2018,
'eval_metric':'auc'
}
params_constr['monotone_constraints'] = "(1,1,0,1,-1,-1,0,0,1,-1,1,0,1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,)"
dtrain = xgb.DMatrix(X_train, label = y_train)
bst_constr = xgb.train(params_constr, dtrain)
X_test['score']=bst_constr.predict_proba(X_test)[:,1]
AttributeError: 'Booster' object has no attribute 'predict_proba'

So based on my understanding, you are trying to obtain the probability for each class in the prediction phase. Two options.
It seems that you are using the XGBoost native api. Then just select the 'objective':'multi:softprob' as the parameter, and use the bst_constr.predict instead of bst_constr.predict_proba.
XGBoost also provides the scikit-learn api. But then you should initiate the model with bst_constr = xgb.XGBClassifier(**params_constr), and use bst_constr.fit() for training. Then you can call the bst_constr.predict_proba to obtain what you want. You can refer here for more details Scikit-Learn API in XGBoost.

Related

Query regarding the probabilities obtained from Logistic regression

I am implementing a classification task which is a 985 class classification problem.
I have trained my model and predicted the class of X_test data.
I am using logistic regression. When I am doing clf.predict(X_test[0]) then I am getting the correct class.
But when I am seeing the probabilities, clf.predict_proba(X_test[0]), then the correct class does not have the highest probability. In fact, another class has a maximum probability. I don't understand why this is happening. I have checked for another input, the same is happening for other inputs also.
This is really hard to troubleshoot without an example to replicate. However, I suspect that there may be an indexing problem. Try restarting the notebook kernel if you're using a notebook, and check for indexing problems.
Also, if you could post more details or examples of this happening, it would help.

Statsmodels Mixed Linear Model predictions

I am estimating a Mixed Linear Model using the statsmodels MixedLM package in Python. After fitting the model, I now want to make predictions but am struggling to understand the 'predict' method.
The statsmodels documentation (http://www.statsmodels.org/dev/generated/statsmodels.regression.mixed_linear_model.MixedLM.predict.html) suggests that the predict method takes an array containing the parameters of the model that has been estimated. How can I retrieve this array?
y = raw_data['dependent_var']
X = raw_data[['var1', 'var2', 'var3']]
groups = raw_data['person_id']
model = sm.MixedLM(endog=y, exog=X, groups=groups)
result = model.fit()
I know I am late by few months but it's good to answer if someone else is having the same question. The params required are available in the result object. They are result.fe_params
model.predict(reresult.fe_params, exog=xest)
or with result object
result.predict(exog=xtest)
To answer the user11806155's question, to make predictions purely on fixed effects, you can do
model.predict(reresult.fe_params, exog=xtest)
To make predictions on random effects, you can just change the parameters with specifying the particular group name (e.g. "group1")
model.predict(reresult.random_effects["group1"], exog=xtest).
I assume the order of features in the test data should follow the same order as what you give as the model's parameters. You can add them together to get the prediction specifically for a group.

How to get predictions out of tensorflow model after you've used tf.group on your optimizers

I'm trying to write something similar to google's wide and deep learning after running into difficulties of doing multi-class classification(12 classes) with the sklearn api. I've tried to follow the advice in a couple of posts and used the tf.group(logistic_regression_optimizer, deep_model_optimizer). It seems to work but I was trying to figure out how to get predictions out of this model. I'm hoping that with the tf.group operator the model is learning to weight the logistic and deep models differently but I don't know how to get these weights out so I can get the right combination of the two model's predictions. Thanks in advance for any help.
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/Cs0R75AGi8A
How to set layer-wise learning rate in Tensorflow?
tf.group() creates a node that forces a list of other nodes to run using control dependencies. It's really just a handy way to package up logic that says "run this set of nodes, and I don't care about their output". In the discussion you point to, it's just a convenient way to create a single train_op from a pair of training operators.
If you're interested in the value of a Tensor (e.g., weights), you should pass it to session.run() explicitly, either in the same call as the training step, or in a separate session.run() invocation. You can pass a list of values to session.run(), for example, your tf.group() expression, as well as a Tensor whose value you would like to compute.
Hope that helps!

Cannot assign class_weight to RandomForestClassifier in Scikit Learn

i just started some time ago to use the scikit learn package to implement Random Forests on my data set. I am trying to make a model based on multiple classes, and tried to implement the RandomForestClassifier. However, i think i have some imbalance and i want to use the class_weight="auto" parameter:
RFC = RandomForestClassifier(n_estimators = int(trees),class_weight="auto").fit(X_train, y_train)
However, when i try to run it, i get
__init__() got an unexpected keyword argument 'class_weight'
I tried checking at other questions, since i thought i didn't use the correct notation, but they all seem to reference class_weight="auto" in that way.
Note: The RF works without the class_weight parameter. I just want to try to improve my results because i think the data is imbalanced.
Thanks (if i did something wrong with formatting or the question, i will edit it, first question here)
I made the mistake of checking the wrong version list. I run in ipython, and while i did update it on the server, it didn't go through in the ipython enviroment, and when i checked it with conda, it was all the times without the ipython enviroment on.
I updated it and it worked, thanks.
Sorry, but thanks for looking into it.

No Support Vector Attribute

The project I am currently working on makes use of the sklearn svm.SVC class where at one point in the code instantiate the following:
self.classifier = OneVsRestClassifier(SVC(kernel = 'linear', probability = True))
After fitting the classifier, I then try to inspect the support_vector_ or support_ attributes of the classifier. However, I get the following error:
'SVC' object has no attribute 'support_vectors_'
I tried changing the kernel to 'poly' or 'rbf', but this does not fix the error. Why is this happening? Shouldn't any linear SVM have something (i.e. 'None' at the least) for this attribute? I am using sklearn version 0.15.1 if that helps.
Thanks!
Assuming you obtained the error message by trying to evaluate
self.classifier.estimator.support_vectors_
observe that OneVsRestClassifier clones your estimator as many times as there are classes and fits as many of them to your data. They can be found in the estimators_ variable of the ovr. Try
self.classifier.estimators_[0].support_vectors_
That will give you the support vectors for the first OVR problem.

Categories

Resources