I have a set of coefficients from a trained model but I don't have access to the model itself or training dataset. I'd like to create an instance of H2OGeneralizedLinearEstimator and set the coefficients manually to use the model for prediction.
The first thing I tried was (this is an example to reproduce the error):
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.frame import H2OFrame
h2o.init()
# creating some test dataset
test = {"x":[0,1,2], "y":[0,0,1]}
df = H2OFrame(python_obj=test)
glm = H2OGeneralizedLinearEstimator(family='binomial', model_id='logreg')
# setting the coefficients
glm.coef = {'Intercept':0, 'x':1}
# predict
glm.predict(test_data=df)
This throws an error:
H2OResponseError: Server error
water.exceptions.H2OKeyNotFoundArgumentException: Error: Object
'logreg' not found in function: predict for argument: model
I also tried to set glm.params keys based on the keys of a similar trained model:
for key in trained.params.keys():
glm.params.__setitem__(key, trained.params[key])
but this doesn't populate glm.params (glm.params = {}).
It looks like you want to use the function makeGLMModel
This is further described in the documentation, and I will repost here for your convenience:
Modifying or Creating a Custom GLM Model
In R and python, the makeGLMModel call can be used to create an H2O model from given coefficients. It needs a source GLM model trained on the same dataset to extract the dataset information. To make a custom GLM model from R or python:
R: call h2o.makeGLMModel. This takes a model, a vector of coefficients, and (optional) decision threshold as parameters.
Pyton: H2OGeneralizedLinearEstimator.makeGLMModel (static method) takes a model, a dictionary containing coefficients, and (optional) decision threshold as parameters.
Related
I have been following along with this really helpful XGBoost tutorial on Medium (code used towards bottom of article): https://medium.com/analytics-vidhya/random-forest-and-xgboost-on-amazon-sagemaker-and-aws-lambda-29abd9467795.
To-date, I've been able to get data appropriately formatted for ML purposes, a model created based on training data, and then test data fed through the model to give useful results.
Whenever I leave and come back to work more on the model or feed in new test data however, I find I need to re-run all model creation steps in order to make any further predictions. Instead I would like to just call my already created model endpoint based on the Image_URI and feed in new data.
Current steps performed:
Model Training
xgb = sagemaker.estimator.Estimator(containers[my_region],
role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
output_path='s3://{}/{}/output'.format(bucket_name, prefix),
sagemaker_session=sess)
xgb.set_hyperparameters(eta=0.06,
alpha=0.8,
lambda_bias=0.8,
gamma=50,
min_child_weight=6,
subsample=0.5,
silent=0,
early_stopping_rounds=5,
objective='reg:linear',
num_round=1000)
xgb.fit({'train': s3_input_train})
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')
Evaluation
test_data_array = test_data.drop([ 'price','id','sqft_above','date'], axis=1).values #load the data into an array
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)
from sklearn.metrics import r2_score
print("R2 score : %.2f" % r2_score(test_data['price'],predictions_array))
It seems that this particular line:
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
needs to be re-written in order to not reference xgb.predictor but instead reference the model location.
I have tried the following
trained_model = sagemaker.model.Model(
model_data='s3://{}/{}/output/xgboost-2020-11-10-00-00/output/model.tar.gz'.format(bucket_name, prefix),
image_uri='XXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
role=role) # your role here; could be different name
trained_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
and then replaced
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
with
trained_model.serializer = csv_serializer # set the serializer type
predictions = trained_model.predict(test_data_array).decode('utf-8') # predict!
but I get the following error:
AttributeError: 'Model' object has no attribute 'predict'
that's a good question :) I agree, many of the official tutorials tend to show the full train-to-invoke pipeline and don't emphasize enough that each step can be done separately. In your specific case, when you want to invoke an already-deployed endpoint, you can either: (A) use the invoke API call in one of the numerous SDKs (example in CLI, boto3) or (B) or instantiate a predictor with the high-level Python SDK, either the generic sagemaker.model.Model class or its XGBoost-specific child: sagemaker.xgboost.model.XGBoostPredictor as illustrated below:
from sagemaker.xgboost.model import XGBoostPredictor
predictor = XGBoostPredictor(endpoint_name='your-endpoint')
predictor.predict('<payload>')
similar question How to use a pretrained model from s3 to predict some data?
Note:
If you want the model.deploy() call to return a predictor, your model must be instantiated with a predictor_cls. This is optional, you can also first deploy a model, and then invoke it as a separate step with the above technique
Endpoints create charges even if you don't invoke them; they are charged per uptime. So if you don't need an always-on endpoint, don't hesitate to shut it down to minimize costs.
I'm using sklearn linear implementation of SVM classifier LinearSVM.
I didn't use it directly but I wrap it with CalibratedClassifierCV to get the probabilities in the prediction time, like:
model = CalibratedClassifierCV(LinearSVC(random_state=0))
After fitting the model, I tried to get the coef_ to print the Top features, following this post Visualising Top Features in Linear SVM with Scikit Learn and Matplotlib, but this I got this error:
coef = classifier.coef_.ravel()
AttributeError: 'CalibratedClassifierCV' object has no attribute 'coef_'
How can I get the coef in the case I wrap the classifier with a calibrator?, I'm not totally interested in this way, thus if there is another way to get the features importance, it will be welcomed.
coef_ is not an attribute of CalibratedClassifierCV however, it is an attribute of the base_estimator which is a LinearSVC in your case. You can access your base estimator via the calibrated_classifiers_ which is a list of the fitted models (which depends on the number of models you fit based on your cv value). I have shown a sample code which you can refer to for your need.
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
model = CalibratedClassifierCV(LinearSVC(random_state=0))
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
[<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57c18>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0aec080>]
In this case my cv is three so I have three models built, so I would simple loop through them and taken an average.
coef_avg = 0
for i in model.calibrated_classifiers_:
coef_avg = coef_avg + i.base_estimator.coef_
coef_avg = coef_avg/len(model.calibrated_classifiers_)
array([[ 0.16464871, 0.45680981, -0.77801375, -0.4170196 ],
[ 0.1238834 , -0.89117967, 0.35451826, -0.89231957],
[-0.83826029, -0.9237139 , 1.30772955, 1.67592916]])
Note: Starting from sklearn version 0.24, CalibratedClassifierCV constructor exposes an ensemble argument, that, if set to False (assuming cv is not set to "prefit"), makes CalibratedClassifierCV expose only one calibrated classifier trained using all training data. This means we no longer need to loop over all calibrated_classifiers_ at prediction time:
model = CalibratedClassifierCV(LinearSVC(random_state=0), ensemble=False)
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
# Returns a list with one element, [<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>]
(using an example above, given by Parthasarathy)
I train my classifier using DeepPavlov, and then when i call trained model for some sample function returns only one class label, but I want to get the probabilities of every class. I did not find function parameters that would allow me to get probabilities.
Has anyone encountered such a problem? Thank!
from deeppavlov import configs, train_model
model = train_model(configs.classifiers.intents_snips)
model(['Some sentence'])
I want the output like np.array with number of classes length, but current output is one label like ['PlayMusic'].
You can change chainer.out parameter of your config to be ["y_pred_probas"] before inferring, but it will also most likely require you to update change train.metrics if you want to train your model on the same config.
Alternatively you can call your model like
model.compute(['Some sentence'], targets=["y_pred_probas"])
And to get classes indexes you can run
dict(model['classes_vocab'])
I have a trained ShareBoost model. How do I obtain the model's weight parameters/vectors?
I tried to get the individual linear machines and extract the individual weight vectors but unlike the linear svm it does not seem to have a get_w() method.
Also, even though the C++ ShareBoost class is a subclass of CMachine, the Parameters object obtained from m_parameters (see docs) does not appear to have the parameters available.
The following code is what I have tried.
num_machines = shareboost.get_num_machines()
# num_machines = 2
lm0 = shareboost.get_machine(0)
p0 = lm0.m_parameters
# The following method does not exist
p0.get_parameter(0)
in case you are using the C++ API you could get the weight vector the following way:
auto lm = (CLinearMachine*)shareboost->get_machine(0);
lm->get_w();
since you are using the python API currently this only possible if you are using the new API of shogun (that is only available in develop branch atm):
lm0 = shareboost.get_machine(0)
weights = lm0.get_real_vector("w")
see some more examples of how to use the new API:
http://shogun.ml/examples/nightly/examples/binary/linear_support_vector_machine.html
Can Label Propagation be used for semi-supervised regression tasks in scikit-learn?
According to its API, the answer is YES.
http://scikit-learn.org/stable/modules/label_propagation.html
However, I got the error message when I tried to run the following code.
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
import numpy as np
rng=np.random.RandomState(0)
boston = datasets.load_boston()
X=boston.data
y=boston.target
y_30=np.copy(y)
y_30[rng.rand(len(y))<0.3]=-999
label_propagation.LabelSpreading().fit(X,y_30)
It shows that "ValueError: Unknown label type: 'continuous'" in the label_propagation.LabelSpreading().fit(X,y_30) line.
How should I solve the problem? Thanks a lot.
It looks like the error in the documentation, code itself clearly is classification only (beggining of the .fit call of the BasePropagation class):
check_classification_targets(y)
# actual graph construction (implementations should override this)
graph_matrix = self._build_graph()
# label construction
# construct a categorical distribution for classification only
classes = np.unique(y)
classes = (classes[classes != -1])
In theory you could remove the "check_classification_targets" call and use "regression like method", but it will not be the true regression since you will never "propagate" any value which is not encountered in the training set, you will simply treat the regression value as the class identifier. And you will be unable to use value "-1" since it is a codename for "unlabeled"...