Semi-supervised learning for regression by scikit-learn - python

Can Label Propagation be used for semi-supervised regression tasks in scikit-learn?
According to its API, the answer is YES.
http://scikit-learn.org/stable/modules/label_propagation.html
However, I got the error message when I tried to run the following code.
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
import numpy as np
rng=np.random.RandomState(0)
boston = datasets.load_boston()
X=boston.data
y=boston.target
y_30=np.copy(y)
y_30[rng.rand(len(y))<0.3]=-999
label_propagation.LabelSpreading().fit(X,y_30)
It shows that "ValueError: Unknown label type: 'continuous'" in the label_propagation.LabelSpreading().fit(X,y_30) line.
How should I solve the problem? Thanks a lot.

It looks like the error in the documentation, code itself clearly is classification only (beggining of the .fit call of the BasePropagation class):
check_classification_targets(y)
# actual graph construction (implementations should override this)
graph_matrix = self._build_graph()
# label construction
# construct a categorical distribution for classification only
classes = np.unique(y)
classes = (classes[classes != -1])
In theory you could remove the "check_classification_targets" call and use "regression like method", but it will not be the true regression since you will never "propagate" any value which is not encountered in the training set, you will simply treat the regression value as the class identifier. And you will be unable to use value "-1" since it is a codename for "unlabeled"...

Related

AttributeError: 'CalibratedClassifierCV' object has no attribute 'coef_'

I'm using sklearn linear implementation of SVM classifier LinearSVM.
I didn't use it directly but I wrap it with CalibratedClassifierCV to get the probabilities in the prediction time, like:
model = CalibratedClassifierCV(LinearSVC(random_state=0))
After fitting the model, I tried to get the coef_ to print the Top features, following this post Visualising Top Features in Linear SVM with Scikit Learn and Matplotlib, but this I got this error:
coef = classifier.coef_.ravel()
AttributeError: 'CalibratedClassifierCV' object has no attribute 'coef_'
How can I get the coef in the case I wrap the classifier with a calibrator?, I'm not totally interested in this way, thus if there is another way to get the features importance, it will be welcomed.
coef_ is not an attribute of CalibratedClassifierCV however, it is an attribute of the base_estimator which is a LinearSVC in your case. You can access your base estimator via the calibrated_classifiers_ which is a list of the fitted models (which depends on the number of models you fit based on your cv value). I have shown a sample code which you can refer to for your need.
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
model = CalibratedClassifierCV(LinearSVC(random_state=0))
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
[<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57c18>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0aec080>]
In this case my cv is three so I have three models built, so I would simple loop through them and taken an average.
coef_avg = 0
for i in model.calibrated_classifiers_:
coef_avg = coef_avg + i.base_estimator.coef_
coef_avg = coef_avg/len(model.calibrated_classifiers_)
array([[ 0.16464871, 0.45680981, -0.77801375, -0.4170196 ],
[ 0.1238834 , -0.89117967, 0.35451826, -0.89231957],
[-0.83826029, -0.9237139 , 1.30772955, 1.67592916]])
Note: Starting from sklearn version 0.24, CalibratedClassifierCV constructor exposes an ensemble argument, that, if set to False (assuming cv is not set to "prefit"), makes CalibratedClassifierCV expose only one calibrated classifier trained using all training data. This means we no longer need to loop over all calibrated_classifiers_ at prediction time:
model = CalibratedClassifierCV(LinearSVC(random_state=0), ensemble=False)
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
# Returns a list with one element, [<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>]
(using an example above, given by Parthasarathy)

PySAL OLS Model: AttributeError: 'OLS' object has no attribute 'predict'

I have divided my data into training and validation samples and have successfully fit my model with three types of linear models. What I cannot figure out how to do is apply the model to the validation sample to evaluate the fit. When I attempt to apply the model to the holdout sample (sorry, I know that this isn't a reproducible example but I think that the issue is pretty clear. I'm just putting this snippet here for completeness. Please be gentle!):
valid = validation.loc[:, x + [ "sale_amt"]]
holdout1 = m1.predict(valid)
I get the following error message:
AttributeError Traceback (most recent call last)
in ()
8
9 valid = validation.loc[:, x + [ "sale_amt"]]
---> 10 holdout1 = m1.predict(valid)
AttributeError: 'OLS' object has no attribute 'predict'`
Other Python OLS regression packages have a 'predict' method, but it doesn't seem that PySAL does. I realize that the function coefficients (betas) are available and will pursue applying them to my validation data directly, but I was hoping that there is a simple answer that I just missed.
I apologize if it is bad form to answer my own question, but I did come up with a solution. I contacted Daniel Arribas-Bel, one of the PySAL developers, and he helped guide me to the result I was seeking. Note that my PySAL OLS object is named m1, and my validation dataframe is called 'validation':
m1 = ps.model.spreg.OLS(...)
m1.intercept = m1.betas[0] # Get the intercept from the betas array
m1.coefficients = m1.betas[1:len(m1.betas)] # Get the coefficients from the betas array
validation['predicted_price'] = m1.intercept + validation.loc[:, x].dot( m1.coefficients)
Note that this is the method I would use for a non-spatial model adapted for the KNN model I built in PySAL and might not be technically fully correct for a spatial model. Caveat emptor.

Creating an H2OGeneralizedLinearEstimator instance from existing coefficients

I have a set of coefficients from a trained model but I don't have access to the model itself or training dataset. I'd like to create an instance of H2OGeneralizedLinearEstimator and set the coefficients manually to use the model for prediction.
The first thing I tried was (this is an example to reproduce the error):
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.frame import H2OFrame
h2o.init()
# creating some test dataset
test = {"x":[0,1,2], "y":[0,0,1]}
df = H2OFrame(python_obj=test)
glm = H2OGeneralizedLinearEstimator(family='binomial', model_id='logreg')
# setting the coefficients
glm.coef = {'Intercept':0, 'x':1}
# predict
glm.predict(test_data=df)
This throws an error:
H2OResponseError: Server error
water.exceptions.H2OKeyNotFoundArgumentException: Error: Object
'logreg' not found in function: predict for argument: model
I also tried to set glm.params keys based on the keys of a similar trained model:
for key in trained.params.keys():
glm.params.__setitem__(key, trained.params[key])
but this doesn't populate glm.params (glm.params = {}).
It looks like you want to use the function makeGLMModel
This is further described in the documentation, and I will repost here for your convenience:
Modifying or Creating a Custom GLM Model
In R and python, the makeGLMModel call can be used to create an H2O model from given coefficients. It needs a source GLM model trained on the same dataset to extract the dataset information. To make a custom GLM model from R or python:
R: call h2o.makeGLMModel. This takes a model, a vector of coefficients, and (optional) decision threshold as parameters.
Pyton: H2OGeneralizedLinearEstimator.makeGLMModel (static method) takes a model, a dictionary containing coefficients, and (optional) decision threshold as parameters.

sklearn mixture.GMM in python using univariate GMM

In R, mclust has an argument 'modelNames' where you can define which model to implement. I wish to do a univariate modeling which is also modelNames <- 'V' in mclust under mixture.GMM in python. However, the only thing I find that I can tweak with is the covariance_type. Nonetheless, when I run the same data using R and mixture.GMM under sklearn, I get different fitting despite the same number of fitted components. What could I change in mixture.GMM to indicate I am using a univariate variable variance?
mclust code:
function(x){Mclust(ma78[x,],G=2,modelNames="V",verbose=FALSE)}
GMM code:
gmm = GMM(n_components = 2).fit(data)
With univariate data, the covariance can either be equal or unique (variable). With Mclust these options are modelNames = "E" or "V", respectively.
With sklearn, they appear to be covariance_type = "tied" or "full". Possibly, something like this for variable Gaussian mixture model
gmm = mixture.GaussianMixture(n_components = 2, covariance_type='full').fit(data)
Even using Mclust or sklearn alone there can be instanced that you may not get same parameter values for different runs - this is because the estimates can depend on the initial values. One way to avoid this is using a larger number of starts if such option is available.
found the answer on stats.stackexchange. The only thing you have to do is to reshape your data data.reshape(-1, 1) before you pass it into sklearn.mixture.GaussianMixture
Andreas

how to enforce Monotonic Constraints in XGBoost with ScikitLearn?

I build up a XGBoost model using scikit-learn and I am pretty happy with it. As fine tuning to avoid overfitting, I'd like to ensure monotonicity of some features but there I start facing some difficulties...
As far as I understood, there is no documentation in scikit-learn about xgboost (which I confess I am really surprised about - knowing that this situation is lasting for several months). The only documentation I found is directly on http://xgboost.readthedocs.io
On this website, I found out that monotonicity can be enforced using "monotone_constraints" option.
I tried to use it in Scikit-Learn but I got an error message "TypeError: init() got an unexpected keyword argument 'monotone_constraints'"
Do you know a way to do it ?
Here is the code I wrote in python (using spyder):
grid = {'learning_rate' : 0.01, 'subsample' : 0.5, 'colsample_bytree' : 0.5,
'max_depth' : 6, 'min_child_weight' : 10, 'gamma' : 1,
'monotone_constraints' : monotonic_indexes}
#'monotone_constraints' ~ = "(1,-1)"
m07_xgm06 = xgb.XGBClassifier(n_estimators=2000, **grid)
m07_xgm06.fit(X_train_v01_oe, Label_train, early_stopping_rounds=10, eval_metric="logloss",
eval_set=[(X_test1_v01_oe, Label_test1)])
In order to do this using the xgboost sklearn API, you need to upgrade to xgboost 0.81. They fixed the ability to set parameters controlled via kwargs as part of this PR:
https://github.com/dmlc/xgboost/pull/3791
XGBoost Scikit-Learn API currently (0.6a2) doesn't support monotone_constraints. You can use Python API instead. Take a look into example.
This code in the example can be removed:
params_constr['updater'] = "grow_monotone_colmaker,prune"
How would you expect monotone constraints to work for a general classification problem where the response might have more than 2 levels? All the examples I've seen relating to this functionality are for regression problems. If your classification response only has 2 levels, try switching to regression on an indicator variable and then choose an appropriate score threshold for classification.
This feature appears to work as of the latest xgboost / scikit-learn, provided that you use an XGBregressor rather than an XGBclassifier and set monotone_constraints via kwargs.
The syntax is like this:
params = {
'monotone_constraints':'(-1,0,1)'
}
normalised_weighted_poisson_model = XGBRegressor(**params)
In this example, there is a negative constraint on column 1 in the training data, no constraint on column 2, and a positive constraint on column 3. It is up to you to keep track of which is which - you cannot refer to columns by name, only by position, and you must specify an entry in the constraint tuple for every column in your training data.

Categories

Resources